YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

MoST: Mixture of Speech and Text Model

MoST (Mixture of Speech and Text) is a multimodal foundation model that builds upon the DeepSeek-V2 Lite architecture, enhanced with Modality-Aware Mixture of Experts (MAMoE) approach for handling both speech and text modalities.

Model Description

MoST is designed for speech-text tasks with the following key features:

  • Based on DeepSeek-V2 Lite architecture
  • Mixture of Experts (MoE) for modality-aware processing
  • Speech tokens and text tokens are supported with specialized routing
  • Enhanced with modality-aware attention mechanisms

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Yuxuan98/MoST-speech-text-moe"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

# Text-only example
text_inputs = tokenizer("Hello, this is an example", return_tensors="pt")
text_output = model.generate(**text_inputs, max_length=50)
print(tokenizer.decode(text_output[0], skip_special_tokens=True))

# Speech-text example will require audio encoding
# See documentation for details on audio-text inputs

Limitations

  • The model is still experimental and under development
  • Performance may vary across different speech and text tasks
  • May not generalize well to all domains and languages

Downloads last month
-
Safetensors
Model size
16B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support