YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
MoST: Mixture of Speech and Text Model
MoST (Mixture of Speech and Text) is a multimodal foundation model that builds upon the DeepSeek-V2 Lite architecture, enhanced with Modality-Aware Mixture of Experts (MAMoE) approach for handling both speech and text modalities.
Model Description
MoST is designed for speech-text tasks with the following key features:
- Based on DeepSeek-V2 Lite architecture
- Mixture of Experts (MoE) for modality-aware processing
- Speech tokens and text tokens are supported with specialized routing
- Enhanced with modality-aware attention mechanisms
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Yuxuan98/MoST-speech-text-moe"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
# Text-only example
text_inputs = tokenizer("Hello, this is an example", return_tensors="pt")
text_output = model.generate(**text_inputs, max_length=50)
print(tokenizer.decode(text_output[0], skip_special_tokens=True))
# Speech-text example will require audio encoding
# See documentation for details on audio-text inputs
Limitations
- The model is still experimental and under development
- Performance may vary across different speech and text tasks
- May not generalize well to all domains and languages
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support