YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

MoST: Mixture of Speech and Text Model

MoST (Mixture of Speech and Text) is a multimodal foundation model that builds upon the DeepSeek-V2 Lite architecture, enhanced with Modality-Aware Mixture of Experts (MAMoE) approach for handling both speech and text modalities.

Model Description

MoST is designed for speech-text tasks with the following key features:

Based on DeepSeek-V2 Lite architecture
Mixture of Experts (MoE) for modality-aware processing
Speech tokens and text tokens are supported with specialized routing
Enhanced with modality-aware attention mechanisms

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Yuxuan98/MoST-speech-text-moe"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

# Text-only example
text_inputs = tokenizer("Hello, this is an example", return_tensors="pt")
text_output = model.generate(**text_inputs, max_length=50)
print(tokenizer.decode(text_output[0], skip_special_tokens=True))

# Speech-text example will require audio encoding
# See documentation for details on audio-text inputs

Limitations

The model is still experimental and under development
Performance may vary across different speech and text tasks
May not generalize well to all domains and languages

Downloads last month: -

Safetensors

Model size

16B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support