Voxtral Mini 4B Realtime
This is a 4-bit quantized MLX conversion of mistralai/Voxtral-Mini-4B-Realtime-2602, Mistral AI's streaming speech-to-text model.
Runs via mlx-audio.
Key Details
| Parameters | 4B (~3.4B LM + ~0.6B Audio Encoder) |
| Quantization | int4 |
| Base model | mistralai/Voxtral-Mini-4B-Realtime-2602 |
| Languages | 13 (Arabic, German, English, Spanish, French, Hindi, Italian, Dutch, Portuguese, Chinese, Japanese, Korean, Russian) |
| License | Apache 2.0 |
See also: fp16 variant
Usage
pip install mlx-audio[stt]
from mlx_audio.stt.utils import load
model = load("mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit")
# Transcribe audio
result = model.generate("audio.wav")
print(result.text)
# Streaming transcription
for chunk in model.generate("audio.wav", stream=True):
print(chunk, end="", flush=True)
# Adjust transcription delay (lower = faster but less accurate)
result = model.generate("audio.wav", transcription_delay_ms=480)
Recommended Settings
| Setting | Value | Notes |
|---|---|---|
| Temperature | 0.0 |
Always use greedy decoding |
| Transcription delay | 480ms |
Sweet spot of accuracy vs. latency |
| Delay range | 80ms – 2400ms |
Multiples of 80ms |
Benchmarks (from upstream)
FLEURS (13 languages, WER%)
| Delay | AVG | EN | FR | DE | ES | ZH | JA | KO |
|---|---|---|---|---|---|---|---|---|
| 160ms | 12.60 | 6.46 | 9.75 | 9.50 | 5.34 | 17.67 | 19.17 | 19.81 |
| 480ms | 8.72 | 4.90 | 6.42 | 6.19 | 3.31 | 10.45 | 9.59 | 15.74 |
| 960ms | 7.70 | 4.34 | 5.68 | 4.87 | 2.98 | 8.99 | 6.80 | 14.90 |
| 2400ms | 6.73 | 4.05 | 5.23 | 4.15 | 2.71 | 8.48 | 5.50 | 14.30 |
Long-form English (WER%)
| Delay | Meanwhile | Earnings-21 | Earnings-22 | TEDLIUM |
|---|---|---|---|---|
| 480ms | 5.05 | 10.23 | 12.30 | 3.17 |
Architecture
- Causal audio encoder (~0.6B) with sliding window attention — enables true streaming
- Language model decoder (~3.4B) based on Ministral-3B with adaptive RMS norm conditioned on transcription delay
- 4x downsampling from encoder to decoder (frame rate = 12.5 Hz)
- Both components use sliding window attention for unbounded audio length
More Info
- Downloads last month
- 98
Model size
1B params
Tensor type
F32
·
F16
·
U32
·
Hardware compatibility
Log In
to add your hardware
4-bit
Model tree for mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit
Base model
mistralai/Ministral-3-3B-Base-2512
Finetuned
mistralai/Voxtral-Mini-4B-Realtime-2602