Voxtral Mini 4B Realtime

This is a 4-bit quantized MLX conversion of mistralai/Voxtral-Mini-4B-Realtime-2602, Mistral AI's streaming speech-to-text model.

Runs via mlx-audio.

Key Details


Parameters	4B (~3.4B LM + ~0.6B Audio Encoder)
Quantization	int4
Base model	mistralai/Voxtral-Mini-4B-Realtime-2602
Languages	13 (Arabic, German, English, Spanish, French, Hindi, Italian, Dutch, Portuguese, Chinese, Japanese, Korean, Russian)
License	Apache 2.0

Usage

pip install mlx-audio[stt]

from mlx_audio.stt.utils import load

model = load("mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit")

# Transcribe audio
result = model.generate("audio.wav")
print(result.text)

# Streaming transcription
for chunk in model.generate("audio.wav", stream=True):
    print(chunk, end="", flush=True)

# Adjust transcription delay (lower = faster but less accurate)
result = model.generate("audio.wav", transcription_delay_ms=480)

Recommended Settings

Setting	Value	Notes
Temperature	`0.0`	Always use greedy decoding
Transcription delay	`480ms`	Sweet spot of accuracy vs. latency
Delay range	`80ms` – `2400ms`	Multiples of 80ms

Benchmarks (from upstream)

FLEURS (13 languages, WER%)

Delay	AVG	EN	FR	DE	ES	ZH	JA	KO
160ms	12.60	6.46	9.75	9.50	5.34	17.67	19.17	19.81
480ms	8.72	4.90	6.42	6.19	3.31	10.45	9.59	15.74
960ms	7.70	4.34	5.68	4.87	2.98	8.99	6.80	14.90
2400ms	6.73	4.05	5.23	4.15	2.71	8.48	5.50	14.30

Long-form English (WER%)

Delay	Meanwhile	Earnings-21	Earnings-22	TEDLIUM
480ms	5.05	10.23	12.30	3.17

Architecture

Causal audio encoder (~0.6B) with sliding window attention — enables true streaming
Language model decoder (~3.4B) based on Ministral-3B with adaptive RMS norm conditioned on transcription delay
4x downsampling from encoder to decoder (frame rate = 12.5 Hz)
Both components use sliding window attention for unbounded audio length

More Info

Downloads last month: 1,227

Safetensors

Model size

1B params

Tensor type

F32

F16

U32

MLX

Hardware compatibility

4-bit

Model tree for mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-Mini-4B-Realtime-2602

Quantized

(10)

this model