Kyutai STT 1B (en/fr) -- GGUF
GGUF conversions and quantisations of kyutai/stt-1b-en_fr for use with CrispStrobe/CrispASR.
Available variants
| File | Quant | Size | Notes |
|---|---|---|---|
kyutai-stt-1b.gguf |
F16 | 2.0 GB | Full precision |
kyutai-stt-1b-q8_0.gguf |
Q8_0 | 1.1 GB | High quality |
kyutai-stt-1b-q4_k.gguf |
Q4_K | 636 MB | Best size/quality tradeoff |
All variants produce correct transcription on test audio (JFK speech).
Model details
- Architecture: Mimi neural audio codec encoder (SEANet CNN + 8-layer transformer + RVQ with 32 codebooks at 12.5 Hz) + 16-layer causal transformer LM (2048d, RoPE, SwiGLU, RMSNorm)
- Parameters: ~1B
- Languages: English, French
- Audio input: 24 kHz mono (auto-resampled from 16 kHz)
- License: MIT
- Reference: moshi.cpp (MIT)
This is a novel codec-based ASR architecture: audio is first encoded into discrete tokens via the Mimi neural audio codec, then a causal language model autoregressively predicts text tokens from the audio codes. Unlike encoder-decoder models (Whisper, Parakeet), the entire pipeline is autoregressive.
Usage with CrispASR
git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
# Auto-detect backend from GGUF
./build/bin/crispasr -m kyutai-stt-1b-q4_k.gguf -f audio.wav
# Explicit backend
./build/bin/crispasr --backend kyutai-stt -m kyutai-stt-1b-q4_k.gguf -f audio.wav -osrt
Conversion
python models/convert-kyutai-stt-to-gguf.py --input kyutai/stt-1b-en_fr --output kyutai-stt-1b.gguf
crispasr-quantize kyutai-stt-1b.gguf kyutai-stt-1b-q4_k.gguf q4_k
- Downloads last month
- 250
Hardware compatibility
Log In to add your hardware
8-bit
Model tree for cstr/kyutai-stt-1b-GGUF
Base model
kyutai/stt-1b-en_fr