--- license: mit base_model: UsefulSensors/moonshine-streaming-tiny language: - en tags: - onnx - int8 - fp16 - quantized - optimized - speech-recognition - asr - streaming - moonshine library_name: onnxruntime pipeline_tag: automatic-speech-recognition --- # Moonshine Streaming Tiny — Optimized Optimized variants of [UsefulSensors/moonshine-streaming-tiny](https://huggingface.co/UsefulSensors/moonshine-streaming-tiny), a 34M parameter streaming ASR model designed for real-time, on-device English speech recognition. Based on: [Moonshine v2: Ergodic Streaming Encoder ASR](https://arxiv.org/abs/2602.12241) ## Optimized Variants | Variant | Total Size | Size Reduction | Best For | |---------|-----------|---------------|----------| | **Original FP32** | 168.1 MB | — | Reference | | **ONNX INT8** | 79.8 MB | **52%** | CPU deployment, edge devices | | **FP16 SafeTensors** | 88.1 MB | **48%** | GPU inference | | **ONNX FP32** | 297 MB | — | ONNX Runtime without quantization | ## Benchmark Results Tested with 5 seconds of audio, generating up to 64 tokens: | Variant | Avg Latency | RTF | Speedup vs FP32 CPU | |---------|------------|-----|---------------------| | **PyTorch FP16 (GPU)** | 47.7 ms | 0.0095 | **1.71x** | | PyTorch INT8 (CPU) | 78.6 ms | 0.0157 | 1.03x | | PyTorch FP32 (CPU) | 81.3 ms | 0.0163 | 1.00x (baseline) | | ONNX FP32 (CPU) | 115.5 ms | 0.0231 | 0.70x | | ONNX INT8 (CPU) | 153.2 ms | 0.0306 | 0.53x | > **Note**: ONNX benchmarks include session overhead and were run on a single test. For production deployment with session reuse on real audio, ONNX Runtime typically provides better throughput, especially for long-running services. The Moonshine team reports 50ms response latency on Apple M3 with their C++ ONNX Runtime backend. ## File Structure ``` ├── onnx_int8/ # ONNX INT8 quantized (recommended for CPU) │ ├── encoder_model_int8.onnx # 9.8 MB │ ├── decoder_model_int8.onnx # 36 MB │ ├── decoder_with_past_model_int8.onnx # 32 MB │ ├── tokenizer.json │ ├── config.json │ └── quantize_config.json ├── onnx/ # ONNX FP32 │ ├── encoder_model.onnx + .data │ ├── decoder_model.onnx + .data │ ├── decoder_with_past_model.onnx + .data │ └── ... └── fp16/ # FP16 SafeTensors (for GPU) ├── model.safetensors # 88.1 MB ├── config.json └── tokenizer.json ``` ## Usage ### ONNX INT8 Inference (CPU — Recommended for Edge) ```bash pip install onnxruntime numpy tokenizers ``` ```python import numpy as np import onnxruntime as ort from tokenizers import Tokenizer MODEL_DIR = "onnx_int8" # or download from this repo BOS, EOS = 1, 2 # Load models opts = ort.SessionOptions() opts.intra_op_num_threads = 4 providers = ["CPUExecutionProvider"] encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers) decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers) decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers) tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json") # Prepare audio (16kHz float32, padded to multiple of 80 samples) audio = np.random.randn(16000 * 5).astype(np.float32) # replace with real audio remainder = len(audio) % 80 if remainder: audio = np.pad(audio, (0, 80 - remainder)) audio_input = audio[np.newaxis, :] attention_mask = np.ones_like(audio_input, dtype=np.int64) # Encode audio (enc_out,) = encoder.run(None, { "input_values": audio_input, "attention_mask": attention_mask, }) # First decode step outs = decoder.run(None, { "decoder_input_ids": np.array([[BOS]], dtype=np.int64), "encoder_hidden_states": enc_out, }) logits, past_kvs = outs[0], outs[1:] token = int(np.argmax(logits[0, -1, :])) # Build KV cache mapping dec_out_names = [o.name for o in decoder.get_outputs()][1:] past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"} kv_dict = {} for name, tensor in zip(dec_out_names, past_kvs): mapped = name.replace("present_", "past_", 1) if mapped in past_in_names: kv_dict[mapped] = tensor # Autoregressive decode loop past_out_names = [o.name for o in decoder_past.get_outputs()][1:] tokens = [token] for _ in range(255): if token == EOS: break inputs = { "decoder_input_ids": np.array([[token]], dtype=np.int64), "encoder_hidden_states": enc_out, } inputs.update(kv_dict) outs = decoder_past.run(None, inputs) token = int(np.argmax(outs[0][0, -1, :])) tokens.append(token) kv_dict = {} for name, tensor in zip(past_out_names, outs[1:]): mapped = name.replace("present_", "past_", 1) if mapped in past_in_names: kv_dict[mapped] = tensor text = tokenizer.decode(tokens) print(text) ``` ### FP16 PyTorch Inference (GPU) ```python from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor import torch model = MoonshineStreamingForConditionalGeneration.from_pretrained( "felixem/moonshine-streaming-tiny-optimized", subfolder="fp16", torch_dtype=torch.float16, ).to("cuda") processor = AutoProcessor.from_pretrained( "felixem/moonshine-streaming-tiny-optimized", subfolder="fp16", ) # Process audio inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000) inputs = {k: v.to("cuda", torch.float16) for k, v in inputs.items()} generated_ids = model.generate(**inputs, max_new_tokens=128) text = processor.decode(generated_ids[0], skip_special_tokens=True) ``` ### PyTorch Dynamic INT8 (CPU — Quick Setup) ```python import torch from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor model = MoonshineStreamingForConditionalGeneration.from_pretrained( "UsefulSensors/moonshine-streaming-tiny" ).eval() # Quantize Linear layers to INT8 model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny") inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000) generated_ids = model.generate(**inputs, max_new_tokens=128) text = processor.decode(generated_ids[0], skip_special_tokens=True) ``` ## ONNX Export Details - **Encoder**: Exported with `torch.onnx.export(dynamo=True)` to handle vmap-based sliding-window attention masking - **Decoder**: Separate models for first step (no KV cache) and autoregressive steps (with KV cache) - **Quantization**: `onnxruntime.quantization.quantize_dynamic` with symmetric INT8, per-channel, reduce_range=True ### KV Cache Structure Each decoder layer produces 4 KV tensors: - `present_{layer}_self_key` / `present_{layer}_self_value`: Self-attention cache [B, 8, S, 40] - `present_{layer}_cross_key` / `present_{layer}_cross_value`: Cross-attention cache [B, 8, T, 40] For `decoder_with_past_model`, feed these back as `past_{layer}_*` inputs. ## Quantization Impact Based on the [Edge-ASR paper](https://arxiv.org/abs/2507.07877) (Table 14), INT8 quantization on Moonshine Tiny has negligible WER impact: | Config | Avg WER | vs FP32 | |--------|---------|---------| | FP32 baseline | 12.72% | — | | **W8-A8 (INT8)** | **12.81%** | **+0.09%** | | W4-A16 (SpQR) | 13.61% | +0.89% | INT8 is the sweet spot for Moonshine Tiny — virtually no accuracy loss with ~50% model size reduction. ## Limitations - English only - Optimized for short utterances (streaming chunks of 1-5 seconds) - ONNX models use external data files (`.onnx.data`) for FP32 variant - The decoder uses autoregressive generation, so output latency scales with transcript length ## Citation ```bibtex @article{kudlur2025moonshine, title={Moonshine v2: Ergodic Streaming Encoder ASR}, author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete}, journal={arXiv preprint arXiv:2602.12241}, year={2025} } ``` ## License MIT (same as base model)