Streaming TTS
Collection
Streaming text-to-speech models and frameworks β’ 2 items β’ Updated
Pure ONNX Runtime inference pipeline for MOSS-TTS-Realtime, enabling streaming text-to-speech without any PyTorch or Hugging Face Transformers dependency at runtime.
This repository provides:
inferencer_onnx.py β Core streaming TTS engine that orchestrates four ONNX models (backbone LLM, local transformer, codec encoder, codec decoder) using only NumPy and ONNX Runtime.moss_text_tokenizer.py β Lightweight Qwen3-compatible tokenizer wrapping the tokenizers library, with no transformers dependency.test_basic_streaming-onnx.py β End-to-end test script that simulates LLM streaming text and produces a WAV file.Reference Audio βββΊ Codec Encoder βββΊ RVQ Audio Codes (voice clone context)
β
βΌ
Text Deltas βββΊ Backbone LLM (Qwen3-1.7B) βββΊ Hidden States
β
βΌ
Local Transformer βββΊ 16-codebook Audio Tokens
β
βΌ
Codec Decoder βββΊ 24 kHz Waveform
| Component | ONNX Model | Description |
|---|---|---|
| Backbone LLM | backbone_llm.onnx |
Qwen3-based causal LM mapping interleaved text+audio tokens to hidden states. Maintains a growing KV-cache across the entire generation. |
| Local Transformer | backbone_local.onnx |
Depth-wise decoder generating 16 RVQ codebook entries per frame from backbone hidden states. Creates and discards a fresh KV-cache per frame. |
| Codec Encoder | codec_encoder.onnx |
Encodes reference speaker waveform into RVQ codes for voice cloning. Run once per session. |
| Codec Decoder | codec_decoder.onnx |
Decodes RVQ audio codes back to 24 kHz waveform. Maintains four hierarchical KV-caches for streaming decode. |
numpy
onnxruntime
soundfile
librosa
tokenizers
Install with:
pip install numpy onnxruntime soundfile librosa tokenizers
.
βββ inferencer_onnx.py # Core ONNX inference engine
βββ moss_text_tokenizer.py # Lightweight Qwen3 tokenizer
βββ test_basic_streaming-onnx.py # End-to-end test script
βββ README.md
βββ onnx_models/ # FP32
β βββ backbone_f32/
β β βββ backbone_f32.onnx
β βββ local_transformer/
β β βββ local_transformer_f32.onnx
β βββ codec_decoder/
β β βββ codec_decoder.onnx
β βββ codec_encoder/
β βββ codec_encoder.onnx
βββ onnx_models/
β βββ codec_decoder_int8/
β β βββ codec_decoder_int8.onnx
βββ configs/
β βββ config_backbone.json
β βββ config_codec.json
βββ tokenizers/
β βββ tokenizer.json
β βββ tokenizer_config.json
βββ audio_ref/
β βββ <reference_speaker>.[wav|mp3|flac]
βββ audio_synth/
βββ <synthesized_example>.wav
MOSSTTSRealtimeProcessor.python test_basic_streaming-onnx.py --tokenizer_vocab_path tokenizers/tokenizer.json --tokenizer_config_path tokenizers/tokenizer_config.json --backbone_llm_path onnx_models/backbone_f32/backbone_f32.onnx --backbone_local_path onnx_models/local_transformer_f32/local_transformer_f32.onnx --codec_decoder_path onnx_models_quantized/codec_decoder_int8/codec_decoder_int8.onnx --codec_encoder_path onnx_models/codec_encoder/codec_encoder.onnx --backbone_config_path configs/config_backbone.json --codec_config_path configs/config_codec.json --prompt_wav audio_ref/male_stewie.mp3 --out_wav output.wav
python test_basic_streaming-onnx.py --tokenizer_vocab_path tokenizers/tokenizer.json --tokenizer_config_path tokenizers/tokenizer_config.json --backbone_llm_path onnx_models/backbone_f32/backbone_f32.onnx --backbone_local_path onnx_models/local_transformer_f32/local_transformer_f32.onnx --codec_decoder_path onnx_models/codec_decoder/codec_decoder.onnx --codec_encoder_path onnx_models/codec_encoder/codec_encoder.onnx --backbone_config_path configs/config_backbone.json --codec_config_path configs/config_codec.json --prompt_wav audio_ref/male_stewie.mp3 --out_wav output.wav
import json
import onnxruntime as ort
from inferencer_onnx import MossTTSRealtimeInferenceONNX
from moss_text_tokenizer import MOSSTextTokenizer
# Load tokenizer and ONNX sessions
tokenizer = MOSSTextTokenizer("tokenizers/tokenizer.json",
"tokenizers/tokenizer_config.json")
backbone_llm = ort.InferenceSession("onnx_models/backbone_llm.onnx",
providers=["CPUExecutionProvider"])
backbone_local = ort.InferenceSession("onnx_models/backbone_local.onnx",
providers=["CPUExecutionProvider"])
codec_decoder = ort.InferenceSession("onnx_models/codec_decoder.onnx",
providers=["CPUExecutionProvider"])
codec_encoder = ort.InferenceSession("onnx_models/codec_encoder.onnx",
providers=["CPUExecutionProvider"])
with open("configs/config_backbone.json") as f:
backbone_config = json.load(f)
with open("configs/config_codec.json") as f:
codec_config = json.load(f)
# Create inferencer
inferencer = MossTTSRealtimeInferenceONNX(
tokenizer, backbone_llm, backbone_local,
codec_decoder, codec_encoder,
backbone_config, codec_config,
)
# Encode reference speaker for voice cloning
prompt_tokens = inferencer._encode_reference_audio("audio/speaker.wav")
input_ids = inferencer.processor.make_ensemble(prompt_tokens.squeeze(1))
inferencer.reset_turn(input_ids=input_ids, include_system_prompt=False,
reset_cache=True)
# Stream text and collect audio
for delta in your_llm_stream():
audio_frames = inferencer.push_text(delta)
for frame in audio_frames:
# push_tokens + audio_chunks for waveform decoding
...
| Argument | Type | Default | Description |
|---|---|---|---|
--tokenizer_vocab_path |
str | required | Path to tokenizer.json |
--tokenizer_config_path |
str | required | Path to tokenizer_config.json |
--backbone_llm_path |
str | required | Path to backbone LLM ONNX model |
--backbone_local_path |
str | required | Path to local transformer ONNX model |
--codec_decoder_path |
str | required | Path to codec decoder ONNX model |
--codec_encoder_path |
str | required | Path to codec encoder ONNX model |
--backbone_config_path |
str | required | Path to config_backbone.json |
--codec_config_path |
str | required | Path to config_codec.json |
--prompt_wav |
str | required | Reference speaker audio for voice cloning |
--out_wav |
str | out_streaming.wav |
Output WAV file path |
--sample_rate |
int | 24000 |
Output sample rate (Hz) |
--temperature |
float | 0.725 |
Sampling temperature |
--top_p |
float | 0.6 |
Nucleus sampling threshold |
--top_k |
int | 34 |
Top-k sampling cutoff |
--repetition_penalty |
float | 1.9 |
Repetition penalty coefficient |
--repetition_window |
int | 50 |
Window for repetition penalty |
--max_length |
int | 5000 |
Maximum generation steps |
--delta_chunk_chars |
int | 1 |
Characters per simulated LLM delta |
--delta_delay_s |
float | 0.0 |
Delay between simulated deltas (seconds) |
--assistant_text |
str | (Russian text) | Text to synthesize |
This work builds upon the MOSS-TTS-Realtime model by OpenMOSS Team and the MOSS-Audio-Tokenizer codec.
Copyright 2026 Patrick Lumbantobing, Vertox-AI
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Base model
OpenMOSS-Team/MOSS-TTS