MOSS-TTS-Realtime ONNX Inference

Pure ONNX Runtime inference pipeline for MOSS-TTS-Realtime, enabling streaming text-to-speech without any PyTorch or Hugging Face Transformers dependency at runtime.

Overview

This repository provides:

inferencer_onnx.py — Core streaming TTS engine that orchestrates four ONNX models (backbone LLM, local transformer, codec encoder, codec decoder) using only NumPy and ONNX Runtime.
moss_text_tokenizer.py — Lightweight Qwen3-compatible tokenizer wrapping the tokenizers library, with no transformers dependency.
test_basic_streaming-onnx.py — End-to-end test script that simulates LLM streaming text and produces a WAV file.

Architecture

Reference Audio ──► Codec Encoder ──► RVQ Audio Codes (voice clone context)
                                           │
                                           ▼
Text Deltas ──► Backbone LLM (Qwen3-1.7B) ──► Hidden States
                                                    │
                                                    ▼
                                            Local Transformer ──► 16-codebook Audio Tokens
                                                                        │
                                                                        ▼
                                                                Codec Decoder ──► 24 kHz Waveform

Component	ONNX Model	Description
Backbone LLM	`backbone_llm.onnx`	Qwen3-based causal LM mapping interleaved text+audio tokens to hidden states. Maintains a growing KV-cache across the entire generation.
Local Transformer	`backbone_local.onnx`	Depth-wise decoder generating 16 RVQ codebook entries per frame from backbone hidden states. Creates and discards a fresh KV-cache per frame.
Codec Encoder	`codec_encoder.onnx`	Encodes reference speaker waveform into RVQ codes for voice cloning. Run once per session.
Codec Decoder	`codec_decoder.onnx`	Decodes RVQ audio codes back to 24 kHz waveform. Maintains four hierarchical KV-caches for streaming decode.

Requirements

numpy
onnxruntime
soundfile
librosa
tokenizers

Install with:

pip install numpy onnxruntime soundfile librosa tokenizers

Directory Structure

.
├── inferencer_onnx.py              # Core ONNX inference engine
├── moss_text_tokenizer.py          # Lightweight Qwen3 tokenizer
├── test_basic_streaming-onnx.py    # End-to-end test script
├── README.md
├── onnx_models/  # FP32
│   ├── backbone_f32/
│   │   └── backbone_f32.onnx
│   ├── local_transformer/
│   │   └── local_transformer_f32.onnx
│   ├── codec_decoder/
│   │   └── codec_decoder.onnx
│   └── codec_encoder/
│       └── codec_encoder.onnx
├── onnx_models/
│   ├── codec_decoder_int8/
│   │   └── codec_decoder_int8.onnx
├── configs/
│   ├── config_backbone.json
│   └── config_codec.json
├── tokenizers/
│   ├── tokenizer.json
│   └── tokenizer_config.json
├── audio_ref/
│   └── <reference_speaker>.[wav|mp3|flac]
└── audio_synth/
    └── <synthesized_example>.wav

Usage

Basic Streaming TTS

Notes 1

With float32, all models loaded will consume about 13GB. It will OOM after about 120 steps on 16GB RAM.
With <= 16GB RAM, you can use quantized (INT8) codec decoder to avoid OOM. Quantized codec encoder can also be further used but degrades the performance.
With quantized (INT8) backbone_llm and backbone_local_transformer, the performance will be unacceptable and most of the times gibberish/hallucinates.
BF16, as the original MOSS-TTS model is saved, is not yet fully supported on most CPUs. If you want to use GPU, you can convert the fp32 model.
We also noted that the performance when using FP32 (torch/ONNX) on backbone_llm and backbone_local is a bit unstable compared to bf16 (torch). Probably due to the training with bfloat16 and excessed in numerical range with fp32 inference.
So, perhaps the better option is to use ONNX converted to fp16 with GPU/supported CPU. We tried with m8a.xlarge and m8a.2xlarge instances, they do not support CPU with fp16.

Notes 2

The KV caching mechanism is modified to use input past_kv tensor/array and initialized with empty on the time dimension so no need to export two ONNXs for prefill and step. In this case, one ONNX can handle both initializing and continuing. This mechanism is all for the backbone_llm (Qwen3Model), backbone_local_transformer, and codec_decoder. The codec_encoder always receives full sequence.

Notes 3

Text by default in Russian, you can modify in the args. The prompt for the speaker is also modified in Russian, you can change in the inferencer_onnx.py.
This prompting and default decoding hyperparameters (temp, top_p, top_k, repetition) has been optimized for Russian, and you can probably change for your language.
The default prompt from MOSS-TTS is given in English, and we investigated you can slightly modify and even change to your targeted language to produce consistent accent/nativeness as we are using for the Russian within the MOSSTTSRealtimeProcessor.

Example

With quantized (INT8) codec decoder (requires at least 13GB RAM)

python test_basic_streaming-onnx.py --tokenizer_vocab_path tokenizers/tokenizer.json --tokenizer_config_path tokenizers/tokenizer_config.json --backbone_llm_path onnx_models/backbone_f32/backbone_f32.onnx --backbone_local_path onnx_models/local_transformer_f32/local_transformer_f32.onnx --codec_decoder_path onnx_models_quantized/codec_decoder_int8/codec_decoder_int8.onnx --codec_encoder_path onnx_models/codec_encoder/codec_encoder.onnx --backbone_config_path configs/config_backbone.json --codec_config_path configs/config_codec.json --prompt_wav audio_ref/male_stewie.mp3 --out_wav output.wav

With all FP32 (requires > 16GB RAM)

python test_basic_streaming-onnx.py --tokenizer_vocab_path tokenizers/tokenizer.json --tokenizer_config_path tokenizers/tokenizer_config.json --backbone_llm_path onnx_models/backbone_f32/backbone_f32.onnx --backbone_local_path onnx_models/local_transformer_f32/local_transformer_f32.onnx --codec_decoder_path onnx_models/codec_decoder/codec_decoder.onnx --codec_encoder_path onnx_models/codec_encoder/codec_encoder.onnx --backbone_config_path configs/config_backbone.json --codec_config_path configs/config_codec.json --prompt_wav audio_ref/male_stewie.mp3 --out_wav output.wav

Programmatic Usage

import json
import onnxruntime as ort
from inferencer_onnx import MossTTSRealtimeInferenceONNX
from moss_text_tokenizer import MOSSTextTokenizer

# Load tokenizer and ONNX sessions
tokenizer = MOSSTextTokenizer("tokenizers/tokenizer.json",
                               "tokenizers/tokenizer_config.json")
backbone_llm = ort.InferenceSession("onnx_models/backbone_llm.onnx",
                                     providers=["CPUExecutionProvider"])
backbone_local = ort.InferenceSession("onnx_models/backbone_local.onnx",
                                       providers=["CPUExecutionProvider"])
codec_decoder = ort.InferenceSession("onnx_models/codec_decoder.onnx",
                                      providers=["CPUExecutionProvider"])
codec_encoder = ort.InferenceSession("onnx_models/codec_encoder.onnx",
                                      providers=["CPUExecutionProvider"])

with open("configs/config_backbone.json") as f:
    backbone_config = json.load(f)
with open("configs/config_codec.json") as f:
    codec_config = json.load(f)

# Create inferencer
inferencer = MossTTSRealtimeInferenceONNX(
    tokenizer, backbone_llm, backbone_local,
    codec_decoder, codec_encoder,
    backbone_config, codec_config,
)

# Encode reference speaker for voice cloning
prompt_tokens = inferencer._encode_reference_audio("audio/speaker.wav")
input_ids = inferencer.processor.make_ensemble(prompt_tokens.squeeze(1))
inferencer.reset_turn(input_ids=input_ids, include_system_prompt=False,
                      reset_cache=True)

# Stream text and collect audio
for delta in your_llm_stream():
    audio_frames = inferencer.push_text(delta)
    for frame in audio_frames:
        # push_tokens + audio_chunks for waveform decoding
        ...

Command-Line Arguments

Argument	Type	Default	Description
`--tokenizer_vocab_path`	str	required	Path to `tokenizer.json`
`--tokenizer_config_path`	str	required	Path to `tokenizer_config.json`
`--backbone_llm_path`	str	required	Path to backbone LLM ONNX model
`--backbone_local_path`	str	required	Path to local transformer ONNX model
`--codec_decoder_path`	str	required	Path to codec decoder ONNX model
`--codec_encoder_path`	str	required	Path to codec encoder ONNX model
`--backbone_config_path`	str	required	Path to `config_backbone.json`
`--codec_config_path`	str	required	Path to `config_codec.json`
`--prompt_wav`	str	required	Reference speaker audio for voice cloning
`--out_wav`	str	`out_streaming.wav`	Output WAV file path
`--sample_rate`	int	`24000`	Output sample rate (Hz)
`--temperature`	float	`0.725`	Sampling temperature
`--top_p`	float	`0.6`	Nucleus sampling threshold
`--top_k`	int	`34`	Top-k sampling cutoff
`--repetition_penalty`	float	`1.9`	Repetition penalty coefficient
`--repetition_window`	int	`50`	Window for repetition penalty
`--max_length`	int	`5000`	Maximum generation steps
`--delta_chunk_chars`	int	`1`	Characters per simulated LLM delta
`--delta_delay_s`	float	`0.0`	Delay between simulated deltas (seconds)
`--assistant_text`	str	(Russian text)	Text to synthesize

Acknowledgments

This work builds upon the MOSS-TTS-Realtime model by OpenMOSS Team and the MOSS-Audio-Tokenizer codec.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for pltobing/MOSS-TTS-Realtime-ONNX

Base model

OpenMOSS-Team/MOSS-TTS

Quantized

(1)

this model

Collection including pltobing/MOSS-TTS-Realtime-ONNX

Streaming TTS

Collection

Streaming text-to-speech models and frameworks • 2 items • Updated 2 days ago