`auden-asr-zh-stream`: Streaming Chinese ASR

This model card describes AudenAI/auden-asr-zh-stream, a Chinese ASR model that supports both streaming and non-streaming inference. The model is a pruned RNN-T ASR system with a Zipformer encoder. It is a small model (~170M parameters) designed for fast training and inference, optimized for low-latency Mandarin transcription while maintaining strong accuracy on common Chinese ASR benchmarks. The streaming chunk size can be as low as 16 (≈450ms) with only minor degradation; smaller chunk sizes are theoretically supported but performance is not guaranteed. Training uses 138,189 hours of speech data as summarized below.

🔍 What Can This Model Do?

🎙️ Streaming Chinese ASR (real-time transcription)
⏱️ Low-latency decoding with greedy search
🧩 Robust performance across diverse Chinese datasets

Quick Start

Non-streaming Usage

from auden.auto.auto_model import AutoModel

# 1) Load a model checkpoint directory (contains config.json + weights)
model_dir = "AudenAI/auden-asr-zh-stream"  # HF repo id or exported directory
model = AutoModel.from_pretrained(model_dir)
model = model.to("cuda")
model.eval()

# 2) Prepare input features (x, x_lens). If you have raw audio, you can use
#    model.speech_encoder.extract_feature(wav) to get (x, x_lens).
x, x_lens = ...  # Tensor shapes: (B, T, F), (B,)

inputs = (x, x_lens)
# Alternatively, you can pass WAV inputs directly:
# - List of WAV paths (str):
#   inputs = ["/abs/a.wav", "/abs/b.wav"]
# - List of mono waveforms (Tensor/ndarray), 16 kHz:
#   inputs = [torch.randn(16000*5), torch.randn(16000*3)]

# 3) Non-streaming ASR (greedy)
hyp = model.generate(inputs)

Streaming Usage

Streaming setup

Use the streaming script in this repo for real-time style decoding:

python examples/asr/decode_streaming.py \
  --model-dir AudenAI/auden-asr-zh-stream \
  --wav /abs/path/to/test.wav \
  --chunk-size 16 \
  --left-context 128

Common parameters

chunk-size: internal streaming chunk size (frames, ~10ms per frame)
left-context: left context frames; larger values improve stability but add latency

Recommended streaming settings

chunk-size can be as small as 16 (≈450ms effective chunk duration; i.e., (2 * chunk_size + 13) * 10ms). This yields lower latency with only minor degradation in accuracy.
left-context is configurable; increasing it improves stability/accuracy.

📌 Model Characteristics

Model ID: AudenAI/auden-asr-zh-stream
Input: Raw audio waveform (16 kHz recommended)
Output: Chinese transcription
Decoding: Greedy search (streaming and non-streaming)
Task: transcribe

📚 Training Data Composition

The streaming ASR model is trained on the following data composition:

Language	Data Source	Type	Hours	Total Hours
Chinese (Zh)	WenetSpeech	Open Source	10,005	129,265
	AISHELL-2	Open Source	1,000
	AISHELL-1	Open Source	150
	Common Voice	Open Source	237
	Yodas	Open Source	222
	In-house Data	In-house	117,651
Code-Switch	TALCS	Open Source	555	8,924
	In-house Data	In-house	8,369

📊 Evaluation

Chinese ASR (WER↓, non-streaming decoding)

Dataset	WER
FLEURS zh-CN	7.05
CommonVoice20 zh-CN	10.87
AISHELL-1	1.72
AISHELL-2	3.15
Wenet Test Meeting	6.87
Wenet Test Net	6.26
KeSpeech	7.36
TALCS	9.70
SpeechIO 0	2.16
SpeechIO 1	1.37
SpeechIO 2	3.89
SpeechIO 3	2.36
SpeechIO 4	2.71
SpeechIO 5	2.45
SpeechIO 6	6.62
SpeechIO 7	6.35
SpeechIO 8	6.91
SpeechIO 9	4.35
SpeechIO 10	3.96
SpeechIO 11	2.04
SpeechIO 12	2.39
SpeechIO 13	5.28
SpeechIO 14	6.63
SpeechIO 15	7.45
SpeechIO 16	4.76
SpeechIO 17	3.79
SpeechIO 18	3.60
SpeechIO 19	3.80
SpeechIO 20	4.13
SpeechIO 21	3.72
SpeechIO 22	4.89
SpeechIO 23	3.86
SpeechIO 24	7.09
SpeechIO 25	4.34
SpeechIO 26	4.21

⚠️ Limitations

Performance depends on audio quality and recording conditions.
For long-form audio, chunking and post-processing might be required.
Not designed for safety-critical applications.

Downloads last month: 7

auden-asr-zh-stream: Streaming Chinese ASR