auden-asr-zh-stream: Streaming Chinese ASR
This model card describes AudenAI/auden-asr-zh-stream, a Chinese ASR model that supports both streaming and non-streaming inference. The model is a pruned RNN-T ASR system with a Zipformer encoder. It is a small model (~170M parameters) designed for fast training and inference, optimized for low-latency Mandarin transcription while maintaining strong accuracy on common Chinese ASR benchmarks. The streaming chunk size can be as low as 16 (β450ms) with only minor degradation; smaller chunk sizes are theoretically supported but performance is not guaranteed. Training uses 138,189 hours of speech data as summarized below.
π What Can This Model Do?
- ποΈ Streaming Chinese ASR (real-time transcription)
- β±οΈ Low-latency decoding with greedy search
- π§© Robust performance across diverse Chinese datasets
Quick Start
Non-streaming Usage
from auden.auto.auto_model import AutoModel
# 1) Load a model checkpoint directory (contains config.json + weights)
model_dir = "AudenAI/auden-asr-zh-stream" # HF repo id or exported directory
model = AutoModel.from_pretrained(model_dir)
model = model.to("cuda")
model.eval()
# 2) Prepare input features (x, x_lens). If you have raw audio, you can use
# model.speech_encoder.extract_feature(wav) to get (x, x_lens).
x, x_lens = ... # Tensor shapes: (B, T, F), (B,)
inputs = (x, x_lens)
# Alternatively, you can pass WAV inputs directly:
# - List of WAV paths (str):
# inputs = ["/abs/a.wav", "/abs/b.wav"]
# - List of mono waveforms (Tensor/ndarray), 16 kHz:
# inputs = [torch.randn(16000*5), torch.randn(16000*3)]
# 3) Non-streaming ASR (greedy)
hyp = model.generate(inputs)
Streaming Usage
Streaming setup
Use the streaming script in this repo for real-time style decoding:
python examples/asr/decode_streaming.py \
--model-dir AudenAI/auden-asr-zh-stream \
--wav /abs/path/to/test.wav \
--chunk-size 16 \
--left-context 128
Common parameters
chunk-size: internal streaming chunk size (frames, ~10ms per frame)left-context: left context frames; larger values improve stability but add latency
Recommended streaming settings
chunk-sizecan be as small as 16 (β450ms effective chunk duration; i.e.,(2 * chunk_size + 13) * 10ms). This yields lower latency with only minor degradation in accuracy.left-contextis configurable; increasing it improves stability/accuracy.
π Model Characteristics
- Model ID:
AudenAI/auden-asr-zh-stream - Input: Raw audio waveform (16 kHz recommended)
- Output: Chinese transcription
- Decoding: Greedy search (streaming and non-streaming)
- Task: transcribe
π Training Data Composition
The streaming ASR model is trained on the following data composition:
| Language | Data Source | Type | Hours | Total Hours |
|---|---|---|---|---|
| Chinese (Zh) | WenetSpeech | Open Source | 10,005 | 129,265 |
| AISHELL-2 | Open Source | 1,000 | ||
| AISHELL-1 | Open Source | 150 | ||
| Common Voice | Open Source | 237 | ||
| Yodas | Open Source | 222 | ||
| In-house Data | In-house | 117,651 | ||
| Code-Switch | TALCS | Open Source | 555 | 8,924 |
| In-house Data | In-house | 8,369 |
π Evaluation
Chinese ASR (WERβ, non-streaming decoding)
| Dataset | WER |
|---|---|
| FLEURS zh-CN | 7.05 |
| CommonVoice20 zh-CN | 10.87 |
| AISHELL-1 | 1.72 |
| AISHELL-2 | 3.15 |
| Wenet Test Meeting | 6.87 |
| Wenet Test Net | 6.26 |
| KeSpeech | 7.36 |
| TALCS | 9.70 |
| SpeechIO 0 | 2.16 |
| SpeechIO 1 | 1.37 |
| SpeechIO 2 | 3.89 |
| SpeechIO 3 | 2.36 |
| SpeechIO 4 | 2.71 |
| SpeechIO 5 | 2.45 |
| SpeechIO 6 | 6.62 |
| SpeechIO 7 | 6.35 |
| SpeechIO 8 | 6.91 |
| SpeechIO 9 | 4.35 |
| SpeechIO 10 | 3.96 |
| SpeechIO 11 | 2.04 |
| SpeechIO 12 | 2.39 |
| SpeechIO 13 | 5.28 |
| SpeechIO 14 | 6.63 |
| SpeechIO 15 | 7.45 |
| SpeechIO 16 | 4.76 |
| SpeechIO 17 | 3.79 |
| SpeechIO 18 | 3.60 |
| SpeechIO 19 | 3.80 |
| SpeechIO 20 | 4.13 |
| SpeechIO 21 | 3.72 |
| SpeechIO 22 | 4.89 |
| SpeechIO 23 | 3.86 |
| SpeechIO 24 | 7.09 |
| SpeechIO 25 | 4.34 |
| SpeechIO 26 | 4.21 |
β οΈ Limitations
- Performance depends on audio quality and recording conditions.
- For long-form audio, chunking and post-processing might be required.
- Not designed for safety-critical applications.
- Downloads last month
- 7