auden-asr-zh-stream: Streaming Chinese ASR

This model card describes AudenAI/auden-asr-zh-stream, a Chinese ASR model that supports both streaming and non-streaming inference. The model is a pruned RNN-T ASR system with a Zipformer encoder. It is a small model (~170M parameters) designed for fast training and inference, optimized for low-latency Mandarin transcription while maintaining strong accuracy on common Chinese ASR benchmarks. The streaming chunk size can be as low as 16 (β‰ˆ450ms) with only minor degradation; smaller chunk sizes are theoretically supported but performance is not guaranteed. Training uses 138,189 hours of speech data as summarized below.

πŸ” What Can This Model Do?

  • πŸŽ™οΈ Streaming Chinese ASR (real-time transcription)
  • ⏱️ Low-latency decoding with greedy search
  • 🧩 Robust performance across diverse Chinese datasets

Quick Start

Non-streaming Usage

from auden.auto.auto_model import AutoModel

# 1) Load a model checkpoint directory (contains config.json + weights)
model_dir = "AudenAI/auden-asr-zh-stream"  # HF repo id or exported directory
model = AutoModel.from_pretrained(model_dir)
model = model.to("cuda")
model.eval()

# 2) Prepare input features (x, x_lens). If you have raw audio, you can use
#    model.speech_encoder.extract_feature(wav) to get (x, x_lens).
x, x_lens = ...  # Tensor shapes: (B, T, F), (B,)

inputs = (x, x_lens)
# Alternatively, you can pass WAV inputs directly:
# - List of WAV paths (str):
#   inputs = ["/abs/a.wav", "/abs/b.wav"]
# - List of mono waveforms (Tensor/ndarray), 16 kHz:
#   inputs = [torch.randn(16000*5), torch.randn(16000*3)]

# 3) Non-streaming ASR (greedy)
hyp = model.generate(inputs)

Streaming Usage

Streaming setup

Use the streaming script in this repo for real-time style decoding:

python examples/asr/decode_streaming.py \
  --model-dir AudenAI/auden-asr-zh-stream \
  --wav /abs/path/to/test.wav \
  --chunk-size 16 \
  --left-context 128

Common parameters

  • chunk-size: internal streaming chunk size (frames, ~10ms per frame)
  • left-context: left context frames; larger values improve stability but add latency

Recommended streaming settings

  • chunk-size can be as small as 16 (β‰ˆ450ms effective chunk duration; i.e., (2 * chunk_size + 13) * 10ms). This yields lower latency with only minor degradation in accuracy.
  • left-context is configurable; increasing it improves stability/accuracy.

πŸ“Œ Model Characteristics

  • Model ID: AudenAI/auden-asr-zh-stream
  • Input: Raw audio waveform (16 kHz recommended)
  • Output: Chinese transcription
  • Decoding: Greedy search (streaming and non-streaming)
  • Task: transcribe

πŸ“š Training Data Composition

The streaming ASR model is trained on the following data composition:

Language Data Source Type Hours Total Hours
Chinese (Zh) WenetSpeech Open Source 10,005 129,265
AISHELL-2 Open Source 1,000
AISHELL-1 Open Source 150
Common Voice Open Source 237
Yodas Open Source 222
In-house Data In-house 117,651
Code-Switch TALCS Open Source 555 8,924
In-house Data In-house 8,369

πŸ“Š Evaluation

Chinese ASR (WER↓, non-streaming decoding)

Dataset WER
FLEURS zh-CN 7.05
CommonVoice20 zh-CN 10.87
AISHELL-1 1.72
AISHELL-2 3.15
Wenet Test Meeting 6.87
Wenet Test Net 6.26
KeSpeech 7.36
TALCS 9.70
SpeechIO 0 2.16
SpeechIO 1 1.37
SpeechIO 2 3.89
SpeechIO 3 2.36
SpeechIO 4 2.71
SpeechIO 5 2.45
SpeechIO 6 6.62
SpeechIO 7 6.35
SpeechIO 8 6.91
SpeechIO 9 4.35
SpeechIO 10 3.96
SpeechIO 11 2.04
SpeechIO 12 2.39
SpeechIO 13 5.28
SpeechIO 14 6.63
SpeechIO 15 7.45
SpeechIO 16 4.76
SpeechIO 17 3.79
SpeechIO 18 3.60
SpeechIO 19 3.80
SpeechIO 20 4.13
SpeechIO 21 3.72
SpeechIO 22 4.89
SpeechIO 23 3.86
SpeechIO 24 7.09
SpeechIO 25 4.34
SpeechIO 26 4.21

⚠️ Limitations

  • Performance depends on audio quality and recording conditions.
  • For long-form audio, chunking and post-processing might be required.
  • Not designed for safety-critical applications.
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support