auden-asr-zh-en: Chinese-English ASR

This model card describes AudenAI/auden-asr-zh-en, a bilingual ASR model for Chinese and English speech. It is a pruned RNN-T ASR system with a Zipformer encoder and is designed for non-streaming transcription only. It is a small model (~170M parameters) designed for fast training and inference. Training uses 245,815 hours of Chinese, English, and code-switching data summarized below, and the model targets robust accuracy across common Chinese and English benchmarks.

πŸ” What Can This Model Do?

  • πŸŽ™οΈ Chinese ASR (Mandarin transcription)
  • 🌍 English ASR (English transcription)
  • 🧩 Robust performance across mixed Chinese/English data

Quick Start

Non-streaming Usage

from auden.auto.auto_model import AutoModel

# 1) Load a model checkpoint directory (contains config.json + weights)
model_dir = "AudenAI/auden-asr-zh-en"  # HF repo id or exported directory
model = AutoModel.from_pretrained(model_dir)
model = model.to("cuda")
model.eval()

# 2) Prepare input features (x, x_lens). If you have raw audio, you can use
#    model.speech_encoder.extract_feature(wav) to get (x, x_lens).
x, x_lens = ...  # Tensor shapes: (B, T, F), (B,)

inputs = (x, x_lens)
# Alternatively, you can pass WAV inputs directly:
# - List of WAV paths (str):
#   inputs = ["/abs/a.wav", "/abs/b.wav"]
# - List of mono waveforms (Tensor/ndarray), 16 kHz:
#   inputs = [torch.randn(16000*5), torch.randn(16000*3)]

# 3) ASR (greedy)
hyp = model.generate(inputs)

πŸ“Œ Model Characteristics

  • Model ID: AudenAI/auden-asr-zh-en
  • Input: Raw audio waveform (16 kHz recommended)
  • Output: Chinese and English transcription
  • Decoding: Greedy search (non-streaming)
  • Task: transcribe

πŸ“š Training Data Composition

This model is trained using Chinese, English, and code-switching data only:

Language Data Source Type Hours Total Hours
Chinese (Zh) WenetSpeech Open Source 10,005 129,265
AISHELL-2 Open Source 1,000
AISHELL-1 Open Source 150
Common Voice Open Source 237
Yodas Open Source 222
In-house Data In-house 117,651
English (En) Libriheavy Open Source 45,751 107,626
Multilingual LibriSpeech (MLS) Open Source 44,659
GigaSpeech Open Source 10,000
Yodas Open Source 3,426
Common Voice Open Source 1,778
LibriSpeech Open Source 960
VoxPopuli Open Source 522
TED-LIUM Open Source 453
AMI Corpus Open Source 77
Code-Switch TALCS Open Source 555 8,924
In-house Data In-house 8,369

πŸ“Š Evaluation

Chinese & English ASR (WER↓, greedy search)

Dataset WER
librispeech-test-clean 1.81
librispeech-test-other 3.63
fleurs-en 7.41
commonvoice20-en 10.33
fleurs-zh-CN 6.35
commonvoice20-zh-CN 6.63
aishell 1.51
aishell2 2.60
wenet_test_meeting 5.19
wenet_test_net 5.51
kespeech 9.72
talcs 9.86
speechio_0 2.15
speechio_1 0.80
speechio_2 3.10
speechio_3 1.36
speechio_4 2.36
speechio_5 1.70
speechio_6 5.83
speechio_7 4.89
speechio_8 4.73
speechio_9 3.56
speechio_10 3.72
speechio_11 1.65
speechio_12 1.98
speechio_13 3.84
speechio_14 4.51
speechio_15 6.03
speechio_16 3.76
speechio_17 2.77
speechio_18 2.29
speechio_19 2.72
speechio_20 3.19
speechio_21 3.33
speechio_22 3.94
speechio_23 3.72
speechio_24 5.25
speechio_25 3.99
speechio_26 3.84

⚠️ Limitations

  • Performance depends on audio quality and recording conditions.
  • Not designed for safety-critical applications.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support