auden-asr-zh-en: Chinese-English ASR
This model card describes AudenAI/auden-asr-zh-en, a bilingual ASR model for Chinese and English speech. It is a pruned RNN-T ASR system with a Zipformer encoder and is designed for non-streaming transcription only. It is a small model (~170M parameters) designed for fast training and inference. Training uses 245,815 hours of Chinese, English, and code-switching data summarized below, and the model targets robust accuracy across common Chinese and English benchmarks.
π What Can This Model Do?
- ποΈ Chinese ASR (Mandarin transcription)
- π English ASR (English transcription)
- π§© Robust performance across mixed Chinese/English data
Quick Start
Non-streaming Usage
from auden.auto.auto_model import AutoModel
# 1) Load a model checkpoint directory (contains config.json + weights)
model_dir = "AudenAI/auden-asr-zh-en" # HF repo id or exported directory
model = AutoModel.from_pretrained(model_dir)
model = model.to("cuda")
model.eval()
# 2) Prepare input features (x, x_lens). If you have raw audio, you can use
# model.speech_encoder.extract_feature(wav) to get (x, x_lens).
x, x_lens = ... # Tensor shapes: (B, T, F), (B,)
inputs = (x, x_lens)
# Alternatively, you can pass WAV inputs directly:
# - List of WAV paths (str):
# inputs = ["/abs/a.wav", "/abs/b.wav"]
# - List of mono waveforms (Tensor/ndarray), 16 kHz:
# inputs = [torch.randn(16000*5), torch.randn(16000*3)]
# 3) ASR (greedy)
hyp = model.generate(inputs)
π Model Characteristics
- Model ID:
AudenAI/auden-asr-zh-en - Input: Raw audio waveform (16 kHz recommended)
- Output: Chinese and English transcription
- Decoding: Greedy search (non-streaming)
- Task: transcribe
π Training Data Composition
This model is trained using Chinese, English, and code-switching data only:
| Language | Data Source | Type | Hours | Total Hours |
|---|---|---|---|---|
| Chinese (Zh) | WenetSpeech | Open Source | 10,005 | 129,265 |
| AISHELL-2 | Open Source | 1,000 | ||
| AISHELL-1 | Open Source | 150 | ||
| Common Voice | Open Source | 237 | ||
| Yodas | Open Source | 222 | ||
| In-house Data | In-house | 117,651 | ||
| English (En) | Libriheavy | Open Source | 45,751 | 107,626 |
| Multilingual LibriSpeech (MLS) | Open Source | 44,659 | ||
| GigaSpeech | Open Source | 10,000 | ||
| Yodas | Open Source | 3,426 | ||
| Common Voice | Open Source | 1,778 | ||
| LibriSpeech | Open Source | 960 | ||
| VoxPopuli | Open Source | 522 | ||
| TED-LIUM | Open Source | 453 | ||
| AMI Corpus | Open Source | 77 | ||
| Code-Switch | TALCS | Open Source | 555 | 8,924 |
| In-house Data | In-house | 8,369 |
π Evaluation
Chinese & English ASR (WERβ, greedy search)
| Dataset | WER |
|---|---|
| librispeech-test-clean | 1.81 |
| librispeech-test-other | 3.63 |
| fleurs-en | 7.41 |
| commonvoice20-en | 10.33 |
| fleurs-zh-CN | 6.35 |
| commonvoice20-zh-CN | 6.63 |
| aishell | 1.51 |
| aishell2 | 2.60 |
| wenet_test_meeting | 5.19 |
| wenet_test_net | 5.51 |
| kespeech | 9.72 |
| talcs | 9.86 |
| speechio_0 | 2.15 |
| speechio_1 | 0.80 |
| speechio_2 | 3.10 |
| speechio_3 | 1.36 |
| speechio_4 | 2.36 |
| speechio_5 | 1.70 |
| speechio_6 | 5.83 |
| speechio_7 | 4.89 |
| speechio_8 | 4.73 |
| speechio_9 | 3.56 |
| speechio_10 | 3.72 |
| speechio_11 | 1.65 |
| speechio_12 | 1.98 |
| speechio_13 | 3.84 |
| speechio_14 | 4.51 |
| speechio_15 | 6.03 |
| speechio_16 | 3.76 |
| speechio_17 | 2.77 |
| speechio_18 | 2.29 |
| speechio_19 | 2.72 |
| speechio_20 | 3.19 |
| speechio_21 | 3.33 |
| speechio_22 | 3.94 |
| speechio_23 | 3.72 |
| speechio_24 | 5.25 |
| speechio_25 | 3.99 |
| speechio_26 | 3.84 |
β οΈ Limitations
- Performance depends on audio quality and recording conditions.
- Not designed for safety-critical applications.
- Downloads last month
- -