Qwen3-ASR-1.7B-th-fleurs

Thai automatic speech recognition model, full fine-tune of Qwen/Qwen3-ASR-1.7B on the Thai split of google/fleurs.

Evaluation

FLEURS Thai test split (1,021 utterances). Reported via the evaluate library — raw model output vs reference, no text normalisation.

Model CER (%) WER (%)
Qwen/Qwen3-ASR-1.7B (base) 8.32 79.56
This model 7.02 61.00
Δ relative improvement +15.7% +23.3%

Lower is better.

Usage

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "PogusTheWhisper/Qwen3-ASR-1.7B-th-fleurs",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=16,
    max_new_tokens=256,
)
results = model.transcribe(audio="path/to/audio.wav", language="Thai")
print(results[0].text)

For maximum throughput, use the vLLM backend:

model = Qwen3ASRModel.LLM(
    model="PogusTheWhisper/Qwen3-ASR-1.7B-th-fleurs",
    gpu_memory_utilization=0.7,
    max_inference_batch_size=128,
    max_new_tokens=4096,
)
results = model.transcribe(audio="path/to/audio.wav", language="Thai")
print(results[0].text)

Training

Full fine-tune (FFT) of the entire 1.7B-parameter base model using the official QwenLM/Qwen3-ASR qwen3_asr_sft.py script.

Hyperparameter Value
Base model Qwen/Qwen3-ASR-1.7B
Method Full fine-tune (FFT, not LoRA)
Dataset google/fleurs (th_th, 2,602 train / 1,021 test utterances)
Label format language Thai<asr_text>{{transcript}}
Optimizer AdamW (HF Trainer defaults)
Learning rate 2e-5 (Official tested at this lr w/ eff bs 32)
LR scheduler cosine (Smoother decay, better final epochs)
Warmup ratio 0.05 (0.1 too aggressive for cosine)
weight_decay 0.01 (Anti-memorization)
Effective batch size 32 (per-device 1 × grad_acc 32)
Epochs 5
Precision bfloat16
Hardware 1× NVIDIA RTX 3090 (24 GB)
Training time ~25 minutes

Limitations

  • Trained on read-speech only — FLEURS is broadcast / audiobook style. Conversational, noisy, and telephony-audio performance is not measured.
  • 16 kHz mono audio required (matches the base model).
  • Code-switched audio (Thai + English) inherits base-model behaviour; not specifically tuned.
  • Small training set (~7 hours). For production use, consider mixing with Common Voice Thai or your own labelled data.

Citation

@misc{qwen3asr,
  title  = {Qwen3-ASR},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-ASR-1.7B}
}

@inproceedings{conneau2023fleurs,
  title     = {FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
  author    = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
  booktitle = {SLT},
  year      = {2023}
}
Downloads last month
51
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PogusTheWhisper/Qwen3-ASR-1.7B-th-fleurs

Finetuned
(63)
this model

Dataset used to train PogusTheWhisper/Qwen3-ASR-1.7B-th-fleurs

Evaluation results