Qwen3-ASR-1.7B-th-fleurs

Thai automatic speech recognition model, full fine-tune of Qwen/Qwen3-ASR-1.7B on the Thai split of google/fleurs.

Evaluation

FLEURS Thai test split (1,021 utterances). Reported via the evaluate library — raw model output vs reference, no text normalisation.

Model	CER (%)	WER (%)
Qwen/Qwen3-ASR-1.7B (base)	8.32	79.56
This model	7.02	61.00
Δ relative improvement	+15.7%	+23.3%

Lower is better.

Usage

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "PogusTheWhisper/Qwen3-ASR-1.7B-th-fleurs",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=16,
    max_new_tokens=256,
)
results = model.transcribe(audio="path/to/audio.wav", language="Thai")
print(results[0].text)

For maximum throughput, use the vLLM backend:

model = Qwen3ASRModel.LLM(
    model="PogusTheWhisper/Qwen3-ASR-1.7B-th-fleurs",
    gpu_memory_utilization=0.7,
    max_inference_batch_size=128,
    max_new_tokens=4096,
)
results = model.transcribe(audio="path/to/audio.wav", language="Thai")
print(results[0].text)

Training

Full fine-tune (FFT) of the entire 1.7B-parameter base model using the official QwenLM/Qwen3-ASR qwen3_asr_sft.py script.

Hyperparameter	Value
Base model	Qwen/Qwen3-ASR-1.7B
Method	Full fine-tune (FFT, not LoRA)
Dataset	google/fleurs (`th_th`, 2,602 train / 1,021 test utterances)
Label format	`language Thai<asr_text>{{transcript}}`
Optimizer	AdamW (HF Trainer defaults)
Learning rate	2e-5 (Official tested at this lr w/ eff bs 32)
LR scheduler	cosine (Smoother decay, better final epochs)
Warmup ratio	0.05 (0.1 too aggressive for cosine)
weight_decay	0.01 (Anti-memorization)
Effective batch size	32 (per-device 1 × grad_acc 32)
Epochs	5
Precision	bfloat16
Hardware	1× NVIDIA RTX 3090 (24 GB)
Training time	~25 minutes

Limitations

Trained on read-speech only — FLEURS is broadcast / audiobook style. Conversational, noisy, and telephony-audio performance is not measured.
16 kHz mono audio required (matches the base model).
Code-switched audio (Thai + English) inherits base-model behaviour; not specifically tuned.
Small training set (~7 hours). For production use, consider mixing with Common Voice Thai or your own labelled data.

Citation

@misc{qwen3asr,
  title  = {Qwen3-ASR},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-ASR-1.7B}
}

@inproceedings{conneau2023fleurs,
  title     = {FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
  author    = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
  booktitle = {SLT},
  year      = {2023}
}

Downloads last month: 51

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for PogusTheWhisper/Qwen3-ASR-1.7B-th-fleurs

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(63)

this model

Dataset used to train PogusTheWhisper/Qwen3-ASR-1.7B-th-fleurs

Evaluation results

Character Error Rate on FLEURS (Thai)
test set self-reported

0.070
Word Error Rate on FLEURS (Thai)
test set self-reported

0.610