OmniVoice 🌍

OmniVoice

This OmniVoice variant was trained exclusively on the Chinese and English subsets of the Emilia dataset and corresponds to the "OmniVoice-Emilia" model described in our paper. It is intended for researchers aiming to reproduce the experimental results reported therein. For regular end users seeking superior performance, we recommend using the full-dataset-trained OmniVoice checkpoint OmniVoice instead.

When using this checkpoint, set denoise = False and lang_id = None: the model was trained without prompt denoising or language-ID conditioning.

Citation

@article{zhu2026omnivoice,
      title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
      author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2604.00688},
      year={2026}
}

Downloads last month: -

Safetensors

Model size

0.6B params

Tensor type

I64

F32

Paper for k2-fsa/OmniVoice-Emilia

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Paper • 2604.00688 • Published Apr 1 • 15