OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
Paper β’ 2604.00688 β’ Published β’ 15
This OmniVoice variant was trained exclusively on the Chinese and English subsets of the Emilia dataset and corresponds to the "OmniVoice-Emilia" model described in our paper. It is intended for researchers aiming to reproduce the experimental results reported therein. For regular end users seeking superior performance, we recommend using the full-dataset-trained OmniVoice checkpoint OmniVoice instead.
When using this checkpoint, set denoise = False and lang_id = None: the model was trained without prompt denoising or language-ID conditioning.
@article{zhu2026omnivoice,
title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}