According to Beating Monitor, Xiaomi AI Lab’s new Kaldi team has open-sourced OmniVoice, a zero-shot voice cloning TTS (text-to-speech) model supporting 646 languages. With just a few seconds of reference audio, the model can clone a voice and generate speech in multiple languages: input a Chinese recording, and the model will speak Japanese, Korean, or other languages using the same voice. The code, weights, and training data are fully open-sourced under the Apache-2.0 license. Architecturally, OmniVoice adopts an extremely minimalist design. The entire model consists of a single bidirectional Transformer that directly maps text to multi-codebook acoustic tokens (discrete encodings of sound), eliminating the traditional two-stage pipeline that first converts text to semantic tokens and then to acoustic tokens. Two key innovations enable this simple architecture: a full-codebook random masking strategy improves training efficiency, while initialization with pre-trained parameters from large language models enhances pronunciation accuracy. Inference runs at 40x real-time speed on PyTorch without requiring additional optimization. Training data was sourced entirely from 50 open-source speech datasets, totaling 580,000 hours after noise reduction and quality filtering. Low-resource languages are dynamically upsampled to ensure training effectiveness. In evaluations across 24 languages, OmniVoice outperformed multiple commercial systems in both voice similarity and intelligibility. Across 102 languages, intelligibility approached or exceeded that of real human recordings. Even for low-resource languages with less than 10 hours of training data, high-quality synthesis is achievable. Beyond voice cloning, the model supports text-based voice customization (e.g., “male, middle-aged, very low pitch” or “female, young, Sichuan dialect”), automatic noise reduction for noisy reference audio, insertion of prosodic markers such as laughter and sighs, and correction of polyphonic characters and proper nouns in both Chinese and English.
Xiaomi open-sources OmniVoice: a 646-language voice cloning model trained on open data
MarsBitShare






The Xiaomi AI Lab’s Kaldi team has open-sourced OmniVoice, a zero-shot text-to-speech model supporting 646 languages. The model can replicate voice timbre using only a few seconds of reference audio and operates across languages. Trained on open-source data, it outperforms commercial systems in voice similarity and intelligibility. On-chain data indicates rising interest in AI tools, with open interest in related projects steadily increasing. The model employs a single bidirectional Transformer and is optimized for fast inference.
Source:Show original
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information.
Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.