Xiaomi open-sources OmniVoice: a 646-language voice cloning model trained on open data

According to Beating Monitor, Xiaomi AI Lab’s new Kaldi team has open-sourced OmniVoice, a zero-shot voice cloning TTS (text-to-speech) model supporting 646 languages. With just a few seconds of reference audio, the model can clone a voice and generate speech in multiple languages: input a Chinese recording, and the model will speak Japanese, Korean, or other languages using the same voice. The code, weights, and training data are fully open-sourced under the Apache-2.0 license. Architecturally, OmniVoice adopts an extremely minimalist design. The entire model consists of a single bidirectional Transformer that directly maps text to multi-codebook acoustic tokens (discrete encodings of sound), eliminating the traditional two-stage pipeline that first converts text to semantic tokens and then to acoustic tokens. Two key innovations enable this simple architecture: a full-codebook random masking strategy improves training efficiency, while initialization with pre-trained parameters from large language models enhances pronunciation accuracy. Inference runs at 40x real-time speed on PyTorch without requiring additional optimization. Training data was sourced entirely from 50 open-source speech datasets, totaling 580,000 hours after noise reduction and quality filtering. Low-resource languages are dynamically upsampled to ensure training effectiveness. In evaluations across 24 languages, OmniVoice outperformed multiple commercial systems in both voice similarity and intelligibility. Across 102 languages, intelligibility approached or exceeded that of real human recordings. Even for low-resource languages with less than 10 hours of training data, high-quality synthesis is achievable. Beyond voice cloning, the model supports text-based voice customization (e.g., “male, middle-aged, very low pitch” or “female, young, Sichuan dialect”), automatic noise reduction for noisy reference audio, insertion of prosodic markers such as laughter and sighs, and correction of polyphonic characters and proper nouns in both Chinese and English.