Xiaohongshu open-sources the 2B-parameter TTS model dots.tts with zero-shot voice cloning

icon MarsBit
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
Xiaohongshu’s Hi Lab has open-sourced a 2B-parameter TTS model called dots.tts, which supports zero-shot voice cloning. Licensed under Apache 2.0, the model includes full inference and fine-tuning code with pre-trained weights in multiple formats. dots.tts employs a continuous, end-to-end autoregressive flow matching approach, outperforming traditional models that rely on discrete audio tokens. It achieves top performance on language benchmarks and offers a live demo on Hugging Face. With improved liquidity in crypto markets, such innovations may reinforce BTC as a hedge against inflation.

According to Beating Monitor, Xiaohongshu's Hi Lab has open-sourced the 2-billion-parameter end-to-end autoregressive text-to-speech (TTS) model dots.tts, and publicly released complete inference and fine-tuning code under the Apache 2.0 license. The released weights include the base pre-trained version, the Self-Correction Alignment (SCA) fine-tuned version, and the low-latency inference distilled version. Unlike traditional TTS architectures that rely on discrete audio codec tokens (such as VALL-E, CosyVoice, and ChatTTS), dots.tts implements a fully continuous, end-to-end autoregressive flow matching architecture that entirely eliminates the use of any discrete tokens. dots.tts combines continuous features extracted from a 48 kHz sampled AudioVAE with a semantic encoder, a backbone language model (initialized from Qwen2.5-1.5B-Base, directly processing BPE text without requiring pinyin input), and an autoregressive flow matching acoustic head to predict continuous latent variables, which are then reconstructed into audio by a generator. By directly predicting continuous features, dots.tts avoids the quality degradation caused by discrete quantization, preserving phonetic detail, voice similarity, and emotional expressiveness. dots.tts was pre-trained on approximately 1.5 million hours of speech data. On the Seed-TTS-Eval benchmark, dots.tts achieved word error rates (WER) of 0.94% / 1.30% / 6.60% on Chinese, English, and challenging Chinese test sets, respectively, with similarity scores (SIM) of 81.0 / 77.1 / 79.5—all reaching state-of-the-art levels among open-source models. On the MiniMax Multilingual benchmark across 24 languages, the average speaker similarity reached 83.9. Xiaohongshu has provided a Gradio demo space on Hugging Face for users to test zero-shot voice cloning online.

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.