Xiaohongshu open-sources the 2B-parameter TTS model dots.tts with zero-shot voice cloning

According to Beating Monitor, Xiaohongshu's Hi Lab has open-sourced the 2-billion-parameter end-to-end autoregressive text-to-speech (TTS) model dots.tts, and publicly released complete inference and fine-tuning code under the Apache 2.0 license. The released weights include the base pre-trained version, the Self-Correction Alignment (SCA) fine-tuned version, and the low-latency inference distilled version. Unlike traditional TTS architectures that rely on discrete audio codec tokens (such as VALL-E, CosyVoice, and ChatTTS), dots.tts implements a fully continuous, end-to-end autoregressive flow matching architecture that entirely eliminates the use of any discrete tokens. dots.tts combines continuous features extracted from a 48 kHz sampled AudioVAE with a semantic encoder, a backbone language model (initialized from Qwen2.5-1.5B-Base, directly processing BPE text without requiring pinyin input), and an autoregressive flow matching acoustic head to predict continuous latent variables, which are then reconstructed into audio by a generator. By directly predicting continuous features, dots.tts avoids the quality degradation caused by discrete quantization, preserving phonetic detail, voice similarity, and emotional expressiveness. dots.tts was pre-trained on approximately 1.5 million hours of speech data. On the Seed-TTS-Eval benchmark, dots.tts achieved word error rates (WER) of 0.94% / 1.30% / 6.60% on Chinese, English, and challenging Chinese test sets, respectively, with similarity scores (SIM) of 81.0 / 77.1 / 79.5—all reaching state-of-the-art levels among open-source models. On the MiniMax Multilingual benchmark across 24 languages, the average speaker similarity reached 83.9. Xiaohongshu has provided a Gradio demo space on Hugging Face for users to test zero-shot voice cloning online.