StepAudio 2.5 TTS Launches with Fine-Grained Emotional Control

iconKuCoinFlash
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
StepAudio 2.5 TTS launched on April 16 (UTC+8), offering fine-grained emotional control through natural language. Users can now specify tones such as "restrained sadness, no crying tone, slight trembling." The system supports global context, in-text control, and zero-shot voice cloning. On-chain news highlights this advancement in AI audio technology. Meanwhile, Google released Gemini 3.1 Flash TTS, also using natural language for precise speech modulation. Global crypto policy discussions may soon incorporate such tools into broader regulatory frameworks.

ME News reports that on April 16 (UTC+8), according to monitoring by Beating, Jiepao Xingchen officially launched StepAudio 2.5 TTS. Unlike traditional TTS systems requiring predefined emotional tags, this version enables precise control over every aspect of speech using natural language descriptions: while tags can only express “sadness,” natural language can further specify “restrained sadness, no sobbing, slight trembling,” and the AI synthesizes the corresponding vocal tone accordingly. Control is structured in three layers. Global context control sets the overall emotional tone and atmospheric setting of the entire speech, ensuring consistent character expression across multi-turn dialogues; in-text context control adjusts tone, rhythm, pauses, emphasis, and breathiness at the sentence level, even capturing a character’s psychological state and subtext; zero-shot voice cloning (Zeroshot TTS) requires no retraining—any reference recording can be used to replicate a voice, with emotion and style adjustable independently. All three features are now fully available on Jiepao Xingchen’s open platform and Step Plan. On the same day, Google also released Gemini 3.1 Flash TTS, which similarly replaces SSML tags with natural language instructions for fine-grained control and topped third-party evaluations. The simultaneous release by both companies using the same approach indicates a collective industry shift in TTS control paradigms—from “selecting tags” to “stating requests.” For audio content creators and voice directors, what once required multiple recording iterations to adjust emotion can now be defined with a single sentence describing nuanced vocal qualities. (Source: BlockBeats)

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.