StepAudio 2.5 TTS Launches with Fine-Grained Emotional Control

ME News reports that on April 16 (UTC+8), according to monitoring by Beating, Jiepao Xingchen officially launched StepAudio 2.5 TTS. Unlike traditional TTS systems requiring predefined emotional tags, this version enables precise control over every aspect of speech using natural language descriptions: while tags can only express “sadness,” natural language can further specify “restrained sadness, no sobbing, slight trembling,” and the AI synthesizes the corresponding vocal tone accordingly. Control is structured in three layers. Global context control sets the overall emotional tone and atmospheric setting of the entire speech, ensuring consistent character expression across multi-turn dialogues; in-text context control adjusts tone, rhythm, pauses, emphasis, and breathiness at the sentence level, even capturing a character’s psychological state and subtext; zero-shot voice cloning (Zeroshot TTS) requires no retraining—any reference recording can be used to replicate a voice, with emotion and style adjustable independently. All three features are now fully available on Jiepao Xingchen’s open platform and Step Plan. On the same day, Google also released Gemini 3.1 Flash TTS, which similarly replaces SSML tags with natural language instructions for fine-grained control and topped third-party evaluations. The simultaneous release by both companies using the same approach indicates a collective industry shift in TTS control paradigms—from “selecting tags” to “stating requests.” For audio content creators and voice directors, what once required multiple recording iterations to adjust emotion can now be defined with a single sentence describing nuanced vocal qualities. (Source: BlockBeats)