ChainThink reports that on April 16, according to monitoring by Beating, Google released its next-generation text-to-speech model, Gemini 3.1 Flash TTS. The model’s key feature enables developers to precisely control the style, speed, and emotional expression of AI-generated speech. It is now available via the Gemini API, Google AI Studio (developer preview), Vertex AI (enterprise preview), and Google Vids (for Workspace users).
The model's core control capabilities rely on "audio tags," allowing developers to embed natural language instructions within the input text to adjust the AI voice's tone, rhythm, and accent—even switching expression styles mid-sentence. Google provides a "director's chair"-style configuration interface in Google AI Studio, featuring three levels of control: scene guidance, character-level parameter tuning, and one-click export.
According to the TTS leaderboard by third-party evaluator Artificial Analysis, Gemini 3.1 Flash TTS ranks first with an Elo score of 1,211 and is also listed in the “Most Appealing Quadrant.” The model supports over 70 languages and native multi-character dialogue, with all generated audio embedded with SynthID watermarks for AI content identification. For developers, this model elevates TTS from a simple text-to-speech tool to a programmable voice performance engine, enabling consistent voice style reuse across product lines.
