ME News reports that on April 18 (UTC+8), according to monitoring by Beating, xAI has launched two independent audio APIs: Grok Speech-to-Text and Grok Text-to-Speech. Both are built on the same audio stack powering Grok Voice, Tesla’s in-car system, and Starlink customer service, and are now available as standalone endpoints for developers to integrate into voice assistants, real-time transcription, accessibility tools, podcasts, and more. The STT API offers two modes: a REST API for batch transcription of large audio files with millisecond-level response times, and a WebSocket API designed for real-time audio streams. Additional capabilities include word-level timestamps, speaker diarization, multi-channel recognition, and Inverse Text Normalization—automatically converting spoken numbers, dates, and currency into standardized structured text. It supports over 25 languages and enables seamless switching between them during conversations. xAI also released a set of Word Error Rate (WER) comparisons (lower is better): overall, Grok achieves 6.9%, compared to ElevenLabs at 9.0%, Deepgram at 11.0%, and AssemblyAI at 12.9%. The gap widens further in “phone call entity recognition,” where Grok reaches 5.0%, versus 12.0%, 13.5%, and 21.3% for the other three providers. Grok also shows slight performance advantages in three common use cases: meetings, video podcasts, and phone calls. These figures were published by xAI and have not yet been independently verified by third parties. Pricing: STT batch processing is $0.10 per hour; streaming is $0.20 per hour. TTS is priced at $4.20 per 1 million characters. TTS supports inline Speech Tags to control emotion and prosody—for example, `[laugh]`, `[sigh]`, `[whisper]`, and more. (Source: BlockBeats)
xAI Launches Grok STT and TTS APIs with a 6.9% Word Error Rate
KuCoinFlashShare






On April 18 (UTC+8), xAI launched the Grok STT and TTS APIs, achieving a 6.9% word error rate—better than ElevenLabs, Deepgram, and AssemblyAI. The APIs support both batch and real-time transcription via REST and WebSocket. TTS includes emotional and rhythmic controls. STT pricing is $0.10 per hour for batch and $0.20 per hour for streaming; TTS costs $4.20 per 1 million characters. The launch coincides with rising interest rate news and increased on-chain activity.
Source:Show original
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information.
Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.