xAI Launches Grok STT and TTS APIs with a 6.9% Word Error Rate

ME News reports that on April 18 (UTC+8), according to monitoring by Beating, xAI has launched two independent audio APIs: Grok Speech-to-Text and Grok Text-to-Speech. Both are built on the same audio stack powering Grok Voice, Tesla’s in-car system, and Starlink customer service, and are now available as standalone endpoints for developers to integrate into voice assistants, real-time transcription, accessibility tools, podcasts, and more. The STT API offers two modes: a REST API for batch transcription of large audio files with millisecond-level response times, and a WebSocket API designed for real-time audio streams. Additional capabilities include word-level timestamps, speaker diarization, multi-channel recognition, and Inverse Text Normalization—automatically converting spoken numbers, dates, and currency into standardized structured text. It supports over 25 languages and enables seamless switching between them during conversations. xAI also released a set of Word Error Rate (WER) comparisons (lower is better): overall, Grok achieves 6.9%, compared to ElevenLabs at 9.0%, Deepgram at 11.0%, and AssemblyAI at 12.9%. The gap widens further in “phone call entity recognition,” where Grok reaches 5.0%, versus 12.0%, 13.5%, and 21.3% for the other three providers. Grok also shows slight performance advantages in three common use cases: meetings, video podcasts, and phone calls. These figures were published by xAI and have not yet been independently verified by third parties. Pricing: STT batch processing is $0.10 per hour; streaming is $0.20 per hour. TTS is priced at $4.20 per 1 million characters. TTS supports inline Speech Tags to control emotion and prosody—for example, `[laugh]`, `[sigh]`, `[whisper]`, and more. (Source: BlockBeats)