Sierra open-sources μ-Bench for multilingual ASR evaluation

ME News reports that on April 21 (UTC+8), according to monitoring by Beating, customer service AI company Sierra has open-sourced a multilingual automatic speech recognition (ASR) benchmark dataset, μ-Bench. The data consists of 250 real customer service phone recordings and 4,270 manually annotated audio clips, sampled at 8 kHz in mono format. Previously available ASR benchmarks either focused solely on English or used studio-recorded read speech, making them nearly irrelevant for teams aiming to deploy voice agents in multilingual customer service environments. μ-Bench directly fills this gap by using real-world call data. This release represents a subset of Sierra’s full internal benchmark suite, which covers 42 languages, 79 regional variants, and over 13 vendors. The open-sourced portion includes five languages/regions—English, Spanish, Turkish, Vietnamese, and Mandarin—as well as performance results from five vendors: Deepgram Nova-3, Google Chirp-3, Microsoft Azure Speech, ElevenLabs Scribe v2, and OpenAI GPT-4o Mini Transcribe. The code, dataset (hosted on Hugging Face), and an open leaderboard are now publicly available, inviting other vendors to submit their results. The most novel aspect of the evaluation lies in its metrics. Sierra introduces a new metric, UER (Utterance Error Rate), which distinguishes between errors that alter meaning and those that are inconsequential. Traditional WER (Word Error Rate) treats missing a filler word like “uh” the same as mishearing a phone number—but for a voice agent executing actions based on transcription, only the latter causes operational failures. Sierra notes that two vendors with similar WER scores may have vastly different UER scores because they make fundamentally different types of errors. In terms of results, Google Chirp-3 leads in accuracy but has slower inference speed; Deepgram Nova-3 achieves nearly 8x faster p50 latency but ranks lowest in multilingual accuracy. Mandarin recognition error rates can reach five times those of English, and Vietnamese performance varies significantly across vendors—differences invisible when evaluating solely on English benchmarks. (Source: BlockBeats)