StepFun's StepAudio 2.5 Realtime Tops Voice AI Benchmarks in April 2026

A Shanghai-based AI lab just quietly embarrassed some of the biggest names in tech. StepFun’s StepAudio 2.5 Realtime, released around May 24, swept all five major voice AI benchmarks from April 2026 testing, beating out both GPT Realtime 1.5 and Gemini Live in the process.

The model doesn’t just understand what you say. It understands how you say it, interpreting tone, emotion, and speech rate in ways that make most competing voice assistants sound like they’re reading a script in a monotone.

The numbers behind the noise

StepAudio 2.5 Realtime posted top scores across every benchmark category tested. In human evaluation, it scored 80.41. General dialogue performance hit 86.36. Automotive scenario testing, which measures how well the model handles voice interaction in driving contexts, landed at 84.80.

The spoken question-and-answer benchmark, spanning 11 separate tasks, came in at 79.80. And the paralinguistic comprehension score, arguably the most interesting metric here, reached 82.18.

For context, the model’s predecessor, StepAudio 2, had already turned heads with an MMAU benchmark score of 77.4%. The jump to 2.5 Realtime represents a meaningful leap, not just an incremental version bump dressed up in marketing language.

How it actually works

The architecture is what sets this apart from the pack. StepAudio 2.5 Realtime uses a unified audio-in, audio-out design that combines three core capabilities into a single framework: Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and real-time dialogue processing.

Think of it like this: most voice AI systems work in stages. They transcribe your speech to text, process the text, generate a response in text, then convert that back to audio. Each handoff introduces latency and loses nuance. StepFun’s approach collapses those steps into one cohesive system.

The secret sauce is what StepFun calls persona-specific Reinforcement Learning from Human Feedback, or RLHF. Standard RLHF trains a model to give better responses based on human preferences. StepFun’s version goes further by tailoring that feedback loop to specific personas, which means the model can maintain consistent character traits during extended roleplay or customer service scenarios.

The model currently supports both Chinese and English, connects via WebSocket API under the model string ‘step-2.5-realtime,’ and is accessible through StepFun’s platform API and a dedicated realtime console. A technical report detailing the architecture was published on arXiv under identifier 2605.23463.

Why paralinguistic comprehension matters

StepAudio 2.5 Realtime’s 82.18 score in paralinguistic comprehension suggests StepFun has made real progress on this problem. A voice assistant that can detect frustration in a caller’s tone and escalate to a human agent, or slow down its speech when it senses confusion, represents a fundamentally different product than one that just processes words accurately.

The automotive scenario benchmark score of 84.80 hints at another lucrative application. In-car voice assistants need to handle noisy environments, interpret commands quickly, and ideally understand when a driver sounds stressed versus relaxed.