Xiaomi has released MiMo-V2.5-Pro-UltraSpeed, an accelerated inference version of its trillion-parameter flagship model. The company states that the new version achieves over 1,000 tokens per second on a standard server equipped with eight general-purpose GPUs, with peak performance nearing 1,200 tokens per second.
The focus of this update is not on the new model itself, but on inference efficiency. Unlike solutions relying on custom chips, Xiaomi emphasizes the use of general-purpose hardware, achieving speed improvements through software and model-side optimizations. This could further lower the barrier to deploying large models at high speed.
Two technologies are driving faster speeds.
Xiaomi primarily adopted two technologies this time. The first is FP4 quantization. The company compressed the expert layers, which constitute the majority of the model's parameters, to 4-bit precision, while maintaining higher precision for the remaining components. This reduces memory usage and bandwidth pressure, thereby improving inference speed.
The second is DFlash speculative decoding. Traditional speculative decoding typically involves a smaller model predicting a few tokens first, followed by parallel verification by a larger model. DFlash instead proposes an entire block of tokens at once, which are then verified by the main model. In code tasks, the main model accepts an average of 6.3 out of 8 candidate tokens per round.
Xiaomi and its inference partner TileRT have also optimized the execution process by keeping the computation pipeline continuously resident within the GPU to reduce the overhead caused by sequential operator launches.
Comparison of mainstream model speeds
According to data cited by Artificial Analysis, the current output speeds of mainstream general-purpose models are generally below this level. The report notes that typical interaction speeds for the GPT series are around 68 tokens per second, Claude Opus 4.6 is approximately 71 tokens per second, and Gemini Flash is about 192 tokens per second.
The report also noted that companies like Cerebras and Groq have long focused on high-throughput inference and rely on their proprietary chip architectures to improve speed. In contrast, Xiaomi achieved this result on general-purpose GPU nodes, highlighting performance gains driven by software optimization.
Limited trial launch on June 9
Xiaomi stated that UltraSpeed accelerates the original MiMo-V2.5-Pro, not the simplified lightweight version. This model was previously described as performing at a level close to Claude Opus in code benchmarks.
The company plans to open a limited API trial from June 9 to June 23 on an application basis, with priority access granted to enterprise users and professional developers. Pricing for the UltraSpeed version is approximately three times the standard MiMo rate, but it offers a roughly tenfold increase in generation speed.
Additional information: Xiaomi stated that the checkpoint model using FP4 and DFlash has been open-sourced on Hugging Face for community testing.
