Xiaomi Launches MiMo Accelerated Version with 1000+ Tokens per Second Speed

icon币界网
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
Xiaomi has launched a new token listings-ready model, MiMo-V2.5-Pro-UltraSpeed, capable of processing over 1,000 tokens per second on standard servers. The model leverages FP4 quantization and DFlash decoding to enhance speed without requiring custom hardware. TileRT optimization reduces GPU overhead. It outperforms GPT, Claude Opus, and Gemini Flash. API access begins on June 9. The model is three times more expensive but ten times faster. FP4 and DFlash checkpoints are open-sourced. SEC developments remain a key concern for token projects.
CoinMarketCap reports:

Xiaomi has released MiMo-V2.5-Pro-UltraSpeed, an accelerated inference version of its trillion-parameter flagship model. The company states that the new version achieves over 1,000 tokens per second on a standard server equipped with eight general-purpose GPUs, with peak performance nearing 1,200 tokens per second.

The focus of this update is not on the new model itself, but on inference efficiency. Unlike solutions relying on custom chips, Xiaomi emphasizes the use of general-purpose hardware, achieving speed improvements through software and model-side optimizations. This could further lower the barrier to deploying large models at high speed.

Two technologies are driving faster speeds.

Xiaomi primarily adopted two technologies this time. The first is FP4 quantization. The company compressed the expert layers, which constitute the majority of the model's parameters, to 4-bit precision, while maintaining higher precision for the remaining components. This reduces memory usage and bandwidth pressure, thereby improving inference speed.

The second is DFlash speculative decoding. Traditional speculative decoding typically involves a smaller model predicting a few tokens first, followed by parallel verification by a larger model. DFlash instead proposes an entire block of tokens at once, which are then verified by the main model. In code tasks, the main model accepts an average of 6.3 out of 8 candidate tokens per round.

Xiaomi and its inference partner TileRT have also optimized the execution process by keeping the computation pipeline continuously resident within the GPU to reduce the overhead caused by sequential operator launches.

Comparison of mainstream model speeds

According to data cited by Artificial Analysis, the current output speeds of mainstream general-purpose models are generally below this level. The report notes that typical interaction speeds for the GPT series are around 68 tokens per second, Claude Opus 4.6 is approximately 71 tokens per second, and Gemini Flash is about 192 tokens per second.

The report also noted that companies like Cerebras and Groq have long focused on high-throughput inference and rely on their proprietary chip architectures to improve speed. In contrast, Xiaomi achieved this result on general-purpose GPU nodes, highlighting performance gains driven by software optimization.

Limited trial launch on June 9

Xiaomi stated that UltraSpeed accelerates the original MiMo-V2.5-Pro, not the simplified lightweight version. This model was previously described as performing at a level close to Claude Opus in code benchmarks.

The company plans to open a limited API trial from June 9 to June 23 on an application basis, with priority access granted to enterprise users and professional developers. Pricing for the UltraSpeed version is approximately three times the standard MiMo rate, but it offers a roughly tenfold increase in generation speed.

Additional information: Xiaomi stated that the checkpoint model using FP4 and DFlash has been open-sourced on Hugging Face for community testing.

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.