ME News reports that on May 27 (UTC+8), according to monitoring by Beating, after implementing a permanent price reduction for its proprietary large model MiMo-V2.5 series APIs, Luo Fuli, head of Xiaomi’s large model team, disclosed the algorithmic cost-reduction mechanisms on X. Luo revealed that even after aligning API pricing with DeepSeek, Xiaomi’s high-load inference engine remains at break-even. The cost reductions primarily stem from a hybrid attention architecture and hierarchical KV cache optimizations. To achieve a 99% reduction in cache hit costs, Xiaomi’s inference framework implemented hierarchical KV cache optimizations tailored for Sliding Window Attention (SWA). Production testing showed that this hierarchical optimization increased token cache capacity by 5x and reduced cache costs by 80%. Combined with Cache Read Overlap technology—enabling overlapping cache reads between global attention modules—the system further lowered the actual overhead of cache hits. Regarding the 60% to 80% reduction in base input and output costs, Luo attributed this to the model’s 1:7 inter-layer sparsity ratio—the proportion of Global Attention (GA) layers to Sliding Window Attention (SWA) layers. During long-text prefilling, the 60 SWA layers compute only local sliding windows, enabling the 70-layer MiMo-V2.5-Pro model to perform attention computations equivalent to only a 10-layer traditional global GQA model. This ultra-low computational load significantly reduced original inference costs, previously providing Xiaomi with a 2x to 3x profit margin before the price adjustment. Thus, the price cut reflects structural cost reduction, not loss-leading competition. Luo emphasized that low-cost inference services help stimulate demand for end-user intelligence. Large model companies should avoid blind price wars and instead achieve sub-break-even operational costs through coordinated, bottom-up design of algorithms and inference systems. (Source: BlockBeats)
Xiaomi's MiMo-V2.5 Model Reduces Costs Using 10-Layer Equivalent Attention Computation
KuCoinFlashShare






Xiaomi's MiMo-V2.5 model reduces costs through a 10-layer equivalent attention computation. The 70-layer Pro version cuts cache hit costs by 99% and input/output costs by 60–80%. On-chain reports indicate the model employs a 1:7 sparsity ratio between global and sliding window attention. Shifts in global crypto policy may affect how these efficiency gains influence AI deployment.
Source:Show original
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information.
Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.