Xiaomi's MiMo-V2.5 Model Reduces Costs Using 10-Layer Equivalent Attention Computation

ME News reports that on May 27 (UTC+8), according to monitoring by Beating, after implementing a permanent price reduction for its proprietary large model MiMo-V2.5 series APIs, Luo Fuli, head of Xiaomi’s large model team, disclosed the algorithmic cost-reduction mechanisms on X. Luo revealed that even after aligning API pricing with DeepSeek, Xiaomi’s high-load inference engine remains at break-even. The cost reductions primarily stem from a hybrid attention architecture and hierarchical KV cache optimizations. To achieve a 99% reduction in cache hit costs, Xiaomi’s inference framework implemented hierarchical KV cache optimizations tailored for Sliding Window Attention (SWA). Production testing showed that this hierarchical optimization increased token cache capacity by 5x and reduced cache costs by 80%. Combined with Cache Read Overlap technology—enabling overlapping cache reads between global attention modules—the system further lowered the actual overhead of cache hits. Regarding the 60% to 80% reduction in base input and output costs, Luo attributed this to the model’s 1:7 inter-layer sparsity ratio—the proportion of Global Attention (GA) layers to Sliding Window Attention (SWA) layers. During long-text prefilling, the 60 SWA layers compute only local sliding windows, enabling the 70-layer MiMo-V2.5-Pro model to perform attention computations equivalent to only a 10-layer traditional global GQA model. This ultra-low computational load significantly reduced original inference costs, previously providing Xiaomi with a 2x to 3x profit margin before the price adjustment. Thus, the price cut reflects structural cost reduction, not loss-leading competition. Luo emphasized that low-cost inference services help stimulate demand for end-user intelligence. Large model companies should avoid blind price wars and instead achieve sub-break-even operational costs through coordinated, bottom-up design of algorithms and inference systems. (Source: BlockBeats)