Xiaomi MiMo API Reduces Prices by 99% Through Engineering Breakthroughs

Article by Xiang Xianzhi

Lou Fuli posted on X to put an end to the controversy surrounding Xiaomi MiMo's price cut.

On May 26, the official MiMo account on X announced: MiMo-V2.5 series APIs are permanently discounted, with maximum reductions of up to 99%. All context lengths are now uniformly priced, and token packages have been upgraded by 5 to 8 times.

This announcement dominated China’s AI community for an entire week. The industry’s initial reactions split into several camps. The largest group called it “another round of price wars”—over the past two years, Chinese large models from companies like Zhipu, DeepSeek, ByteDance’s Doubao, and Alibaba’s Tongyi have been taking turns cutting prices, with no one staying out of the race.

Another perspective is pessimistic: Xiaomi just announced that its profits this year have halved, yet it’s still pouring 60 billion into AI and slashing API costs by 90%—a classic case of “selling at a loss to capture market share.” Others believe this is a continuation of the DeepSeek effect—DeepSeek has dragged the entire industry’s pricing benchmark to rock bottom, and anyone who doesn’t follow will be left behind.

As the head of MiMo, Luo Fuli directly released a 5,000-word technical blog last night, openly disclosing the engineering accounting behind the price reduction to everyone.

Look, this is real engineering capability, not marketing hype.

To understand what Luo Fuli is saying, you first need to know what exactly was reduced by 99%.

This is not a full model price reduction. The 99% discount applies exclusively to the pricing tier called Input (Cache Hit)—the portion related to users repeatedly reading historical context in long conversations. Discounts for new inputs (No Cache Hit) are much smaller, and the discount for model output (Output) is the smallest.

If you think of the model as a coffee shop, this becomes easier to understand.

You order a half-sugar latte, and the coffee shop has two ways to make it: either grind the beans, measure the syrup, and pour the milk from scratch each time—paying for ingredients and labor every single time—or, since the model knows you’re drinking the same half-sugar latte every day this week, it just makes a large batch, stores it in the fridge, and scoops out one serving at a time. That’s exactly what MiMo did this time—changed the handling of repeated user requests from “calculate on demand” to “retrieve on demand,” bringing the real cost of these operations close to zero, which is why a 99% discount is natural.

To achieve "instant withdrawal," the technical blog outlines six engineering components, each of which is essential. Let’s examine them one by one.

Project One: Reduce the model's memory footprint to 1/7

During conversation, the model computes and stores an intermediate state for each token, known as KVCache—essentially the model’s "short-term memory notebook." With every sentence, the model records a summary of that sentence in its notebook, allowing it to refer back to the notes directly next time, rather than reprocessing all previously spoken content from the beginning.

Traditional models apply "Full Attention" at every layer—meaning each token examines all tokens in the entire conversation, causing the notebook to grow thicker and thicker. MiMo-V2.5-Pro changes this architecture: out of 70 layers, 60 layers only attend to the most recent 128 tokens (Sliding Window Attention), while just 10 layers act as "archivists" that view the full context.

As a result, the KVCache size is reduced to just 1/7 of Full Attention, and the computational load is also reduced to 1/7.

This is the first cornerstone of cost reduction. For example, previously, every employee was required to remember all meeting minutes, resulting in mental overload and low efficiency. The new policy reduces the cognitive burden on 60 employees to just 1/7—leaving only 10 archivists responsible for managing all historical records—while maintaining the company’s overall memory capacity and increasing efficiency sevenfold.

Project Two: Make the Space Saved by SWA Actually Usable

Reducing the laptop's footprint to 1/7 is the first step architecturally, but bridging the gap between the theoretical 1/7 and the actual 1/7 remains a challenge.

Traditional KVCache systems allocate memory uniformly across all layers based on the "maximum possible usage." This means that even if 60 layers of SWA only need a small notebook, the system allocates each layer the size of a "archivist’s large ledger"—the saved space from SWA is merely reserved and effectively wasted.

The Luo Fuli team's approach is to split the KVCache into two independent pools. The 10 layers of Full Attention use the "large pool," allocated for the full sequence length, while the 60 layers of SWA use the "small pool," allocated only for a 128-token window.

For example, imagine the company previously gave every employee a filing cabinet capable of storing 100 years of documents—yet 60 employees only needed small cabinets for a week’s worth of files, leaving 99% of the large cabinets empty. The new approach allocates cabinets based on actual needs, allowing over five times as many colleagues to work in the same office space—similarly, a single GPU can now serve five times as many concurrent users.

This step may seem simple, but without it, the advantages of the previous SWA architecture would be meaningless.

Project Three: Ensure that "repeat reads by existing users" actually hit the cache.

The notebook is compressed to 1/7 + the space savings are truly significant; the next step is to address a longstanding issue: the hit rate of the prefix cache.

Many user conversations share the same beginning—the same system prompt, the same codebase, or the same long document. The system stores previously computed results for these and reuses them directly when a match occurs. This mechanism is called prefix caching.

However, there's a pitfall in SWA mode: two requests having the same token does not guarantee that the KV cache is still valid. The prefix may have been computed, but the parts outside the SWA window could have already been evicted. If the system still follows the old rule of "same token equals cache hit," it may retrieve invalid or overwritten data, causing the model's performance to collapse immediately.

The Luo Fuli team upgraded the rules to "window security length"—only committing to the portion you can fully borrow.

For example, imagine a library with one million books, and you want to borrow the complete three-volume set of "The Three-Body Problem." Under the old system, you’d be told, "This book is available," only to rush over and find that the shelf holds only the cover and the first volume—the other two volumes have already been checked out. This "false hit" forces you to make a wasted trip and request the set again. The new system changes the rule: it only commits to what you can fully borrow—first giving you Volume One, then retrieving the remaining two volumes for you.

It may sound like it would be stricter and reduce the hit rate, but the opposite is true: because SWA reduces the KVCache size to 1/7, the same storage space can hold several times more content, significantly increasing the actual hit rate.

Luofuli's blog provided real-time test figures: under mainstream harness frameworks, the server cache hit rate averages 93%, with high-frequency, long-term users reaching over 95%.

The meaning of this number: 95% of "repeated read" requests require no GPU computation at all—they are retrieved directly from the cache. This is the physical basis for a 99% discount.

Project Four: Integrate the "cache" into the GPU's built-in SSD

The hit rate has improved; the next question is: where are these caches stored?

VRAM (HBM memory on GPUs) is expensive and limited—a single H100 eight-GPU system has only 640 GB of VRAM, but MiMo may require KVCache storage in the tens of terabytes. Therefore, a layered approach is essential: recently used data is stored in VRAM (L1), slightly older data in CPU memory (L2), and cold data in a distributed cache (L3).

It’s the same as managing your money. Cash in your wallet is like video memory—available instantly but limited in capacity. Your bank account balance is like CPU memory—takes 30 seconds to access but can hold much more. A time deposit is like L3 distributed cache—takes two minutes to withdraw but is significantly cheaper.

The industry standard is to set up a separate storage cluster for L3, using dedicated hardware and a dedicated data center, with monthly rental payments.

The Xiaomi storage team took a different approach. They developed their own distributed cache called GCache, which is deployed directly on the SSDs built into the GPU machines—co-located with training and inference tasks on the same machines.

Others rented a warehouse specifically to store large amounts of data; Xiaomi discovered that the garage housing GPU machines was unused and simply stored the data there instead, saving monthly rent.

The additional storage cost is $0.

The impact of this is greater than it appears. In conventional "AI company compute cost" models, storage cost is a fixed expense—the larger your model and the more users you have, the longer your storage bill becomes. GCache eliminates this entirely. Combined with SWA’s small size and 93–95% hit rate, the TTL (time-to-live) of KVCache in L3 extends from minutes to hours, even days—the longer the TTL, the wider the window for hitting historical context, the higher the cache hit rate, and the more sustainable the 99% discount becomes.

Project Five: Route cached requests along the shortest path

Caching is storage-capable, queryable, and cost-effective; the final step is: how to route the correct requests to the right machines.

Xiaomi developed its own scheduling system called LLM-Router, which performs three tasks:

First, affinity scheduling: requests with the same prefix are routed to the same server to maximize cache reuse.

Second, length-based bucketing. Route short requests (0–64 KB), medium requests (64 KB–256 KB), and long requests (256 KB–1 MB) to separate processing channels to prevent short requests from being slowed down by long ones.

Third, TTFT optimization: Prioritize scheduling requests with minimal actual computation (i.e., those that heavily hit the cache) in the queue waiting for inference—preventing them from being blocked by requests requiring heavy computation from fresh inputs.

For example, in conventional airport scheduling, all passengers flying to the same destination are grouped in the same lounge and share the baggage claim process—this is affinity scheduling. Passengers with carry-on luggage and those with three large checked bags use separate security checkpoints, so the fast aren’t held back by the slow—this is length-based bucketing. Boarding prioritizes passengers with only carry-on luggage, as they board quickly, enabling the plane to depart earlier—this is TTFT optimization.

This scheduling strategy has been tested to increase L2 cache hit rate by 25%, single-machine input throughput by 30%, and reduce P90 latency for long requests by 30%.

The same GPU can serve more users. The other half of the price reduction logic lies here—higher effective output per unit of computing power and lower cost per user.

Project Six: Make the Model Type Faster

The first five items optimize the "read" side—reducing the cost for users to reread historical context to nearly zero. The sixth item optimizes the "write" side—the process by which the model generates the next token.

Traditional models generate only one token at a time. MiMo natively supports three-layer MTP (Multi-Token Prediction)—predicting the next three tokens at once, and skipping intermediate computations if the predictions in between are correct.

For example, traditional typing involves pressing a key for each character—you’d need to press four keys to type "今天天气." MTP is like having an auto-complete feature that predicts your next one or two characters—if it guesses correctly, you don’t need to press those extra keys.

MiMo's MTP tested in agentic scenarios: 2.3x acceleration for the first 128 tokens, 1.5x acceleration for tokens 128–256.

The significance of this is that the 99% discount specifically applies to Input (Cache Hit), but when the model serves users, input and output occur within the same request—if the output isn’t saved, the overall request cost is only reduced by half. MTP brings down the cost of that output half as well, completing the full cost-reduction profit model.

Link six things into a cost-reduction chain:

SWA architecture → KVCache 1/7 → Dual-pool truly frees up capacity → One GPU can handle 5x more concurrent requests → Prefix cache hit rate of 93–95% → 95% of requests require almost no computation → GCache reduces storage costs to zero → Scheduling prioritizes cached requests → MTP also saves generation costs → GPU time per request drops by an order of magnitude → Cost per unit drops by over 95% → Pricing reduced by 99%, while maintaining positive gross margins.

If any single link is missing, the entire chain breaks at that point. The 99% discount is not a marketing figure—it’s the cumulative effect of six engineering pillars combined with real-world online validation.

Looking back at the initial interpretations from the industry, each had some truth to it. The price war among Chinese large model companies over the past two years is real; Xiaomi cutting its profits in half while still investing heavily in AI is real; and DeepSeek dragging industry pricing down to rock-bottom levels is also real.

But by publicly releasing this technical blog and thoroughly breaking down the technical details, Luo Fuli is clearly aiming to counter claims about a price war, emphasizing that "technical issues belong to technology, and marketing issues belong to marketing."

In her blog, she wrote that the inference efficiency of the MiMo-V2.5 series models does not stem from a single breakthrough in one component, but rather from multidimensional, coordinated optimizations. Hybrid SWA benefits both prefill and decode phases, but an inadequately optimized KVCache implementation can increase costs across all stages. To achieve this goal, the MiMo team systematically restructured KVCache management, hierarchical caching, and prefix caching trees, resolved core issues with SWA KVCache, optimized scheduling strategies and the prefill/decode pipeline, and validated these improvements in real-world online scenarios, ultimately translating theoretical efficiency gains into production performance. Only then did Hybrid SWA fully realize its architectural advantages in long-context inference—delivering both strength and efficiency. Combined with MoE configurations and various optimizations for multimodal inference, these enhancements significantly improved the performance of online inference services.

This is a systematic approach to AI engineering and a cost-reduction strategy worthy of industry-wide reference and adoption.

You don't need to write a blog for a price war; you need one for engineering delivery.