Zhipu AI's engineering optimizations enhance cost efficiency and market confidence.

The first trading day after the May Day holiday saw Zhipu and MiniMax surge dramatically.

On May 4, Zhipu rose over 10%, with its stock price once again approaching the HK$1,000 mark, while MiniMax surged 12.62% to close at HK$803.

According to Morgan Stanley's report, the surge in stock prices is driven by China's unique AI narrative of value for money.

In its report titled "China’s AI Path: More Bang For The Buck," Morgan Stanley stated that, under constraints on computing power, the intelligence levels of top AI models in China and the U.S. are rapidly converging, with the gap narrowing to three to six months.

The report also notes that the true strength of Chinese models lies in their ability to achieve intelligence levels comparable to their U.S. counterparts at only 15% to 20% of the inference cost.

This is actually quite easy to understand. Not everyone needs the most powerful model, but most people want to use a cost-effective one.

The market is not buying a simple story of "domestic replacement," but rather China's AI industry converting cost efficiency into actual usage volumes, real revenue, and tangible valuation flexibility.

But the question arises: where does this cost-effectiveness come from?

If it's only about acquiring customers at low prices, it will quickly turn into a price war.

If it’s just model distillation, but companies like Anthropic and OpenAI have already closed off distillation pathways, shouldn’t the rating be lowered instead of being raised?

In fact, what truly made this narrative more compelling was Zhipu's technical blog post, "Scaling Pain: Practical Insights into Large-Scale Coding Agent Inference," released before Labor Day.

This blog doesn't discuss grand AGI visions; instead, it lays out the underlying engineering aspects—KV Cache, throughput, scheduling, and anomalous outputs—for the market to see.

Most importantly, it has exposed the secret behind China's AI value-for-money advantage.

01

In this blog, Zhipu explains how to enable the same GPU to handle more work with fewer errors by optimizing caching, scheduling, and anomaly monitoring.

Zhipu found that AI not performing well isn't always due to an unintelligent model—it could also be caused by a disorganized backend system. It fixed issues with cached data corruption, optimized GPU scheduling and cache reuse, and added an alert system that can detect abnormal outputs in advance.

As a result, the same model on the same GPU can serve more users with a lower error rate. Its value proposition isn't simply about lowering prices—it's about engineering optimizations that extract more stable, usable compute power from each GPU.

Through underlying engineering optimizations, the GLM-5 series has achieved a maximum 132% increase in system throughput in Coding Agent scenarios, with the system's abnormal output rate decreasing from approximately 10 per ten thousand to 3 per ten thousand.

For example, previously, one GPU could handle 100 tasks per hour; after optimization, it can now handle up to 232 tasks per hour.

Individually, each factor isn't enough to determine the outcome. But when combined, they deliver double the throughput under the same computational power and an order-of-magnitude improvement in stability.

The model hasn't changed. What has changed is how the model is being put into use.

Specifically, since March, Zhipu has observed three types of anomalies in the online monitoring and user feedback for GLM-5: garbled text, repetition, and rare characters. These phenomena appear similar to the common “intelligence degradation” seen in long-context scenarios.

However, the Zhipu team has not deployed any optimizations that reduce model accuracy. So, is the anomaly caused by the model itself or by the inference pipeline?

After repeatedly analyzing and reasoning through the logs, they discovered an unexpected entry point: speculative sampling metrics could serve as a reference signal for anomaly detection.

Speculative sampling was originally just a performance optimization technique. It first generates candidate tokens using a draft model, then verifies and decides whether to accept them using the target model, thereby improving decoding efficiency without altering the final output distribution.

Have the small model quickly generate a batch of answers, then let the large model select the correct ones—this approach is both fast and accurate.

The Zhipu team discovered that when anomalies occur, two metrics of speculative sampling exhibit stable patterns. As a result, they expanded speculative sampling from a mere performance optimization to a real-time monitoring signal for output quality.

When spec_accept_length remains below 1.4 for an extended period and the generated length exceeds 128 tokens, or when spec_accept_rate exceeds 0.96, the system actively aborts the current generation and redirects the request to the load balancer for retry.

These two numbers are like health indicators—if they become abnormal, it means the model is "ill" and needs to be restarted for treatment.

Although users do not perceive this process, a restart has indeed been completed in the background.

The root cause of the anomaly is a KV cache reuse conflict.

It's like a kitchen during peak meal hours, when many people come in to place orders at the same time.

The system needs to temporarily store each user’s context, known as the KV Cache. What this table ordered earlier, whether they want less chili or no cilantro—fine for one or two guests, but once the number of guests grows, the server is likely to make mistakes.

MiniMax

During high concurrency, the order in which certain caches are reclaimed, reused, or read can become disordered. As a result, the model may retrieve the wrong context, leading to garbled output, repetition, or obscure characters.

In the inference engine, under the PD separation architecture, there is an inconsistency between the request lifecycle and the timing of KV Cache recycling and reuse. Under high concurrency, conflicts are amplified, manifesting on the user side as garbled text and repetition.

Multiple requests competing for the same memory block resulted in corrupted data, causing users to see garbled text.

The Zhipu team identified this bug and fixed it.

In addition, they identified and fixed a missing load timing issue in the HiCache module at the source code level of the mainstream open-source inference framework SGLang, namely the read-before-ready problem.

The fix was submitted to the SGLang community via Pull Request #22811 and was accepted.

SGLang is an open-source project whose full name can be understood as a reasoning and serving framework designed for large language models. It is not a large model itself, nor an AI company, but rather a foundational software suite that enables large models to run efficiently.

While using the open-source inference framework SGLang, Zhipu discovered a high-concurrency caching bug.

It didn't just fix the issue internally; Zhipu also submitted the fix to the open-source project SGLang.

After review by the project maintainers, the fix was accepted and merged. As a result, this improvement became available in the public release, allowing other developers and companies using SGLang to benefit from it as well.

What does this mean?

If any deployment path of Qwen uses SGLang+HiCache, Alibaba will also benefit from Zhipu's discovery and resolution of this issue.

As I mentioned earlier, the model itself hasn’t changed, but through engineering optimizations, it’s now smarter when in use.

02

What this blog by Zhipu truly exposes is a deeper level.

The low cost of chatbots in this era stems largely from low training costs, with part of the training data derived from distillation of leading models.

In the agent era, this trick doesn't work.

This year, Anthropic and OpenAI have gradually shut down their distillation channels, explicitly prohibiting the use of their model outputs to train competing models. The path of relying on distillation to gain an advantage is becoming increasingly narrow.

However, the narrative around the cost-effectiveness of Chinese AI companies has not weakened; instead, the market is reinforcing this story.

The reason is that the definition of cost-performance ratio has changed.

In the era of chatbots, the average context is 55K tokens per conversation, with low concurrency.

In the Agent era, average context exceeds 70K tokens, long-duration tasks (8-hour scale), high concurrency, and high prefix reuse.

In the age of chatbots, the metric for AI cost-effectiveness is simple: for the same question, whose model is cheaper and whose answer is closer to frontline quality?

The industry discusses the cost per million tokens, the size of model parameters, and how high the rankings are.

In the Agent era, no one asks about this—this algorithm no longer works.

The user is not buying just an answer—they are buying the completed outcome of an entire task.

A Coding Agent must read code, understand context, plan steps, invoke tools, modify files, run tests, and retry on failure. The tokens it consumes are not an incremental increase from a single Q&A, but rather the total accounting of an entire workflow.

As the world's largest invocation platform, OpenRouter's weekly total token volume increased from 6.4 trillion in the first week of January 2026 to 13 trillion in the week of February 9, doubling in one month.

According to OpenRouter, the incremental invocation demand in the 100K to 1M text length range is a typical use case for agent workflows.

Users' approach to AI has shifted from "conversational" to "workflow-based." Consequently, the unit of AI cost-effectiveness has changed from "cost per token" to "cost per task."

This means that some models may have low token costs, but due to poor performance, they frequently fail during tasks or produce results that don’t meet requirements, making their overall agent cost anything but cheap.

For example, if a coding task on an 8-hour timeframe encounters a single corruption, the entire workflow may need to be restarted. The savings from a lower token price cannot make up for the wasted time.

China's AI value-for-money narrative is escalating.

Previously, it was about "delivering answers of the same quality, but at a lower price." Now, it's about "completing equally complex tasks at a lower cost."

Open-source infrastructure is also becoming China's new moat in AI.

As mentioned earlier, SGLang is exactly like this. China's AI engineering capabilities are beginning to radiate upstream into the community.

The value of this lies not just in Zhipu fixing a bug, but in Chinese AI companies reverse-engineering real-world business challenges—such as high concurrency, long context handling, and agent invocation—into capabilities for public infrastructure.

As mentioned earlier, when a fix is integrated into an open-source framework like SGLang, it no longer serves only Zhipu’s own models. All teams deploying large models using this framework gain access to more stable caching, lower inference costs, and improved agent experiences.

Model capabilities can be caught up with, and prices can be driven down, but once infrastructure enters the open-source ecosystem, it becomes a standard, an interface, and a development habit.

The earlier someone incorporates their engineering experience into these underlying systems, the better positioned they will be for the next wave of AI applications.

03

Back to the capital markets.

AI large model-related stocks surged across the board—will capital revalue AI companies? What exactly are investors buying?

The answer is that capital markets are rewarding the narrative that Chinese AI companies can achieve near-top-tier intelligence at lower inference costs.

Still using OpenRouter's data.

The token consumption share of China’s leading AI companies rose rapidly from 5% in April 2025 to 32% in March 2026, while the share of U.S. leading models declined significantly from 58% to 19%.

MiniMax, Zhipu, and Alibaba's token usage increased 4 to 6 times in February-March 2026 compared to December of last year.

Beyond token calls, Chinese AI is developing a growth logic entirely distinct from that of overseas giants.

Overseas leading models are selling a "capability premium."

The stronger the model's capabilities, the higher the cost per invocation, as users pay for the most intelligent performance. Claude, GPT-5, and Gemini are all moving in this direction.

Chinese AI is selling "engineering."

The model's capabilities approach those of top-tier models, but with lower costs, reduced latency, and lower access barriers, making it better suited for the needs of most high-frequency use cases.

Morgan Stanley's report notes that the input price for China's model is approximately $0.3 per million tokens, while some overseas comparable products are around $5, representing a gap of more than tenfold.

When AI transitions from a novelty tool to a productivity tool, cost-effectiveness will directly determine usage frequency.

The cheaper the model, the more willing companies are to entrust it with more customer service, code, marketing, and data analysis tasks. The more tasks it runs, the greater the token consumption, allowing the platform to better amortize its infrastructure costs.

MiniMax

I believe that, at this stage, it has the potential to create a flywheel effect.

The first round aims to attract developers and enterprises with lower API prices and capabilities closer to those of top-tier providers.

In the second round, higher call volumes will bring more real-world scenarios, pushing the model and inference system to continue optimizing.

The third round, as described in this technical blog post by Zhipu, involves using engineering optimizations to reduce the cost per token and per task, enabling manufacturers to continue lowering prices, increasing volume, or raising prices in high-value scenarios.

Round four: As token consumption becomes the new traffic of the AI era, those who can support more tokens at a lower cost will be closer to becoming the next-generation platform companies.

If only the model's price is lowered, the market may worry that this is a form of subsidy or price war, leading to ever-increasing costs until someone's wallet can no longer keep up.

Moreover, a price war cannot sustain a high valuation.

But if the price reduction is driven by increased throughput, cache reuse, lower error rates, and improved scheduling efficiency, then the lower price is not about sacrificing profit for growth—it’s the cost savings unlocked by enhanced engineering capabilities.

The results of a price war and this kind of engineering optimization may both make the model cheaper and appear similar on financial statements, but they are vastly different in valuation models.

The former is a subsidy, which leads to a market discount. The latter is an engineering barrier, which leads to a market premium.

Ultimately, it can lead to a judgment.

In the past, AI companies were valued based on the upper limits of their model capabilities and how close they were to AGI. The market was paying for "the strongest intelligence," but the definition of the strongest intelligence became increasingly vague, and each inference call grew more expensive.

In the age of agents, valuation still depends on the cost floor—look at who can deliver intelligent, stable, and affordable solutions at scale.

For those seeking the most advanced "intelligence," this may not be where Chinese AI excels.

However, China's AI is most likely to turn the words "intelligence" into infrastructure affordable to everyone and every business.

And the market is only willing to pay for companies that can clearly articulate their logic.

This article is from the WeChat public account "Letter Board" (ID: wujicaijing), authored by Miao Zheng.