Zhipu AI's stock surges 26% after launching 400 tokens/s API

Text | AIDeepDive

Today, Zhipu (02513.HK), known as the "world's first listed large model company," surged again.

The intraday gain briefly exceeded 30%. Closing at HK$1,282, the day's gain surpassed 26%, with a market capitalization of HK$571.57 billion, setting another all-time high.

Large model

What triggered this surge was a specific technical indicator: 400 tokens/s.

On May 22, Zhipu officially launched the GLM-5.1 High-Speed API (GLM-5.1-highspeed) for enterprise customers, with one key core parameter: the model’s output speed reaches 400 tokens per second, setting a new global record for large model API speed.

I originally thought this was just another PR stunt by a domestic large model, but after carefully examining the technical details, I finally understood the logic behind the capital market.

What does 400 tokens/s mean?

The model can generate approximately 200 Chinese characters per second, equivalent to a professional writer’s intense one-minute output, compressed into just one second.

The amount of text a creator might spend several days writing at a desk can be delivered by GLM-5.1 High-Speed Edition in just one minute; a system refactoring task that an engineer might spend three days on can be completed while enjoying a cup of coffee.

01 Speed matters more than you think

Speed has always been the most overlooked dimension in the competition among AI models.

Over the past three years, the large model arms race has focused on two tracks: parameter scale (larger, smarter models) and price wars (cheaper, more accessible tokens). "Speed" has never been the main character.

This is because, in the past, "speed" was typically achieved by reducing model parameters. To increase speed, smaller and more streamlined models had to be used, at the cost of reduced capability.

The significance of GLM-5.1 High-Speed Edition lies in its ability to achieve a speed of 400 tokens/s while retaining the full-capability flagship base model.

For the first time, both "flagship capabilities" and "ultra-low latency" are achieved without compromise—whether viewed from a domestic or global perspective.

Large model

Why is speed so critical? Because the main battlefield for AI is undergoing a fundamental shift.

When AI transitions from ChatBot to Agent, Q&A is no longer the primary scenario; to accomplish a task, an Agent often requires the model to self-call dozens or even hundreds of times: writing code, calling APIs, searching for information, invoking tools...

Under this workflow, delays between each round of calls are ruthlessly compounded. For a task requiring 50 rounds of calls, saving just 1 second per call results in nearly a minute faster overall. For AI programming assistants, voice interactions, and business decision systems, this difference can be life-or-death.

On a deeper level, faster inference within a fixed time budget means the model can traverse longer reasoning paths and perform more rounds of self-verification. Speed is evolving from a system metric into the very limit of intelligence.

02 How difficult is speed?

What is the current industry standard for speed?

Among leading vendors, OpenAI’s GPT-4o operates at approximately 100–150 tokens/s, Anthropic’s Claude Sonnet series at around 80–120 tokens/s, and most domestic mainstream flagship model APIs fall within the 50–100 tokens/s range. 400 tokens/s is roughly 3 to 5 times the industry average.

More importantly, this gap cannot be bridged simply by investing more computing power.

A server equipped with eight H200 GPUs can theoretically transfer up to 38 TB of data per second. For GLM-5.1, generating a single token requires reading approximately 42 GB of activation parameters; purely theoretically, this could approach 1000 tokens/s.

But real-world systems often only achieve dozens of tokens per second.

Large model

This is a gap in magnitude. GPUs aren't slow enough—they waste vast amounts of time waiting, idling, and undergoing inefficient scheduling.

Zhipu has achieved a breakthrough in final speed by innovating simultaneously at three levels: the inference engine, parallel strategy, and network architecture.

Large model

03 Three-layer technology stacking approaches the physical limits of hardware

Large models operate in this way: they are broken down into individual operators, each of which launches a compute kernel once, completes its calculation, then pauses and synchronizes before launching the next one.

During training, each computation takes seconds or even minutes, so the overhead of startup and waiting is negligible. However, during inference, generating a single token may take only tens of microseconds, making the overhead of startup and waiting relatively significant.

TileRT's core idea: Compile the entire model into a continuously running engine, launched once and never stopped.

TileRT statically unfolds all computational logic of the model into a single continuous pipeline during the code compilation phase, ensuring that the GPU remains continuously high-speed during runtime, with computation, data movement, and communication proceeding in parallel. Intermediate results are kept within the GPU’s high-speed cache as much as possible, eliminating repeated writes back to slow video memory and subsequent reads.

Large model

There is a key design detail: Warp specialization.

To understand Warp, you first need to understand how GPUs work. The biggest difference between a GPU and a CPU is that a GPU contains thousands of relatively simple computing units, grouped together in sets of 32—each group is called a Warp.

The 32 units within the same warp must always act in sync, executing the same instruction, like a squad in an army where the squad leader gives a command and everyone moves together.

In traditional frameworks, all Warps execute the same instruction sequence; TileRT assigns different responsibilities to different Warp groups: one group specializes in pre-fetching the next batch of data, another focuses solely on mathematical computations, and a third handles communication with other GPUs. These three groups work simultaneously in a pipelined fashion, without waiting for each other.

It’s like transitioning from "one worker moving bricks, laying walls, and inspecting tasks sequentially," to "brick-moving team, wall-laying team, and inspection team working simultaneously."

Once efficiency within a single card is resolved, parallel processing across multiple cards presents new challenges.

The industry standard practice is tensor parallelism: splitting the model's weight matrices into multiple parts, with each GPU responsible for one part, and then aggregating the results via high-speed interconnects (NVLink) after individual computations are completed.

This solution works exceptionally well for regular, dense computations like matrix multiplication and is the standard multi-GPU approach used by nearly all large model inference frameworks today.

GLM-5.1 uses **MLA (Multi-head Latent Attention), an attention mechanism introduced by DeepSeek.

Traditional attention mechanisms require storing all intermediate data (KV Cache) from each step for later use, consuming significant GPU memory; MLA compresses these intermediate data into a compact "latent vector" for storage and expands them back when needed, greatly reducing memory usage and improving inference efficiency.

But the MLA calculation process includes a special step: performing sparse indexing on a large volume of historical data—similar to quickly identifying the most relevant books in a vast library before closely reading those few.

The "locate book" step relies on global information and is not suitable for distribution across multiple GPUs; "close reading" is the dense computation that benefits from parallel processing across multiple GPUs. Forcing all 8 GPUs to participate in "locate book" would waste significant time on synchronization and communication between GPUs.

TileRT's solution enables heterogeneous GPU operation: GPU 0 acts as the "library librarian," handling sparse indexing and routing decisions; GPUs 1–7 serve as "close readers," performing dense attention computations and matrix operations. Each type of worker employs the parallel strategy best suited to its role, collaborating seamlessly to complete the entire computation layer.

Large model

Next, TileRT directly embeds communication operations between GPUs into the execution pipeline, eliminating them as separate steps. From an external perspective, the entire 8-GPU system requires only a single kernel launch to complete one layer of attention computation, with all internal communication and calculation seamlessly performed within the continuous pipeline.

The above two layers address issues within a single machine. When the cluster scales to hundreds or even thousands of GPUs, data transfer between GPUs themselves becomes the new bottleneck.

The industry standard practice is ROFT (Rail-Optimized Fat-Tree), the officially recommended solution by NVIDIA and the absolute industry standard.

Its structure is a tree: servers first connect to Leaf switches (access layer, directly facing servers), and Leaf switches then connect upward to Spine switches (core layer, responsible for interconnecting different Leaf switches, like highway interchanges). Data transmitted between two GPUs must "first ascend to the Spine, then descend to the target Leaf," requiring at least three hops.

To prevent traffic from concentrating on a few links, this architecture relies on the ECMP algorithm to distribute data across multiple paths, operating effectively under the assumption of statistically uniform internet traffic.

But traffic in inference scenarios is completely uneven. The context lengths of different requests can vary by factors of tens, the direction of KV Cache transmission between GPUs is nearly random, and certain leaf switches periodically become hotspots, triggering backpressure mechanisms that spread congestion from local areas to the entire chain. This congestion is not solvable by tuning protocol parameters—it is an inherent product of the topology itself.

Large model

ZCube's fundamental breakthrough: physically preventing this type of congestion at the architectural level.

The core design consists of two steps:

Step 1: Remove the Spine backbone layer and flatten the network. Divide all Leaf switches into two groups based on odd and even numbering, and fully interconnect the two groups—each odd-numbered switch connects to every even-numbered switch, and vice versa. Any two GPUs can reach each other through at most two switches, reducing the hop count from three to two.

Large model

Step two, and the most ingenious part: each GPU network card connects to two separate sets of switches in two entirely different ways. This unique topology delivers a critical mathematical property: between any two GPUs in the entire network, there is exactly one optimal path.

Large model

The "single path" directly eliminates the root cause of congestion. Traditional architectures are prone to hotspots precisely because multiple paths are available—load balancing algorithms can mistakenly direct traffic to concentrate. ZCube eliminates the very concept of "choice" by design: no load balancing is needed because there are no branching paths.

04 Under the same hardware conditions, how is the accounting done?

After upgrading its GLM-5.1 production cluster from traditional ROFT to ZCube, Zhipu obtained three numbers:

In summary, with the same GPU investment, the cluster can serve more users; with the same user experience requirements, the cluster can purchase one-third fewer network devices. Efficiency and cost both improve.

Large model

Specifically, a 15% increase in throughput is equivalent to gaining 15% additional computing power for free. With the same number of GPUs, a 15% increase in throughput translates to approximately a 13% reduction in the amortized hardware cost per token, or the ability to serve 15% more users at the same cost.

If a cluster has 1,000 GPUs, this upgrade is equivalent to adding 150 additional cards out of nowhere, representing an算力 value of hundreds of millions of yuan based on current market prices for high-end inference cards.

Tail latency decreased by 40.6%, addressing stability rather than average speed. For an Agent task requiring 50 rounds of calls, if tail latency is reduced by 1 second per round, the worst-case completion time is compressed by nearly one minute.

Costs are reduced by one-third through direct savings at the infrastructure level. ZCube eliminates the Spine layer, directly reducing the number of switches and optical modules required by one-third for the same cluster size. According to Zhipu’s estimates, this single change alone can save between 210 million and 640 million yuan in a 10,000-GPU cluster.

In the long term, as cluster scale grows exponentially, the complexity of GPU-to-GPU communication increases several-fold, simultaneously amplifying the probability and impact of congestion. This means the value of architecture-level innovations like ZCube will accelerate as reasoning clusters continue to expand. The benefits of tomorrow’s ten-thousand-GPU clusters may far exceed today’s 15%.

05 Final Thoughts

After reading Zhipu's technical report, I wonder if this will stir up the industry just like DeepSeek did.

Think about it carefully—the impacts seem to be on different fronts. When DeepSeek was released, it demonstrated that the same level of intelligence can be achieved with far less computing power. The market worried that "fewer GPUs would be needed," causing NVIDIA’s market value to drop nearly $600 billion that day.

But today, Zhipu's technical proof shows: with the same computing power, more can be produced. It is reimagining what other infrastructure should look like beyond GPUs.

In the short term, NVIDIA will not be affected, but in the long term, the moat formed by GPUs, NVLink interconnects, InfiniBand networks, and the CUDA software ecosystem is being undermined—particularly the InfiniBand assets NVIDIA acquired for $69 billion in 2019, whose network premium will be significantly eroded.

In addition, ZCube has eliminated the Spine layer, but it has increased the port density requirements for Leaf switches. Manufacturers capable of producing high-density, high-port Leaf switches (Ruijie, Arista, Broadcom switching chips) benefit, while manufacturers primarily reliant on high-end Spine switches to capture premium pricing are disadvantaged.

In 2025, Celestica and NVIDIA together accounted for approximately 50% of the AI backend network switch market share, a landscape that will be reshuffled following the spread of the ZCube paradigm.

Optical modules are the most direct beneficiary of this industry chain transformation, with a very clear rationale. For domestic optical module manufacturers (such as InnoLight and TF Technology), this represents a structural tailwind: not only is the overall demand increasing, but under the ZCube paradigm, demand for high-speed optical modules (800G, 1.6T) is more concentrated and urgent than in traditional architectures.

Whether it's TileRT or ZCube architecture, this is a pure software inference engine running on standard GPUs, without relying on NVIDIA's proprietary hardware features, and theoretically portable to domestic chips such as Huawei Ascend. If this direction succeeds, it will significantly lower the software stack barrier for domestic AI chips in inference scenarios.

This may be the greater significance behind this technological innovation.