Zhipu AI Discovers Two Critical Bugs in the GLM-5 Coding Agent System

AIMPACT Message, April 30 (UTC+8): According to monitoring by Beating, Zhipu has published a post reviewing issues with the GLM-5 series models in Coding Agent scenarios, including garbled text, repetition, and rare characters. Since March, users have gradually reported these anomalies, which only occur under high concurrency and long context lengths (averaging over 70K tokens) in Coding Agent tasks and cannot be reproduced in standard inference environments. Zhipu stated that its inference system handles hundreds of millions of Coding Agent calls daily. After weeks of investigation, the team identified two independent underlying race condition bugs. The first occurred in the PD-separation architecture (a deployment method splitting prefilling and decoding across different nodes): after the decoding side timed out and aborted a request, it reclaimed the KV Cache (caching previously computed attention states to avoid redundant computation), but the RDMA write from the prefilling side had not yet completed. A new request was then assigned to the same GPU memory, causing old data to overwrite new data. The fix involved adding explicit synchronization before reclaiming memory to ensure all writes were complete before release. After deployment, the anomaly rate dropped from several per ten thousand to below three per ten thousand. The second bug occurred in HiCache (multi-level KV Cache): during asynchronous offloading of cache from CPU memory, there was no synchronization point between the loading and computation pipelines, allowing the computation side to begin reading before data had fully loaded. After fixing this issue, this type of anomaly disappeared entirely, and the patch has been submitted to the SGLang community (PR #22811). During the investigation, the team made an unexpected discovery: the acceptance rate metric for speculative sampling (a speed-up technique where a small model guesses tokens and a larger model verifies them) could serve as an anomaly detection signal. When garbled text occurred, draft tokens were almost entirely rejected; during repetition, the acceptance rate was abnormally high. The team implemented real-time monitoring based on this insight: when the threshold is triggered, generation is automatically halted and retried. After resolving these bugs, the team further optimized a bottleneck: LayerSplit KV Cache, which stores only partial layers of KV Cache on each GPU rather than the full set, using broadcasting for coordinated computation. Under a 90% cache hit rate, for request lengths between 40K and 120K tokens, throughput improved by 10% to 132%, with greater gains observed as context length increased. (Source: BlockBeats)