CPU Becomes the New Bottleneck in the AI Era as Demand Surpasses GPU Focus

Over these years of rapid AI advancement, the industry has been almost entirely governed by one logic: computing power sets the upper limit, and GPUs are at the core of computing power.

However, entering 2026, this logic began to shift: model inference is no longer the sole bottleneck; system performance is increasingly determined by execution and scheduling capabilities. GPUs remain important, but the key factor in determining whether AI can actually run is gradually shifting toward the CPU, long overlooked.

On April 9 in the United States, Google and Intel reached a multi-year agreement to widely deploy Intel’s “Xeon processors” across global AI data centers, specifically to address this bottleneck. Intel CEO Chen Liwu explicitly stated that AI runs across the entire system, and that CPUs and IPUs are key to performance, efficiency, and flexibility. In other words, the CPU, which has been treated as a supporting player over the past two years, is now constraining the scaling of AI.

Google

Intel CEO Patrick Gelsinger stated on social media that Intel is deepening its collaboration with Google, expanding from traditional CPUs to AI infrastructure (such as IPUs), to jointly advance AI and cloud computing capabilities.

The CPU is no longer just a passive supporting component; it is becoming one of the key variables in AI infrastructure.

01 A "quiet" supply crisis

While everyone is focused on GPU delivery times, the CPU market has already become tightly constrained.

According to the latest reports from multiple IT distributors, the average selling price of server CPUs increased by approximately 30% in the fourth quarter of 2025. Such a rise is extremely rare in the relatively mature CPU market.

Forrest Norrod, head of AMD’s data center business, revealed that CPU demand has grown at an unprecedented rate over the past three quarters. AMD’s delivery lead times have now extended from eight weeks to over ten weeks, with some models facing delays of up to six months.

This shortage is primarily caused by resource congestion triggered by "secondary effects." Industry insiders note that due to TSMC’s 3nm production lines being extremely tight, wafer capacity originally allocated to CPUs is increasingly being displaced by more profitable GPU orders. This has created a highly ironic situation: AI labs have secured ample GPUs but find it difficult to purchase sufficient high-end CPUs to "drive" these graphics cards.

In this round of CPU buying frenzy, there’s also Elon Musk.

Intel CEO Patrick Gelsinger confirmed on social media that Musk has commissioned Intel to design and manufacture custom chips for his "Terafab" project in Texas. This large-scale initiative aims to provide a unified computing foundation for xAI, SpaceX, and Tesla.

Musk’s trust in Intel stems largely from Intel’s efforts to embed itself at every level, from ground-based data centers to space-orbit computing.

Google

For Intel, this is undoubtedly a boost. While industry analysts predict that AMD’s revenue share in the server CPU market will surpass Intel’s by 2026, Intel’s deep ecosystem inertia and manufacturing capabilities in the x86 landscape remain a significant advantage that major customers like Musk cannot afford to ignore.

This deep cross-industry integration is transforming CPU market competition from a simple parameter showdown into a battle over ecosystems and supply chain stability.

02 Why does the CPU become a bottleneck?

The CPU has suddenly become a bottleneck because the workload it must handle has fundamentally changed in the age of agents.

In traditional chatbot architectures, the CPU primarily handles scheduling and data processing, while the GPU performs the core inference computations. Since the compute-intensive tasks are concentrated on the GPU side, overall latency is typically dominated by the GPU, and the CPU rarely becomes a performance bottleneck.

However, agent workloads are entirely different. An agent must perform multi-step reasoning, call APIs, read and write databases, orchestrate complex business flows, and integrate intermediate results into a final output. Tasks such as searching, API calls, code execution, file I/O, and result orchestration primarily fall on the CPU and host system. The GPU handles token generation (i.e., "thinking"), while the CPU transforms those "thoughts" into actionable steps.

In their November 2025 paper, "A CPU-Centric Perspective on Agentic AI," researchers at Georgia Tech quantified the latency distribution in agentic workloads. The study found that CPU-side tool handling accounts for 50% to 90.6% of total latency. In some scenarios, the GPU is ready to process the next batch of tasks while the CPU is still waiting for tool call responses.

Another key factor is the rapid expansion of context windows. In 2024, mainstream models generally supported 128K to 200K tokens. By 2025, models such as Gemini 2.5 Pro, GPT-4.1, and Llama 4 Maverick have begun supporting over 1 million tokens. The Key-Value (KV) cache, used to accelerate Transformer model inference, scales linearly with the number of tokens, reaching approximately 200 GB at 1 million tokens—far exceeding the 80 GB VRAM capacity of a single H100 GPU.

One solution to this issue is to offload part of the KV cache to CPU memory. This means the CPU must not only manage orchestration and tool calls but also assist in storing data that cannot fit in GPU memory. CPU memory capacity, memory bandwidth, and the interconnect speed between the CPU and GPU thus become critical to system performance.

Therefore, CPUs suited for the agent era require low latency, consistent memory access, and stronger system-level collaboration, rather than merely expanding single-core scale.

03 What are manufacturers doing? Some are competing for market share, while others are redesigning their products.

Faced with this sudden surge in CPU demand, several major companies have adopted completely different strategies.

Intel has long been the leader in traditional server CPUs. According to data from Mercury Research, in the fourth quarter of 2025, Intel still held a 60% share of the server CPU market, with AMD at 24.3% and NVIDIA at 6.2%. However, over the years, Intel has been working to catch up with new technologies, and this surge in CPU demand presents both an opportunity and a challenge for them.

Intel’s current strategy is a two-pronged approach. On one hand, it continues to sell Xeon processors with deep partnerships with hyperscale customers like Google. On the other, it collaborates with SambaNova to launch a combined solution featuring Xeon processors and its proprietary RDU accelerator, highlighting the selling point of “running agent inference without GPUs.” The roadmap for Xeon 6 Granite Rapids and the 18A process will be critical in determining whether Intel can turn things around.

AMD is one of the biggest beneficiaries of this surge in CPU demand. In the fourth quarter of 2025, AMD’s data center revenue reached $5.4 billion, a 39% year-over-year increase. Fifth-generation EPYC Turin accounted for more than half of server CPU revenue, and cloud instances running EPYC saw over 50% year-over-year growth. AMD’s market share in server CPU revenue surpassed 40% for the first time.

AMD CEO Lisa Su attributed the growth directly to the development of agents—agent workloads are pushing tasks back onto traditional CPUs.

In February 2026, AMD also announced a potential transaction with Meta worth over $100 billion, supplying MI450 GPUs and Venice EPYC CPUs.

However, AMD still has room for improvement in system-level collaboration, lacking mature high-speed CPU-GPU interconnect capabilities similar to NVLink C2C. As agent systems demand higher efficiency in data exchange and coordination, the importance of this aspect is steadily increasing.

NVIDIA's approach to designing CPUs is completely different from that of Intel and AMD.

NVIDIA Grace CPU has only 72 cores, while AMD EPYC and Intel Xeon typically have 128. Dion Harris, NVIDIA’s head of AI infrastructure, explained: “If you’re a hyperscaler, you want to maximize the number of cores per CPU, which essentially reduces costs—the cost per core in dollars. So it’s a business model.”

In other words, within the AI computing architecture, the CPU is no longer the primary general-purpose processor but rather acts as a "coordination hub" serving the GPU. If the CPU cannot keep up, the expensive GPU will be forced to wait, ultimately reducing overall efficiency.

Therefore, NVIDIA's design prioritizes efficient collaboration between the CPU and GPU. For example, through NVLink C2C interconnect, the bandwidth between the CPU and GPU is increased to approximately 1.8 TB/s, far exceeding traditional PCIe, allowing the CPU to directly access GPU memory, which significantly simplifies KV cache management.

NVIDIA has now sold the Vera CPU as a standalone product. CoreWeave is the first customer. The deal with Meta is even more significant, marking its first large-scale "pure Grace deployment"—a large-scale independent deployment of CPUs without pairing them with GPUs.

Ben Bajarin, Chief Analyst at research firm Creative Strategies, noted that in high-intensity system collaboration, CPU processing power must keep pace with the iteration speed of accelerators. Even a one percent delay in the data pathway can significantly undermine the economic efficiency of an entire AI cluster. This pursuit of极致 system efficiency is compelling all major companies to reevaluate CPU performance metrics.

Holger Mueller, Vice President and Chief Analyst at Constellation Research, said that as AI workloads shift toward agent-driven architectures, the role of the CPU is becoming increasingly central. He noted, “In the agent world, agents need to invoke APIs and various business applications—tasks that are best suited for the CPU.”

He added, "Currently, there is no consensus on whether GPUs or CPUs are better suited for inference tasks. GPUs hold an advantage in model training, and custom ASICs like TPUs also have their strengths. But one thing is clear: Google needs to adopt a hybrid processor architecture. Therefore, it is reasonable for Google to partner with Intel."

04 Conclusion: In the Age of Agents, the Balance of Computing Power Shifts Back

In the latest industry insights, one data point deserves our attention: within Amazon AWS’s $38 billion partnership with OpenAI, the official announcement explicitly mentions an expansion scale of “tens of millions of CPUs.”

Over the past few years, the industry’s focus has typically been on “hundreds of thousands of GPUs.” However, leading labs like OpenAI have proactively treated CPU scale as a critical planning variable, sending a clear signal that scaling agent workloads must be built on a massive CPU infrastructure.

Bank of America predicts that the global CPU market size could double from its current $27 billion to $60 billion by 2030, with nearly all of this growth driven by AI.

We are witnessing the expansion of an entirely new infrastructure: major companies are no longer just deploying GPUs, but are simultaneously scaling up an entire layer of "CPU orchestration infrastructure" specifically designed to support AI agents.

The collaboration between Intel and Google, along with Musk’s substantial investment in custom chips, all confirm one truth: the key to winning the AI race is shifting forward. When computing power is no longer scarce, only those who can first resolve system-level bottlenecks will emerge victorious in this trillion-dollar game.

Special contribution to this article was provided by Jinlu.

This article is from the WeChat public account "Tencent Technology," authored by Li Helen and edited by Xu Qingyang.