Cerebras' Wafer-Scale AI Chip Breaks Through the Memory Barrier in the Inference Era

In 2026, the global development of AI reached a landmark turning point—cloud providers’ capital expenditure on inference surpassed training capital expenditure for the first time in history. The industry’s focal point shifted from “training large models” to “utilizing large models,” fundamentally reversing the structure of compute demand.

In the training era, the core challenge in computing power was "double-precision floating-point performance versus cluster scale"; in the inference era, the core challenge has shifted to "memory bandwidth versus communication latency."

The bottleneck in large model inference is no longer just computation, but data movement—model weights, intermediate activations, and KV Cache require frequent interactions between off-chip DRAM (such as HBM) and GPUs. The larger the model, the higher the energy consumption and latency of data transfer, ultimately far exceeding the energy cost of computation itself, creating a memory wall.

NVIDIA GPUs have built a strong fortress with CUDA and NVLink, but they still cannot avoid GPU idle time caused by bandwidth bottlenecks.

The Chinese large model company Zhipu conducted a simple experiment: keeping the GPU, model, and code unchanged in a 512-GPU inference cluster, they increased the network bandwidth limit from 200 GB/s to 400 GB/s—resulting in a 10% increase in inference throughput and a 19% reduction in first-token latency. The principle is straightforward: widening the road allows vehicles to move faster.

However, non-GPU architectures, such as those represented by Cerebras, appear to be breaking through the memory wall.

Wafer-level chip

Cerebras WSE-3 chip compared to NVIDIA B200 GPU size

The essence of Cerebras: a near-memory computing machine based on SRAM

Cerebras Systems was founded in Silicon Valley by Andrew Feldman and others, with the original founding team entirely coming from SeaMicro, a low-power microserver company later acquired by AMD. Subsequently:

In 2015, the founding team established the "wafer-scale computing" roadmap;

In 2016, completed registration and Series A funding, and entered stealth development mode;

In 2019, the first products, the WSE-1 chip and CS-1 system, were released, built on TSMC's 16nm process.

In 2021, launched the second-generation product based on TSMC's 7nm process;

In 2024, we launched the third-generation product (WSE-3 / CS-3), built on TSMC’s 5nm process, with both the chip and system entirely manufactured in the United States, making it a genuinely American-made chip system.

Wafer-level chip

CS-3 system configuration, featuring 1 WSE-3 chip

Cerebras’ wafer-scale engine (WSE) architecture philosophy is simple yet brutally effective: maximize physical space to drastically reduce data movement latency.

Traditional chips involve cutting a wafer into many small chips, such as NVIDIA’s GPUs. Cerebras does the opposite: it doesn’t cut the wafer at all, instead turning nearly the entire wafer into one massive chip called the Wafer-Scale Engine, or WSE.

Traditional chips are formed by cutting a single 300mm-diameter wafer into hundreds of smaller chips; however, Cerebras chose to retain the entire wafer as one single chip. The latest WSE-3 features 4 trillion transistors and 900,000 AI cores, each equipped with 48KB of local SRAM, resulting in a total on-chip SRAM capacity of 44GB, delivering 21 PB/s of on-chip memory bandwidth and 214 PB/s of fabric bandwidth—thousands of times greater than traditional HBM bandwidth.

Wafer-level chip

The memory bandwidth of the Cerebras WSE is 2,625 times that of NVIDIA's B200 packaged chip, breaking the memory bandwidth bottleneck in large model inference scenarios.

In Cerebras' architecture, model weights are never stored on SRAM; instead, they reside on off-chip MemoryX and are transferred layer by layer onto the large chip. This is achieved by separating the storage of neural network weights from the computation units.

All model weights are stored externally in the MemoryX memory expansion module. Weights required for each layer of the network are transmitted on-demand, layer by layer, to the CS-3 system. These weights are stored in the DRAM and flash memory of MemoryX and are transferred to the CS-3 system at full bandwidth. The weights are never stored within the CS-3 system—not even in temporary caches—and the CS-3 relies on its core dataflow architecture to perform computations.

Cerebras, with its wafer-scale architecture, demonstrates a decisive advantage in LLM inference constrained by memory bandwidth. During token-by-token generation, weights are streamed layer by layer from external MemoryX to the CS-3, achieving a token throughput 1.5 to 5 times higher than NVIDIA’s B200 across various models.

Wafer-level chip

Comparison of token throughput between NVIDIA DGX B200 GPUs and Cerebras CS-3 chips across different large models

Its key advantage lies in CS-3’s 44 GB on-chip SRAM, which delivers an ultra-high bandwidth of 21 PB/s (2,625 times that of the B200) and 214 Pb/s interconnect, freeing weight streaming from HBM interface limitations. As a result, it excels significantly in TTFT (Time To First Token—the time from request initiation to the model returning the first token), long-context processing, and agent workloads.

Although weights are external to MemoryX and loaded layer-by-layer on demand without on-chip caching, the CS-3 achieves fully lossless FP16 precision operations within SRAM using its core dataflow architecture; thanks to linear performance scaling, it delivers remarkable total throughput even under concurrent multi-user inference.

In addition to bandwidth, there are advantages in power efficiency. Recently, Liu Sheng, Chairman of InnoLight, mentioned that customers require optical modules to achieve 1 pJ/bit, whereas the current level is 10 pJ/bit. In Cerebras chips, the interconnect power consumption is only 0.15 pJ/bit, compared to 10 pJ/bit for current GPUs.

Wafer-level chip

Bandwidth and power consumption comparison between Cerebras interconnect and GPU interconnect architectures

Thus, if Cerebras’ wafer-scale large chip architecture becomes mainstream for AI inference and even training, it could significantly suppress and structurally alter the shipment volumes of traditional optical modules and CPO (Co-Packaged Optics). The core logic is this: the high demand for optical modules and CPO fundamentally stems from addressing bandwidth bottlenecks in GPU clusters related to “inter-chip interconnects” and “inter-node interconnects”; whereas Cerebras’ architecture solves this by eliminating distributed interconnects altogether.

Counterintuitive: The True and False Flaws of Wafer-Level Large Chips

The core of any chip always lies in trade-offs. Cerebras' pursuit of极致bandwidth for on-chip SRAM has also introduced some challenges.

Low yield?

On the contrary, the size of each AI core has been reduced to just 0.05 square millimeters (1% of the size of a single compute core in the H100), resulting in higher yield. On-chip routing enables defective cores to be disabled and bypassed, increasing defect tolerance by 100 times compared to traditional multi-core processors. Although the entire chip contains one million AI cores, the advertised count is 900,000 AI cores after accounting for yield.

Good at reasoning, but not at training?

Within a few years of Cerebras' founding, training was the dominant focus, so the company invested heavily in training; however, after demand for inference surged, people realized its advantages in inference were even more pronounced.

In fact, simplified distributed computing also brings a range of advantages, including reduced code complexity and lower communication overhead.

Training a 175-billion-parameter model on 4,000 GPUs typically requires approximately 20,000 lines of distributed training code.

Cerebras achieved the equivalent of 565 lines of code training—the entire model fits on a wafer without requiring the complexity of data parallelism.

SRAM scaling is dead, facing physical limits to its core advantages.

The third-generation product is built on TSMC’s 5nm process, and its SRAM capacity is only 10% higher than that of the second-generation product based on TSMC’s 7nm process; after 5nm, SRAM cell area hardly shrinks further with process advancements.

This means Cerebras can no longer significantly increase its core advantage (SRAM capacity) by upgrading TSMC’s process technology, such as moving from 5nm to 3nm, as it did in the past.

Limited by wafer size, thermal dissipation, and manufacturing costs, on-chip SRAM and other memory resources cannot scale linearly in tandem with computational cores, creating a bottleneck in resource allocation. This has nearly blocked its path of evolution.

Wafer-level chip

Cerebras Third-Generation Product Specifications

The triple trial of cooling, manufacturing, and ecosystem.

The entire wafer generates concentrated heat with high thermal flux density, requiring customized data centers and dedicated liquid cooling systems. Additionally, limited ecosystem compatibility means customers must adapt to its proprietary software stack, resulting in weak compatibility with existing general-purpose programming frameworks like CUDA and high costs for software porting and adaptation.

Low off-chip bandwidth creates expansion "islands."

Due to limitations in wafer-level physical design, the WSE has an extremely limited number of I/O pins, resulting in an I/O bandwidth of only 150 GB/s. This pales in comparison to NVIDIA’s NVLink, which routinely offers bidirectional bandwidths of 1.8 TB/s, making the WSE’s I/O performance seem like that of a snail. This severely hampers the WSE’s ability to scale outward at high speeds. Although Cerebras’ SwarmX interconnect performs reasonably well in linking multiple systems, the extremely low off-chip bandwidth becomes a structural physical bottleneck when facing ultra-large models requiring high-speed multi-chip interconnectivity.

The Road War: How Much Time Remains for Cerebras’ Window of Opportunity?

Large companies are not relying solely on wafer-scale solutions to address the need for higher bandwidth and lower latency in inference; they are simultaneously pursuing three parallel paths to encircle the technological advantages of startups.

① Proprietary ASIC chip

Google's TPU v8 has been split into training-specific and inference-specific versions; AWS Trainium 4 is on the way; Microsoft's Maia is already in use internally on Azure, built on TSMC's 3nm process, featuring native FP8/FP4 tensor cores, a redesigned memory system, and 216GB of HBM3e with 272MB of on-chip SRAM; even Anthropic has begun evaluating its own inference chip.

The probability of this path is extremely high, and it will directly result in a 10% to 25% compression of the upper limit of the third-party inference procurement TAM (total addressable market) by 2028.

② Generalization of the standard packaging process

This is the most direct blow to Cerebras.

TSMC's System-on-Wafer (SoW) is now widely available to customers, and the CoWoS 9.5x interposer is scheduled to launch in 2027.

What these two products do—stitching multiple dies at the wafer level—is essentially to generalize and democratize Cerebras’s physical manufacturing process.

NVIDIA's Vera Rubin will enter this ecosystem in the second half of 2026.

Although Cerebras' proprietary cross-reticle stitching is currently exclusive, the exclusivity window lasts at most two to three years; after 2027–2028, its process advantages will be eroded by TSMC’s advanced packaging technologies.

③ Breakthrough in Optical Interconnects/Optical Computing

The interconnects of electronic chips and the memory wall have reached their limits; photonics, with its high bandwidth, low latency, and zero crosstalk, is the ultimate solution.

The optical pathway represented by Lumentum is on the rise. The greatest advantage of wafer-scale is on-chip computing, but as models inevitably grow larger, high-speed interconnects beyond wafer scale become essential.

As CPO (Co-Packaged Optics) and optical interconnects mature, we are very likely to see optical I/O directly integrated into WSE wafers, breaking the constraints of electrical interconnects; NVIDIA may also acquire companies with specific architectural advantages, such as Groq, and combine optical interconnects to develop wafer-scale systems compatible with existing NV super-node software.

Running on the Cliff: Cerebras’ Business and Delivery

Cerebras is currently racing toward a cliff due to massive orders forcing its pace.

Transactions with major clients like OpenAI have forced Cerebras to transform from a chip company into a new type of cloud service provider. It is no longer just selling hardware but must rapidly secure and build massive data center power and infrastructure.

According to the contract, Cerebras is required to deliver 250 MW of data center capacity annually from 2026 to 2028. However, wafer-scale systems have extremely high facility requirements and cannot be directly accommodated in traditional air-cooled IDCs. Currently, Cerebras is significantly behind schedule in preparing the required data center capacity.

From silicon fabrication to factory construction, from power approval to cooling system deployment, this is a capital-intensive, long-cycle quagmire.

Epilogue: Left or Right?

Returning to the original proposition, when the inflection point of reasoning compute power has been reached, the core of compute architecture has always been about trade-offs.

There is no absolute right or wrong—only relative optimal solutions under the heaviest load. The load is already changing.

Cerebras went left, opting for extreme physical optimization—trading an entire wafer and massive amounts of SRAM for unparalleled low latency on single tasks, making it unbeatable in scenarios where first-token latency is critically sensitive.

NVIDIA chose to the right, maintaining versatility by leveraging HBM, NVLink, and massive cluster throughput to handle diverse workloads with a consistent approach.

The winds are rising, the clouds are swirling, and the path ahead is uncertain. It is precisely this dual uncertainty of technology and business that gives rise to the potential for disruption. In the torrent of computing power leading toward AGI, it is still too early to draw conclusions—because uncertainty is where opportunity lies.

This article is from the WeChat public account "Garlic Granule Lab," authored by Thunderbolt Ranger.