DeepSeek’s Strategy: Building a $10 Trillion AI Hardware Ecosystem

DeepSeek's $10 trillion grand strategy

Original author: @bookwormengr

Peggy, BlockBeats

Editor’s Note: Over the past year, discussions around DeepSeek have largely focused on model performance, open-source strategy, and price wars. But if you understand DeepSeek only through the lens of “whether it offers subscriptions,” “whether it has multimodal capabilities,” or “whether it can act as a coding agent,” you may be underestimating what it truly aims to change.

This article presents a more radical perspective: DeepSeek’s goal may not be short-term monetization through the application layer, but rather to reshape the cost structure of AI training and inference through a series of foundational architectural innovations, indirectly fostering the emergence of a new hardware ecosystem. From MoE and MLA to DSA, CSA, mHC, Engram, and onward to Dual Path and TileLang, DeepSeek’s technical roadmap consistently revolves around one core question: how to run stronger models with less high-end compute, given constraints on HBM, advanced process nodes, packaging, and the CUDA ecosystem.

What’s most noteworthy in this article isn’t whether DeepSeek can generate hundreds of millions of dollars through APIs or subscriptions, but whether it’s integrating model capabilities, memory architectures, and domestic hardware ecosystems together. KV Cache compression reduces reliance on HBM, NAND and SSD can handle long-term caching, LPDDR can be used for weight streaming and Engram storage, and TileLang aims to weaken the CUDA moat. If these innovations continue to spread, the beneficiaries won’t just be DeepSeek itself, but also storage, ASICs, GPUs, network chips, and the entire AI infrastructure chain.

Of course, the claims regarding a "$10 trillion industry ecosystem" and a "$1 trillion valuation" remain highly speculative. However, they offer an important lens for understanding DeepSeek: open source does not necessarily mean abandoning commercialization, and low pricing does not always equate to market subsidies. For DeepSeek, the real business may not lie at the application layer, but in enabling more hardware to become usable and making lower-cost AI supply possible. In other words, it may not be selling the models themselves, but the viability of the next-generation AI infrastructure.

The following is the original text:

Have you ever wondered how DeepSeek plans to make money—and possibly make a lot of it?

It has not launched competitive programming subscription plans like GLM, MoonShot, and MiniMax; nor does it offer multimodal, audio, or video models. So far, it hasn’t even developed its own harness—the outer runtime framework for model invocation, tool integration, and task execution—though they have recently begun hiring for relevant roles to build this system.

Meanwhile, DeepSeek also appears to have long-standing, firm support for open source, even willingly sharing its own “secrets.” Isn’t that crazy? Isn’t it just burning money? Are those investors preparing to invest $10 billion throwing their money down the drain?

I personally believe the answer is exactly the opposite.

Next, I’ll offer some observations based on what DeepSeek has accomplished so far, and analyze the strategy it appears to be following. DeepSeek’s CEO, Liang Wenhong, may have ambitions that extend far beyond the current model competition—he may be targeting something much larger: an opportunity for DeepSeek to reach a $1 trillion valuation while helping to catalyze a new industry worth $10 trillion.

TechInAsia's report on DeepSeek's latest funding round

Revisiting DeepSeek's "Hero's Journey"

DeepSeek has been sailing against the wind. Instead of continuously releasing slightly stronger models and rushing to package them into directly monetizable applications, such as coding subscription plans, it has taken a different path. On January 27, 2025, I posted a widely shared tweet outlining what I saw as DeepSeek’s “hero’s journey.” Today, this story has become even more compelling.

While others were still trying to build dense models, DeepSeek chose the more difficult-to-train Mixture of Experts (MoE) model.

They applied a first-principles approach to invent a new GRPO algorithm, replacing the then-dominant but more costly-to-implement PPO reinforcement learning algorithm.

They found that Reinforcement Learning from Verified Rewards (RLVR) is a key strategy for enhancing model reasoning capabilities.

They also propose a simple speculative decoding strategy through "Multi Token Prediction," making the training signals denser.

They refined the "ZERO bubble" pipeline to improve the utilization of limited GPU resources.

They have released an expert load balancer that makes it easier for everyone to deploy MoE models. In particular, through the "Wide Expert Parallel" strategy, models can be served with larger batch sizes, significantly reducing inference costs.

They developed mechanisms such as MLA, DSA, CSA, and HCA to reduce the demand for KV Cache and keep the increasing computational requirements associated with longer context lengths as close to constant as possible.

They invented Engram, trading memory for computational efficiency.

They also invented mHC, enabling stable training even as model scale increases. There are many similar examples.

In the most universal narrative structure, the hero’s journey, the hero never begins with a clear understanding of where the journey will lead. Instead, through learning along the way, they gradually discover their true, great purpose—and fulfill it despite overwhelming obstacles. They encounter many doubters, yet choose to ignore them. They also face numerous malicious actors. They have clear flaws or weaknesses, but ultimately overcome them to complete their mission. They confront seemingly insurmountable challenges, yet find ways to form alliances and learn how to wisely use limited and precious resources. It is precisely this that inspires audiences to root for the hero. This is also why DeepSeek has won followers, global respect, and adversaries.

As I will detail shortly, DeepSeek has been on this path for a long time and has gradually discovered its ultimate destiny: its goal is not to sell coding subscriptions, but to drive a $10 trillion Chinese AI hardware ecosystem and achieve a $1 trillion valuation. In doing so, it will also create opportunities for many new entrants in the Western hardware ecosystem.

Start with some interesting KV Cache calculations

Please see this timely tweet from @SemiAnalysis_:

DeepSeek has already solved this problem better than anyone else!

Let’s start with some fun KV Cache calculations. Don’t worry—even if you’re not into math, we’ll use the recently released KV Cache calculator to see how much KV Cache savings DeepSeek V4 Pro offers, and compare it with the latest GLM and Qwen models.

Here, I'm calculating with a context length of 1 million, assuming KV precision of 8 bits and indexer precision of 16 bits. You can also try this calculator yourself: https://kvcache.ai/tools/kv-cache-calculator/

You can also try opening your calculator!

With a context length of 1 million:

·DeepSeek V4 requires only 5.48GB of HBM;

·GLM-5 requires 60GB of HBM;

·Qwen3-235B-A22B requires up to 89 GB of HBM.

Note that:

·DeepSeek is a 1.6 trillion parameter model;

·GLM-5 has approximately 700 billion parameters and has adopted DeepSeek's MLA and DSA, but has not yet implemented the latest compressed attention mechanism;

Qwen3-235B-A22B has approximately 235 billion parameters and uses the GQA attention mechanism.

DeepSeek has made foundational contributions to alleviating memory pressure. If such innovations are widely adopted, they will significantly reduce the operational costs of long-context agents and unlock the next wave of new use cases.

Comparison of KV Cache occupancy under 1 million tokens context and model size

The methodology behind the "madness"

The reason KV Cache can be so small without sacrificing model quality is precisely why DeepSeek can offer long-term caching at an extremely low price—less than 3% of Sonnet 4.6’s cache hit price—while retaining caches for hours.

For long-duration tasks, a smaller KV Cache enables more cost-effective offloading to SSD and reloading when needed, reducing reliance on HBM. From the perspective of China’s AI hardware industry, HBM is not only in short supply but also one of the most difficult types of memory to manufacture.

In addition, DeepSeek has developed technology to load the KV Cache faster from SSD, as described in their Dual Path paper.

DeepSeek V4 compresses the KV Cache to such a large extent that this step may no longer be necessary.

So, who benefits most directly from KV cache compression?

Who is supplying SSDs at scale? Don’t forget that YMTC (Yangtze Memory Technologies) is emerging as a giant in the 3D NAND space. NAND can help DeepSeek avoid redundant KV computations. In turn, DeepSeek is creating a massive market for NAND and SSDs—benefiting not only YMTC but also other related manufacturers.

However, this is not just about NAND and SSD.

LPDDR memory also has significant potential. It can serve as a storage location for model weights and stream these weights into HBM as needed, thereby alleviating pressure on HBM. The SGLang team has published an excellent blog post detailing this approach. The diagram below illustrates how this solution works.

Although DeepSeek was not specifically designed for this solution, its MoE architecture, inherent large number of expert models, and 4-bit weight characteristics make this solution easier to implement.

This diagram illustrates how memory may be utilized and how model weights are streamed from LPDDR to HBM. We strongly recommend reading SGLang’s blog post.

This innovation, when combined with an extremely compact and lossless KV Cache, will significantly reduce the demand for HBM.

So, who in China is producing LPDDR? The answer is CXMT, also known as CXMT. They are only about half a generation behind in LPDDR speed and one generation behind in density — the gap is not significant.

In addition to ample NAND, China’s AI ecosystem will also have sufficient LPDDR supply in the near future. Can this alleviate computing power pressure? The answer is: yes. Keep reading.

Intelligent memory usage can also reduce the burden on GPU/ASIC.

Using NAND to store the KV Cache is easy to understand: it allows the KV Cache to be retained longer, reduces pressure on HBM, and avoids recomputing the KV Cache, thereby easing the computational load on GPUs and ASICs.

So, can LPDDR also function in a similar way? In addition to serving as a storage location that can "on-demand, instantly" stream weights to HBM, can it further reduce computational pressure?

The answer is: Yes.

LPDDR can be used to store large amounts of content known as Engrams. In DeepSeek’s Engram paper, they note that MoE can expand model capacity through conditional computation, but the Transformer itself lacks a native “knowledge retrieval” mechanism. As a result, Transformers often have to inefficiently simulate retrieval through computation.

To address this issue, DeepSeek introduced the Engram module, which modernizes the classic N-gram embedding into a hash-based O(1) lookup mechanism, creating a complementary sparsification pathway they call conditional memory.

This approach saves computation but requires memory to store the embedding table, which itself can be very large.

Essentially, this is a classic "trade memory for computation" approach. But its key insight is that, in terms of cost per bit of data accessed, the "memory" side is significantly cheaper—a single LPDDR lookup is far less expensive than passing the full data through multiple layers of a Transformer for one forward pass. Therefore, at scale, this is an extremely favorable trade-off.

This is how DeepSeek trades off some memory to save on computation.

Worthwhile trade-offs

Due to lower transistor density in chips and the absence of EUV, Chinese GPUs and ASICs are likely to lag behind Western GPUs in raw FLOPs for the long term. They also still have a clear gap in advanced packaging. Therefore, this trade-off is very worthwhile, especially given China’s ability to mass-produce NAND and LPDDR memory.

Review DeepSeek's long-term strategy

Looking at these innovations, DeepSeek’s goal does not appear to be generating hundreds of millions in profits right now. Many of its past decisions reflect this: it still lacks multimodal capabilities, voice models, and certainly video models.

What it is truly engaged in is a patient, long-term game with a potential scale of $10 trillion: fostering the development of an alternative AI hardware ecosystem.

This is not only about enabling Chinese memory manufacturers to become key players in China’s and the global AI hardware market, but also about fundamentally reducing resource requirements and making AI model training and services more cost-efficient. As a result, many GPU, ASIC, and network chip manufacturers will have the opportunity to become viable options.

Meanwhile, these innovations will also benefit the Western open-source ecosystem and the next generation of hardware manufacturers.

All the signs have actually already appeared. Let’s take a closer look at the innovations DeepSeek has introduced so far:

1. The Mixture of Experts (MoE) and MLA introduced in DeepSeek V2

DeepSeek introduced MoE and MLA in V2. MoE reduces the computational cost required to train high-intelligence models by approximately 40% to 50%; MLA reduces the KV Cache by 90%.

This makes offloading the KV Cache to SSD quite efficient.

These ideas first appeared in DeepSeek's DeepSeek V2 paper, released in May 2024, and later formed the foundation for the training of DeepSeek V3. At the time, DeepSeek trained a system with performance approaching that of proprietary models using only 2048 underperforming H800 GPUs.

2. DSA: Introduced in DeepSeek V3.2 Exp to reduce computational overhead in long-context scenarios and alleviate HBM bandwidth pressure.

The core function of DSA is to ensure that computational load does not continuously increase with longer context lengths. See the chart below: as context length increases, DeepSeek-V3.2's processing time remains largely stable.

3. mHC: Proposed by DeepSeek in the December 2025 paper titled “mHC: Manifold-Constrained Hyper-Connections”.

mHC is an innovation by DeepSeek at the macro-architecture level, redesigning how information flows between Transformer layers.

In the past, since ResNet, models have typically used standard residual connections, i.e., x + F(x). mHC, however, extends the residual stream into multiple parallel information channels and allows the model to learnably mix between these channels. Crucially, it constrains the mixing matrix to be doubly stochastic by projecting it onto the Birkhoff polytope via the Sinkhorn-Knopp algorithm. This ensures mathematically that signal amplitude remains stable regardless of how deep the model is stacked.

This resolves the catastrophic instability previously faced by unconstrained Hyper-Connections. Hyper-Connections were initially proposed by ByteDance, but without constraints, signal amplification surged to 3000x at a scale of 27 billion parameters, ultimately causing training to collapse entirely.

The computational cost of mHC is very low: it introduces only about a 6.7% overhead in actual training time, as it does not alter the FLOPs of the attention or FFN layers, but merely changes how their outputs are routed between layers.

However, the performance improvements are substantial: on the BIG-Bench Hard reasoning task with a 27-billion-parameter model, mHC achieves a 7.2-point gain; on DROP, a 3.2-point gain; on GSM8K math tasks, a 2.8-point gain; and on MMLU general knowledge tasks, a 1.4-point gain—all achieved with the same model size and nearly identical computational budget.

Essentially, mHC achieves higher intelligence per parameter by providing a richer, more expressive cross-layer information routing topology with almost no additional FLOPs.

mHC is a complex architectural design, but it enables a more stable training process and higher intelligence per parameter.

4. CSA, HSA: Introduced by DeepSeek in V4 in April 2026.

The goal of CSA and HSA is to further reduce KV Cache requirements by 90% by compressing KV tokens, while significantly lowering the required FLOPs, thereby alleviating pressure on both HBM and GPU/ASIC.

5. Engram: Introduced by DeepSeek in the first quarter of 2026, it essentially trades memory—specifically LPDDR memory—for improved computational efficiency.

As shown in the detailed chart below, Engram delivers a significant performance improvement under the same total parameter budget.

6. Engram: Introduced by DeepSeek in the first quarter of 2026, it essentially trades memory—specifically LPDDR memory—for improved computational efficiency.

As shown in the detailed chart below, Engram delivers a significant performance improvement under the same total parameter budget.

This is the advice DeepSeek shared with hardware manufacturers in their V4 paper. I’m confident they provided even more feedback during in-person discussions.

7. Investment in TileLang also points in the same direction: DeepSeek is not just addressing its own computing power bottlenecks, but is helping build China’s hardware ecosystem to compete with Western counterparts.

With TileLang, developers can write a kernel—the low-level code used for computation—just once and run it successfully across multiple hardware platforms, provided those platforms have corresponding TileLang backends.

I expect other Chinese AI labs will gradually join in. This will help Chinese hardware manufacturers indirectly address the so-called "CUDA moat." At the same time, it will unlock more potential from Western hardware, such as AMD.

It should be noted that many Chinese AI hardware platforms already offer CUDA compatibility or CUDA translation layers. For example, Moore Threads, Moxi, Biren, and TianShu Intelligence are Chinese chip manufacturers that achieve high CUDA compatibility through translation layers. Therefore, theoretically, they do not necessarily require TileLang.

Large-scale reinforcement learning and RSI

As DeepSeek gains access to more computing resources—expanding its available hardware options—and its model itself requires fewer computational resources, it can pursue more ambitious training initiatives, particularly reinforcement learning fine-tuning.

Reinforcement learning requires generating a large number of trajectories, which means generating trillions of tokens. This process quickly becomes extremely expensive. Furthermore, training a model with a 1 million token context length requires generating trajectories of the same length. Only by training models on such ultra-long trajectories can they truly support long-horizon tasks.

In addition, with more hardware options available, DeepSeek will have access to greater hardware resources, which will drive automated research, known as RSI. RSI refers to AI designing and executing experiments on its own. This approach involves extensive trial and error, leading to rapidly increasing costs. However, RSI is essential for exploring the full model design space. Before reaching AGI, and subsequently ASI, DeepSeek must possess RSI capabilities.

What DeepSeek does today, the entire industry will follow tomorrow.

DeepSeek's innovations in areas such as mixture-of-experts models, MLA, and DSA have been gradually adopted by other AI labs globally and in China.

For example, ZAI, the developer of the GLM series models, has adopted MLA and DSA. Kimi, also known as Moonshot, has also implemented MLA and openly stated that its architecture is based on the DeepSeek architecture. In turn, DeepSeek uses the Muon optimizer, which was originally adopted by Kimi (Moonshot) in large-scale training.

It should be noted that:

MoE was first introduced by Google in 2017, with Noam Shazeer as a key author. DeepSeek's contribution lies in large-scale application of MoE and the invention of its own supporting techniques.

Muon, or MomentUm Orthogonalized by Newton-Schulz optimizer, was proposed by machine learning researcher Keller Jordan in late 2024. The Kimi (Moonshot) team was the first to apply it to large-scale training.

What about the issue of making money?

We can look at the interesting example of OpenAI.

OpenAI received warrants or options to purchase AMD and Cerebras shares at a discounted price, tied to its computing consumption milestones. For AMD and Cerebras, this is an extremely favorable deal, as OpenAI’s commitment to using their hardware significantly increases their long-term success potential.

AMD's announcement includes the following passage:

As part of the agreement, to further align the strategic interests of both parties, AMD has issued warrants to OpenAI allowing the purchase of up to 160 million shares of AMD common stock, which will vest in tranches based on the achievement of specific milestones. The first tranche will vest upon completion of the initial 1 gigawatt deployment, with subsequent tranches vesting as procurement scales up to 6 gigawatts. Vesting is also contingent on AMD achieving certain stock price targets and OpenAI reaching the technical and commercial milestones required for AMD’s large-scale deployment.

I anticipate that DeepSeek will also reach similar agreements and engage in deep collaborations with multiple Chinese manufacturers of memory, ASICs, CPUs, and network technology stacks, enabling their hardware stacks to handle leading AI workloads.

Considering that the total market capitalization of AI stocks in the West, including East Asian allies, has already far exceeded $1 trillion, this approach of “gaining equity returns through collaboration” will give DeepSeek the opportunity to help China build a similarly massive industry and secure its share of the value, ultimately achieving a $1 trillion valuation.

This will not only allow DeepSeek to earn far more than traditional app subscription models, but also achieve its stated goal of “making AGI accessible to everyone.” Liang Wenhong is a devoted fan of Jim Simons and a shrewd enough player in capital markets to not miss this opportunity.

If you look back at everything DeepSeek has done so far, this is the only explanation that makes sense.

These are key AI stocks. The chart does not yet include hyperscalers—large-scale cloud providers—or many other related companies.

Original link