DeepSeek V4 Demonstrates Stable Performance on Domestic AI Chips

By World Model Workshop

DeepSeek V4 has once again shaken the entire country.

Model size, context length, benchmark scores… these technical metrics have been repeatedly compared across various reports.

But if you only focus on the surface-level data, you’ll miss the core strategic significance of this release.

Over the past three years, China's large models have been trapped in an awkward reality: training relies on NVIDIA, and inference also relies on NVIDIA, with domestic chips serving only as a backup option.

When NVIDIA stops supplying, the entire Chinese model community will be left in anxiety.

But today, DeepSeek V4 has proven with its capabilities:

A cutting-edge large model with trillions of parameters can also run stably and efficiently on domestic computing power.

The significance of this goes beyond the model's technical metrics themselves.

Breakthrough in Localization

To truly understand the difficulty of this domestic adaptation, you must first understand NVIDIA’s chip empire.

NVIDIA owns more than just chips—it has a highly integrated, end-to-end ecosystem:

On the hardware side, there is a family of GPU chips connected by a high-speed network using NVLink and NVSwitch;

On software, CUDA is NVIDIA's meticulously developed AI operating system over the past decade.

It operates like a highly optimized factory, with every layer—from the fundamental operators (the basic units of model computation) to parallel computing, memory management, and distributed communication—tailored specifically for NVIDIA GPUs.

In other words, NVIDIA doesn’t just sell engines—it has also built the roads, gas stations, repair shops, and navigation systems.

Almost all of the world's top large models have grown on this ecosystem.

Switching to domestic computing power presents a completely different scenario.

Different hardware architectures, varying interconnection methods, differing levels of software stack maturity, and a tooling ecosystem that is still rapidly catching up.

DeepSeek’s attempt to adapt to domestic chips isn’t as simple as swapping out an engine—it’s like switching a race car already speeding on a highway to a mountain road still under construction.

A slight mistake can cause shaking, loss of power, or even prevent the entire vehicle from moving forward.

This time, DeepSeek V4 did not choose to continue optimizing solely along the CUDA path, but instead began simultaneously adapting to the software stack of domestic computing power.

Based on publicly available information, V4 has achieved a breakthrough with domestic inference chips, offering deep optimization for Huawei Ascend 950 chips and running stably on Cambricon chips on the day of the model release, truly achieving Day 0 compatibility.

This means that cutting-edge models are now beginning to have the potential to be deployed within China's domestic chip ecosystem.

How does DeepSeek V4 achieve this?

The first step occurs at the model architecture level.

V4 didn't choose to have domestic chips directly handle 1M context; instead, it first made the model itself more efficient.

The most critical designs in the official technical report are the CSA + HCA hybrid attention mechanism and long-context optimizations such as KV cache compression.

In simple terms, traditional long-context reasoning forces the model to sprawl out an entire library every time it answers a question, quickly consuming all available VRAM, bandwidth, and computing power.

V4's approach is to first re-index, compress, and filter the library's data, sending only the most critical information into the computation pipeline.

As a result, the 1M context no longer relies entirely on hardware brute force, but instead first reduces the computational and VRAM loads through algorithms.

This is crucial for domestic chips.

If the model still heavily relies on GPU memory bandwidth and mature CUDA libraries, even if domestic chips can run it, they will struggle to do so cost-effectively and stably.

V4 initially reduces the inference load, effectively easing the burden on domestic computing power.

Step two occurs at the MoE architecture and activation parameter layer.

Although V4-Pro has a total of 1.6 trillion parameters, it activates only about 49 billion parameters per inference; V4-Flash has a total of 284 billion parameters, activating approximately 13 billion parameters per inference.

This means it doesn't fetch and calculate all parameters with every call, but rather operates like a large team of experts, summoning only the relevant specialists when a task arises.

This is equally important for domestic chips.

It reduces the computational burden per inference and makes it easier for inference cards to handle long contexts and agent scenarios.

The third step is adapting the operators and kernel layers.

The greatest strength of the CUDA ecosystem is that NVIDIA has thoroughly optimized a vast array of low-level computations, enabling direct access to many high-performance computing functions.

The significance of V4 lies in its extraction of certain key computations from NVIDIA's black box, transforming them into more portable and adaptable custom computation pathways.

In simple terms, V4 is like taking apart the most critical components of an engine, allowing companies like Huawei Ascend and Cambricon to retune them according to their own chip architectures.

Step four is the reasoning framework and service layer.

If domestic chip compatibility remains limited to just running demos, its industrial significance is minimal. What truly matters is whether it can be integrated into a callable, billable service system.

Internal testing on the Ascend 950PR shows that V4 delivers significantly improved inference speed and reduced power consumption compared to earlier versions, with single-card performance exceeding that of NVIDIA’s customized H20 by more than 2x in specific low-precision scenarios.

DeepSeek officially noted that the current V4-Pro is limited by high-end computing power, resulting in restricted service throughput. Prices are expected to drop significantly after the Ascend 950 super nodes enter mass production in the second half of the year.

This indicates that as domestic hardware such as Ascend reaches mass production, V4's future throughput and cost-performance will continue to improve.

However, it is important to note that V4 has not fully replaced NVIDIA’s GPUs and CUDA; model training may still rely on NVIDIA, but inference can gradually be localized.

This is actually a very realistic business path.

Training involves phased investment—train once, adjust once, iterate once. Inference incurs ongoing costs, with millions or even billions of daily user requests, each requiring computational resources.

The biggest cost for model companies will increasingly shift toward inference over the long term. Those who can handle inference demands more cheaply and reliably will gain a real advantage in industrial applications.

DeepSeek V4 has for the first time enabled a path for deploying China’s cutting-edge models that does not rely on NVIDIA CUDA as the default assumption.

This step is already substantial enough.

The impact of V4 on industry applications

If compatibility of domestic chips answers whether they can run, then price addresses another, more practical question:

Can businesses afford it?

In the past, DeepSeek's greatest strength was its ability to deliver near-cutting-edge model performance at an extremely low price.

This was true in the V3 and R1 eras, and it remains true in V4.

The difference is that this time, it’s not engaging in price competition within a standard context window, but rather continuing to lower prices under a 1M context window combined with Agent capabilities.

According to DeepSeek's official pricing:

For V4-Flash, cached input is $0.2 per million tokens, non-cached input is $1 per million tokens, and output is $2 per million tokens;

For V4-Pro, cached input is 1 yuan per million tokens, non-cached input is 12 yuan per million tokens, and output is 24 yuan per million tokens.

Compare it within the category of domestic models:

Compared to Alibaba's Qwen3.6-Plus in the 256K-1M range, the output price of V4-Pro is approximately half, and V4-Flash is even lower.

Compared to the Xiaomi MiMo Pro Series in the 256K-1M range, both the V4-Flash and V4-Pro are significantly cheaper.

Kimi K2.6 has a context length of 256K, whereas V4-Pro offers a longer context at a lower price; V4-Flash reduces the cost of high-frequency calls to an entirely different level.

This has significant implications for enterprise applications.

Because a 1M context means the model can read an entire code repository, a thick bundle of contracts, hundreds of pages of prospectuses, lengthy meeting minutes, or the accumulated historical state during an Agent’s sequential task execution.

In the past, many enterprise applications got stuck here: the model capability was sufficient but the context wasn’t; the context was adequate but the cost was too high; the price was acceptable but the model’s performance wasn’t stable enough.

For example, a company building an investment research agent needs the model to simultaneously analyze annual reports, earnings call transcripts, industry reports, competitor news, and internal meeting minutes.

When the context is only 128K or 256K, the system often needs to repeatedly slice, retrieve, and summarize information, leading to data loss through multiple rounds of compression.

A 1M context allows the model to retain more of the original material, reducing omissions and fragmentation.

For example, a code agent.

It’s not about writing a few lines of code at once—it requires reading the repository, understanding dependencies, modifying files, running tests, and fixing errors based on the feedback. This process consumes tokens repeatedly.

If each step is expensive, the agent can only perform demonstrations; but if tokens are cheap enough, it may enter actual R&D processes.

This is also the industrial value of V4.

It may not be the most powerful model, but it could become the most frequently used model in enterprises.

DeepSeek has once again transformed AI from a proprietary toy of a few large companies into a productivity tool that can be scaled across countless industries.

The true value of V4

When 1M context reaches the front lines of industry at an extremely low cost, the true value of DeepSeek V4 becomes evident.

All of this is built on a foundation of domestic computing power that is still immature.

Faced with the systemic gap in China's domestic chip ecosystem, the DeepSeek team chose not to wait for the ecosystem to mature before launching.

They repeatedly delayed the release window, spending months in deep joint debugging with partners such as Huawei—a level of engineering complexity far beyond what outsiders imagine.

Therefore, achieving inference and Agent capabilities close to those of top-tier proprietary models on domestic computing power is especially remarkable.

V4 proves that, even in the face of temporary hardware ecosystem gaps, the Chinese team can still achieve competitive performance through extreme engineering investment and hardware-software co-innovation.

Of course, there is still a gap to full maturity.

The maturity of the Ascend platform’s toolchain, the stability of ultra-large-scale clusters, and deeper optimizations for more vertical scenarios all require continued collaborative efforts from all industry stakeholders.

However, the success of V4 has paved a replicable path for subsequent models.

It has provided a strong boost to the autonomy and controllability of the entire AI supply chain.

In today’s uncertain external environment, this resilience—able to break through constraints—is more worthy of respect than mere numerical metrics.

Not lured by praise, not frightened by slander, follow the path with integrity and uphold oneself with dignity.

This statement from the official DeepSeek team is the perfect commentary on it.