U.S. Government Report Claims China’s Leading AI Model Is 8 Months Behind

CoinDesk reports:

A U.S. government agency released an assessment of China’s most powerful AI: eight months behind, with the gap widening over time. After reading the evaluation methodology, internet users began raising various questions.

CAISI—the Center for AI Standards and Innovation, a division of the U.S. National Institute of Standards and Technology (NIST)—released an evaluation report on May 1 stating that DeepSeek’s open-source flagship product is “approximately eight months behind state-of-the-art technology.”

CAISI also calls it the most powerful Chinese AI model evaluated to date.

Rating system

CAISI does not average benchmark scores like most evaluation agencies. Instead, it applies item response theory—a statistical method from standardized testing—to estimate each model’s latent ability by tracking which problems each model solved and did not solve across nine benchmarks in five domains: cybersecurity, software engineering, natural sciences, abstract reasoning, and mathematics.

Based on Elo scores estimated by IRT, GPT-5.5 scores 1260, while Anthropic’s Claude Opus 4.6 scores 999. DeepSeek V4 Pro scores approximately 800 (±28), very close to GPT-5.4 mini’s 749. In the CAISI scoring system, DeepSeek is closer to the previous-generation GPT mini than to Opus.

The scoring system in benchmark tests simulates how standardized exams grade students—not by directly counting correct answers, but by assigning weights to correct and incorrect responses to estimate a score. This estimated score only has relative meaning when compared against the same evaluation applied to other models. Generally, a higher score indicates a better model, with the highest-scoring model serving as the reference point for measuring model capability.

Since two of the nine benchmarks were not disclosed, and the gaps were most significant in these two tests, CAISI's results cannot be reproduced. For example, GPT-5.5 scored 71% on one of CAISI’s cybersecurity benchmarks, CTF-Archive-Diamond, while DeepSeek scored only around 32%.

In public benchmark tests, the results differ. On the GPQA-Diamond test (a PhD-level scientific reasoning assessment scored by accuracy), DeepSeek scored 90%, just 1 percentage point below Opus 4.6’s 91%. On mathematical olympiad benchmarks (OTIS-AIME-2025, PUMaC 2024, and SMT 2025), DeepSeek scored 97%, 96%, and 96% respectively. On the SWE-Bench Verified test (evaluating real bug fixes on GitHub, scored by resolution rate), DeepSeek achieved 74%, compared to GPT-5.5’s 81%. DeepSeek’s own technical report claims that V4 Pro’s performance is comparable to Opus 4.6 and GPT-5.4.

For cost comparison, CAISI filtered out all U.S. models whose performance was significantly lower than DeepSeek or whose per-token cost was substantially higher than DeepSeek’s. Only one model met the criteria: GPT-5.4 mini. This nearly encompassed all of the most advanced U.S. algorithms, leaving only this one.

DeepSeek outperformed OpenAI’s smallest and least capable AI model in 5 out of 7 benchmark tests, while also being more cost-effective.

Counterargument: Is the gap larger or smaller?

Critiquing CAISI's methodology does not fully prove DeepSeek's correctness. The AI developer pseudonymously known as CAISI, Ex0bit, directly rebutted: "There is no such thing as a 'gap,' and no one is eight months behind. Every time we do a private sale in the U.S., we're mocked, but when we do a public sale, we're ridiculed instead."

The AI Intelligence Index v4.0 (a rating system that tracks frontier model intelligence through 10 evaluations) shows that as of May 2026, OpenAI’s score is nearing 60, while DeepSeek’s score is around 50, significantly narrowing the gap compared to a year ago.

According to standardized benchmarks, their approach indicates that the gap is actually narrowing.

When DeepSeek first appeared in January 2025, the question was whether China had caught up. U.S. laboratories responded urgently. Stanford University’s 2026 AI Index—released on April 13—reported that the Arena leaderboard gap between Claude Opus 4.6 and China’s Dola-Seed-2.0 Preview is narrowing, now standing at just 2.7%.

CAISI plans to release a more comprehensive explanation of the IRT methodology in the near future.