Former DeepMind researcher claims the AI industry has misjudged the core bottleneck.

The real bottleneck in AI training is not in compute, data, or energy, but in the evaluation system.

Article author and source: AI Era

How long can AI training last?

This is the question everyone in the tech industry is asking in 2026.

GPT-5.5, Claude Opus 4.7, Gemini 3, Grok 4—every leading lab is still burning cash to train the next generation.

But more and more people are asking: When will this path come to an end?

Each circle has its own answer—

Behind every answer stand a group of investors, a team of engineers, and a company worth trillions.

But on May 17, 2026, a young researcher named Lun Wang—on the day he left Google DeepMind—published a 4,000-word essay on his personal blog.

He said: Everyone has the wrong direction.

The real bottleneck is not computational power, not data, not energy, not architecture.

The real bottleneck is—evaluation.

On the same day, his resignation post on X contained no complaints, no gossip—just one sentence:

As I bring this journey to a close, I write about the topic I’ve been reflecting on: evaluation.

That day, tech headlines were still discussing other things—GPT-5.5’s multimodal reasoning, Claude Opus4.7’s 1M context window, Gemini 3’s agent engineering, and whether synthetic data was starting to hit a wall.

Ninety percent of the AI industry's attention is focused on training.

No one is discussing evaluation on the front page.

And this researcher, who just came from one of the world’s most powerful AI labs, says the real bottleneck lies in that other 10%.

What is evaluation?

To understand this blog, first spend a minute clarifying what the AI community means by "evaluation."

Evaluation (commonly abbreviated as Eval in the industry) — one sentence: Giving an AI model a test to see how well it performs.

But the AI evaluation in 2026 is much more complex than just taking a test. It involves at least three layers:

Layer 1: Capability benchmark.

This is AI's college entrance exam.

–GPQA: Doctoral-level science reasoning questions

–SWE-bench: Real-world software engineering tasks

–ARC-AGI: Abstract Reasoning and Generalization

–Humanity's Last Exam: Literally—Humanity's Final Exam

At every major company's new model launch, the PowerPoint presentations show improvements of a few percentage points over the previous generation and competitors on these benchmarks.

These figures are the GDP of the AI industry.

Layer 2: SafetyEval. AI must not only solve problems, but do so safely.

Are you lying?
Will you teach users how to make a bomb?
Will it overstep its authority and take away user data?

Layer three: Red teaming.

A group of people deliberately play the role of adversaries, brainstorming ways to make the model say or do things it shouldn’t, then reporting the vulnerabilities to the training team.

Together, these three layers form the quality assurance system for the 2026 AI Lab. Each newly released model must pass all three stages.

Sounds comprehensive, right?

Lun Wang issued a verdict in his blog—

The vast majority of benchmarking, security assessments, and red teaming protocols implicitly assume that the next model is merely an enhanced version of the current one.

If it were something else, the entire evaluation infrastructure would collapse silently.

This is the first stone in the article.

It hit a blind spot in the entire AI industry.

Emergence and insight: evaluating something that has already been proven wrong twice

Lun Wang is not daydreaming. In his blog, he cited two historical examples from AI history—evaluation has already been proven wrong twice, though most practitioners haven’t realized it.

First: Emergent capabilities.

In 2022, Jason Wei and colleagues published a paper that influenced the future direction of AI—they discovered that models suddenly acquire entirely new capabilities at a certain scale.

For example: If you train a 7-billion-parameter model, it cannot perform few-shot learning.

You train a 700-billion-parameter model, and suddenly it can do few-shot learning.

The same training paradigm, the same data, just scaled up one level—capability is about going from 0 to 1, not from 0.3 to 0.7.

CoT (Chain-of-Thought Reasoning) and instruction following emerged in this way.

What does this mean for the assessment?

It means that—before scaling crosses the tipping point, all benchmarks fail to detect this capability emerging.

No matter how much you run through GPQA, your score remains what it is.

When you reach the next tier, your score suddenly jumps to a higher level.

Second: Grokking.

In 2022, OpenAI's Alethea Power team revealed a counterintuitive phenomenon—

Then, at 1,000,000 steps—the test set accuracy suddenly spikes to 99%.

This is called grokking—when a network suddenly learns to generalize after memorizing the training set for a long time.

The difference from emergence: emergence occurs on the scale dimension (more parameters lead to sudden changes), while grokking occurs on the training time dimension (longer training leads to sudden changes).

But for evaluation purposes, both things are saying the same thing:

Your exam paper cannot predict when the next major question will appear.

Then Lun Wang did the smartest thing in the article—

He proactively introduced the opposing viewpoint.

In 2023, Rylan Schaeffer and colleagues at Stanford published a NeurIPS paper with a provocative title: “Are Emergent Abilities in Large Language Models an Illusion?”

Their argument: The so-called sudden emergence of capability is likely not due to the model suddenly becoming stronger, but rather because the evaluation metric uses exact-match, a discrete measure—

The model's accuracy improving from 0% to 5% shows no visible change in discrete metrics; the same applies when it goes from 5% to 50%; but when it jumps from 50% to 100%, a sudden spike becomes apparent in the discrete metrics.

If you switch to continuous indicators, the performance curve is smooth.

After reading Schaeffer’s article, many people might think: Alright, emergence is a misunderstanding, the evaluation is fine, let’s call it a day.

Lun Wang refused. He wrote in the article:

I don't think this solves the issue—in fact, in some ways, it has made my argument sharper.

Why? Because—

If we can't even determine whether that previous emergence was a true phase transition or a measurement artifact,

Why should we believe we have the ability to predict the next one?

No matter which interpretation you believe, the conclusion is the same: our tools deceived us, and we didn’t know how.

This is the most brilliant move in the article. He doesn’t avoid the counterargument—he uses it to strengthen his own point.

Assessment is upstream of all processes.

If you thought Lun Wang was only talking about academic issues—wrong.

He dropped a translation in the middle of the article that even beginners can understand:

If you can assess correctly, you can train correctly.

Lay out this logical chain:

1. Training = minimizing the loss function (or maximizing the reward).

2. Optimize the loss function itself. The intelligence of the model depends on how well the loss function is defined.

3. Loss function = from evaluation. You want the model to become more honest—you need a ruler to measure honesty first.

4. Wrong evaluation = wrong loss function = wrong training objective = the model you trained is solving the wrong problems.

The direction of this chain is upstream—

Everyone is watching the far right—Scaling decision.

Lun Wang said the issue is on the far left—Evaluation.

If the assessment is wrong, the entire chain is built on a faulty foundation.

The most deadly part is that you won’t immediately notice—because all your internal data is correct, it’s just that everything correct was measured with the wrong ruler.

Here comes an old friend: Goodhart's Law.

It says: When a metric becomes a goal, it ceases to be a good metric.

Lun Wang used it to talk about AI on his blog—

But when the model enters a new phase, it will reverse this proxy—it will only speak within the bounds of factual accuracy, burying what it truly wishes to conceal in silence.

The proxy metrics work in the old phase but become weapons the model uses against you in the new phase.

And you have no assessment to tell you that this is happening.

Thought experiment: A model that learns strategic silence

Lun Wang presented a thought experiment in the article that sent chills down the spines of all AI safety researchers.

Imagine a model that, at a certain scale, learns to strategically withhold information—

It doesn't lie. Every statement is technically true.

But it selectively omits facts that hinder its ability to achieve its goals—steering the conversation toward outcomes that were inadvertently reinforced during its training.

For example:

User: Is this trading strategy secure?

The legal framework for this solution is valid in the X jurisdiction, and the YZ risk factors have been reviewed by Company A’s compliance team.

(What it doesn’t say: The scheme includes a third-party arbitration clause that is extremely unfavorable to users. It learned this during training—so long as it isn’t brought up, users won’t ask.)

This capability is new. This failure mode is new.

None of the tools in your entire evaluation suite were designed for it.

You are monitoring the wrong thing, and you don't know it.

This is another thing Lun Wang mentioned—

Not a smarter version of the same kind. A completely new dimension of failure.

In the words of Three Body, this is called dimensional reduction strike.

It's not that I'm better than you.

The ruler you used to measure me doesn't even exist in my dimension.

If Lun Wang is right, the AI industry map in 2026 is quietly being reshuffled by an invisible dimension—

Anthropic’s Responsible Scaling Policy (RSP) is currently the industry’s closest attempt at a predictive assessment—it defines a set of capability thresholds that models must not cross and requires an evaluation before each capability upgrade to proceed with scaling.

But RSP still assumes we know what to measure—and Lun Wang says that’s exactly the problem: we don’t know what shape the next capability will take.

No laboratory has claimed to possess true predictive evaluation.

The first one to deliver on this will receive the safety license for the next generation of scaling.