OpenAI's Noam Brown Criticizes AI Benchmarking Standards, Calls for Performance vs. Cost Metrics

iconMetaEra
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
OpenAI’s Noam Brown has criticized current AI benchmarking standards in the latest AI + crypto news, noting that existing leaderboards fail to account for the computational costs of reasoning. He argues that performance should be evaluated against computational budgets, not just raw scores. Brown proposes publishing performance-versus-cost curves, establishing budget caps in benchmarks, and incorporating reasoning costs into safety evaluations. On-chain developments indicate rising focus on efficiency metrics in AI development.
OpenAI's Noam Brown posted a critique of AI industry evaluation standards, pointing out that current benchmark leaderboards ignore reasoning budget considerations.

Article author and source: AI New Era

Noam Brown from OpenAI has just published a lengthy article criticizing the entire AI industry.

The article title is "Insights from Large-Scale Inference Computing," and its core argument is simple: all the AI benchmark rankings you currently see are largely misleading.

The reason is simple.

The same model, when given one dollar to think versus ten thousand dollars to think, produces vastly different scores. But currently, all leaderboards fail to disclose how much it cost to run each model’s results.

Is GPT-5.5's report card "fake"?

On April 23, GPT-5.5 was released.

OpenAI released a benchmark table, and the community once again went through it line by line. The conclusion: it’s decent, slightly better than 5.4, but not by much.

Then several hours passed.

Polish mathematician Bartosz Naskręcki used a single prompt to have GPT-5.5 build an algebraic geometry visualization application in 11 minutes.

Ruby on Rails creator DHH remarked that switching back from Opus 5.5 to Opus 4.7 felt like stepping back in time.

Same model. The benchmark says “okay,” but people say “amazing.” Why?

The reason is simple: 5.5 and 5.4 were not tested under the same computational budget.

It’s like two students taking the same test—one gets 30 minutes, the other gets three hours. You compare their scores and say, “The difference isn’t that big”—that’s not comparison, that’s ridiculous.

The API pricing for GPT-5.4 Pro is $30/$180 (per million tokens), while GPT-5.5 is $5/$30. The price difference is sixfold.

However, on the benchmark table, these two models are compared as if they are on the same scale, completely ignoring differences in inference budgets. Once token budgets are controlled, GPT-5.5 significantly outperforms GPT-5.4 in cybersecurity assessments.

Brown presents two figures in the text. On the left, from a traditional benchmark perspective, 5.5 is slightly better than 5.4. On the right, with the x-axis switched to token count, the curve for 5.5 far outperforms that of 5.4.

The same exam. Look at it from a different perspective, and the conclusion is entirely different.

This is not an isolated case.

MMLU, once the dominant evaluation benchmark, now has all state-of-the-art models clustered above 88%, with score differences statistically insignificant. What you're seeing isn't "who is smarter"—it's noise.

On the MRCR v2 test with a 1 million token length, GPT-5.4 scored 36.6%, while GPT-5.5 scored 74.0%—more than doubling. However, this metric does not exist in standard benchmark tables.

On ARC-AGI, OpenAI's o3 achieved the highest score, with a reasoning cost of $30,000 per question.

The neighboring NVARC team achieved 24% accuracy using a small model with 4 billion parameters, at $0.20 per question.

Thirty thousand dollars versus twenty cents—the very question of “who ranks higher” has already become meaningless.

When a model's capability is a function of computational reasoning, a benchmark score without an x-axis is like a physical quantity without units—it tells you nothing.

In Brown’s view, the right approach is to plot a curve: performance versus inference computation.

The x-axis can represent the number of tokens, USD value, or time elapsed—each has its own advantages and disadvantages. However, it is certain that any curve is superior to a single scalar value.

Alternatively, you can set a clear budget limit and tell the model, "You have this much money—give me an answer."

This is exactly the logic of human exams: the SAT provides a fixed time, and the International Mathematical Olympiad also provides a fixed time.

Only AI evaluations, even in 2026, continue to pretend the variable "how much money is paid to think" doesn't exist.

The ignored x-axis

Why did this issue surface now?

Because two years ago, inference computing was solely an o1 concept.

And the core contributor to o1 is Brown.

Previously, he developed Libratus and Pluribus at Carnegie Mellon, both of which defeated top professional poker players—with Pluribus appearing on the cover of Science—and created CICERO at Meta FAIR, the first AI to reach human-level performance in the strategy game Diplomacy.

From games with incomplete information to reasoning models, he has consistently stayed on the same path: enabling AI to think longer and deeper.

In 2024, o1 brought the concept of "trading reasoning time for accuracy" into the public eye. By 2026, inference-time computation has become standard for all state-of-the-art models.

GPT-5.5 Pro is not an independent model; it uses the same foundation as GPT-5.5 with added parallel inference computation: when faced with difficult problems, it runs multiple reasoning chains and synthesizes the results.

Claude has extended thinking, Gemini has Deep Think, and nearly every leading lab is moving in the same direction.

Academia has also provided a quantitative relationship: coverage exhibits a logarithmic linear relationship with the number of samples.

In other words, giving AI twice as much "thinking time" won't make it twice as smart, but it will become a bit smarter. The returns diminish logarithmically.

But Brown cited a key finding from Karpathy and the AI Safety Institute—

Stronger models yield higher returns over longer time horizons. The performance plateau is pushed further out—or may even disappear.

A weak model might have already peaked after thinking for two more minutes. But a strong model, after thinking for two hours, is still climbing.

Each time a new model is released, if you only run benchmarks under a fixed inference budget, you’re only seeing the tip of the iceberg. The true upper limit of its capabilities lies in the waters you can’t afford to test.

In Brown's words: "We may have no idea where the capability ceiling of modern LLMs lies because the cost of measurement is too high."

Brown's three prescriptions

Brown offered three recommendations to address this issue.

First, when the lab releases a new model, it should publish the performance-inference compute curve, clearly indicating the inference budget corresponding to each score.

GPT-5.5 achieves 82.7% on Terminal-Bench 2.0, but you have no idea how much it cost to run. When you compare it to another model, you also have no idea how much that one cost.

It's like two companies comparing revenues, with one reporting annual revenue and the other reporting quarterly revenue, without specifying the time period.

Second, track inference usage via benchmark rankings or set a clear budget cap.

ARC-AGI is already doing this, but it is not an industry standard.

Third, the security preparedness framework and responsible scaling policy are explicitly incorporated into computational inference.

Security assessments cannot rely solely on testing the 'default state'—nation-state attackers could allocate up to $10 million in reasoning budget for a single task.

For example, Gemini 3 Deep Think.

Deep Think is essentially Gemini 3 Pro with an external calling framework; anyone can replicate it by paying the same reasoning fee.

What should really be asked is why none of the model cards show capabilities as a function of inference budget.

Brown's ideal security assessment should be a diagram.

The x-axis represents inference budget (from $1 to $10M), and the y-axis represents the model's performance on specific dangerous capabilities. Measurements are taken at low budgets, then extrapolated to high-budget regions.

But he also acknowledged a tricky issue: long-term evaluation may not be solvable through extrapolation. To determine whether an AI agent will encounter problems over a year, you might truly need to let it run for a year.

AI labs will soon face an absurd situation—agent deployment cycles are outpacing the development cycles of new models. Before you’ve finished evaluating the long-term behavior of the previous generation, the next one has already been released.

Superintelligence is an arithmetic problem.

All the previous discussions point to the same issue.

If a model's capabilities are a function of computational reasoning power, and stronger models reach their plateau further out, then what exactly is "superintelligence"?

In traditional understanding, ASI represents a qualitative turning point: one day, a model suddenly surpasses humans across all cognitive tasks.

Following this logic further—ASI may not be a single moment, but rather a curve.

The previous numbers have made it clear: for the same type of task, a reasoning budget of twenty cents versus thirty thousand dollars yields completely different results. But these are still just the ranges that have already been tested.

What if you gave a cutting-edge model a $1,000,000 inference budget? What about $100,000,000?

No one has tested it. Brown said they can't afford to.

But the logarithmic scaling relationship indicates that the curve has not yet reached its peak, and the stronger the model, the farther away the plateau.

ASI may not require a completely new architectural breakthrough. It may simply need enough money and enough time.

An AI agent that operates for an entire year, consuming hundreds of millions of dollars in inference budget, may have developed capabilities over that period that surpass the lifetime accumulation of a human individual in a specific domain.

The actual score of the final

Over the past decade, the entire AI industry has been accustomed to one way of evaluation: one model, one score, one ranking. From ImageNet to MMLU to Chatbot Arena, whoever had the higher number won.

Today, the "2D era" of money laundering is beginning.

The model's capability has shifted from a single point to a curve, and evaluation has changed from a single score to a graph. The y-axis represents performance, and the x-axis represents how much you're willing to spend to make it think.

Each "first" must be multiplied by another variable: inference budget.

The capabilities of the same model under a $5 budget versus a $500 budget may not even be on the same level. And most of the area on this two-dimensional map remains unexplored to this day.

In 2026, global tech giants are expected to invest nearly $700 billion in AI infrastructure. This spending will buy not just larger models, but longer inference times, more sampling, and faster inference.

The same open-source model costs some people $0.20 per question and others $30,000 per question. The difference in capability isn’t due to the model—it’s due to resource disparities.

When "intelligence" becomes a continuous function priced in dollars, "superintelligence" is no longer a yes-or-no question.

Whoever adapts to this two-dimensional coordinate system first will see the true score of the ASI final.

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.