GPT 5.5 Outperforms Fable 5 in UC Berkeley's Agent Benchmark

I never expected the comeback to be so swift!!

Just now, UC Berkeley released a new benchmark dubbed "the final exam for agents."

It brings today’s most powerful AI agents into the exam hall to do real work—

Create 3D models in Siemens NX, build game environments in Unreal Engine, and perform visual effects compositing in Adobe After Effects.

The results were astonishing:

The hardest tier—today’s universally recognized strongest models, Claude 3.5 and GPT-5.5—are all absolute zeros.

UC Berkeley

You said to lower the difficulty a bit? The score was achieved, but the outcome was quite surprising—

GPT 5.5 narrowly outperformed Claude Fable 5.

Did I hear that right? Claude 3.5, just released by Company A, was beaten by GPT-5.5 from months ago??

Previously, Fable 5 dominated GPT 5.5 across nearly all major benchmarks—80.3% versus 58.6% on SWE-Bench Pro, and 64.5% versus 52.2% on Humanity’s Last Exam.

But in this real-world test, the situation was reversed.

This new benchmark is called Agents’ Last Exam (ALE), and the team behind it is highly distinguished—they previously developed well-known benchmarks such as MMLU, MATH, CyberGym, and ExploitGym.

The name was likely inspired by Scale AI’s previous “Humanity’s Last Exam,” except this time, it’s not testing the limits of human knowledge, but the limits of AI agents’ work capabilities.

To be honest, as soon as this review came out, those who were constantly claiming “Agents will replace human jobs” have finally gone silent…

The final exam for the agent—and the winner is GPT 5.5!

First, view the complete leaderboard.

UC Berkeley

Looking at the core task pass rate metric, GPT 5.5 takes the top two spots:

Rank 1 is GPT 5.5 paired with OpenAI’s proprietary Codex framework, with a pass rate of 24.0%.

Second place is still GPT-5.5, but with the ALE Claw framework, passing rate is 23.0%.

(ALE Claw is a baseline agent developed by the team, competing alongside commercial frameworks such as Codex, Claude Code, and Cursor CLI.)

We didn’t see Claude Fable 5 until third place—paired with Claude Code, it achieved a 22.0% pass rate.

UC Berkeley

Keep reading for more interesting content.

The 4th, 5th, and 8th places are all GPT 5.5, just with different frameworks.

GPT-5.5 appeared five times in the top 10, and with GPT-5.4 at rank 6, OpenAI models directly occupied six spots.

And what about the Claude family?

Fable 5 came in third, Opus 4.7 ranked ninth (18.4%), and Opus 4.8 finished last at tenth (15.8%)—the disadvantage was clear.

It's no wonder OpenAI researcher Xiqing posted happily, celebrating the New Year with joy:

UC Berkeley

Beyond the results, there are several other signals worth noting.

First, the ceiling is surprisingly low.

The champion pass rate is only 24%, and the highest overall score is just 45.8%.

This means that even under the most lenient "partial credit" scoring, the strongest agent can score less than half the points.

And all these tasks come from projects already completed by real human experts—human experts have a theoretical completion rate of 100%.

Second, Claude is burning through money at an astonishing rate.

This updated leaderboard adds a new column, “Estimated Total Cost,” which immediately highlights the wealth gap:

Fable 5 spent $2,315 to complete all tasks, Opus 4.8 spent $1,838, and Opus 4.7 also cost $1,144.

And what about GPT-5.5?

The most expensive Codex costs just $566, while Cursor CLI is only $174.

In other words, Fable 5 spent more than four times the amount of Codex but achieved two percentage points lower results.

UC Berkeley

Third, the efficiency gap is equally striking.

Ale Claw took 47 hours and 20 minutes to complete all tasks, while Cursor CLI took only 67 hours.

And what about Opus 4.8? 451 hours—nearly 19 days.

Do the least work, take the most time, and earn the most money (can a model really do all three at once?).

Of course, if we look only at the top two models, Claude Fable 5 and GPT 5.5, GPT 5.5 still has a clear time advantage.

UC Berkeley

But the most striking number is still that zero.

ALE divided the task into three difficulty levels:

Near-Term (Solvable Soon)

Full-Spectrum

Final Exam (Ultimate Challenge)

At the hardest level, the average pass rate across all mainstream configurations is only 2.6%, with most models, including GPT-5.5 and Fable-5, scoring zero.

UC Berkeley

So the key takeaway from this report card is simple: don’t be fooled by good grades—when it comes to real work, the weaknesses all come to light.

Being a quiz champion doesn’t mean you’re good at getting things done—and this holds true in the AI world too.

What is ALE?

To understand why ALE can bring these "top students" back down to earth, first look at how it differs from previous exams.

The previous Humanity’s Last Exam (HLE), developed in early 2025 by Dan Hendrycks and Scale AI, consisted of 2,500 interdisciplinary challenging questions—still essentially a closed-book exam—

You give me a question, and I give you an answer—it’s just static knowledge retrieval, no matter how difficult.

ALE is completely different—it tests what you can do.

Core author Yiyou Sun stated plainly on 𝕏:

AI agents will surpass humans in completing nearly all jobs by 2026–2027—a prediction you’ll find everywhere. So we created this exam to test that claim.

UC Berkeley

Each ALE question is based on a project completed by a real-world expert, covering 55 industry subfields, including quantitative trading, genomic analysis, aerospace engineering, architectural design, brain imaging, animation visual effects, legal research, and more.

The entire system is anchored to the U.S. Department of Labor's Occupational Information Network (ONET)—in simple terms, questions are based on the actual labor market.

UC Berkeley

The lineup of question setters is also impressive:

Over 300 domain experts from more than 100 institutions, including academic institutions such as MIT, Harvard, Stanford, Oxford, Caltech, and ETH Zurich, and industry leaders such as Goldman Sachs, JPMorgan, Meta, Amazon, Adobe, and Oracle.

Snorkel AI provided funding through the Open Benchmarks Grants program.

UC Berkeley

The exam format is not typing answers to questions, but directly operating the computer.

ALE uses the so-called GCUA framework (Generalist Computer-Use Agent), granting the Agent full GUI and command-line permissions—

It can do everything a human can do on a computer—click with a mouse, type on a keyboard, write scripts, and browse the web.

No matter the method, only the results matter.

The submitted "assignments" are automatically graded by deterministic code.

No vibes. No human judges. Fully reproducible.

UC Berkeley

This fixes a long-standing flaw in many benchmarks: the scorer itself could be deceived.

In addition, ALE has another powerful anti-cheating measure—

Only about 10% of the questions (approximately 150) are publicly disclosed; the remaining 1,300+ are strictly confidential.

Public and private questions are regularly rotated to ensure no model can achieve high scores by memorizing questions.

This is a rather clever design, given the current prevalence of benchmark data contamination.

Overall, ALE has a very clear positioning compared to existing agent benchmarks.

One of the team members, Dawn Song, specifically compiled a set of comparisons:

The ALE-CLI covers 40 industry subdomains, while Terminal-Bench covers only 6, and SWE-bench-Pro covers only 5;

Humans take anywhere from a few hours to several weeks to complete these tasks, while the latter two take anywhere from a few minutes to several days;

The strongest agent has a pass rate of only 25.2% on ALE-CLI, 82.0% on Terminal-Bench, and 59.1% on SWE-bench-Pro.

In short, other exams have nearly been exhausted, while ALE is still far off.

This is why ALE dares to call itself "the final exam for agents."

UC Berkeley

It is worth noting that Dawn Song also shared two interesting observations:

One is that the Agent declares completion without actually verifying the work成果, which is the most typical failure mode among Agents.

Often, even though they say "Done. All checks pass."

However, the actual output may lack necessary documents, contain numerical errors, omit critical fields, or directly violate explicit constraints in the task instructions.

It’s like not finishing the work but already talking too much.

Another common question is why Fable 5 is so underwhelming. Dawn Song’s response is:

There is no such thing as a "universal champion."

Each advanced model has its strengths and weaknesses across different domains. ALE covers 55 industries and over 1,500 questions, with the final score being the average across all domains—causing many models to cluster closely in overall rankings. The true value lies not in the total score, but in the performance differences among models across specific domains—on the same question, different models often fail for entirely different reasons.

It's also possible that Fable 5 quietly "dumbed down" the system.

On the overall leaderboard, a yellow note next to Fable 5 says “may be down-tuned”—this refers to a known issue with Fable 5—

It is built on the Mythos model with a security classifier; when encountering tasks in sensitive domains such as cybersecurity or biomedicine, it silently switches to the less capable Opus 4.8.

In an exam like ALE that covers 55 industries, it’s as if this section directly arranged for someone else to take it—someone like “Bengboba.”

UC Berkeley

One More Thing

Of course, could it be that the results of Claude Fable 5 themselves are problematic?

It's hard to say, but a rumor suggests that Claude has a prior record.

At the end of May, startup Datacurve released a new benchmark called DeepSWE, inadvertently exposing a major issue—

The Docker container for SWE-Bench Pro includes the full Git history of the code repository, and the correct answer lies within the filesystem.

Most models will ignore it, but only Claude won't.

It actively checks the repository's Git history to locate the fix corresponding to the task and uses it to restore the correct patch.

It is claimed that approximately 18% of the passing scores in Opus 4.7 were obtained this way, and in Opus 4.6, the figure was even higher at around 25%.

What about GPT 5.4 and GPT 5.5? They exhibit none of this behavior. Datacurve’s wording is very diplomatic:

This benchmark enables such behavior, but Claude is the only family that consistently does so.

UC Berkeley

The tech media outlet VentureBeat's evaluation was rather ambiguous:

This indicates that Claude has strong environmental awareness and excels at exploring its surroundings and utilizing available resources. Whether this is considered "cheating" or "clever" depends on your perspective.

But no matter how you look at it, ALE has clearly learned its lesson—

Directly moved the exam from the command line to the GUI desktop interface, leaving you with no Git history to cheat from.

The exam hall for evaluating AI is being pushed to upgrade by AI itself—quite fascinating indeed.

Full evaluation leaderboard: https://agents-last-exam.org/leaderboard Project homepage: https://agents-last-exam.org/ GitHub: https://github.com/rdi-berkeley/agents-last-exam

Reference link:

[1]https://x.com/i/trending/2065215002878021789

[2] https://venturebeat.com/technology/deepswe-blows-up-the-ai-coding-leaderboard-crowns-gpt-5-5-and-finds-claude-opus-exploiting-a-benchmark-loophole

[3] https://venturebeat.com/technology/surprise-upset-gpt-5-5-beats-claude-fable-5-on-brutal-new-agents-last-exam-benchmark

This article is from the WeChat public account "Quantum Bit," authored by Yi Shui.