AI Agents Pass Just 2.6% of Real-World Tasks in New Benchmark

A new benchmark from UC Berkeley suggests that AI agent timelines need a serious reality check.

The Agents’ Last Exam, a large-scale evaluation framework built with input from over 250 industry experts across more than 100 institutions, found that mainstream AI agents achieve an average full pass rate of just 2.6% on its hardest tier of real-world professional tasks. The best-performing agent, Codex running on gpt-5-5, managed roughly 26%.

What the benchmark actually tests

The benchmark covers 55 non-physical sub-industries organized into 13 clusters, derived from the O*NET/SOC 2018 taxonomy. So far, the team has cataloged more than 1,500 tasks, with an ambitious goal of reaching 5,000. Each task produces verifiable outcomes, meaning there’s no room for the kind of fluent-sounding-but-wrong outputs that large language models have become famous for.

The paper was submitted to arXiv on June 3, 2026, and the project lives at agents-last-exam.org. It’s designed as a living benchmark that will continue expanding in scope and complexity over time.

The collaboration behind it

The initiative was spearheaded by UC Berkeley’s RDI and drew collaborative input from institutions including MIT, Harvard, Stanford, Goldman Sachs, JPMorgan, Meta, Amazon, Adobe, and Snorkel AI.

Why a 26% top score matters

That 26% figure represents the overall pass rate for the best-performing configuration, Codex on gpt-5-5. The average across popular configurations of mainstream agents sits at 2.6% on the hardest tier. Cursor and Claude-based setups followed Codex in the rankings.

The benchmark specifically evaluates long-term task performance rather than quick-hit question answering. An AI agent might be able to answer a finance question correctly in isolation but completely fall apart when asked to execute a multi-step workflow that requires maintaining context, making sequential decisions, and producing a verified deliverable.