New AI Agent Benchmark ALE Reveals Significant Performance Gaps in Real-World Tasks

iconMetaEra
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
A research team from UC Berkeley and over 250 industry experts, citing MetaEra, has launched a new AI agent benchmark called Agents' Last Exam (ALE). The benchmark includes 1,490 real-world tasks across manufacturing, law, healthcare, and visual media. Results show that top configurations, such as Codex + GPT-5.5, complete only 8.6% of the most difficult tasks, with mainstream systems averaging 2.6%. The study identifies domain knowledge as the primary limitation, not execution capability. Model choice impacts results three times more than agent design. As global regulators such as MiCA and CFT push for stricter oversight, such benchmarks may influence future compliance and risk frameworks.

Led by the University of California, Berkeley, and involving over 250 industry experts, the research team has introduced the AI Agent Evaluation Benchmark, Agents' Last Exam (ALE). This benchmark comprises 1,490 real-world professional tasks spanning manufacturing, law, healthcare, visual media, and other fields, designed to measure AI performance in long-term, economically valuable workflows. Results show that while current mainstream models achieve high scores on traditional benchmarks, their average complete pass rate on the most difficult level of ALE is only 2.6%, with the best configuration reaching just 8.6%. The research team notes that the primary bottleneck in current systems lies in domain knowledge rather than execution capability, with model selection having approximately three times the impact of agent frameworks. As a continuously updated benchmark, ALE will expand to include new workflows and industries in the future.

Article author and source: 36Kr

A research team led by the University of California, Berkeley, in collaboration with over 250 industry experts, has introduced the new AI Agent evaluation benchmark ALE to address the limitation of existing benchmarks in continuously measuring AI performance in real-world, long-duration, economically valuable tasks.

Paper link: https://arxiv.org/abs/2606.05405

What is the final exam about?

The Agents' Last Exam (ALE) is an AI agent evaluation benchmark developed by over 250 industry experts to measure AI performance in long-term, economically valuable real-world workflows.

To test whether AI can perform real-world tasks on a computer like a human, the research team collected 1,490 tasks spanning fields such as manufacturing, law, healthcare, and visual media. These tasks were drawn from the daily work of real professionals: some required the AI to create 3D models, while others asked it to perform chroma keying and video compositing in DaVinci Resolve.

Figure | Distribution of 1,490 task instances under the ALE classification system

Compared to common question-answering or short-task benchmarks, these tasks place higher demands on agents. The research team refers to such agents as Generalist Computer-Use Agents (GCUAs): they must not only interact with interfaces but also execute command-line operations, handle files, write code, invoke tools, and complete entire workflows.

Figure | Typical GCUA framework structure.

To test the true capabilities of these agents, ALE provides a comprehensive set of task environments that can be executed and scored. During execution, the task scripts handle loading the tasks, setting up the environment, and final scoring, while the agent observes the environment, selects actions, and continuously performs them based on the task description. After the task is completed, the script directly evaluates the results—93.2% of tasks can be automatically scored without manual intervention.

Figure | Task Construction Process.

How did you do on the exam?

The research team noted that, when considering only the most difficult tier of tasks, the current best-performing configuration—Codex + GPT-5.5—achieved a full pass rate of just 8.6%; the average full pass rate across mainstream systems provided by the research team was 2.6%.

The research team listed several specific failure cases. In the music transcription task, which required submission of a score PDF, MIDI file, and interface screenshots, the AI only exported the MIDI file and received a score of 0. In the injection molding simulation task, the AI completed the simulation and exported results in Moldex3D but failed to consistently extract key values, resulting in a score of 0.4762. In the green screen compositing task, although the AI exported a video, the output did not meet the reference requirements, earning it a score of 0 as well.

Figure | Main results of ALE.

Figure | Overview of Experimental Analysis.

The research team then categorized the causes of failure. For example, with Claude Code + Opus 4.7, 31% were due to comprehension issues, 47% to methodological issues, and 22% to execution issues—comprehension and methodological issues together accounted for approximately 80%. Based on this, the research team concluded that the primary bottleneck in current systems lies in domain knowledge, not execution capability.

The research team also compared the impact of different models and agent frameworks. The results showed that the differences caused by switching models were significantly greater than those caused by switching agent frameworks. When the agent framework was fixed and only the model was changed, the difference between the highest and lowest overall pass rates was 18 percentage points; when the model was fixed and only the agent framework was changed, this gap was approximately 5 to 6 percentage points. The range of impact from model selection is about three times that of the agent framework.

Shortcomings and Future Directions

The research team also noted that ALE uses SOC 2018 as its occupational classification framework, primarily covering software-based and digital professional tasks. Currently, these tasks are mainly executed on Linux or Windows virtual machines.

In addition, ALE’s coverage across different domains is uneven, with some areas having many tasks and others very few. For example, energy and nuclear engineering has only four task instances, urban and spatial planning has five, and law has fifteen. The public set currently represents only a portion of the full task pool. The research team conducted a test: on Claude Code + Opus 4.7, the correlation coefficient between the pass rates of the public subset and the full task pool across domains was only 0.89.

However, the research team considers ALE to be a continuously updated benchmark. In the future, the task pool will expand to include new workflows and industries, and tasks currently held in the private pool will be regularly rotated into the public set.

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.