New AI Agent Benchmark ALE Reveals Significant Performance Gaps in Real-World Tasks

Led by the University of California, Berkeley, and involving over 250 industry experts, the research team has introduced the AI Agent Evaluation Benchmark, Agents' Last Exam (ALE). This benchmark comprises 1,490 real-world professional tasks spanning manufacturing, law, healthcare, visual media, and other fields, designed to measure AI performance in long-term, economically valuable workflows. Results show that while current mainstream models achieve high scores on traditional benchmarks, their average complete pass rate on the most difficult level of ALE is only 2.6%, with the best configuration reaching just 8.6%. The research team notes that the primary bottleneck in current systems lies in domain knowledge rather than execution capability, with model selection having approximately three times the impact of agent frameworks. As a continuously updated benchmark, ALE will expand to include new workflows and industries in the future.

Article author and source: 36Kr

A research team led by the University of California, Berkeley, in collaboration with over 250 industry experts, has introduced the new AI Agent evaluation benchmark ALE to address the limitation of existing benchmarks in continuously measuring AI performance in real-world, long-duration, economically valuable tasks.

Paper link: https://arxiv.org/abs/2606.05405

What is the final exam about?

The Agents' Last Exam (ALE) is an AI agent evaluation benchmark developed by over 250 industry experts to measure AI performance in long-term, economically valuable real-world workflows.

To test whether AI can perform real-world tasks on a computer like a human, the research team collected 1,490 tasks spanning fields such as manufacturing, law, healthcare, and visual media. These tasks were drawn from the daily work of real professionals: some required the AI to create 3D models, while others asked it to perform chroma keying and video compositing in DaVinci Resolve.

Figure | Distribution of 1,490 task instances under the ALE classification system

Compared to common question-answering or short-task benchmarks, these tasks place higher demands on agents. The research team refers to such agents as Generalist Computer-Use Agents (GCUAs): they must not only interact with interfaces but also execute command-line operations, handle files, write code, invoke tools, and complete entire workflows.

Figure | Typical GCUA framework structure.

To test the true capabilities of these agents, ALE provides a comprehensive set of task environments that can be executed and scored. During execution, the task scripts handle loading the tasks, setting up the environment, and final scoring, while the agent observes the environment, selects actions, and continuously performs them based on the task description. After the task is completed, the script directly evaluates the results—93.2% of tasks can be automatically scored without manual intervention.

Figure | Task Construction Process.

How did you do on the exam?

The research team noted that, when considering only the most difficult tier of tasks, the current best-performing configuration—Codex + GPT-5.5—achieved a full pass rate of just 8.6%; the average full pass rate across mainstream systems provided by the research team was 2.6%.

The research team listed several specific failure cases. In the music transcription task, which required submission of a score PDF, MIDI file, and interface screenshots, the AI only exported the MIDI file and received a score of 0. In the injection molding simulation task, the AI completed the simulation and exported results in Moldex3D but failed to consistently extract key values, resulting in a score of 0.4762. In the green screen compositing task, although the AI exported a video, the output did not meet the reference requirements, earning it a score of 0 as well.

Figure | Main results of ALE.

Figure | Overview of Experimental Analysis.

The research team then categorized the causes of failure. For example, with Claude Code + Opus 4.7, 31% were due to comprehension issues, 47% to methodological issues, and 22% to execution issues—comprehension and methodological issues together accounted for approximately 80%. Based on this, the research team concluded that the primary bottleneck in current systems lies in domain knowledge, not execution capability.

The research team also compared the impact of different models and agent frameworks. The results showed that the differences caused by switching models were significantly greater than those caused by switching agent frameworks. When the agent framework was fixed and only the model was changed, the difference between the highest and lowest overall pass rates was 18 percentage points; when the model was fixed and only the agent framework was changed, this gap was approximately 5 to 6 percentage points. The range of impact from model selection is about three times that of the agent framework.

Shortcomings and Future Directions

The research team also noted that ALE uses SOC 2018 as its occupational classification framework, primarily covering software-based and digital professional tasks. Currently, these tasks are mainly executed on Linux or Windows virtual machines.

In addition, ALE’s coverage across different domains is uneven, with some areas having many tasks and others very few. For example, energy and nuclear engineering has only four task instances, urban and spatial planning has five, and law has fifteen. The public set currently represents only a portion of the full task pool. The research team conducted a test: on Claude Code + Opus 4.7, the correlation coefficient between the pass rates of the public subset and the full task pool across domains was only 0.89.

However, the research team considers ALE to be a continuously updated benchmark. In the future, the task pool will expand to include new workflows and industries, and tasks currently held in the private pool will be regularly rotated into the public set.