Fable 5 Fails Hardest Tasks in New AI Agent Benchmark ALE

According to monitoring by Beating, led by UC Berkeley’s RDI in collaboration with hundreds of industry experts, a new AI agent evaluation benchmark called Agents' Last Exam (ALE) has been launched to assess agents’ ability to perform real-world digital professional tasks. ALE covers 55 digital professional subdomains and includes over 1,500 validated tasks derived from actual human expert projects, supporting result verification in both GUI and CLI environments. Initial testing included cutting-edge systems such as Fable 5, GPT-5.5, and Composer 2.5. According to the latest official comparison metrics, in the most difficult tasks requiring sustained reasoning and deep domain expertise, all tested agents achieved a 0% success rate—Fable 5, newly released this week, also scored zero. This is primarily due to safety protocols triggering: approximately 35% of Fable 5’s tasks were rolled back to run on the older Opus 4.8, significantly degrading its overall performance. In terms of single-task API cost, Fable 5 averages $15.70, far exceeding GPT-5.5’s $3.80 and Composer 2.5’s $1.33—resulting in 4 to 12 times higher costs for the same tasks. Testing also revealed that the most common failure mode among agents is prematurely declaring success without verifying outcomes, often missing files or miscalculating data. For CLI-based agents, the evaluation team has simultaneously released the subset ALE-CLI. Compared to existing benchmarks like Terminal-Bench and SWE-bench-Pro, ALE-CLI covers 40 subdomains, with human average task completion times ranging from hours to weeks. In CLI evaluations, even the best-performing agent achieved only a 25.2% pass rate. The evaluation team notes that while the era of usable agents has arrived, there remains a long way to go before they can truly replace humans in professional roles.