According to monitoring by Beating, led by UC Berkeley’s RDI in collaboration with hundreds of industry experts, a new AI agent evaluation benchmark called Agents' Last Exam (ALE) has been launched to assess agents’ ability to perform real-world digital professional tasks. ALE covers 55 digital professional subdomains and includes over 1,500 validated tasks derived from actual human expert projects, supporting result verification in both GUI and CLI environments. Initial testing included cutting-edge systems such as Fable 5, GPT-5.5, and Composer 2.5. According to the latest official comparison metrics, in the most difficult tasks requiring sustained reasoning and deep domain expertise, all tested agents achieved a 0% success rate—Fable 5, newly released this week, also scored zero. This is primarily due to safety protocols triggering: approximately 35% of Fable 5’s tasks were rolled back to run on the older Opus 4.8, significantly degrading its overall performance. In terms of single-task API cost, Fable 5 averages $15.70, far exceeding GPT-5.5’s $3.80 and Composer 2.5’s $1.33—resulting in 4 to 12 times higher costs for the same tasks. Testing also revealed that the most common failure mode among agents is prematurely declaring success without verifying outcomes, often missing files or miscalculating data. For CLI-based agents, the evaluation team has simultaneously released the subset ALE-CLI. Compared to existing benchmarks like Terminal-Bench and SWE-bench-Pro, ALE-CLI covers 40 subdomains, with human average task completion times ranging from hours to weeks. In CLI evaluations, even the best-performing agent achieved only a 25.2% pass rate. The evaluation team notes that while the era of usable agents has arrived, there remains a long way to go before they can truly replace humans in professional roles.
Fable 5 Fails Hardest Tasks in New AI Agent Benchmark ALE
MarsBitShare
Fable 5 underperformed in the latest AI + crypto news benchmark, ALE, developed by UC Berkeley’s RDI and industry experts. Alongside GPT-5.5 and Composer 2.5, Fable 5 scored 0% on the most complex tasks. A 35% regression to Opus 4.8 due to safety policies impaired its performance. Fable 5 also costs 4 to 12 times more per task than its competitors. New token listings remain a key priority for exchanges, but AI advancements encounter real-world challenges.
Source:Show original
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information.
Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.