FactCheckRadar Fact-check archive

Published fact-check

AI Agents Struggle on New Professional Work Benchmark

Mostly true

Claim checked

“A new benchmark called Agents’ Last Exam (ALE) is testing whether AI agents are truly ready for real digital labor-market work. The benchmark includes more than 1,500 expert-sourced tasks. The tasks span 55 occupations. Models tested include Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems. The benchmark was created by researchers who previously worked on major evals like MMLU, MATH, CyberGym, and ExploitGym. Current agents can solve some real professional tasks. But on ALE’s hardest tier, every tested frontier agent scored 0% success. That includes Fable 5.”

Published

Verdict

Mostly true

The claim is mostly true. The Agents' Last Exam (ALE) is a real benchmark designed to test AI agents on professional workflows, and it does include 1,500+ tasks across 55 occupations. Models like GPT-5.5, Fable 5, and Composer 2.5 are indeed tested on the leaderboard. However, the claim that every tested frontier agent scored 0% on the hardest tier is slightly misleading. While some configurations, including Claude Opus 4.8 and Gemini CLI, did score 0.0% on the "Last-Exam" tier, the top-performing agents (like Codex with GPT-5.5) achieved 24.0% overall, and the average pass rate on the hardest tier is below 1%, not universally 0%. The claim overstates the uniformity of the 0% result.

Reasoning

The Agents' Last Exam (ALE) is a real benchmark developed by UC Berkeley RDI and 300+ industry experts, as confirmed by the official website and a VentureBeat article from June 10, 2026. The benchmark includes 1,500+ tasks across 55 non-physical industries, and it tests models like GPT-5.5, Fable 5, and Composer 2.5 on the leaderboard. The claim that every tested frontier agent scored 0% on the hardest tier is partially supported but overstated. According to VentureBeat and the Snorkel AI article, the "Last-Exam" tier is the hardest, and some configurations, such as Claude Opus 4.8 and Gemini CLI, did record a 0.0% pass rate. However, the top-performing agent (Codex with GPT-5.5) achieved a 24.0% overall pass rate, and the average pass rate on the hardest tier is below 1%, not universally 0%. The claim's wording suggests a total failure across all models, which is not accurate. The benchmark's design and results are well-documented, but the specific assertion about a 0% success rate for every frontier agent is a slight exaggeration of the data.

The evidence is strong, with multiple sources including the official benchmark website, a VentureBeat article, and a Snorkel AI post confirming the existence and details of the ALE benchmark. The sources provide consistent information about the number of tasks, occupations covered, and model performance, though there is a minor discrepancy in the exact pass rates on the hardest tier.

Key checks

  • Benchmark existence and scope: The Agents' Last Exam (ALE) is a real benchmark developed by UC Berkeley RDI and 300+ industry experts, covering 1,500+ tasks across 55 non-physical industries.

  • Model performance on hardest tier: While some models like Claude Opus 4.8 scored 0.0% on the 'Last-Exam' tier, the top-performing agent (Codex with GPT-5.5) achieved a 24.0% overall pass rate, and the average pass rate on the hardest tier is below 1%, not universally 0%.

Confidence

High

Was this useful?

Your vote helps us see which fact-checks deserve more attention.

6 reviewed sources behind this verdict.

Might interest you next