Published fact-check

AI Agents Struggle on New Professional Work Benchmark

Claim checked

“A new benchmark called Agents’ Last Exam (ALE) is testing whether AI agents are truly ready for real digital labor-market work. The benchmark includes more than 1,500 expert-sourced tasks. The tasks span 55 occupations. Models tested include Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems. The benchmark was created by researchers who previously worked on major evals like MMLU, MATH, CyberGym, and ExploitGym. Current agents can solve some real professional tasks. But on ALE’s hardest tier, every tested frontier agent scored 0% success. That includes Fable 5.”

Published June 12, 2026 at 3:37 PM

Verdict

Mostly true

The claim is mostly true. The Agents' Last Exam (ALE) is a real benchmark designed to test AI agents on professional workflows, and it does include 1,500+ tasks across 55 occupations. Models like GPT-5.5, Fable 5, and Composer 2.5 are indeed tested on the leaderboard. However, the claim that every tested frontier agent scored 0% on the hardest tier is slightly misleading. While some configurations, including Claude Opus 4.8 and Gemini CLI, did score 0.0% on the "Last-Exam" tier, the top-performing agents (like Codex with GPT-5.5) achieved 24.0% overall, and the average pass rate on the hardest tier is below 1%, not universally 0%. The claim overstates the uniformity of the 0% result.

Reasoning

The Agents' Last Exam (ALE) is a real benchmark developed by UC Berkeley RDI and 300+ industry experts, as confirmed by the official website and a VentureBeat article from June 10, 2026. The benchmark includes 1,500+ tasks across 55 non-physical industries, and it tests models like GPT-5.5, Fable 5, and Composer 2.5 on the leaderboard. The claim that every tested frontier agent scored 0% on the hardest tier is partially supported but overstated. According to VentureBeat and the Snorkel AI article, the "Last-Exam" tier is the hardest, and some configurations, such as Claude Opus 4.8 and Gemini CLI, did record a 0.0% pass rate. However, the top-performing agent (Codex with GPT-5.5) achieved a 24.0% overall pass rate, and the average pass rate on the hardest tier is below 1%, not universally 0%. The claim's wording suggests a total failure across all models, which is not accurate. The benchmark's design and results are well-documented, but the specific assertion about a 0% success rate for every frontier agent is a slight exaggeration of the data.

The evidence is strong, with multiple sources including the official benchmark website, a VentureBeat article, and a Snorkel AI post confirming the existence and details of the ALE benchmark. The sources provide consistent information about the number of tasks, occupations covered, and model performance, though there is a minor discrepancy in the exact pass rates on the hardest tier.

Key checks

Benchmark existence and scope: The Agents' Last Exam (ALE) is a real benchmark developed by UC Berkeley RDI and 300+ industry experts, covering 1,500+ tasks across 55 non-physical industries.
Model performance on hardest tier: While some models like Claude Opus 4.8 scored 0.0% on the 'Last-Exam' tier, the top-performing agent (Codex with GPT-5.5) achieved a 24.0% overall pass rate, and the average pass rate on the hardest tier is below 1%, not universally 0%.

Confidence

High

Was this useful?

Your vote helps us see which fact-checks deserve more attention.

6 reviewed sources behind this verdict.

Might interest you next

Source context

The claim comes from a post by Wes Roth on X, summarizing the launch of the Agents' Last Exam (ALE) benchmark. The post asserts that the benchmark includes 1,500+ tasks across 55 occupations, tests models like Fable 5, GPT-5.5, and Composer 2.5, and that every tested frontier agent scored 0% on the hardest tier, including Fable 5. The claim is based on a quote from Dawn Song, a UC Berkeley professor and co-creator of the benchmark.

Original source

Open on X ↗

Found stronger evidence? Send us the source.

If your link is relevant, this page is rewritten automatically and immediately.

AI Agents Struggle on New Professional Work Benchmark

Verdict

Reasoning

Key checks

Confidence

Was this useful?

Agents’ Last Exam

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark | VentureBeat

Agents' Last Exam: can AI agents actually do real jobs? | Snorkel AI

AI Agent Benchmark for Real-World Professional Workflows

Paper page - Agents' Last Exam - Hugging Face

[2606.05405] Agents' Last Exam - arXiv

Might interest you next