The Agents' Last Exam (ALE) is a real benchmark developed by UC Berkeley RDI and 300+ industry experts, as confirmed by the official website and a VentureBeat article from June 10, 2026. The benchmark includes 1,500+ tasks across 55 non-physical industries, and it tests models like GPT-5.5, Fable 5, and Composer 2.5 on the leaderboard. The claim that every tested frontier agent scored 0% on the hardest tier is partially supported but overstated. According to VentureBeat and the Snorkel AI article, the "Last-Exam" tier is the hardest, and some configurations, such as Claude Opus 4.8 and Gemini CLI, did record a 0.0% pass rate. However, the top-performing agent (Codex with GPT-5.5) achieved a 24.0% overall pass rate, and the average pass rate on the hardest tier is below 1%, not universally 0%. The claim's wording suggests a total failure across all models, which is not accurate. The benchmark's design and results are well-documented, but the specific assertion about a 0% success rate for every frontier agent is a slight exaggeration of the data.
The evidence is strong, with multiple sources including the official benchmark website, a VentureBeat article, and a Snorkel AI post confirming the existence and details of the ALE benchmark. The sources provide consistent information about the number of tasks, occupations covered, and model performance, though there is a minor discrepancy in the exact pass rates on the hardest tier.