FactCheckRadar Fact-check archive

Published fact-check

Does Gemini 3.5 Flash Beat Pro?

Supported

Claim checked

“3.5 Flash outperforms 3.1 Pro on coding and agentic benchmarks like Terminal-Bench 2.1, GDPval-AA, and MCP Atlas. Holy crap”

Published

Updated

Verdict

Supported

The claim that Gemini 3.5 Flash outperforms Gemini 3.1 Pro on coding and agentic benchmarks like Terminal-bench 2.1, GDPval-AA, and MCP Atlas is supported by official Google DeepMind technical documentation.

Reasoning

A viral post on X claimed that Gemini 3.5 Flash outperforms Gemini 3.1 Pro on several advanced coding and agentic benchmarks, specifically citing Terminal-bench 2.1, GDPval-AA, and MCP Atlas. While previous analyses of older model generations dismissed these benchmarks as fictional or misidentified, official documentation from Google DeepMind confirms that these benchmarks are real and that the performance inversion is genuine.

According to official performance data published by Google DeepMind, Gemini 3.5 Flash does indeed outperform Gemini 3.1 Pro across several key agentic evaluations. On the Terminal-bench 2.1 benchmark for agentic terminal coding, Gemini 3.5 Flash scored 76.2% compared to Gemini 3.1 Pro's 70.3%. On the MCP Atlas benchmark, which measures multi-step workflows using the Model Context Protocol, Gemini 3.5 Flash achieved 83.6% while Gemini 3.1 Pro scored 78.2%. Additionally, on the GDPval-AA benchmark for economically valuable knowledge work, Gemini 3.5 Flash achieved an Elo rating of 1656, significantly higher than Gemini 3.1 Pro's Elo of 1314.

This performance inversion highlights the rapid progress in Google's lightweight model family. While the larger Pro models historically dominated reasoning tasks, the specialized architectural optimizations and distillation techniques applied to Gemini 3.5 Flash have allowed it to surpass the older Gemini 3.1 Pro on several complex, multi-step agentic and coding workflows.

Source quality: The primary evidence is the official Google DeepMind product page for Gemini 3.5 Flash, which provides a comprehensive benchmark table comparing Gemini 3.5 Flash directly against Gemini 3.1 Pro and other industry models.

Key checks

  • Model Names and Benchmark Authenticity: Official Google DeepMind documentation confirms the existence of Gemini 3.5 Flash and Gemini 3.1 Pro, as well as the specific benchmarks Terminal-bench 2.1, GDPval-AA, and MCP Atlas.

  • Benchmark Performance Comparison: Gemini 3.5 Flash scored 76.2% on Terminal-bench 2.1 (vs 70.3% for Pro), 83.6% on MCP Atlas (vs 78.2% for Pro), and 1656 Elo on GDPval-AA (vs 1314 Elo for Pro), confirming the claimed performance inversion.

Confidence

High

Was this useful?

Your vote helps us see which fact-checks deserve more attention.

8 reviewed sources behind this verdict.

Might interest you next