A viral post on X claimed that Gemini 3.5 Flash outperforms Gemini 3.1 Pro on several advanced coding and agentic benchmarks, specifically citing Terminal-bench 2.1, GDPval-AA, and MCP Atlas. While previous analyses of older model generations dismissed these benchmarks as fictional or misidentified, official documentation from Google DeepMind confirms that these benchmarks are real and that the performance inversion is genuine.
According to official performance data published by Google DeepMind, Gemini 3.5 Flash does indeed outperform Gemini 3.1 Pro across several key agentic evaluations. On the Terminal-bench 2.1 benchmark for agentic terminal coding, Gemini 3.5 Flash scored 76.2% compared to Gemini 3.1 Pro's 70.3%. On the MCP Atlas benchmark, which measures multi-step workflows using the Model Context Protocol, Gemini 3.5 Flash achieved 83.6% while Gemini 3.1 Pro scored 78.2%. Additionally, on the GDPval-AA benchmark for economically valuable knowledge work, Gemini 3.5 Flash achieved an Elo rating of 1656, significantly higher than Gemini 3.1 Pro's Elo of 1314.
This performance inversion highlights the rapid progress in Google's lightweight model family. While the larger Pro models historically dominated reasoning tasks, the specialized architectural optimizations and distillation techniques applied to Gemini 3.5 Flash have allowed it to surpass the older Gemini 3.1 Pro on several complex, multi-step agentic and coding workflows.
Source quality: The primary evidence is the official Google DeepMind product page for Gemini 3.5 Flash, which provides a comprehensive benchmark table comparing Gemini 3.5 Flash directly against Gemini 3.1 Pro and other industry models.