Supported

Published fact-check

Fact Check: Does GPT-5.5 underperform Claude Opus 4.7 on the SWE-Bench Pro coding benchmark?

Claim checked

“GPT 5.5 underperforms Opus 4.7 on SWE-Bench Pro. Couldn't find any reported SWE-Bench scores at all and an internal benchmark is reported instead. That footnote is trying really hard to bury the lede. GPT 5.5 isn't SOTA for coding.”

Published

Verdict

Supported

Evidence from official release notes and independent technical analysis confirms that Claude Opus 4.7 outperforms GPT-5.5 on the SWE-Bench Pro benchmark, a key metric for real-world software engineering. While OpenAI's announcement highlights a state-of-the-art (SOTA) score on a different benchmark (Terminal-Bench 2.0), its reported score for SWE-Bench Pro is lower than the score previously established by Anthropic's latest model.

7 reviewed sources behind this verdict.

Reasoning

The claim is supported by comparing the official performance data released by both OpenAI and Anthropic in April 2026.

GPT-5.5 was reported by OpenAI to achieve a 58.6% resolve rate on SWE-Bench Pro. In contrast, Claude Opus 4.7, released one week earlier, achieved a 64.3% resolve rate on the same benchmark.

Furthermore, the user's observation regarding OpenAI's reporting style is largely accurate: OpenAI's primary comparison table in their official announcement features an internal benchmark called 'Expert-SWE' (where GPT-5.5 scored 73.1%) rather than the industry-standard SWE-Bench Pro. The SWE-Bench Pro score was mentioned in the body text of the announcement but omitted from the main visual comparison, supporting the assertion that the company prioritized internal metrics where the model appeared more dominant.

Source quality: The report relies on official product release documentation from OpenAI and Anthropic, as well as technical reporting from The Next Web and Vellum.ai that specifically compares these models on the mentioned benchmarks.

Key checks

  • GPT-5.5 SWE-Bench Pro Performance: In its April 23, 2026, release announcement, OpenAI stated that GPT-5.5 reaches 58.6% on SWE-Bench Pro, which evaluates the resolution of real-world GitHub issues.

  • Claude Opus 4.7 SWE-Bench Pro Performance: Anthropic's Claude Opus 4.7, released on April 16, 2026, scores 64.3% on SWE-Bench Pro, which is 5.7 percentage points higher than GPT-5.5.

  • OpenAI's Use of Internal Benchmarks: OpenAI's official announcement table prominently features 'Expert-SWE (Internal)', where GPT-5.5 scores 73.1%. It does not list the SWE-Bench Pro score in that specific comparison table, though it is mentioned later in the text.

Confidence

High