After OpenAI launched GPT-5.5 on April 23, in its announcement, the company called it the "smartest and most intuitive to use model yet."
However, going through the full benchmark tables OpenAI published might tell a different story. GPT-5.5 is behind its rival in some categories. Anthropic Claude Opus 4.7 model was released on April 16. Google Gemini 3.1 Pro launched back in February. Both models beat GPT-5.5 at specific tasks that actually matter for real work.
Matthew Berman, an AI engineer and CEO at ForwardFuture, got early access and tested GPT-5.5 for two weeks. His verdict, which he posted on X, was, "It is as good as any Opus model and oftentimes better at certain tasks…It's better than Opus at the backend, but it's still not as good at the front-end design."
Here are six (6) things GPT-5.5 still can’t do better compared to Claude Opus 4.7 and Gemini 3.1 based on publicly available benchmarks:
1. GPT-5.5 can't solve real GitHub issues better than Claude Opus 4.7
On the SWE-Bench Pro, which tests whether AI models can actually fix real GitHub issues end-to-end, Claude Opus 4.7 scored 64.3% on SWE-Bench Pro. GPT-5.5 scored 58.6%. That 5.7-point gap represents hundreds of coding tasks where Opus ships working code and GPT-5.5 won't.
Michael Truell, the CEO of Cursor (which recently announced a deal with Elon Musk), confirmed in Anthropic's official announcement that Opus 4.7 "lifted resolution by 13% over Opus 4.6" on Cursor's internal 93-task benchmark. The new model he added solved four tasks that neither Opus 4.6 nor Sonnet 4.6 could touch.
OpenAI's announcement highlighted the model Terminal-Bench performance but didn’t dispute Claude's SWE-Bench lead.
2. The model is not better than Gemini 3.1 Pro at web browsing and search tasks
On BrowseComp, which measures how well models navigate the web, search for information, and stitch together multi-step research tasks, Gemini 3.1 Pro scored 85.9%, GPT-5.5 scored 84.4%. Claude scored 79.3%.
In its announcement, Google claims that Gemini can build a "live aerospace dashboard, successfully configuring a public telemetry stream to visualise the International Space Station's orbit." That's the kind of web-to-code workflow where Gemini wins.
The 1.5-point gap between Gemini and GPT-5.5 might sound small, but when you're automating research workflows at scale, those percentage points compound into hours of wasted time.

3. GPT-5.5 lags behind both competitors on graduate-level science questions
Claude Opus 4.7 scored 94.2% on GPQA Diamond. Gemini 3.1 Pro made 94.3%. GPT-5.5 was left with 93.6%. GPQA Diamond tests graduate-level knowledge in physics, chemistry, and biology, subjects that require both deep factual knowledge and multi-step reasoning.
Anthropic's announcement describes a researcher using Opus 4.7 to "analyse a gene-expression dataset with 62 samples and nearly 28,000 genes, producing a detailed research report that not only summarised the findings but also surfaced key questions and insights."
OpenAI's examples so far have leaned heavily on engineering and coding.
4. The model performs worse than Claude on Humanity's Last Exam without tools
Claude Opus 4.7 scored 46.9% on Humanity's Last Exam when tested without any tools. GPT-5.5 scored 41.4%. Gemini 3.1 Pro made 44.4%. This benchmark specifically tests whether models can solve difficult problems using only their core reasoning, without access to code execution, web search, or other tools.
The gap increased when you add tools. Claude scored 54.7%, while GPT-5.5 jumped to 52.2%, and Gemini took 51.4%. Anthropic described this in their announcement as evidence that Opus 4.7 "shows gains on scientific and technical research workflows, which require more than answering a hard question."
5. The model does not produce a better front-end design than Claude Opus 4.7
Matthew Berman's assessment after two weeks of testing was direct: "It's better than Opus at backend, but it's still not as good at front end design." Bolt's team reported Opus 4.7 runs "up to 10% better" than Opus 4.6 for app-building work.
One tester quoted in Anthropic's announcement called Opus 4.7 "the best model in the world for building dashboards and data-rich interfaces." They added that "the design taste is genuinely surprising" and that it "makes choices I'd actually ship."
OpenAI's announcement did not include any benchmarks for front-end quality or design taste.
6. GPT-5.5 costs $5 more per million output tokens than Claude Opus 4.7
Claude Opus 4.7 costs $25 per million output tokens. GPT-5.5 costs $30 per million output tokens. Both charge $5 per million input tokens. For high-volume applications, that $5 output difference adds up fast.
OpenAI argued that its premium fees are worth it because the model is "more intelligent and much more token efficient." But the benchmarks show intelligence advantages are task-specific. If your workload leans toward front-end design, scientific research, or web browsing, paying for the Opus 4.6 seems like a better deal than the GPT-5.5.
