OpenAI Abandons SWE-bench Verified After Finding 59% of Failed Tests Were Flawed

OpenAI Abandons SWE-bench Verified After Finding 59% of Failed Tests Were Flawed




Rebeca Moen
Mar 03, 2026 18:33

OpenAI reveals major contamination issues in SWE-bench Verified benchmark, showing frontier AI models memorized solutions and tests rejected correct code.



OpenAI Abandons SWE-bench Verified After Finding 59% of Failed Tests Were Flawed

OpenAI has stopped reporting scores on SWE-bench Verified, the widely-used AI coding benchmark, after discovering that nearly 60% of problems its models failed contained fundamentally broken tests. The company’s February 23, 2026 analysis also found evidence that all major frontier models—including GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash—had been trained on benchmark solutions, rendering scores meaningless.

“Improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities,” OpenAI stated. “Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.”

The Numbers Tell the Story

OpenAI audited 138 problems—27.6% of the 500-problem dataset—that its o3 model couldn’t consistently solve across 64 independent runs. The findings were damning: 59.4% of these problems had material issues in test design or problem descriptions that made them “extremely difficult or impossible even for the most capable model or human to solve.”

Breaking down the failures: 35.5% of audited tasks had overly strict tests that rejected functionally correct solutions by demanding specific implementation details never mentioned in problem descriptions. Another 18.8% tested for functionality that wasn’t even specified in the task.

One example involved a pylint PR where tests required importing a function called “get_annotation”—a name never mentioned in the problem statement. Models that solved the underlying issue correctly still failed because they didn’t psychically guess the expected function name.

Every Major Model Is Contaminated

The contamination evidence proved more troubling. OpenAI built an automated red-teaming system using GPT-5 to probe competing models for benchmark knowledge. The results showed all tested frontier models could reproduce original human-written solutions or quote verbatim problem details they should never have seen.

GPT-5.2, when given minimal hints, reproduced the exact code patch for a Django authentication fix—including the specific conditional statement “if username is None or password is None.” Claude Opus 4.5 quoted word-for-word an inline comment from a gold patch it supposedly never encountered. Gemini 3 Flash, given only a task ID, output the complete unified diff with correct line numbers.

The contamination creates an unfair advantage. Models that have seen solutions during training can pass underspecified tests by “remembering” implementation details that weren’t in the problem description—essentially having the answer key before the exam.

From 80% to 23%

The benchmark’s decay became visible in stalled progress. State-of-the-art scores improved only from 74.9% to 80.9% over six months—not because models hit capability ceilings, but because the remaining problems were either impossible or required memorized knowledge.

SWE-bench Pro, the recommended replacement, paints a different picture. According to recent data from February 26, 2026, models scoring 80% on Verified dropped to roughly 23% on Pro—a benchmark designed to resist contamination. Claude Opus 4.6 currently leads Pro with 79.20% performance, though that figure measures a different, cleaner test set.

What Comes Next

OpenAI recommends the industry shift to SWE-bench Pro’s public split while acknowledging it’s imperfect. The company is investing in privately-authored benchmarks like GDPVal, where domain experts create original tasks and trained reviewers grade solutions holistically.

The broader lesson matters for anyone tracking AI capabilities: benchmarks sourced from public repositories carry inherent contamination risk. When training data includes the test, scores become theater. For researchers, investors, and developers betting on AI coding progress, the real frontier is harder to measure than leaderboards suggest.

Image source: Shutterstock




Source link

Share:

Facebook
Twitter
Pinterest
LinkedIn

Leave a Reply

Your email address will not be published. Required fields are marked *

Most Popular

Social Media

Get The Latest Updates

Subscribe To Our Weekly Newsletter

No spam, notifications only about new products, updates.

Categories