March 3, 2026
No Comments
Crypto

OpenAI Abandons SWE-bench Verified After Finding 59% of Failed Tests Were Flawed

admin

Crypto

OpenAI Abandons SWE-bench Verified After Finding 59% of Failed Tests Were Flawed

OpenAI has stopped reporting scores on SWE-bench Verified, the widely-used AI coding benchmark, after discovering that nearly 60% of problems its models failed contained fundamentally broken tests. The company’s February 23, 2026 analysis also found evidence that all major frontier models—including GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash—had been trained on benchmark solutions, rendering scores meaningless.

“Improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities,” OpenAI stated. “Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.”

The Numbers Tell the Story

OpenAI audited 138 problems—27.6% of the 500-problem dataset—that its o3 model couldn’t consistently solve across 64 independent runs. The findings were damning: 59.4% of these problems had material issues in test design or problem descriptions that made them “extremely difficult or impossible even for the most capable model or human to solve.”

Breaking down the failures: 35.5% of audited tasks had overly strict tests that rejected functionally correct solutions by demanding specific implementation details never mentioned in problem descriptions. Another 18.8% tested for functionality that wasn’t even specified in the task.

One example involved a pylint PR where tests required importing a function called “get_annotation”—a name never mentioned in the problem statement. Models that solved the underlying issue correctly still failed because they didn’t psychically guess the expected function name.

Every Major Model Is Contaminated

The contamination evidence proved more troubling. OpenAI built an automated red-teaming system using GPT-5 to probe competing models for benchmark knowledge. The results showed all tested frontier models could reproduce original human-written solutions or quote verbatim problem details they should never have seen.

GPT-5.2, when given minimal hints, reproduced the exact code patch for a Django authentication fix—including the specific conditional statement “if username is None or password is None.” Claude Opus 4.5 quoted word-for-word an inline comment from a gold patch it supposedly never encountered. Gemini 3 Flash, given only a task ID, output the complete unified diff with correct line numbers.

The contamination creates an unfair advantage. Models that have seen solutions during training can pass underspecified tests by “remembering” implementation details that weren’t in the problem description—essentially having the answer key before the exam.

From 80% to 23%

The benchmark’s decay became visible in stalled progress. State-of-the-art scores improved only from 74.9% to 80.9% over six months—not because models hit capability ceilings, but because the remaining problems were either impossible or required memorized knowledge.

SWE-bench Pro, the recommended replacement, paints a different picture. According to recent data from February 26, 2026, models scoring 80% on Verified dropped to roughly 23% on Pro—a benchmark designed to resist contamination. Claude Opus 4.6 currently leads Pro with 79.20% performance, though that figure measures a different, cleaner test set.

What Comes Next

OpenAI recommends the industry shift to SWE-bench Pro’s public split while acknowledging it’s imperfect. The company is investing in privately-authored benchmarks like GDPVal, where domain experts create original tasks and trained reviewers grade solutions holistically.

The broader lesson matters for anyone tracking AI capabilities: benchmarks sourced from public repositories carry inherent contamination risk. When training data includes the test, scores become theater. For researchers, investors, and developers betting on AI coding progress, the real frontier is harder to measure than leaderboards suggest.

Image source: Shutterstock

Source link

Post Views: 3

admin

Social Media

Subscribe To Our Weekly Newsletter

No spam, notifications only about new products, updates.

OpenAI Abandons SWE-bench Verified After Finding 59% of Failed Tests Were Flawed

admin

OpenAI Abandons SWE-bench Verified After Finding 59% of Failed Tests Were Flawed

The Numbers Tell the Story

Every Major Model Is Contaminated

From 80% to 23%

What Comes Next

Share:

admin

Leave a Reply Cancel reply

Most Popular

Adin Ross’ Sister Madeline Dead at 36

Full Moon is now in production on Models vs. Werewolves, and the first images have been unveiled!

War disrupts sports with doubt over Messi’s ‘Finalissima’ and Cristiano Ronaldo games

Pittsburgh Steelers have already decided on Aaron Rodgers replacement whether he retires or not

Noem defends her leadership of DHS amid sharp questioning from senators

Get yourself ready for a violent stock market crash!

Social Media

Subscribe To Our Weekly Newsletter

Categories

Related Posts

OpenAI GPT-5.2 Derives New Theoretical Physics Result in Landmark AI Discovery

Katherine Short Cause of Death Released

When is Luc Besson’s Dracula getting a digital release?

Rodrygo suffers cruciate ligament injury and says goodbye to the season… and the World Cup

Adin Ross' Sister Madeline Dead at 36

Full Moon is now in production on Models vs. Werewolves, and the first images have been unveiled!

War disrupts sports with doubt over Messi's 'Finalissima' and Cristiano Ronaldo games

Adin Ross' Sister Madeline Dead at 36

Full Moon is now in production on Models vs. Werewolves, and the first images have been unveiled!

War disrupts sports with doubt over Messi's 'Finalissima' and Cristiano Ronaldo games