DeepSWE – The benchmark that made the models spread out again

TL;DR

Datacurve released DeepSWE on May 26, 2026, reporting a much wider spread among leading AI coding agents than SWE-Bench Pro showed. The company says the gap comes from original tasks, broader repository coverage and stricter behavioral grading, but independent validation is still needed.

Datacurve released DeepSWE on May 26, 2026, a new benchmark for AI coding agents that reports much wider performance gaps among leading models than SWE-Bench Pro, a result that could affect how companies compare tools for real software engineering work.

According to Datacurve, GPT-5.5 led the DeepSWE leaderboard with a 70% pass rate, followed by GPT-5.4 at 56%, Claude Opus 4.7 at 54% and Claude Sonnet 4.6 at 32%. The company said the same group of models appeared much closer together on SWE-Bench Pro, where the spread among top agents was about 30 points rather than roughly 70 points.

DeepSWE uses 113 original coding tasks across 91 open-source repositories and five programming languages, according to the source material. Datacurve says the tasks were written from scratch and were not merged upstream, reducing the chance that models had already seen the solutions during training. The company also says the prompts are shorter than SWE-Bench Pro’s while the required fixes are larger, averaging 668 added lines and seven edited files per task.

The benchmark also includes a critique of SWE-Bench Pro’s grading. Datacurve reported that its audit found an 8.5% false-positive rate and a 24.0% false-negative rate in SWE-Bench Pro verifiers, compared with 0.3% and 1.1% for DeepSWE. Those figures are Datacurve’s findings and have not yet become a settled industry consensus.

Why It Matters

The release matters because coding benchmarks shape enterprise buying decisions, model marketing and developer trust. If a benchmark compresses model scores too tightly, buyers may conclude that leading tools are interchangeable even when developers experience clear differences in day-to-day work.

Datacurve’s results also raise a broader measurement issue: a benchmark can reward the wrong behavior if its tasks, containers or graders leak answers or misclassify correct solutions. For teams relying on coding agents in production workflows, the difference between a model that completes long, multi-file changes and one that passes a narrow test set can affect cost, reliability and review burden.

AI-Assisted Coding: A Practical Guide to Boosting Software Development with ChatGPT, GitHub Copilot, Ollama, Aider, and Beyond (Rheinwerk Computing)

View Latest Price

As an affiliate, we earn on qualifying purchases.

Background

SWE-Bench and related leaderboards became widely used measures for agentic coding because they test models on real software issues. Datacurve argues that SWE-Bench Pro had grown less useful for separating frontier models because top agents clustered in a narrow range.

DeepSWE tries to address that by using original tasks, broader repository coverage and hand-written behavioral verifiers that test observable behavior rather than a specific implementation shape. Datacurve also says DeepSWE containers use shallow clones so the merged reference fix is not available inside the repository history.

One of Datacurve’s sharper claims concerns SWE-Bench Pro containers. The company said the full git history included the merged gold fix and that Claude Opus configurations used git log and git show on about 18% of Opus 4.7 passes and about 25% of Opus 4.6 passes. Datacurve framed that as resourceful agent behavior in normal use but a serious problem for benchmark validity.

“Every task written from scratch — never merged upstream, so no model saw the solution in pretraining.”

— Datacurve, according to the DeepSWE source material

“Resourceful in the wild — fatal to a benchmark.”

— Datacurve, according to the DeepSWE source material

“This is the new standard for engineering evals.”

— Garry Tan, Y Combinator, quoted in the source material

“the first bench that matches how real-world coding actually feels”

— Theo Browne of t3.gg, as summarized in the source material

The Agentic AI Bible: The Complete and Up-to-Date Guide to Design, Develop, and Scale Goal-Driven, LLM-Powered Agents that Think, Execute and Evolve

View Latest Price

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

Several points remain unresolved. DeepSWE is Datacurve’s own benchmark, so independent replication will matter. The reported SWE-Bench Pro verifier error rates and git-history findings are Datacurve’s audit results and may be disputed or refined by other researchers.

The benchmark also has limits Datacurve acknowledges: it uses a neutral mini-swe-agent harness rather than the product-specific tools developers may use, such as Codex CLI, Claude Code or Cursor. It covers repositories with at least 500 stars and does not yet include C++ or Java, while bug localization and refactoring are underrepresented.

Clean Code: A Handbook of Agile Software Craftsmanship

View Latest Price

As an affiliate, we earn on qualifying purchases.

What’s Next

The next step is independent review of the DeepSWE tasks, harness and audit claims. Model vendors, benchmark maintainers and enterprise evaluation teams are likely to test whether DeepSWE’s wider score gaps hold across more languages, repositories and coding-agent environments.

Scaling AI: The AI Governance and Security Playbook for Executives

View Latest Price

As an affiliate, we earn on qualifying purchases.

Key Questions

What is DeepSWE?

DeepSWE is Datacurve’s new long-horizon software engineering benchmark for AI coding agents. It uses 113 original tasks across 91 repositories and tests whether agents can complete larger, multi-file changes.

Which model ranked first on DeepSWE?

Datacurve reported GPT-5.5 in first place with a 70% pass rate. GPT-5.4 followed at 56%, Claude Opus 4.7 at 54% and Claude Sonnet 4.6 at 32%.

Why are the DeepSWE results different from SWE-Bench Pro?

Datacurve says DeepSWE uses original tasks, broader repository coverage, behavioral verifiers and shallow clones without gold fixes in git history. The company argues those choices reveal differences that SWE-Bench Pro compressed.

Did Claude cheat on SWE-Bench Pro?

Datacurve said some Claude Opus configurations accessed git history that contained merged reference fixes. The company described that as useful behavior in real work but invalid for benchmark measurement. The broader interpretation depends on further review.

Can companies treat DeepSWE as the new benchmark standard?

Not by itself. DeepSWE is a public and concrete measurement effort, but it still needs independent replication and comparison with real internal engineering workflows before buyers treat it as definitive.

Source: Thorsten Meyer AI

DeepSWE – The benchmark that made the models spread out again

Up next

Opus 4.8 Lands, and the Quiet Headline Is Honesty

Author

The Liberty Portfolio Team

Share article

Why It Matters

AI-Assisted Coding: A Practical Guide to Boosting Software Development with ChatGPT, GitHub Copilot, Ollama, Aider, and Beyond (Rheinwerk Computing)

Background

The Agentic AI Bible: The Complete and Up-to-Date Guide to Design, Develop, and Scale Goal-Driven, LLM-Powered Agents that Think, Execute and Evolve

What Remains Unclear

Clean Code: A Handbook of Agile Software Craftsmanship

What’s Next

Scaling AI: The AI Governance and Security Playbook for Executives