CodeEst. 2023

SWE-bench Verified

SWE-bench Verified evaluates AI systems on real-world software engineering tasks drawn from GitHub issues in popular Python repositories. Models must understand codebases, diagnose issues, and generate correct patches.

Metrics

Resolve rate (%) on verified GitHub issues

Created By

Princeton NLP

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	Claude Opus 4.6 + Agentless	62.4%	2026-02
2	GPT-5.2 + SWE-Agent	59.8%	2026-03
3	Gemini 3 Ultra + Agent	55.3%	2026-01
4	DeepSeek Coder V3	51.7%	2026-01
5	Grok 4 + Agent	49.2%	2026-02

Related Code Benchmarks

HumanEval

HumanEval evaluates the functional correctness of code generated by language models. It consists of 164 hand-written programming problems with function signatures, docstrings, and unit tests, measuring pass@1 and pass@k rates.

Top: Claude Opus 4.6 — 96.3%

MBPP (Mostly Basic Python Problems)

MBPP consists of around 1,000 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem includes a task description, code solution, and three automated test cases.

Top: Claude Opus 4.6 — 93.8%

CodeContests

CodeContests is a competitive programming benchmark drawn from Codeforces, CodeChef, and other platforms. It tests algorithmic problem-solving with problems requiring complex data structures, dynamic programming, and mathematical reasoning.

Top: GPT-5.2 — 43.2%

HumanEval+

HumanEval+ augments the original HumanEval benchmark with 80x more test cases per problem, providing a more rigorous evaluation of code correctness. Many models that score well on HumanEval see significant drops on HumanEval+.

Top: Claude Opus 4.6 — 90.2%

← Back to all benchmarks