CodeEst. 2023

SWE-bench Verified

SWE-bench Verified evaluates AI systems on real-world software engineering tasks drawn from GitHub issues in popular Python repositories. Models must understand codebases, diagnose issues, and generate correct patches.

Metrics

Resolve rate (%) on verified GitHub issues

Created By

Princeton NLP

Top Model Scores

RankModelScoreDate
1Claude Opus 4.6 + Agentless62.4%2026-02
2GPT-5.2 + SWE-Agent59.8%2026-03
3Gemini 3 Ultra + Agent55.3%2026-01
4DeepSeek Coder V351.7%2026-01
5Grok 4 + Agent49.2%2026-02