HumanEval
HumanEval evaluates the functional correctness of code generated by language models. It consists of 164 hand-written programming problems with function signatures, docstrings, and unit tests, measuring pass@1 and pass@k rates.
Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | Claude Opus 4.6 | 96.3% | 2026-02 |
| 2 | GPT-5.2 | 95.7% | 2026-03 |
| 3 | Gemini 3 Ultra | 94.1% | 2026-01 |
| 4 | DeepSeek Coder V3 | 93.4% | 2026-01 |
| 5 | Grok 4 | 92.8% | 2026-02 |
Related Code Benchmarks
SWE-bench Verified
SWE-bench Verified evaluates AI systems on real-world software engineering tasks drawn from GitHub issues in popular Python repositories. Models must understand codebases, diagnose issues, and generate correct patches.
Top: Claude Opus 4.6 + Agentless — 62.4%
MBPP (Mostly Basic Python Problems)
MBPP consists of around 1,000 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem includes a task description, code solution, and three automated test cases.
Top: Claude Opus 4.6 — 93.8%
CodeContests
CodeContests is a competitive programming benchmark drawn from Codeforces, CodeChef, and other platforms. It tests algorithmic problem-solving with problems requiring complex data structures, dynamic programming, and mathematical reasoning.
Top: GPT-5.2 — 43.2%
HumanEval+
HumanEval+ augments the original HumanEval benchmark with 80x more test cases per problem, providing a more rigorous evaluation of code correctness. Many models that score well on HumanEval see significant drops on HumanEval+.
Top: Claude Opus 4.6 — 90.2%