CodeEst. 2021

HumanEval

HumanEval evaluates the functional correctness of code generated by language models. It consists of 164 hand-written programming problems with function signatures, docstrings, and unit tests, measuring pass@1 and pass@k rates.

Metrics

Pass@1 (%) on 164 problems

Created By

OpenAI

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	Claude Opus 4.6	96.3%	2026-02
2	GPT-5.2	95.7%	2026-03
3	Gemini 3 Ultra	94.1%	2026-01
4	DeepSeek Coder V3	93.4%	2026-01
5	Grok 4	92.8%	2026-02

Related Code Benchmarks

SWE-bench Verified

SWE-bench Verified evaluates AI systems on real-world software engineering tasks drawn from GitHub issues in popular Python repositories. Models must understand codebases, diagnose issues, and generate correct patches.

Top: Claude Opus 4.6 + Agentless — 62.4%

MBPP (Mostly Basic Python Problems)

MBPP consists of around 1,000 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem includes a task description, code solution, and three automated test cases.

Top: Claude Opus 4.6 — 93.8%

CodeContests

CodeContests is a competitive programming benchmark drawn from Codeforces, CodeChef, and other platforms. It tests algorithmic problem-solving with problems requiring complex data structures, dynamic programming, and mathematical reasoning.

Top: GPT-5.2 — 43.2%

HumanEval+

HumanEval+ augments the original HumanEval benchmark with 80x more test cases per problem, providing a more rigorous evaluation of code correctness. Many models that score well on HumanEval see significant drops on HumanEval+.

Top: Claude Opus 4.6 — 90.2%

← Back to all benchmarks