What Is HumanEval?
HumanEval is a benchmark created by OpenAI that evaluates AI models' ability to generate correct Python code by testing them on 164 programming problems, measuring the percentage of problems where the model produces code that passes all unit tests (pass@k).
How HumanEval Works
HumanEval presents models with function signatures and docstrings describing what the function should do, then evaluates whether the generated code correctly solves the problem by running unit tests. The primary metric is pass@1 (percentage of problems solved on the first attempt) and pass@10 (solved in at least one of 10 attempts). HumanEval has become the standard benchmark for AI code generation, but it has limitations — the problems are relatively simple algorithmic tasks, not representative of real-world software engineering. Extensions like HumanEval+ add more test cases to catch false positives, and newer benchmarks like SWE-bench test models on real GitHub issues.
Real-World Examples
GPT-4 achieving 67% pass@1 on HumanEval when first released, a significant jump from GPT-3.5's 48%
A code-specialized model like DeepSeek Coder scoring 78.6% on HumanEval, outperforming general-purpose models
A company comparing GitHub Copilot and Cursor by running both on HumanEval problems and comparing pass rates