Codeforces Benchmark
The Codeforces Benchmark evaluates AI models on competitive programming problems from the Codeforces platform, one of the world's largest competitive programming communities. Problems range from beginner to expert difficulty and require algorithmic thinking, data structure knowledge, and efficient implementation. AI models are rated using the same ELO-style system as human competitors, enabling direct comparison with human programmers.
Metrics
ELO rating (competitive programming scale)
Created By
Mike Mirzayanov (Codeforces)
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | Claude Opus 4.6 | 1892 | 2026-02 |
| 2 | GPT-5.2 | 1856 | 2026-03 |
| 3 | DeepSeek-V4 | 1798 | 2026-01 |
| 4 | Gemini 3 Ultra | 1752 | 2026-01 |
| 5 | Grok 4 | 1689 | 2026-02 |
Related Code Benchmarks
HumanEval
HumanEval evaluates the functional correctness of code generated by language models. It consists of 164 hand-written programming problems with function signatures, docstrings, and unit tests, measuring pass@1 and pass@k rates.
Top: Claude Opus 4.6 — 96.3%
SWE-bench Verified
SWE-bench Verified evaluates AI systems on real-world software engineering tasks drawn from GitHub issues in popular Python repositories. Models must understand codebases, diagnose issues, and generate correct patches.
Top: Claude Opus 4.6 + Agentless — 62.4%
MBPP (Mostly Basic Python Problems)
MBPP consists of around 1,000 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem includes a task description, code solution, and three automated test cases.
Top: Claude Opus 4.6 — 93.8%
CodeContests
CodeContests is a competitive programming benchmark drawn from Codeforces, CodeChef, and other platforms. It tests algorithmic problem-solving with problems requiring complex data structures, dynamic programming, and mathematical reasoning.
Top: GPT-5.2 — 43.2%