CodeEst. 2023

Codeforces Benchmark

The Codeforces Benchmark evaluates AI models on competitive programming problems from the Codeforces platform, one of the world's largest competitive programming communities. Problems range from beginner to expert difficulty and require algorithmic thinking, data structure knowledge, and efficient implementation. AI models are rated using the same ELO-style system as human competitors, enabling direct comparison with human programmers.

Metrics

ELO rating (competitive programming scale)

Created By

Mike Mirzayanov (Codeforces)

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	Claude Opus 4.6	1892	2026-02
2	GPT-5.2	1856	2026-03
3	DeepSeek-V4	1798	2026-01
4	Gemini 3 Ultra	1752	2026-01
5	Grok 4	1689	2026-02

Related Code Benchmarks

HumanEval

HumanEval evaluates the functional correctness of code generated by language models. It consists of 164 hand-written programming problems with function signatures, docstrings, and unit tests, measuring pass@1 and pass@k rates.

Top: Claude Opus 4.6 — 96.3%

SWE-bench Verified

SWE-bench Verified evaluates AI systems on real-world software engineering tasks drawn from GitHub issues in popular Python repositories. Models must understand codebases, diagnose issues, and generate correct patches.

Top: Claude Opus 4.6 + Agentless — 62.4%

MBPP (Mostly Basic Python Problems)

MBPP consists of around 1,000 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem includes a task description, code solution, and three automated test cases.

Top: Claude Opus 4.6 — 93.8%

CodeContests

CodeContests is a competitive programming benchmark drawn from Codeforces, CodeChef, and other platforms. It tests algorithmic problem-solving with problems requiring complex data structures, dynamic programming, and mathematical reasoning.

Top: GPT-5.2 — 43.2%

← Back to all benchmarks