MathEst. 2021

GSM8K

Grade School Math 8K is a dataset of 8,500 high-quality, linguistically diverse grade school math word problems. It tests multi-step mathematical reasoning with problems requiring 2-8 steps to solve.

Metrics

Accuracy (%) on 8,500 math word problems

Created By

OpenAI

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	GPT-5.2	98.1%	2026-03
2	Claude Opus 4.6	97.8%	2026-02
3	Gemini 3 Ultra	97.5%	2026-01
4	Grok 4	97.2%	2026-02
5	Llama 4 405B	96.3%	2026-01

Related Math Benchmarks

MATH

The MATH benchmark consists of 12,500 challenging competition mathematics problems from AMC, AIME, and Olympiad competitions. Problems span seven subjects: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

Top: GPT-5.2 — 89.6%

MGSM

Multilingual Grade School Math (MGSM) extends GSM8K to 10 typologically diverse languages including Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, and Thai, testing multilingual mathematical reasoning.

Top: GPT-5.2 — 93.7%

AIME 2024

The American Invitational Mathematics Examination (AIME) 2024 problems test advanced mathematical problem-solving. These competition problems require creative mathematical thinking and are used to evaluate frontier model math capabilities.

Top: GPT-5.2 — 83.3%

AIME 2024

The American Invitational Mathematics Examination (AIME) is a prestigious math competition for high school students who score in the top 5% on the AMC. AIME 2024 problems have been adopted as an AI benchmark because they require creative problem-solving, multi-step reasoning, and deep mathematical insight that cannot be solved through simple pattern matching. Each problem demands sophisticated approaches spanning algebra, geometry, number theory, and combinatorics.

Top: GPT-5.2 — 13/15

← Back to all benchmarks