GSM8K
Grade School Math 8K is a dataset of 8,500 high-quality, linguistically diverse grade school math word problems. It tests multi-step mathematical reasoning with problems requiring 2-8 steps to solve.
Metrics
Accuracy (%) on 8,500 math word problems
Created By
OpenAI
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | GPT-5.2 | 98.1% | 2026-03 |
| 2 | Claude Opus 4.6 | 97.8% | 2026-02 |
| 3 | Gemini 3 Ultra | 97.5% | 2026-01 |
| 4 | Grok 4 | 97.2% | 2026-02 |
| 5 | Llama 4 405B | 96.3% | 2026-01 |
Related Math Benchmarks
MATH
The MATH benchmark consists of 12,500 challenging competition mathematics problems from AMC, AIME, and Olympiad competitions. Problems span seven subjects: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.
Top: GPT-5.2 — 89.6%
MGSM
Multilingual Grade School Math (MGSM) extends GSM8K to 10 typologically diverse languages including Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, and Thai, testing multilingual mathematical reasoning.
Top: GPT-5.2 — 93.7%
AIME 2024
The American Invitational Mathematics Examination (AIME) 2024 problems test advanced mathematical problem-solving. These competition problems require creative mathematical thinking and are used to evaluate frontier model math capabilities.
Top: GPT-5.2 — 83.3%
AIME 2024
The American Invitational Mathematics Examination (AIME) is a prestigious math competition for high school students who score in the top 5% on the AMC. AIME 2024 problems have been adopted as an AI benchmark because they require creative problem-solving, multi-step reasoning, and deep mathematical insight that cannot be solved through simple pattern matching. Each problem demands sophisticated approaches spanning algebra, geometry, number theory, and combinatorics.
Top: GPT-5.2 — 13/15