MGSM
Multilingual Grade School Math (MGSM) extends GSM8K to 10 typologically diverse languages including Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, and Thai, testing multilingual mathematical reasoning.
Metrics
Accuracy (%) across 10 languages
Created By
Google Research
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | GPT-5.2 | 93.7% | 2026-03 |
| 2 | Claude Opus 4.6 | 92.4% | 2026-02 |
| 3 | Gemini 3 Ultra | 93.1% | 2026-01 |
| 4 | Grok 4 | 89.8% | 2026-02 |
| 5 | Llama 4 405B | 87.5% | 2026-01 |
Related Math Benchmarks
MATH
The MATH benchmark consists of 12,500 challenging competition mathematics problems from AMC, AIME, and Olympiad competitions. Problems span seven subjects: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.
Top: GPT-5.2 — 89.6%
GSM8K
Grade School Math 8K is a dataset of 8,500 high-quality, linguistically diverse grade school math word problems. It tests multi-step mathematical reasoning with problems requiring 2-8 steps to solve.
Top: GPT-5.2 — 98.1%
AIME 2024
The American Invitational Mathematics Examination (AIME) 2024 problems test advanced mathematical problem-solving. These competition problems require creative mathematical thinking and are used to evaluate frontier model math capabilities.
Top: GPT-5.2 — 83.3%
AIME 2024
The American Invitational Mathematics Examination (AIME) is a prestigious math competition for high school students who score in the top 5% on the AMC. AIME 2024 problems have been adopted as an AI benchmark because they require creative problem-solving, multi-step reasoning, and deep mathematical insight that cannot be solved through simple pattern matching. Each problem demands sophisticated approaches spanning algebra, geometry, number theory, and combinatorics.
Top: GPT-5.2 — 13/15