MathEst. 2022

MGSM

Multilingual Grade School Math (MGSM) extends GSM8K to 10 typologically diverse languages including Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, and Thai, testing multilingual mathematical reasoning.

Metrics

Accuracy (%) across 10 languages

Created By

Google Research

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	GPT-5.2	93.7%	2026-03
2	Claude Opus 4.6	92.4%	2026-02
3	Gemini 3 Ultra	93.1%	2026-01
4	Grok 4	89.8%	2026-02
5	Llama 4 405B	87.5%	2026-01

Related Math Benchmarks

MATH

The MATH benchmark consists of 12,500 challenging competition mathematics problems from AMC, AIME, and Olympiad competitions. Problems span seven subjects: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

Top: GPT-5.2 — 89.6%

GSM8K

Grade School Math 8K is a dataset of 8,500 high-quality, linguistically diverse grade school math word problems. It tests multi-step mathematical reasoning with problems requiring 2-8 steps to solve.

Top: GPT-5.2 — 98.1%

AIME 2024

The American Invitational Mathematics Examination (AIME) 2024 problems test advanced mathematical problem-solving. These competition problems require creative mathematical thinking and are used to evaluate frontier model math capabilities.

Top: GPT-5.2 — 83.3%

AIME 2024

The American Invitational Mathematics Examination (AIME) is a prestigious math competition for high school students who score in the top 5% on the AMC. AIME 2024 problems have been adopted as an AI benchmark because they require creative problem-solving, multi-step reasoning, and deep mathematical insight that cannot be solved through simple pattern matching. Each problem demands sophisticated approaches spanning algebra, geometry, number theory, and combinatorics.

Top: GPT-5.2 — 13/15

← Back to all benchmarks