MathVista
MathVista evaluates mathematical reasoning in visual contexts. It combines challenges from diverse math and vision tasks including geometry, statistics, chart/graph understanding, and scientific figure interpretation.
Metrics
Accuracy (%) on visual math problems
Created By
UCLA / Microsoft
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | GPT-5.2 | 78.4% | 2026-03 |
| 2 | Gemini 3 Ultra | 77.1% | 2026-01 |
| 3 | Claude Opus 4.6 | 75.8% | 2026-02 |
| 4 | Grok 4 | 71.3% | 2026-02 |
| 5 | InternVL 3 | 69.7% | 2026-01 |
Related Multimodal Benchmarks
MMMU
Massive Multi-discipline Multimodal Understanding (MMMU) evaluates multimodal models on college-level subject knowledge and deliberate reasoning across 30 subjects and 183 subfields, using images, charts, diagrams, and domain-specific visualizations.
Top: GPT-5.2 — 74.6%
MMLU-Pro Vision
MMLU-Pro Vision extends MMLU-Pro to multimodal settings where questions include images, diagrams, charts, and figures alongside text. It tests whether vision-language models can leverage visual information for academic reasoning.
Top: GPT-5.2 — 68.4%
MMMU
MMMU (Massive Multi-discipline Multimodal Understanding) is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks requiring both visual understanding and domain-specific knowledge. It spans 30 subjects across art, business, science, health, humanities, and engineering with 11,500 questions that include images, diagrams, charts, and tables. Models must jointly reason over visual and textual information.
Top: GPT-5.2 — 74.8%