LanguageEst. 2024

SimpleQA

SimpleQA evaluates factual accuracy on straightforward, unambiguous factual questions with short, verifiable answers. It specifically tests whether models provide correct factual information vs. hallucinating plausible-sounding but incorrect answers.

Metrics

Factual accuracy (%) on simple questions

Created By

OpenAI

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	GPT-5.2	52.8%	2026-03
2	Claude Opus 4.6	48.3%	2026-02
3	Gemini 3 Ultra	46.7%	2026-01
4	Grok 4	43.1%	2026-02
5	Llama 4 405B	38.9%	2026-01

Related Language Benchmarks

MMLU

Massive Multitask Language Understanding measures knowledge across 57 academic subjects including STEM, humanities, social sciences, and more. It tests both world knowledge and problem-solving ability at varying difficulty levels from elementary to professional.

Top: GPT-5.2 — 92.4%

← Back to all benchmarks

SimpleQA

Top Model Scores

Related Language Benchmarks

MMLU

MT-Bench

AlpacaEval 2.0

WildBench