AI Benchmarks & Datasets
Explore 64+ benchmarks used to evaluate AI models. Compare scores across GPT-5, Claude, Gemini, and other frontier models on the evaluations that matter.
Showing 64 benchmarks
Aider Polyglot
Aider Polyglot benchmarks AI code editing capabilities across multiple programming languages. Models must correctly edit existing codebases given natural language instructions, testing real-world coding assistant performance.
Top Score
Claude Opus 4.6: 72.1%
AIME 2024
The American Invitational Mathematics Examination (AIME) 2024 problems test advanced mathematical problem-solving. These competition problems require creative mathematical thinking and are used to evaluate frontier model math capabilities.
Top Score
GPT-5.2: 83.3%
AIME 2024
The American Invitational Mathematics Examination (AIME) is a prestigious math competition for high school students who score in the top 5% on the AMC. AIME 2024 problems have been adopted as an AI benchmark because they require creative problem-solving, multi-step reasoning, and deep mathematical insight that cannot be solved through simple pattern matching. Each problem demands sophisticated approaches spanning algebra, geometry, number theory, and combinatorics.
Top Score
GPT-5.2: 13/15
AlpacaEval 2.0
AlpacaEval 2.0 is an automatic evaluation benchmark that measures instruction-following ability. It uses a length-controlled win rate against a reference model, reducing length bias that affected the original version.
Top Score
Claude Opus 4.6: 72.1%
ARC (AI2 Reasoning Challenge)
The AI2 Reasoning Challenge contains 7,787 genuine grade-school science questions, split into Easy and Challenge sets. The Challenge set contains only questions that are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.
Top Score
GPT-5.2: 98.2%
ARC-Challenge
The AI2 Reasoning Challenge (ARC) tests science question answering at a grade-school level, but the Challenge partition specifically selects questions that simple retrieval and word co-occurrence methods fail on. These questions require genuine commonsense reasoning, multi-step inference, and understanding of basic scientific principles. ARC-Challenge has been a longstanding benchmark for measuring reasoning progress in language models.
Top Score
GPT-5.2: 98.1%
Arena-Hard
Arena-Hard is a pipeline for creating high-quality benchmarks from Chatbot Arena data by selecting the most discriminative user prompts. It contains 500 challenging prompts that best separate strong models from weak ones. An LLM judge evaluates responses, producing win rates against a baseline model. Arena-Hard achieves high correlation with full Chatbot Arena rankings while being faster and cheaper to run.
Top Score
GPT-5.2: 92.1%
Arena-Hard-Auto
Arena-Hard-Auto is an automated benchmark that correlates highly with Chatbot Arena rankings. It uses 500 challenging user queries and automated judge evaluation to approximate human preferences at a fraction of the cost.
Top Score
GPT-5.2: 92.1%
AudioBench
AudioBench evaluates audio language models across speech understanding, audio scene analysis, and voice-based reasoning. It covers speech recognition, emotion detection, speaker identification, and audio event classification.
Top Score
Gemini 3 Ultra: 87.3%
BBH (BIG-Bench Hard)
BIG-Bench Hard is a suite of 23 challenging tasks from the BIG-Bench benchmark where language models previously performed below average human raters. Tasks include boolean expressions, causal judgement, date understanding, disambiguation, and more.
Top Score
GPT-5.2: 95.3%
BFCL (Berkeley Function Calling)
Berkeley Function Calling Leaderboard evaluates the ability of models to accurately generate function/tool calls with correct parameters. It tests API call generation, parameter extraction, and multi-tool orchestration scenarios.
Top Score
Claude Opus 4.6: 93.7%
BIG-Bench Hard
BIG-Bench Hard (BBH) is a curated subset of 23 challenging tasks from the original BIG-Bench suite where language models previously performed below average human raters. These tasks span algorithmic reasoning, natural language understanding, and world knowledge. BBH specifically tests chain-of-thought reasoning capabilities and has become a key measure of whether models can handle tasks requiring multi-step logical thinking.
Top Score
GPT-5.2: 94.2%
ChartQA
ChartQA tests the ability of models to answer questions about charts and visualizations, requiring both visual understanding of chart elements and reasoning about the underlying data.
Top Score
GPT-5.2: 90.1%
Chatbot Arena (LMSYS)
Chatbot Arena is a crowdsourced evaluation platform where users engage in blind, head-to-head comparisons of AI chatbots. Models are ranked using an Elo rating system based on hundreds of thousands of human preference votes.
Top Score
GPT-5.2: 1387
Chatbot Arena ELO
Chatbot Arena is a crowdsourced benchmark platform where users have blind conversations with two anonymous AI models and vote for the better response. The resulting ELO ratings provide a human-preference-based ranking that captures real-world conversational quality. With over 1 million votes collected, it is widely considered the most representative benchmark of actual user satisfaction and practical model utility.
Top Score
GPT-5.2: 1385
CodeContests
CodeContests is a competitive programming benchmark drawn from Codeforces, CodeChef, and other platforms. It tests algorithmic problem-solving with problems requiring complex data structures, dynamic programming, and mathematical reasoning.
Top Score
GPT-5.2: 43.2%
Codeforces Benchmark
The Codeforces Benchmark evaluates AI models on competitive programming problems from the Codeforces platform, one of the world's largest competitive programming communities. Problems range from beginner to expert difficulty and require algorithmic thinking, data structure knowledge, and efficient implementation. AI models are rated using the same ELO-style system as human competitors, enabling direct comparison with human programmers.
Top Score
Claude Opus 4.6: 1892
DocVQA
Document Visual Question Answering evaluates the ability of models to understand and answer questions about document images including forms, invoices, scientific papers, and handwritten notes.
Top Score
Gemini 3 Ultra: 95.2%
DocVQA
DocVQA (Document Visual Question Answering) tests AI models on their ability to understand and answer questions about document images including invoices, letters, reports, forms, and tables. Models must perform optical character recognition, layout understanding, and reasoning over document structure to extract specific information. It is a critical benchmark for enterprise document processing and automation applications.
Top Score
GPT-5.2: 0.952
DROP
Discrete Reasoning Over Paragraphs tests reading comprehension that requires discrete reasoning steps including addition, subtraction, counting, sorting, and other operations over text passages.
Top Score
GPT-5.2: 93.1
GPQA
Graduate-Level Google-Proof Q&A (GPQA) is a challenging benchmark consisting of expert-crafted questions in biology, physics, and chemistry that require deep domain knowledge. Questions are designed so that even skilled non-experts with internet access struggle, while domain experts achieve high accuracy. It tests whether AI models possess genuine graduate-level scientific understanding rather than surface-level pattern matching.
Top Score
GPT-5.2: 68.4%
GPQA (Diamond)
Graduate-Level Google-Proof Q&A (GPQA) Diamond is a challenging benchmark of expert-level questions in biology, physics, and chemistry. Questions are designed to be answerable by domain experts but extremely difficult for non-experts, even with web search.
Top Score
GPT-5.2: 94.7%
GSM8K
Grade School Math 8K is a dataset of 8,500 high-quality, linguistically diverse grade school math word problems. It tests multi-step mathematical reasoning with problems requiring 2-8 steps to solve.
Top Score
GPT-5.2: 98.1%
HellaSwag
HellaSwag is a commonsense reasoning benchmark that tests whether AI models can predict the most plausible continuation of a given scenario. It uses adversarially constructed wrong answers that are challenging for models but easy for humans.
Top Score
GPT-5.2: 97.8%
HellaSwag
HellaSwag tests commonsense natural language inference by asking models to predict the most plausible continuation of a given scenario. The dataset uses adversarial filtering to generate wrong answers that are superficially plausible but logically incorrect. Tasks span everyday activities like cooking, sports, and social interactions, testing whether models truly understand sequential reasoning about real-world events.
Top Score
GPT-5.2: 97.6%
HumanEval
HumanEval evaluates the functional correctness of code generated by language models. It consists of 164 hand-written programming problems with function signatures, docstrings, and unit tests, measuring pass@1 and pass@k rates.
Top Score
Claude Opus 4.6: 96.3%
HumanEval+
HumanEval+ augments the original HumanEval benchmark with 80x more test cases per problem, providing a more rigorous evaluation of code correctness. Many models that score well on HumanEval see significant drops on HumanEval+.
Top Score
Claude Opus 4.6: 90.2%
IFEval
IFEval (Instruction Following Evaluation) tests how well models follow verifiable formatting instructions such as word count constraints, inclusion/exclusion of specific phrases, formatting requirements, and structural constraints.
Top Score
Claude Opus 4.6: 91.2%
IFEval
IFEval (Instruction Following Evaluation) tests whether language models can precisely follow specific formatting and content instructions. Tasks include writing responses with exact word counts, including or excluding specific phrases, formatting output as JSON or bullet points, and following complex multi-constraint instructions. It measures the practical reliability of models when users need outputs to conform to exact specifications.
Top Score
GPT-5.2: 89.7%
LegalBench
LegalBench is a collaboratively built benchmark for evaluating legal reasoning in language models. It consists of 162 tasks spanning 6 types of legal reasoning: issue-spotting, rule-recall, interpretation, rule-application, conclusion, and rhetorical understanding.
Top Score
Claude Opus 4.6: 84.6%
LiveBench
LiveBench is a continuously updated benchmark designed to minimize contamination by using new questions monthly. It covers math, coding, reasoning, language, instruction following, and data analysis with objective, verifiable answers.
Top Score
GPT-5.2: 82.6%
LiveBench
LiveBench is a continuously updated benchmark that uses new questions every month, drawn from recent information sources, to prevent data contamination. It covers math, coding, reasoning, language, instruction following, and data analysis. Because questions are regularly refreshed, models cannot have seen the test data during training, providing a cleaner signal of genuine model capability versus memorization.
Top Score
GPT-5.2: 78.3%
LMSYS Leaderboard
The LMSYS Chatbot Arena Leaderboard aggregates human preference data from blind side-by-side model comparisons into comprehensive rankings across multiple categories. Beyond overall ELO, it provides specialized leaderboards for coding, math, hard prompts, longer queries, and instruction following. It serves as the definitive community-driven ranking of AI model capabilities across diverse real-world use cases.
Top Score
GPT-5.2: 1388
MATH
The MATH benchmark consists of 12,500 challenging competition mathematics problems from AMC, AIME, and Olympiad competitions. Problems span seven subjects: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.
Top Score
GPT-5.2: 89.6%
MathVista
MathVista evaluates mathematical reasoning in visual contexts. It combines challenges from diverse math and vision tasks including geometry, statistics, chart/graph understanding, and scientific figure interpretation.
Top Score
GPT-5.2: 78.4%
MathVista
MathVista evaluates mathematical reasoning in visual contexts by combining math problems with charts, plots, geometry figures, scientific diagrams, and synthetic scenes. It aggregates 6,141 examples from 28 existing datasets and 3 new datasets, covering five task types and seven mathematical reasoning abilities. Models must interpret visual information accurately and apply mathematical reasoning to derive correct answers.
Top Score
GPT-5.2: 72.6%
MBPP (Mostly Basic Python Problems)
MBPP consists of around 1,000 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem includes a task description, code solution, and three automated test cases.
Top Score
Claude Opus 4.6: 93.8%
MGSM
Multilingual Grade School Math (MGSM) extends GSM8K to 10 typologically diverse languages including Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, and Thai, testing multilingual mathematical reasoning.
Top Score
GPT-5.2: 93.7%
MMLU
Massive Multitask Language Understanding measures knowledge across 57 academic subjects including STEM, humanities, social sciences, and more. It tests both world knowledge and problem-solving ability at varying difficulty levels from elementary to professional.
Top Score
GPT-5.2: 92.4%
MMLU-Pro
MMLU-Pro is a more rigorous and challenging version of MMLU with 10 answer options instead of 4, reducing the chance of lucky guesses. It focuses on harder, reasoning-intensive questions across academic domains.
Top Score
GPT-5.2: 81.4%
MMLU-Pro Vision
MMLU-Pro Vision extends MMLU-Pro to multimodal settings where questions include images, diagrams, charts, and figures alongside text. It tests whether vision-language models can leverage visual information for academic reasoning.
Top Score
GPT-5.2: 68.4%
MMMU
Massive Multi-discipline Multimodal Understanding (MMMU) evaluates multimodal models on college-level subject knowledge and deliberate reasoning across 30 subjects and 183 subfields, using images, charts, diagrams, and domain-specific visualizations.
Top Score
GPT-5.2: 74.6%
MMMU
MMMU (Massive Multi-discipline Multimodal Understanding) is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks requiring both visual understanding and domain-specific knowledge. It spans 30 subjects across art, business, science, health, humanities, and engineering with 11,500 questions that include images, diagrams, charts, and tables. Models must jointly reason over visual and textual information.
Top Score
GPT-5.2: 74.8%
MT-Bench
MT-Bench evaluates multi-turn conversation ability using 80 high-quality multi-turn questions across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. Responses are judged by GPT-4 on a 1-10 scale.
Top Score
GPT-5.2: 9.72
MT-Bench
MT-Bench (Multi-Turn Bench) evaluates chatbot capabilities through 80 carefully designed multi-turn conversations across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. An LLM judge (GPT-4 class) scores responses on a 1-10 scale. It specifically tests how well models handle follow-up questions, maintain context, and engage in extended dialogue rather than single-turn responses.
Top Score
GPT-5.2: 9.6
MultiMedQA
MultiMedQA combines multiple medical question answering benchmarks including MedQA (USMLE-style), MedMCQA, PubMedQA, and clinical case studies. It evaluates medical knowledge and clinical reasoning capabilities.
Top Score
GPT-5.2: 93.7%
MuSR
Multistep Soft Reasoning (MuSR) evaluates models on complex reasoning tasks that require multiple inference steps in domains like murder mysteries, team allocation puzzles, and object placements. Problems require 2-7 reasoning steps.
Top Score
GPT-5.2: 71.8%
MuSR
MuSR (Multi-Step Reasoning) tests language models on complex problems that require chaining multiple reasoning steps together. The benchmark includes murder mystery puzzles, team allocation problems, and object placement tasks that demand tracking multiple entities, applying logical rules, and maintaining consistency across 5-10 reasoning steps. It exposes weaknesses in models that appear strong on simpler benchmarks.
Top Score
GPT-5.2: 71.2%
Natural Questions
Natural Questions is a question answering benchmark with real queries from Google Search. Each question has a long answer (paragraph) and a short answer (entity or phrase) from Wikipedia, testing both retrieval and comprehension.
Top Score
GPT-5.2: 78.3
RealWorldQA
RealWorldQA evaluates vision-language models on practical, real-world visual understanding tasks including spatial reasoning about real photographs, reading text in images, understanding scenes, and answering practical questions.
Top Score
Gemini 3 Ultra: 79.6%
SafetyBench
SafetyBench evaluates the safety of large language models across 7 categories: offensiveness, unfairness and bias, physical health, mental health, illegal activities, ethics and morality, and privacy. It includes questions in both English and Chinese.
Top Score
Claude Opus 4.6: 91.7%
SimpleQA
SimpleQA evaluates factual accuracy on straightforward, unambiguous factual questions with short, verifiable answers. It specifically tests whether models provide correct factual information vs. hallucinating plausible-sounding but incorrect answers.
Top Score
GPT-5.2: 52.8%
SWE-bench
SWE-bench evaluates AI models on their ability to resolve real-world GitHub issues from popular open-source Python repositories. Each task requires the model to understand a bug report or feature request, navigate the codebase, and produce a working patch. It tests practical software engineering capabilities far beyond simple code generation, including debugging, testing, and code comprehension at scale.
Top Score
Claude Opus 4.6: 62.8%
SWE-bench Verified
SWE-bench Verified evaluates AI systems on real-world software engineering tasks drawn from GitHub issues in popular Python repositories. Models must understand codebases, diagnose issues, and generate correct patches.
Top Score
Claude Opus 4.6 + Agentless: 62.4%
TAU-bench
TAU-bench evaluates AI agents on real-world tasks requiring tool use and multi-step reasoning in retail and airline customer service domains. It measures end-to-end task completion with realistic tool APIs.
Top Score
Claude Opus 4.6: 68.4%
ToxiGen
ToxiGen evaluates the propensity of language models to generate toxic content targeting 13 minority groups. It uses adversarially designed prompts to test whether models produce harmful implicit or explicit toxicity.
Top Score
Claude Opus 4.6: 1.2%
TruthfulQA
TruthfulQA measures whether language models generate truthful answers to questions. It includes 817 questions spanning 38 categories where humans might give false answers due to misconceptions, superstitions, or conspiracy theories.
Top Score
Claude Opus 4.6: 82.4%
TruthfulQA
TruthfulQA measures whether language models generate truthful answers to questions where humans commonly hold misconceptions. The benchmark covers 817 questions across 38 categories including health, law, finance, and conspiracy theories. It specifically targets questions where models are incentivized to reproduce popular falsehoods rather than provide accurate but less common truths, making it a key safety benchmark.
Top Score
Claude Opus 4.6: 82.4%
VQAv2
Visual Question Answering v2 is a large-scale benchmark for visual question answering containing over 1 million questions about images from COCO. It tests the ability to answer open-ended questions that require understanding image content.
Top Score
Gemini 3 Ultra: 88.9%
WildBench
WildBench evaluates AI models on challenging real-world user queries collected from the wild. It focuses on complex, multi-constraint instructions that test practical model capabilities beyond academic benchmarks.
Top Score
Claude Opus 4.6: 68.7%
WildVision-Bench
WildVision-Bench evaluates multimodal AI models on challenging real-world vision-language tasks collected from actual user interactions in the wild. Unlike curated academic datasets, WildVision captures the diversity and difficulty of how people naturally interact with vision-language models, including complex scene understanding, multi-image reasoning, and nuanced visual questions that require world knowledge and common sense.
Top Score
GPT-5.2: 78.4%
WinoGrande
WinoGrande is a large-scale dataset of 44,000 Winograd-style problems that require commonsense reasoning to resolve pronoun ambiguity. It is adversarially constructed to be challenging for statistical models.
Top Score
GPT-5.2: 96.4%
WinoGrande
WinoGrande is a large-scale commonsense reasoning benchmark inspired by the original Winograd Schema Challenge. It presents fill-in-the-blank problems that require understanding context, physical commonsense, and social reasoning to resolve ambiguous pronoun references. The dataset contains 44,000 problems adversarially constructed to minimize annotation artifacts, making it a robust test of genuine commonsense understanding.
Top Score
GPT-5.2: 95.3%
ZebraLogic
ZebraLogic tests logical deduction ability using Zebra puzzles (also known as Einstein's riddle). Models must use constraint satisfaction and logical elimination to solve grid-based logic puzzles of increasing complexity.
Top Score
Claude Opus 4.6: 74.8%