AI Benchmarks & Datasets

Explore 64+ benchmarks used to evaluate AI models. Compare scores across GPT-5, Claude, Gemini, and other frontier models on the evaluations that matter.

Showing 64 benchmarks

Code2024

Aider Polyglot

Aider Polyglot benchmarks AI code editing capabilities across multiple programming languages. Models must correctly edit existing codebases given natural language instructions, testing real-world coding assistant performance.

Top Score

Claude Opus 4.6: 72.1%

View details →
Math2024

AIME 2024

The American Invitational Mathematics Examination (AIME) 2024 problems test advanced mathematical problem-solving. These competition problems require creative mathematical thinking and are used to evaluate frontier model math capabilities.

Top Score

GPT-5.2: 83.3%

View details →
Math2024

AIME 2024

The American Invitational Mathematics Examination (AIME) is a prestigious math competition for high school students who score in the top 5% on the AMC. AIME 2024 problems have been adopted as an AI benchmark because they require creative problem-solving, multi-step reasoning, and deep mathematical insight that cannot be solved through simple pattern matching. Each problem demands sophisticated approaches spanning algebra, geometry, number theory, and combinatorics.

Top Score

GPT-5.2: 13/15

View details →
Language2023

AlpacaEval 2.0

AlpacaEval 2.0 is an automatic evaluation benchmark that measures instruction-following ability. It uses a length-controlled win rate against a reference model, reducing length bias that affected the original version.

Top Score

Claude Opus 4.6: 72.1%

View details →
Reasoning2018

ARC (AI2 Reasoning Challenge)

The AI2 Reasoning Challenge contains 7,787 genuine grade-school science questions, split into Easy and Challenge sets. The Challenge set contains only questions that are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.

Top Score

GPT-5.2: 98.2%

View details →
Reasoning2018

ARC-Challenge

The AI2 Reasoning Challenge (ARC) tests science question answering at a grade-school level, but the Challenge partition specifically selects questions that simple retrieval and word co-occurrence methods fail on. These questions require genuine commonsense reasoning, multi-step inference, and understanding of basic scientific principles. ARC-Challenge has been a longstanding benchmark for measuring reasoning progress in language models.

Top Score

GPT-5.2: 98.1%

View details →
General2024

Arena-Hard

Arena-Hard is a pipeline for creating high-quality benchmarks from Chatbot Arena data by selecting the most discriminative user prompts. It contains 500 challenging prompts that best separate strong models from weak ones. An LLM judge evaluates responses, producing win rates against a baseline model. Arena-Hard achieves high correlation with full Chatbot Arena rankings while being faster and cheaper to run.

Top Score

GPT-5.2: 92.1%

View details →
General2024

Arena-Hard-Auto

Arena-Hard-Auto is an automated benchmark that correlates highly with Chatbot Arena rankings. It uses 500 challenging user queries and automated judge evaluation to approximate human preferences at a fraction of the cost.

Top Score

GPT-5.2: 92.1%

View details →
Audio2024

AudioBench

AudioBench evaluates audio language models across speech understanding, audio scene analysis, and voice-based reasoning. It covers speech recognition, emotion detection, speaker identification, and audio event classification.

Top Score

Gemini 3 Ultra: 87.3%

View details →
Reasoning2022

BBH (BIG-Bench Hard)

BIG-Bench Hard is a suite of 23 challenging tasks from the BIG-Bench benchmark where language models previously performed below average human raters. Tasks include boolean expressions, causal judgement, date understanding, disambiguation, and more.

Top Score

GPT-5.2: 95.3%

View details →
Code2024

BFCL (Berkeley Function Calling)

Berkeley Function Calling Leaderboard evaluates the ability of models to accurately generate function/tool calls with correct parameters. It tests API call generation, parameter extraction, and multi-tool orchestration scenarios.

Top Score

Claude Opus 4.6: 93.7%

View details →
Reasoning2022

BIG-Bench Hard

BIG-Bench Hard (BBH) is a curated subset of 23 challenging tasks from the original BIG-Bench suite where language models previously performed below average human raters. These tasks span algorithmic reasoning, natural language understanding, and world knowledge. BBH specifically tests chain-of-thought reasoning capabilities and has become a key measure of whether models can handle tasks requiring multi-step logical thinking.

Top Score

GPT-5.2: 94.2%

View details →
Vision2022

ChartQA

ChartQA tests the ability of models to answer questions about charts and visualizations, requiring both visual understanding of chart elements and reasoning about the underlying data.

Top Score

GPT-5.2: 90.1%

View details →
General2023

Chatbot Arena (LMSYS)

Chatbot Arena is a crowdsourced evaluation platform where users engage in blind, head-to-head comparisons of AI chatbots. Models are ranked using an Elo rating system based on hundreds of thousands of human preference votes.

Top Score

GPT-5.2: 1387

View details →
General2023

Chatbot Arena ELO

Chatbot Arena is a crowdsourced benchmark platform where users have blind conversations with two anonymous AI models and vote for the better response. The resulting ELO ratings provide a human-preference-based ranking that captures real-world conversational quality. With over 1 million votes collected, it is widely considered the most representative benchmark of actual user satisfaction and practical model utility.

Top Score

GPT-5.2: 1385

View details →
Code2022

CodeContests

CodeContests is a competitive programming benchmark drawn from Codeforces, CodeChef, and other platforms. It tests algorithmic problem-solving with problems requiring complex data structures, dynamic programming, and mathematical reasoning.

Top Score

GPT-5.2: 43.2%

View details →
Code2023

Codeforces Benchmark

The Codeforces Benchmark evaluates AI models on competitive programming problems from the Codeforces platform, one of the world's largest competitive programming communities. Problems range from beginner to expert difficulty and require algorithmic thinking, data structure knowledge, and efficient implementation. AI models are rated using the same ELO-style system as human competitors, enabling direct comparison with human programmers.

Top Score

Claude Opus 4.6: 1892

View details →
Vision2020

DocVQA

Document Visual Question Answering evaluates the ability of models to understand and answer questions about document images including forms, invoices, scientific papers, and handwritten notes.

Top Score

Gemini 3 Ultra: 95.2%

View details →
Vision2020

DocVQA

DocVQA (Document Visual Question Answering) tests AI models on their ability to understand and answer questions about document images including invoices, letters, reports, forms, and tables. Models must perform optical character recognition, layout understanding, and reasoning over document structure to extract specific information. It is a critical benchmark for enterprise document processing and automation applications.

Top Score

GPT-5.2: 0.952

View details →
Reasoning2019

DROP

Discrete Reasoning Over Paragraphs tests reading comprehension that requires discrete reasoning steps including addition, subtraction, counting, sorting, and other operations over text passages.

Top Score

GPT-5.2: 93.1

View details →
Reasoning2023

GPQA

Graduate-Level Google-Proof Q&A (GPQA) is a challenging benchmark consisting of expert-crafted questions in biology, physics, and chemistry that require deep domain knowledge. Questions are designed so that even skilled non-experts with internet access struggle, while domain experts achieve high accuracy. It tests whether AI models possess genuine graduate-level scientific understanding rather than surface-level pattern matching.

Top Score

GPT-5.2: 68.4%

View details →
Reasoning2023

GPQA (Diamond)

Graduate-Level Google-Proof Q&A (GPQA) Diamond is a challenging benchmark of expert-level questions in biology, physics, and chemistry. Questions are designed to be answerable by domain experts but extremely difficult for non-experts, even with web search.

Top Score

GPT-5.2: 94.7%

View details →
Math2021

GSM8K

Grade School Math 8K is a dataset of 8,500 high-quality, linguistically diverse grade school math word problems. It tests multi-step mathematical reasoning with problems requiring 2-8 steps to solve.

Top Score

GPT-5.2: 98.1%

View details →
Reasoning2019

HellaSwag

HellaSwag is a commonsense reasoning benchmark that tests whether AI models can predict the most plausible continuation of a given scenario. It uses adversarially constructed wrong answers that are challenging for models but easy for humans.

Top Score

GPT-5.2: 97.8%

View details →
Language2019

HellaSwag

HellaSwag tests commonsense natural language inference by asking models to predict the most plausible continuation of a given scenario. The dataset uses adversarial filtering to generate wrong answers that are superficially plausible but logically incorrect. Tasks span everyday activities like cooking, sports, and social interactions, testing whether models truly understand sequential reasoning about real-world events.

Top Score

GPT-5.2: 97.6%

View details →
Code2021

HumanEval

HumanEval evaluates the functional correctness of code generated by language models. It consists of 164 hand-written programming problems with function signatures, docstrings, and unit tests, measuring pass@1 and pass@k rates.

Top Score

Claude Opus 4.6: 96.3%

View details →
Code2023

HumanEval+

HumanEval+ augments the original HumanEval benchmark with 80x more test cases per problem, providing a more rigorous evaluation of code correctness. Many models that score well on HumanEval see significant drops on HumanEval+.

Top Score

Claude Opus 4.6: 90.2%

View details →
Language2023

IFEval

IFEval (Instruction Following Evaluation) tests how well models follow verifiable formatting instructions such as word count constraints, inclusion/exclusion of specific phrases, formatting requirements, and structural constraints.

Top Score

Claude Opus 4.6: 91.2%

View details →
Language2023

IFEval

IFEval (Instruction Following Evaluation) tests whether language models can precisely follow specific formatting and content instructions. Tasks include writing responses with exact word counts, including or excluding specific phrases, formatting output as JSON or bullet points, and following complex multi-constraint instructions. It measures the practical reliability of models when users need outputs to conform to exact specifications.

Top Score

GPT-5.2: 89.7%

View details →
Language2023

LegalBench

LegalBench is a collaboratively built benchmark for evaluating legal reasoning in language models. It consists of 162 tasks spanning 6 types of legal reasoning: issue-spotting, rule-recall, interpretation, rule-application, conclusion, and rhetorical understanding.

Top Score

Claude Opus 4.6: 84.6%

View details →
General2024

LiveBench

LiveBench is a continuously updated benchmark designed to minimize contamination by using new questions monthly. It covers math, coding, reasoning, language, instruction following, and data analysis with objective, verifiable answers.

Top Score

GPT-5.2: 82.6%

View details →
General2024

LiveBench

LiveBench is a continuously updated benchmark that uses new questions every month, drawn from recent information sources, to prevent data contamination. It covers math, coding, reasoning, language, instruction following, and data analysis. Because questions are regularly refreshed, models cannot have seen the test data during training, providing a cleaner signal of genuine model capability versus memorization.

Top Score

GPT-5.2: 78.3%

View details →
General2023

LMSYS Leaderboard

The LMSYS Chatbot Arena Leaderboard aggregates human preference data from blind side-by-side model comparisons into comprehensive rankings across multiple categories. Beyond overall ELO, it provides specialized leaderboards for coding, math, hard prompts, longer queries, and instruction following. It serves as the definitive community-driven ranking of AI model capabilities across diverse real-world use cases.

Top Score

GPT-5.2: 1388

View details →
Math2021

MATH

The MATH benchmark consists of 12,500 challenging competition mathematics problems from AMC, AIME, and Olympiad competitions. Problems span seven subjects: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

Top Score

GPT-5.2: 89.6%

View details →
Multimodal2023

MathVista

MathVista evaluates mathematical reasoning in visual contexts. It combines challenges from diverse math and vision tasks including geometry, statistics, chart/graph understanding, and scientific figure interpretation.

Top Score

GPT-5.2: 78.4%

View details →
Multimodal2023

MathVista

MathVista evaluates mathematical reasoning in visual contexts by combining math problems with charts, plots, geometry figures, scientific diagrams, and synthetic scenes. It aggregates 6,141 examples from 28 existing datasets and 3 new datasets, covering five task types and seven mathematical reasoning abilities. Models must interpret visual information accurately and apply mathematical reasoning to derive correct answers.

Top Score

GPT-5.2: 72.6%

View details →
Code2021

MBPP (Mostly Basic Python Problems)

MBPP consists of around 1,000 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem includes a task description, code solution, and three automated test cases.

Top Score

Claude Opus 4.6: 93.8%

View details →
Math2022

MGSM

Multilingual Grade School Math (MGSM) extends GSM8K to 10 typologically diverse languages including Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, and Thai, testing multilingual mathematical reasoning.

Top Score

GPT-5.2: 93.7%

View details →
Language2020

MMLU

Massive Multitask Language Understanding measures knowledge across 57 academic subjects including STEM, humanities, social sciences, and more. It tests both world knowledge and problem-solving ability at varying difficulty levels from elementary to professional.

Top Score

GPT-5.2: 92.4%

View details →
Language2024

MMLU-Pro

MMLU-Pro is a more rigorous and challenging version of MMLU with 10 answer options instead of 4, reducing the chance of lucky guesses. It focuses on harder, reasoning-intensive questions across academic domains.

Top Score

GPT-5.2: 81.4%

View details →
Multimodal2024

MMLU-Pro Vision

MMLU-Pro Vision extends MMLU-Pro to multimodal settings where questions include images, diagrams, charts, and figures alongside text. It tests whether vision-language models can leverage visual information for academic reasoning.

Top Score

GPT-5.2: 68.4%

View details →
Multimodal2023

MMMU

Massive Multi-discipline Multimodal Understanding (MMMU) evaluates multimodal models on college-level subject knowledge and deliberate reasoning across 30 subjects and 183 subfields, using images, charts, diagrams, and domain-specific visualizations.

Top Score

GPT-5.2: 74.6%

View details →
Multimodal2023

MMMU

MMMU (Massive Multi-discipline Multimodal Understanding) is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks requiring both visual understanding and domain-specific knowledge. It spans 30 subjects across art, business, science, health, humanities, and engineering with 11,500 questions that include images, diagrams, charts, and tables. Models must jointly reason over visual and textual information.

Top Score

GPT-5.2: 74.8%

View details →
Language2023

MT-Bench

MT-Bench evaluates multi-turn conversation ability using 80 high-quality multi-turn questions across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. Responses are judged by GPT-4 on a 1-10 scale.

Top Score

GPT-5.2: 9.72

View details →
General2023

MT-Bench

MT-Bench (Multi-Turn Bench) evaluates chatbot capabilities through 80 carefully designed multi-turn conversations across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. An LLM judge (GPT-4 class) scores responses on a 1-10 scale. It specifically tests how well models handle follow-up questions, maintain context, and engage in extended dialogue rather than single-turn responses.

Top Score

GPT-5.2: 9.6

View details →
Language2022

MultiMedQA

MultiMedQA combines multiple medical question answering benchmarks including MedQA (USMLE-style), MedMCQA, PubMedQA, and clinical case studies. It evaluates medical knowledge and clinical reasoning capabilities.

Top Score

GPT-5.2: 93.7%

View details →
Reasoning2023

MuSR

Multistep Soft Reasoning (MuSR) evaluates models on complex reasoning tasks that require multiple inference steps in domains like murder mysteries, team allocation puzzles, and object placements. Problems require 2-7 reasoning steps.

Top Score

GPT-5.2: 71.8%

View details →
Reasoning2023

MuSR

MuSR (Multi-Step Reasoning) tests language models on complex problems that require chaining multiple reasoning steps together. The benchmark includes murder mystery puzzles, team allocation problems, and object placement tasks that demand tracking multiple entities, applying logical rules, and maintaining consistency across 5-10 reasoning steps. It exposes weaknesses in models that appear strong on simpler benchmarks.

Top Score

GPT-5.2: 71.2%

View details →
Language2019

Natural Questions

Natural Questions is a question answering benchmark with real queries from Google Search. Each question has a long answer (paragraph) and a short answer (entity or phrase) from Wikipedia, testing both retrieval and comprehension.

Top Score

GPT-5.2: 78.3

View details →
Vision2024

RealWorldQA

RealWorldQA evaluates vision-language models on practical, real-world visual understanding tasks including spatial reasoning about real photographs, reading text in images, understanding scenes, and answering practical questions.

Top Score

Gemini 3 Ultra: 79.6%

View details →
Safety2023

SafetyBench

SafetyBench evaluates the safety of large language models across 7 categories: offensiveness, unfairness and bias, physical health, mental health, illegal activities, ethics and morality, and privacy. It includes questions in both English and Chinese.

Top Score

Claude Opus 4.6: 91.7%

View details →
Language2024

SimpleQA

SimpleQA evaluates factual accuracy on straightforward, unambiguous factual questions with short, verifiable answers. It specifically tests whether models provide correct factual information vs. hallucinating plausible-sounding but incorrect answers.

Top Score

GPT-5.2: 52.8%

View details →
Code2023

SWE-bench

SWE-bench evaluates AI models on their ability to resolve real-world GitHub issues from popular open-source Python repositories. Each task requires the model to understand a bug report or feature request, navigate the codebase, and produce a working patch. It tests practical software engineering capabilities far beyond simple code generation, including debugging, testing, and code comprehension at scale.

Top Score

Claude Opus 4.6: 62.8%

View details →
Code2023

SWE-bench Verified

SWE-bench Verified evaluates AI systems on real-world software engineering tasks drawn from GitHub issues in popular Python repositories. Models must understand codebases, diagnose issues, and generate correct patches.

Top Score

Claude Opus 4.6 + Agentless: 62.4%

View details →
General2024

TAU-bench

TAU-bench evaluates AI agents on real-world tasks requiring tool use and multi-step reasoning in retail and airline customer service domains. It measures end-to-end task completion with realistic tool APIs.

Top Score

Claude Opus 4.6: 68.4%

View details →
Safety2022

ToxiGen

ToxiGen evaluates the propensity of language models to generate toxic content targeting 13 minority groups. It uses adversarially designed prompts to test whether models produce harmful implicit or explicit toxicity.

Top Score

Claude Opus 4.6: 1.2%

View details →
Safety2021

TruthfulQA

TruthfulQA measures whether language models generate truthful answers to questions. It includes 817 questions spanning 38 categories where humans might give false answers due to misconceptions, superstitions, or conspiracy theories.

Top Score

Claude Opus 4.6: 82.4%

View details →
Safety2021

TruthfulQA

TruthfulQA measures whether language models generate truthful answers to questions where humans commonly hold misconceptions. The benchmark covers 817 questions across 38 categories including health, law, finance, and conspiracy theories. It specifically targets questions where models are incentivized to reproduce popular falsehoods rather than provide accurate but less common truths, making it a key safety benchmark.

Top Score

Claude Opus 4.6: 82.4%

View details →
Vision2017

VQAv2

Visual Question Answering v2 is a large-scale benchmark for visual question answering containing over 1 million questions about images from COCO. It tests the ability to answer open-ended questions that require understanding image content.

Top Score

Gemini 3 Ultra: 88.9%

View details →
Language2024

WildBench

WildBench evaluates AI models on challenging real-world user queries collected from the wild. It focuses on complex, multi-constraint instructions that test practical model capabilities beyond academic benchmarks.

Top Score

Claude Opus 4.6: 68.7%

View details →
Vision2024

WildVision-Bench

WildVision-Bench evaluates multimodal AI models on challenging real-world vision-language tasks collected from actual user interactions in the wild. Unlike curated academic datasets, WildVision captures the diversity and difficulty of how people naturally interact with vision-language models, including complex scene understanding, multi-image reasoning, and nuanced visual questions that require world knowledge and common sense.

Top Score

GPT-5.2: 78.4%

View details →
Reasoning2019

WinoGrande

WinoGrande is a large-scale dataset of 44,000 Winograd-style problems that require commonsense reasoning to resolve pronoun ambiguity. It is adversarially constructed to be challenging for statistical models.

Top Score

GPT-5.2: 96.4%

View details →
Language2019

WinoGrande

WinoGrande is a large-scale commonsense reasoning benchmark inspired by the original Winograd Schema Challenge. It presents fill-in-the-blank problems that require understanding context, physical commonsense, and social reasoning to resolve ambiguous pronoun references. The dataset contains 44,000 problems adversarially constructed to minimize annotation artifacts, making it a robust test of genuine commonsense understanding.

Top Score

GPT-5.2: 95.3%

View details →
Reasoning2024

ZebraLogic

ZebraLogic tests logical deduction ability using Zebra puzzles (also known as Einstein's riddle). Models must use constraint satisfaction and logical elimination to solve grid-based logic puzzles of increasing complexity.

Top Score

Claude Opus 4.6: 74.8%

View details →