ReasoningEst. 2019

DROP

Discrete Reasoning Over Paragraphs tests reading comprehension that requires discrete reasoning steps including addition, subtraction, counting, sorting, and other operations over text passages.

Metrics

F1 score on discrete reasoning questions

Created By

Allen Institute for AI

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	GPT-5.2	93.1	2026-03
2	Claude Opus 4.6	92.4	2026-02
3	Gemini 3 Ultra	91.7	2026-01
4	Grok 4	89.8	2026-02
5	Llama 4 405B	87.3	2026-01

Related Reasoning Benchmarks

HellaSwag

HellaSwag is a commonsense reasoning benchmark that tests whether AI models can predict the most plausible continuation of a given scenario. It uses adversarially constructed wrong answers that are challenging for models but easy for humans.

Top: GPT-5.2 — 97.8%

ARC (AI2 Reasoning Challenge)

The AI2 Reasoning Challenge contains 7,787 genuine grade-school science questions, split into Easy and Challenge sets. The Challenge set contains only questions that are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.

Top: GPT-5.2 — 98.2%

GPQA (Diamond)

Graduate-Level Google-Proof Q&A (GPQA) Diamond is a challenging benchmark of expert-level questions in biology, physics, and chemistry. Questions are designed to be answerable by domain experts but extremely difficult for non-experts, even with web search.

Top: GPT-5.2 — 94.7%

BBH (BIG-Bench Hard)

BIG-Bench Hard is a suite of 23 challenging tasks from the BIG-Bench benchmark where language models previously performed below average human raters. Tasks include boolean expressions, causal judgement, date understanding, disambiguation, and more.

Top: GPT-5.2 — 95.3%

← Back to all benchmarks