DROP
Discrete Reasoning Over Paragraphs tests reading comprehension that requires discrete reasoning steps including addition, subtraction, counting, sorting, and other operations over text passages.
Metrics
F1 score on discrete reasoning questions
Created By
Allen Institute for AI
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | GPT-5.2 | 93.1 | 2026-03 |
| 2 | Claude Opus 4.6 | 92.4 | 2026-02 |
| 3 | Gemini 3 Ultra | 91.7 | 2026-01 |
| 4 | Grok 4 | 89.8 | 2026-02 |
| 5 | Llama 4 405B | 87.3 | 2026-01 |
Related Reasoning Benchmarks
HellaSwag
HellaSwag is a commonsense reasoning benchmark that tests whether AI models can predict the most plausible continuation of a given scenario. It uses adversarially constructed wrong answers that are challenging for models but easy for humans.
Top: GPT-5.2 — 97.8%
ARC (AI2 Reasoning Challenge)
The AI2 Reasoning Challenge contains 7,787 genuine grade-school science questions, split into Easy and Challenge sets. The Challenge set contains only questions that are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.
Top: GPT-5.2 — 98.2%
GPQA (Diamond)
Graduate-Level Google-Proof Q&A (GPQA) Diamond is a challenging benchmark of expert-level questions in biology, physics, and chemistry. Questions are designed to be answerable by domain experts but extremely difficult for non-experts, even with web search.
Top: GPT-5.2 — 94.7%
BBH (BIG-Bench Hard)
BIG-Bench Hard is a suite of 23 challenging tasks from the BIG-Bench benchmark where language models previously performed below average human raters. Tasks include boolean expressions, causal judgement, date understanding, disambiguation, and more.
Top: GPT-5.2 — 95.3%