ReasoningEst. 2019

HellaSwag

HellaSwag is a commonsense reasoning benchmark that tests whether AI models can predict the most plausible continuation of a given scenario. It uses adversarially constructed wrong answers that are challenging for models but easy for humans.

Metrics

Accuracy (%) on commonsense completion

Created By

Rowan Zellers et al.

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	GPT-5.2	97.8%	2026-03
2	Claude Opus 4.6	97.5%	2026-02
3	Gemini 3 Ultra	97.2%	2026-01
4	Grok 4	96.9%	2026-02
5	Llama 4 405B	95.8%	2026-01

Related Reasoning Benchmarks

ARC (AI2 Reasoning Challenge)

The AI2 Reasoning Challenge contains 7,787 genuine grade-school science questions, split into Easy and Challenge sets. The Challenge set contains only questions that are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.

Top: GPT-5.2 — 98.2%

GPQA (Diamond)

Graduate-Level Google-Proof Q&A (GPQA) Diamond is a challenging benchmark of expert-level questions in biology, physics, and chemistry. Questions are designed to be answerable by domain experts but extremely difficult for non-experts, even with web search.

Top: GPT-5.2 — 94.7%

BBH (BIG-Bench Hard)

BIG-Bench Hard is a suite of 23 challenging tasks from the BIG-Bench benchmark where language models previously performed below average human raters. Tasks include boolean expressions, causal judgement, date understanding, disambiguation, and more.

Top: GPT-5.2 — 95.3%

WinoGrande

WinoGrande is a large-scale dataset of 44,000 Winograd-style problems that require commonsense reasoning to resolve pronoun ambiguity. It is adversarially constructed to be challenging for statistical models.

Top: GPT-5.2 — 96.4%

← Back to all benchmarks