HellaSwag
HellaSwag is a commonsense reasoning benchmark that tests whether AI models can predict the most plausible continuation of a given scenario. It uses adversarially constructed wrong answers that are challenging for models but easy for humans.
Metrics
Accuracy (%) on commonsense completion
Created By
Rowan Zellers et al.
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | GPT-5.2 | 97.8% | 2026-03 |
| 2 | Claude Opus 4.6 | 97.5% | 2026-02 |
| 3 | Gemini 3 Ultra | 97.2% | 2026-01 |
| 4 | Grok 4 | 96.9% | 2026-02 |
| 5 | Llama 4 405B | 95.8% | 2026-01 |
Related Reasoning Benchmarks
ARC (AI2 Reasoning Challenge)
The AI2 Reasoning Challenge contains 7,787 genuine grade-school science questions, split into Easy and Challenge sets. The Challenge set contains only questions that are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.
Top: GPT-5.2 — 98.2%
GPQA (Diamond)
Graduate-Level Google-Proof Q&A (GPQA) Diamond is a challenging benchmark of expert-level questions in biology, physics, and chemistry. Questions are designed to be answerable by domain experts but extremely difficult for non-experts, even with web search.
Top: GPT-5.2 — 94.7%
BBH (BIG-Bench Hard)
BIG-Bench Hard is a suite of 23 challenging tasks from the BIG-Bench benchmark where language models previously performed below average human raters. Tasks include boolean expressions, causal judgement, date understanding, disambiguation, and more.
Top: GPT-5.2 — 95.3%
WinoGrande
WinoGrande is a large-scale dataset of 44,000 Winograd-style problems that require commonsense reasoning to resolve pronoun ambiguity. It is adversarially constructed to be challenging for statistical models.
Top: GPT-5.2 — 96.4%