ReasoningEst. 2019

HellaSwag

HellaSwag is a commonsense reasoning benchmark that tests whether AI models can predict the most plausible continuation of a given scenario. It uses adversarially constructed wrong answers that are challenging for models but easy for humans.

Metrics

Accuracy (%) on commonsense completion

Created By

Rowan Zellers et al.

Top Model Scores

RankModelScoreDate
1GPT-5.297.8%2026-03
2Claude Opus 4.697.5%2026-02
3Gemini 3 Ultra97.2%2026-01
4Grok 496.9%2026-02
5Llama 4 405B95.8%2026-01