ReasoningEst. 2018

ARC-Challenge

The AI2 Reasoning Challenge (ARC) tests science question answering at a grade-school level, but the Challenge partition specifically selects questions that simple retrieval and word co-occurrence methods fail on. These questions require genuine commonsense reasoning, multi-step inference, and understanding of basic scientific principles. ARC-Challenge has been a longstanding benchmark for measuring reasoning progress in language models.

Metrics

Accuracy (%) on challenge-set science questions

Created By

Peter Clark et al. (AI2)

Top Model Scores

RankModelScoreDate
1GPT-5.298.1%2026-03
2Claude Opus 4.697.8%2026-02
3Gemini 3 Ultra97.5%2026-01
4Grok 496.9%2026-02
5Llama 4 405B95.2%2026-01