ReasoningEst. 2022

BIG-Bench Hard

BIG-Bench Hard (BBH) is a curated subset of 23 challenging tasks from the original BIG-Bench suite where language models previously performed below average human raters. These tasks span algorithmic reasoning, natural language understanding, and world knowledge. BBH specifically tests chain-of-thought reasoning capabilities and has become a key measure of whether models can handle tasks requiring multi-step logical thinking.

Metrics

Accuracy (%) across 23 hard tasks with chain-of-thought

Created By

Mirac Suzgun et al. (Google/Stanford)

Top Model Scores

RankModelScoreDate
1GPT-5.294.2%2026-03
2Claude Opus 4.693.5%2026-02
3Gemini 3 Ultra92.8%2026-01
4Grok 491.1%2026-02
5Llama 4 405B88.6%2026-01