ReasoningEst. 2022

BBH (BIG-Bench Hard)

BIG-Bench Hard is a suite of 23 challenging tasks from the BIG-Bench benchmark where language models previously performed below average human raters. Tasks include boolean expressions, causal judgement, date understanding, disambiguation, and more.

Metrics

Accuracy (%) across 23 hard tasks

Created By

Mirac Suzgun et al.

Top Model Scores

RankModelScoreDate
1GPT-5.295.3%2026-03
2Claude Opus 4.694.8%2026-02
3Gemini 3 Ultra94.1%2026-01
4Grok 492.7%2026-02
5Llama 4 405B90.6%2026-01