LegalBench
LegalBench is a collaboratively built benchmark for evaluating legal reasoning in language models. It consists of 162 tasks spanning 6 types of legal reasoning: issue-spotting, rule-recall, interpretation, rule-application, conclusion, and rhetorical understanding.
Metrics
Accuracy (%) across 162 legal reasoning tasks
Created By
Stanford HAI / Hazy Research
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | Claude Opus 4.6 | 84.6% | 2026-02 |
| 2 | GPT-5.2 | 83.9% | 2026-03 |
| 3 | Gemini 3 Ultra | 81.2% | 2026-01 |
| 4 | Grok 4 | 78.7% | 2026-02 |
| 5 | Llama 4 405B | 75.3% | 2026-01 |
Related Language Benchmarks
MMLU
Massive Multitask Language Understanding measures knowledge across 57 academic subjects including STEM, humanities, social sciences, and more. It tests both world knowledge and problem-solving ability at varying difficulty levels from elementary to professional.
Top: GPT-5.2 — 92.4%
MT-Bench
MT-Bench evaluates multi-turn conversation ability using 80 high-quality multi-turn questions across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. Responses are judged by GPT-4 on a 1-10 scale.
Top: GPT-5.2 — 9.72
AlpacaEval 2.0
AlpacaEval 2.0 is an automatic evaluation benchmark that measures instruction-following ability. It uses a length-controlled win rate against a reference model, reducing length bias that affected the original version.
Top: Claude Opus 4.6 — 72.1%
WildBench
WildBench evaluates AI models on challenging real-world user queries collected from the wild. It focuses on complex, multi-constraint instructions that test practical model capabilities beyond academic benchmarks.
Top: Claude Opus 4.6 — 68.7%