LanguageEst. 2024

WildBench

WildBench evaluates AI models on challenging real-world user queries collected from the wild. It focuses on complex, multi-constraint instructions that test practical model capabilities beyond academic benchmarks.

Metrics

Win rate (%) on real-world user queries

Created By

Allen Institute for AI

Top Model Scores

RankModelScoreDate
1Claude Opus 4.668.7%2026-02
2GPT-5.267.2%2026-03
3Gemini 3 Ultra63.8%2026-01
4Grok 461.4%2026-02
5Llama 4 405B57.1%2026-01