LanguageEst. 2023

AlpacaEval 2.0

AlpacaEval 2.0 is an automatic evaluation benchmark that measures instruction-following ability. It uses a length-controlled win rate against a reference model, reducing length bias that affected the original version.

Metrics

Length-controlled win rate (%) vs reference

Created By

Stanford CRFM

Top Model Scores

RankModelScoreDate
1Claude Opus 4.672.1%2026-02
2GPT-5.270.8%2026-03
3Gemini 3 Ultra67.4%2026-01
4Grok 465.9%2026-02
5Llama 4 405B61.3%2026-01