GeneralEst. 2024

Arena-Hard-Auto

Arena-Hard-Auto is an automated benchmark that correlates highly with Chatbot Arena rankings. It uses 500 challenging user queries and automated judge evaluation to approximate human preferences at a fraction of the cost.

Metrics

Win rate (%) vs baseline model

Created By

LMSYS Org

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	GPT-5.2	92.1%	2026-03
2	Claude Opus 4.6	90.6%	2026-02
3	Gemini 3 Ultra	87.3%	2026-01
4	Grok 4	84.8%	2026-02
5	DeepSeek V3	81.2%	2026-01

Related General Benchmarks

Chatbot Arena (LMSYS)

Chatbot Arena is a crowdsourced evaluation platform where users engage in blind, head-to-head comparisons of AI chatbots. Models are ranked using an Elo rating system based on hundreds of thousands of human preference votes.

Top: GPT-5.2 — 1387

LiveBench

LiveBench is a continuously updated benchmark designed to minimize contamination by using new questions monthly. It covers math, coding, reasoning, language, instruction following, and data analysis with objective, verifiable answers.

Top: GPT-5.2 — 82.6%

TAU-bench

TAU-bench evaluates AI agents on real-world tasks requiring tool use and multi-step reasoning in retail and airline customer service domains. It measures end-to-end task completion with realistic tool APIs.

Top: Claude Opus 4.6 — 68.4%

MT-Bench

MT-Bench (Multi-Turn Bench) evaluates chatbot capabilities through 80 carefully designed multi-turn conversations across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. An LLM judge (GPT-4 class) scores responses on a 1-10 scale. It specifically tests how well models handle follow-up questions, maintain context, and engage in extended dialogue rather than single-turn responses.

Top: GPT-5.2 — 9.6

← Back to all benchmarks