GeneralEst. 2023

LMSYS Leaderboard

The LMSYS Chatbot Arena Leaderboard aggregates human preference data from blind side-by-side model comparisons into comprehensive rankings across multiple categories. Beyond overall ELO, it provides specialized leaderboards for coding, math, hard prompts, longer queries, and instruction following. It serves as the definitive community-driven ranking of AI model capabilities across diverse real-world use cases.

Metrics

Composite ELO across multiple categories

Created By

LMSYS Org (UC Berkeley)

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	GPT-5.2	1388	2026-03
2	Claude Opus 4.6	1375	2026-02
3	Gemini 3 Ultra	1361	2026-01
4	Grok 4	1344	2026-02
5	DeepSeek-V4	1332	2026-01

Related General Benchmarks

Chatbot Arena (LMSYS)

Chatbot Arena is a crowdsourced evaluation platform where users engage in blind, head-to-head comparisons of AI chatbots. Models are ranked using an Elo rating system based on hundreds of thousands of human preference votes.

Top: GPT-5.2 — 1387

LiveBench

LiveBench is a continuously updated benchmark designed to minimize contamination by using new questions monthly. It covers math, coding, reasoning, language, instruction following, and data analysis with objective, verifiable answers.

Top: GPT-5.2 — 82.6%

Arena-Hard-Auto

Arena-Hard-Auto is an automated benchmark that correlates highly with Chatbot Arena rankings. It uses 500 challenging user queries and automated judge evaluation to approximate human preferences at a fraction of the cost.

Top: GPT-5.2 — 92.1%

TAU-bench

TAU-bench evaluates AI agents on real-world tasks requiring tool use and multi-step reasoning in retail and airline customer service domains. It measures end-to-end task completion with realistic tool APIs.

Top: Claude Opus 4.6 — 68.4%

← Back to all benchmarks