LMSYS Leaderboard
The LMSYS Chatbot Arena Leaderboard aggregates human preference data from blind side-by-side model comparisons into comprehensive rankings across multiple categories. Beyond overall ELO, it provides specialized leaderboards for coding, math, hard prompts, longer queries, and instruction following. It serves as the definitive community-driven ranking of AI model capabilities across diverse real-world use cases.
Metrics
Composite ELO across multiple categories
Created By
LMSYS Org (UC Berkeley)
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | GPT-5.2 | 1388 | 2026-03 |
| 2 | Claude Opus 4.6 | 1375 | 2026-02 |
| 3 | Gemini 3 Ultra | 1361 | 2026-01 |
| 4 | Grok 4 | 1344 | 2026-02 |
| 5 | DeepSeek-V4 | 1332 | 2026-01 |
Related General Benchmarks
Chatbot Arena (LMSYS)
Chatbot Arena is a crowdsourced evaluation platform where users engage in blind, head-to-head comparisons of AI chatbots. Models are ranked using an Elo rating system based on hundreds of thousands of human preference votes.
Top: GPT-5.2 — 1387
LiveBench
LiveBench is a continuously updated benchmark designed to minimize contamination by using new questions monthly. It covers math, coding, reasoning, language, instruction following, and data analysis with objective, verifiable answers.
Top: GPT-5.2 — 82.6%
Arena-Hard-Auto
Arena-Hard-Auto is an automated benchmark that correlates highly with Chatbot Arena rankings. It uses 500 challenging user queries and automated judge evaluation to approximate human preferences at a fraction of the cost.
Top: GPT-5.2 — 92.1%
TAU-bench
TAU-bench evaluates AI agents on real-world tasks requiring tool use and multi-step reasoning in retail and airline customer service domains. It measures end-to-end task completion with realistic tool APIs.
Top: Claude Opus 4.6 — 68.4%