Chatbot Arena (LMSYS)
Chatbot Arena is a crowdsourced evaluation platform where users engage in blind, head-to-head comparisons of AI chatbots. Models are ranked using an Elo rating system based on hundreds of thousands of human preference votes.
Metrics
Elo rating from human preference votes
Created By
LMSYS Org
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | GPT-5.2 | 1387 | 2026-03 |
| 2 | Claude Opus 4.6 | 1379 | 2026-02 |
| 3 | Gemini 3 Ultra | 1365 | 2026-01 |
| 4 | Grok 4 | 1348 | 2026-02 |
| 5 | DeepSeek V3 | 1331 | 2026-01 |
Related General Benchmarks
LiveBench
LiveBench is a continuously updated benchmark designed to minimize contamination by using new questions monthly. It covers math, coding, reasoning, language, instruction following, and data analysis with objective, verifiable answers.
Top: GPT-5.2 — 82.6%
Arena-Hard-Auto
Arena-Hard-Auto is an automated benchmark that correlates highly with Chatbot Arena rankings. It uses 500 challenging user queries and automated judge evaluation to approximate human preferences at a fraction of the cost.
Top: GPT-5.2 — 92.1%
TAU-bench
TAU-bench evaluates AI agents on real-world tasks requiring tool use and multi-step reasoning in retail and airline customer service domains. It measures end-to-end task completion with realistic tool APIs.
Top: Claude Opus 4.6 — 68.4%
MT-Bench
MT-Bench (Multi-Turn Bench) evaluates chatbot capabilities through 80 carefully designed multi-turn conversations across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. An LLM judge (GPT-4 class) scores responses on a 1-10 scale. It specifically tests how well models handle follow-up questions, maintain context, and engage in extended dialogue rather than single-turn responses.
Top: GPT-5.2 — 9.6