GeneralEst. 2023

Chatbot Arena (LMSYS)

Chatbot Arena is a crowdsourced evaluation platform where users engage in blind, head-to-head comparisons of AI chatbots. Models are ranked using an Elo rating system based on hundreds of thousands of human preference votes.

Metrics

Elo rating from human preference votes

Created By

LMSYS Org

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	GPT-5.2	1387	2026-03
2	Claude Opus 4.6	1379	2026-02
3	Gemini 3 Ultra	1365	2026-01
4	Grok 4	1348	2026-02
5	DeepSeek V3	1331	2026-01

Related General Benchmarks

LiveBench

LiveBench is a continuously updated benchmark designed to minimize contamination by using new questions monthly. It covers math, coding, reasoning, language, instruction following, and data analysis with objective, verifiable answers.

Top: GPT-5.2 — 82.6%

Arena-Hard-Auto

Arena-Hard-Auto is an automated benchmark that correlates highly with Chatbot Arena rankings. It uses 500 challenging user queries and automated judge evaluation to approximate human preferences at a fraction of the cost.

Top: GPT-5.2 — 92.1%

TAU-bench

TAU-bench evaluates AI agents on real-world tasks requiring tool use and multi-step reasoning in retail and airline customer service domains. It measures end-to-end task completion with realistic tool APIs.

Top: Claude Opus 4.6 — 68.4%

MT-Bench

MT-Bench (Multi-Turn Bench) evaluates chatbot capabilities through 80 carefully designed multi-turn conversations across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. An LLM judge (GPT-4 class) scores responses on a 1-10 scale. It specifically tests how well models handle follow-up questions, maintain context, and engage in extended dialogue rather than single-turn responses.

Top: GPT-5.2 — 9.6

← Back to all benchmarks