LanguageEst. 2023

MT-Bench

MT-Bench evaluates multi-turn conversation ability using 80 high-quality multi-turn questions across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. Responses are judged by GPT-4 on a 1-10 scale.

Metrics

Average score (1-10) across 80 multi-turn questions

Created By

LMSYS Org

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	GPT-5.2	9.72	2026-03
2	Claude Opus 4.6	9.68	2026-02
3	Gemini 3 Ultra	9.55	2026-01
4	Grok 4	9.41	2026-02
5	Llama 4 405B	9.18	2026-01

Related Language Benchmarks

MMLU

Massive Multitask Language Understanding measures knowledge across 57 academic subjects including STEM, humanities, social sciences, and more. It tests both world knowledge and problem-solving ability at varying difficulty levels from elementary to professional.

Top: GPT-5.2 — 92.4%

← Back to all benchmarks

MT-Bench

Top Model Scores

Related Language Benchmarks

MMLU

AlpacaEval 2.0

WildBench

IFEval