LanguageEst. 2023

MT-Bench

MT-Bench evaluates multi-turn conversation ability using 80 high-quality multi-turn questions across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. Responses are judged by GPT-4 on a 1-10 scale.

Metrics

Average score (1-10) across 80 multi-turn questions

Created By

LMSYS Org

Top Model Scores

RankModelScoreDate
1GPT-5.29.722026-03
2Claude Opus 4.69.682026-02
3Gemini 3 Ultra9.552026-01
4Grok 49.412026-02
5Llama 4 405B9.182026-01