LiveBench
LiveBench is a continuously updated benchmark designed to minimize contamination by using new questions monthly. It covers math, coding, reasoning, language, instruction following, and data analysis with objective, verifiable answers.
Metrics
Average accuracy (%) across 6 categories
Created By
Abacus.AI
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | GPT-5.2 | 82.6% | 2026-03 |
| 2 | Claude Opus 4.6 | 81.3% | 2026-02 |
| 3 | Gemini 3 Ultra | 79.8% | 2026-01 |
| 4 | Grok 4 | 77.4% | 2026-02 |
| 5 | Llama 4 405B | 73.9% | 2026-01 |
Related General Benchmarks
Chatbot Arena (LMSYS)
Chatbot Arena is a crowdsourced evaluation platform where users engage in blind, head-to-head comparisons of AI chatbots. Models are ranked using an Elo rating system based on hundreds of thousands of human preference votes.
Top: GPT-5.2 — 1387
Arena-Hard-Auto
Arena-Hard-Auto is an automated benchmark that correlates highly with Chatbot Arena rankings. It uses 500 challenging user queries and automated judge evaluation to approximate human preferences at a fraction of the cost.
Top: GPT-5.2 — 92.1%
TAU-bench
TAU-bench evaluates AI agents on real-world tasks requiring tool use and multi-step reasoning in retail and airline customer service domains. It measures end-to-end task completion with realistic tool APIs.
Top: Claude Opus 4.6 — 68.4%
MT-Bench
MT-Bench (Multi-Turn Bench) evaluates chatbot capabilities through 80 carefully designed multi-turn conversations across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. An LLM judge (GPT-4 class) scores responses on a 1-10 scale. It specifically tests how well models handle follow-up questions, maintain context, and engage in extended dialogue rather than single-turn responses.
Top: GPT-5.2 — 9.6