TAU-bench
TAU-bench evaluates AI agents on real-world tasks requiring tool use and multi-step reasoning in retail and airline customer service domains. It measures end-to-end task completion with realistic tool APIs.
Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | Claude Opus 4.6 | 68.4% | 2026-02 |
| 2 | GPT-5.2 | 65.7% | 2026-03 |
| 3 | Gemini 3 Ultra | 61.3% | 2026-01 |
| 4 | Grok 4 | 57.8% | 2026-02 |
| 5 | DeepSeek V3 | 53.2% | 2026-01 |
Related General Benchmarks
Chatbot Arena (LMSYS)
Chatbot Arena is a crowdsourced evaluation platform where users engage in blind, head-to-head comparisons of AI chatbots. Models are ranked using an Elo rating system based on hundreds of thousands of human preference votes.
Top: GPT-5.2 — 1387
LiveBench
LiveBench is a continuously updated benchmark designed to minimize contamination by using new questions monthly. It covers math, coding, reasoning, language, instruction following, and data analysis with objective, verifiable answers.
Top: GPT-5.2 — 82.6%
Arena-Hard-Auto
Arena-Hard-Auto is an automated benchmark that correlates highly with Chatbot Arena rankings. It uses 500 challenging user queries and automated judge evaluation to approximate human preferences at a fraction of the cost.
Top: GPT-5.2 — 92.1%
MT-Bench
MT-Bench (Multi-Turn Bench) evaluates chatbot capabilities through 80 carefully designed multi-turn conversations across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. An LLM judge (GPT-4 class) scores responses on a 1-10 scale. It specifically tests how well models handle follow-up questions, maintain context, and engage in extended dialogue rather than single-turn responses.
Top: GPT-5.2 — 9.6