GeneralEst. 2024

TAU-bench

TAU-bench evaluates AI agents on real-world tasks requiring tool use and multi-step reasoning in retail and airline customer service domains. It measures end-to-end task completion with realistic tool APIs.

Metrics

Task completion rate (%)

Created By

Sierra Research

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	Claude Opus 4.6	68.4%	2026-02
2	GPT-5.2	65.7%	2026-03
3	Gemini 3 Ultra	61.3%	2026-01
4	Grok 4	57.8%	2026-02
5	DeepSeek V3	53.2%	2026-01

Related General Benchmarks

Chatbot Arena (LMSYS)

Chatbot Arena is a crowdsourced evaluation platform where users engage in blind, head-to-head comparisons of AI chatbots. Models are ranked using an Elo rating system based on hundreds of thousands of human preference votes.

Top: GPT-5.2 — 1387

LiveBench

LiveBench is a continuously updated benchmark designed to minimize contamination by using new questions monthly. It covers math, coding, reasoning, language, instruction following, and data analysis with objective, verifiable answers.

Top: GPT-5.2 — 82.6%

Arena-Hard-Auto

Arena-Hard-Auto is an automated benchmark that correlates highly with Chatbot Arena rankings. It uses 500 challenging user queries and automated judge evaluation to approximate human preferences at a fraction of the cost.

Top: GPT-5.2 — 92.1%

MT-Bench

MT-Bench (Multi-Turn Bench) evaluates chatbot capabilities through 80 carefully designed multi-turn conversations across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. An LLM judge (GPT-4 class) scores responses on a 1-10 scale. It specifically tests how well models handle follow-up questions, maintain context, and engage in extended dialogue rather than single-turn responses.

Top: GPT-5.2 — 9.6

← Back to all benchmarks