GeneralEst. 2024

TAU-bench

TAU-bench evaluates AI agents on real-world tasks requiring tool use and multi-step reasoning in retail and airline customer service domains. It measures end-to-end task completion with realistic tool APIs.

Metrics

Task completion rate (%)

Created By

Sierra Research

Top Model Scores

RankModelScoreDate
1Claude Opus 4.668.4%2026-02
2GPT-5.265.7%2026-03
3Gemini 3 Ultra61.3%2026-01
4Grok 457.8%2026-02
5DeepSeek V353.2%2026-01