BFCL (Berkeley Function Calling)
Berkeley Function Calling Leaderboard evaluates the ability of models to accurately generate function/tool calls with correct parameters. It tests API call generation, parameter extraction, and multi-tool orchestration scenarios.
Metrics
Overall accuracy (%) on function calling
Created By
UC Berkeley
Paper
View paper →Website
Visit website →Top Model Scores
| Rank | Model | Score | Date |
|---|---|---|---|
| 1 | Claude Opus 4.6 | 93.7% | 2026-02 |
| 2 | GPT-5.2 | 92.4% | 2026-03 |
| 3 | Gemini 3 Ultra | 90.8% | 2026-01 |
| 4 | Grok 4 | 88.3% | 2026-02 |
| 5 | DeepSeek V3 | 86.1% | 2026-01 |
Related Code Benchmarks
HumanEval
HumanEval evaluates the functional correctness of code generated by language models. It consists of 164 hand-written programming problems with function signatures, docstrings, and unit tests, measuring pass@1 and pass@k rates.
Top: Claude Opus 4.6 — 96.3%
SWE-bench Verified
SWE-bench Verified evaluates AI systems on real-world software engineering tasks drawn from GitHub issues in popular Python repositories. Models must understand codebases, diagnose issues, and generate correct patches.
Top: Claude Opus 4.6 + Agentless — 62.4%
MBPP (Mostly Basic Python Problems)
MBPP consists of around 1,000 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem includes a task description, code solution, and three automated test cases.
Top: Claude Opus 4.6 — 93.8%
CodeContests
CodeContests is a competitive programming benchmark drawn from Codeforces, CodeChef, and other platforms. It tests algorithmic problem-solving with problems requiring complex data structures, dynamic programming, and mathematical reasoning.
Top: GPT-5.2 — 43.2%