CodeEst. 2024

BFCL (Berkeley Function Calling)

Berkeley Function Calling Leaderboard evaluates the ability of models to accurately generate function/tool calls with correct parameters. It tests API call generation, parameter extraction, and multi-tool orchestration scenarios.

Metrics

Overall accuracy (%) on function calling

Created By

UC Berkeley

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	Claude Opus 4.6	93.7%	2026-02
2	GPT-5.2	92.4%	2026-03
3	Gemini 3 Ultra	90.8%	2026-01
4	Grok 4	88.3%	2026-02
5	DeepSeek V3	86.1%	2026-01

Related Code Benchmarks

HumanEval

HumanEval evaluates the functional correctness of code generated by language models. It consists of 164 hand-written programming problems with function signatures, docstrings, and unit tests, measuring pass@1 and pass@k rates.

Top: Claude Opus 4.6 — 96.3%

SWE-bench Verified

SWE-bench Verified evaluates AI systems on real-world software engineering tasks drawn from GitHub issues in popular Python repositories. Models must understand codebases, diagnose issues, and generate correct patches.

Top: Claude Opus 4.6 + Agentless — 62.4%

MBPP (Mostly Basic Python Problems)

MBPP consists of around 1,000 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem includes a task description, code solution, and three automated test cases.

Top: Claude Opus 4.6 — 93.8%

CodeContests

CodeContests is a competitive programming benchmark drawn from Codeforces, CodeChef, and other platforms. It tests algorithmic problem-solving with problems requiring complex data structures, dynamic programming, and mathematical reasoning.

Top: GPT-5.2 — 43.2%

← Back to all benchmarks