LanguageEst. 2022

MultiMedQA

MultiMedQA combines multiple medical question answering benchmarks including MedQA (USMLE-style), MedMCQA, PubMedQA, and clinical case studies. It evaluates medical knowledge and clinical reasoning capabilities.

Metrics

Accuracy (%) on medical QA tasks

Created By

Google Research / DeepMind

Paper

View paper →

Website

Visit website →

Top Model Scores

Rank	Model	Score	Date
1	GPT-5.2	93.7%	2026-03
2	Med-Gemini 3	93.1%	2026-01
3	Claude Opus 4.6	91.8%	2026-02
4	Grok 4	88.4%	2026-02
5	Llama 4 405B	85.6%	2026-01

Related Language Benchmarks

MMLU

Massive Multitask Language Understanding measures knowledge across 57 academic subjects including STEM, humanities, social sciences, and more. It tests both world knowledge and problem-solving ability at varying difficulty levels from elementary to professional.

Top: GPT-5.2 — 92.4%

MT-Bench

MT-Bench evaluates multi-turn conversation ability using 80 high-quality multi-turn questions across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, and STEM. Responses are judged by GPT-4 on a 1-10 scale.

Top: GPT-5.2 — 9.72

AlpacaEval 2.0

AlpacaEval 2.0 is an automatic evaluation benchmark that measures instruction-following ability. It uses a length-controlled win rate against a reference model, reducing length bias that affected the original version.

Top: Claude Opus 4.6 — 72.1%

WildBench

WildBench evaluates AI models on challenging real-world user queries collected from the wild. It focuses on complex, multi-constraint instructions that test practical model capabilities beyond academic benchmarks.

Top: Claude Opus 4.6 — 68.7%

← Back to all benchmarks