AI Glossary/MMLU (Massive Multitask Language Understanding)

What Is MMLU (Massive Multitask Language Understanding)?

Definition

MMLU (Massive Multitask Language Understanding) is a widely used AI benchmark that evaluates language models across 57 academic subjects — including STEM, humanities, social sciences, and professional fields — using multiple-choice questions to measure general knowledge and reasoning capability.

How MMLU (Massive Multitask Language Understanding) Works

MMLU tests models on questions ranging from elementary mathematics to professional law, clinical medicine, and abstract algebra. It provides a broad measure of a model's knowledge and reasoning ability across diverse domains. Scores are reported as accuracy percentages, with human expert performance around 89.8% and the best AI models now exceeding 90%. MMLU has become one of the most cited benchmarks for comparing LLMs, appearing in virtually every model release announcement. However, critics note that it tests only multiple-choice knowledge recall and may not reflect a model's ability to generate, reason, or solve real-world problems. Variants like MMLU-Pro add harder questions to maintain benchmark difficulty.

Real-World Examples

1

GPT-4 scoring 86.4% on MMLU when first released, demonstrating strong performance across 57 subject areas

2

Claude 3 Opus scoring 86.8% on MMLU, placing it among the top-performing language models at launch

3

A model card listing MMLU scores broken down by subject to show strengths in science (92%) vs. humanities (84%)

V

MMLU (Massive Multitask Language Understanding) on Vincony

Vincony's Compare Chat lets users run their own evaluations across models, complementing formal benchmarks like MMLU with real-world task performance.

Try Vincony free →

Recommended Tools

Related Terms