AI Glossary/BLEU Score

What Is BLEU Score?

Definition

BLEU (Bilingual Evaluation Understudy) score is an automatic evaluation metric that assesses the quality of machine-generated text — particularly translations — by measuring the overlap of n-grams (word sequences) between the generated output and one or more reference texts.

How BLEU Score Works

BLEU computes how many words and phrases in the generated text match the reference text, using unigrams (single words) through 4-grams (four-word sequences) with a brevity penalty for too-short outputs. Scores range from 0 to 1 (often expressed as 0-100), where higher scores indicate closer matches to the reference. BLEU was originally designed for machine translation but has been applied to summarization, text generation, and other NLP tasks. While widely used due to its simplicity and reproducibility, BLEU has significant limitations: it cannot assess fluency, meaning preservation, or creative variation, and two equally valid translations might score very differently. Modern evaluation increasingly supplements BLEU with learned metrics and human evaluation.

Real-World Examples

1

A machine translation system achieving a BLEU score of 45 on the WMT English-to-German benchmark, indicating strong translation quality

2

A researcher comparing two summarization models using BLEU scores against human-written reference summaries

3

A team noticing that their model has a high BLEU score but poor human ratings, highlighting BLEU's limitations for evaluating fluency

Recommended Tools

Related Terms