What Is Perplexity (Metric)?
Perplexity is a language model evaluation metric that measures how well the model predicts a given sequence of text — lower perplexity indicates the model assigns higher probability to the actual text, meaning it is less 'surprised' and better at modeling the language.
How Perplexity (Metric) Works
Perplexity can be intuitively understood as the effective number of equally likely next-token choices the model considers at each step. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing between 10 equally likely tokens. Lower perplexity means the model makes more confident and accurate predictions. Perplexity is computed as the exponential of the average cross-entropy loss over the test data. While perplexity is a useful intrinsic measure of language modeling quality, it does not directly measure a model's usefulness for downstream tasks — a model with lower perplexity is not necessarily better at following instructions or reasoning. It is most useful for comparing language models trained on similar data.
Real-World Examples
GPT-2 achieving a perplexity of 18.34 on the WikiText test set, indicating strong predictive performance on Wikipedia text
A researcher comparing two model checkpoints and selecting the one with lower perplexity for further fine-tuning
A language model pre-trained on English text showing very high perplexity on Japanese text, indicating poor Japanese modeling