AI Glossary/Evaluation Metrics (AI)

What Is Evaluation Metrics (AI)?

Definition

Evaluation metrics are quantitative measurements used to assess the performance of AI models on specific tasks, providing objective criteria for model selection, comparison, and improvement across different aspects like accuracy, fluency, relevance, and safety.

How Evaluation Metrics (AI) Works

Choosing the right evaluation metric is as important as choosing the right model. Different tasks require different metrics: classification uses accuracy, precision, recall, and F1; language generation uses perplexity, BLEU, and ROUGE; information retrieval uses mean reciprocal rank and NDCG; and LLMs are increasingly evaluated with human preference scores and task-specific benchmarks. No single metric tells the complete story — a model might score high on accuracy but low on fairness, or high on BLEU but low on fluency. Modern AI evaluation typically combines multiple automatic metrics with human evaluation to get a comprehensive picture of model quality. The choice of metrics also influences model development, as teams optimize for whatever they measure.

Real-World Examples

1

A model card reporting accuracy (87%), F1 (0.84), precision (0.89), and recall (0.80) for a text classification model

2

A team evaluating their translation model with both BLEU score (automatic) and human quality ratings (manual)

3

An AI safety team measuring both helpfulness and harmlessness metrics to ensure the model is both useful and safe

Recommended Tools

Related Terms