AI Glossary/Data Labeling

What Is Data Labeling?

Definition

Data labeling is the process of assigning meaningful tags, categories, or annotations to raw data — such as identifying objects in images, classifying text sentiment, or transcribing audio — to create labeled datasets used for training supervised machine learning models.

How Data Labeling Works

Supervised learning requires data with correct answers (labels), and data labeling is how those answers are created. Human annotators review data and apply labels according to predefined guidelines — for example, drawing bounding boxes around objects in images, marking named entities in text, or rating the quality of AI outputs for RLHF. Data labeling is often the most time-consuming and expensive part of building AI systems, which is why techniques like active learning, semi-supervised learning, and synthetic data generation have emerged to reduce labeling needs. The quality of labels directly determines the quality of the trained model — 'garbage in, garbage out.'

Real-World Examples

1

Annotators drawing bounding boxes around pedestrians, cars, and traffic signs in thousands of driving images for autonomous vehicle training

2

Medical experts labeling X-ray images with diagnoses like 'pneumonia' or 'normal' to train a diagnostic AI

3

Workers rating pairs of AI chatbot responses to create preference data for RLHF training

Recommended Tools

Related Terms