What Is Training Data?
Training data is the collection of examples — including text, images, audio, or structured data — that machine learning models learn from during the training process, forming the basis for all the patterns, knowledge, and capabilities the model develops.
How Training Data Works
Training data is arguably the most important ingredient in building AI systems. A model can only learn what its training data teaches it. GPT-4 was trained on trillions of tokens from books, websites, and code repositories. Stable Diffusion was trained on billions of image-text pairs. The quality, diversity, and representativeness of training data directly determine model behavior, including its strengths, weaknesses, and biases. Training data curation involves collecting, cleaning, filtering, deduplicating, and balancing data. Recent trends include using synthetic data to supplement real data and carefully curating high-quality data subsets rather than simply scaling quantity.
Real-World Examples
OpenAI training GPT-4 on a filtered dataset of trillions of tokens from books, web pages, and code repositories
Stability AI training Stable Diffusion on LAION-5B, a dataset of 5 billion image-text pairs from the internet
A hospital curating 500,000 labeled radiology images as training data for a diagnostic AI system