Question 1

What is Training Data?

Accepted Answer

Training data is the collection of examples — including text, images, audio, or structured data — that machine learning models learn from during the training process, forming the basis for all the patterns, knowledge, and capabilities the model develops.

Question 2

How does Training Data work?

Accepted Answer

Training data is arguably the most important ingredient in building AI systems. A model can only learn what its training data teaches it. GPT-4 was trained on trillions of tokens from books, websites, and code repositories. Stable Diffusion was trained on billions of image-text pairs. The quality, diversity, and representativeness of training data directly determine model behavior, including its strengths, weaknesses, and biases. Training data curation involves collecting, cleaning, filtering, deduplicating, and balancing data. Recent trends include using synthetic data to supplement real data and carefully curating high-quality data subsets rather than simply scaling quantity.

Question 3

What are examples of Training Data?

Accepted Answer

OpenAI training GPT-4 on a filtered dataset of trillions of tokens from books, web pages, and code repositories Stability AI training Stable Diffusion on LAION-5B, a dataset of 5 billion image-text pairs from the internet A hospital curating 500,000 labeled radiology images as training data for a diagnostic AI system

What Is Training Data?

How Training Data Works

Real-World Examples

Recommended Tools

Related Terms