What Is Corpus?
A corpus (plural: corpora) is a large, structured collection of text or speech data compiled for the purpose of training language models, conducting linguistic research, or building natural language processing (NLP) systems.
How Corpus Works
Corpora are the foundational text datasets used to train language models. They range from carefully curated literary collections to massive web scrapes containing billions of pages. Famous corpora include Common Crawl (a regularly updated archive of the internet), The Pile (an 800GB diverse text dataset), and RedPajama (a reproduction of LLaMA's training data). The composition of a corpus significantly influences model behavior — a corpus heavy on legal text produces a model with legal knowledge, while one heavy on code produces strong coding abilities. Corpus design involves decisions about data sources, filtering criteria, language distribution, and quality thresholds that directly shape model capabilities.
Real-World Examples
Common Crawl providing petabytes of web text used as a primary training data source for many large language models
The Pile assembling text from 22 diverse sources including Wikipedia, ArXiv papers, and GitHub code for balanced LLM training
A university building a specialized corpus of 10 million academic papers for training a research-focused language model