Question 1

What is Corpus?

Accepted Answer

A corpus (plural: corpora) is a large, structured collection of text or speech data compiled for the purpose of training language models, conducting linguistic research, or building natural language processing (NLP) systems.

Question 2

How does Corpus work?

Accepted Answer

Corpora are the foundational text datasets used to train language models. They range from carefully curated literary collections to massive web scrapes containing billions of pages. Famous corpora include Common Crawl (a regularly updated archive of the internet), The Pile (an 800GB diverse text dataset), and RedPajama (a reproduction of LLaMA's training data). The composition of a corpus significantly influences model behavior — a corpus heavy on legal text produces a model with legal knowledge, while one heavy on code produces strong coding abilities. Corpus design involves decisions about data sources, filtering criteria, language distribution, and quality thresholds that directly shape model capabilities.

Question 3

What are examples of Corpus?

Accepted Answer

Common Crawl providing petabytes of web text used as a primary training data source for many large language models The Pile assembling text from 22 diverse sources including Wikipedia, ArXiv papers, and GitHub code for balanced LLM training A university building a specialized corpus of 10 million academic papers for training a research-focused language model

What Is Corpus?

How Corpus Works

Real-World Examples

Recommended Tools

Related Terms