Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark
Abstract
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained. We train a compute-optimal model, Chinchilla (70B parameters, 1.4T tokens), that uses the same compute as Gopher (280B) but outperforms it on nearly every benchmark.
Key Findings
- 1Showed that most large LLMs were significantly undertrained relative to their size
- 2Established that model size and training tokens should be scaled equally
- 3Demonstrated a 70B model trained on more data beats a 280B model trained on less
- 4Provided revised scaling laws showing the importance of data quantity
- 5Changed industry best practices for LLM training budgets
Impact & Significance
Chinchilla fundamentally changed how the industry trains LLMs. After this paper, labs shifted to training smaller models on more data rather than simply making models larger, leading to more efficient models like Llama and Mistral.
Related Papers
The Llama 3 Herd of Models
Meta AI
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic