EfficiencyMarch 29, 2022DeepMind

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark

Abstract

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained. We train a compute-optimal model, Chinchilla (70B parameters, 1.4T tokens), that uses the same compute as Gopher (280B) but outperforms it on nearly every benchmark.

Key Findings

1Showed that most large LLMs were significantly undertrained relative to their size
2Established that model size and training tokens should be scaled equally
3Demonstrated a 70B model trained on more data beats a 280B model trained on less
4Provided revised scaling laws showing the importance of data quantity
5Changed industry best practices for LLM training budgets

Impact & Significance

Chinchilla fundamentally changed how the industry trains LLMs. After this paper, labs shifted to training smaller models on more data rather than simply making models larger, leading to more efficient models like Llama and Mistral.

Related Tools

Gemini Llama

Read Full Paper

Training Compute-Optimal Large Language Models

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku