EfficiencyMarch 29, 2022DeepMind

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark

Abstract

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained. We train a compute-optimal model, Chinchilla (70B parameters, 1.4T tokens), that uses the same compute as Gopher (280B) but outperforms it on nearly every benchmark.

Key Findings

  • 1Showed that most large LLMs were significantly undertrained relative to their size
  • 2Established that model size and training tokens should be scaled equally
  • 3Demonstrated a 70B model trained on more data beats a 280B model trained on less
  • 4Provided revised scaling laws showing the importance of data quantity
  • 5Changed industry best practices for LLM training budgets

Impact & Significance

Chinchilla fundamentally changed how the industry trains LLMs. After this paper, labs shifted to training smaller models on more data rather than simply making models larger, leading to more efficient models like Llama and Mistral.

Related Tools

Read Full Paper