LLMJanuary 23, 2020OpenAI / Johns Hopkins

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever

Abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training. Larger models are significantly more sample-efficient. Our results strongly suggest that larger models will continue to perform better, and are worth training even if some training runs are incomplete.

Key Findings

  • 1Discovered power-law relationships between model size, data, compute, and performance
  • 2Showed that larger models are more sample-efficient
  • 3Demonstrated smooth, predictable improvement curves with scale
  • 4Found that model size matters more than dataset size for fixed compute budgets
  • 5Provided a framework for planning optimal training configurations

Impact & Significance

Scaling laws fundamentally changed how AI labs plan training runs and allocate resources. They provided the scientific basis for the race to build larger models and influenced every major LLM training decision since publication.

Related Tools

Read Full Paper