Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever
Abstract
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training. Larger models are significantly more sample-efficient. Our results strongly suggest that larger models will continue to perform better, and are worth training even if some training runs are incomplete.
Key Findings
- 1Discovered power-law relationships between model size, data, compute, and performance
- 2Showed that larger models are more sample-efficient
- 3Demonstrated smooth, predictable improvement curves with scale
- 4Found that model size matters more than dataset size for fixed compute budgets
- 5Provided a framework for planning optimal training configurations
Impact & Significance
Scaling laws fundamentally changed how AI labs plan training runs and allocate resources. They provided the scientific basis for the race to build larger models and influenced every major LLM training decision since publication.
Related Papers
The Llama 3 Herd of Models
Meta AI
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic