LLMJanuary 23, 2020OpenAI / Johns Hopkins

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever

Abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training. Larger models are significantly more sample-efficient. Our results strongly suggest that larger models will continue to perform better, and are worth training even if some training runs are incomplete.

Key Findings

1Discovered power-law relationships between model size, data, compute, and performance
2Showed that larger models are more sample-efficient
3Demonstrated smooth, predictable improvement curves with scale
4Found that model size matters more than dataset size for fixed compute budgets
5Provided a framework for planning optimal training configurations

Impact & Significance

Scaling laws fundamentally changed how AI labs plan training runs and allocate resources. They provided the scientific basis for the race to build larger models and influenced every major LLM training decision since publication.

Related Tools

Chatgpt Claude Gemini

Read Full Paper

Scaling Laws for Neural Language Models

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku