OtherJanuary 6, 2022OpenAI

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra

Abstract

We show that neural networks can learn to generalize on algorithmic tasks long after memorizing the training data, a phenomenon we call grokking. In some cases, networks achieve perfect generalization thousands of training steps after reaching perfect training accuracy. This challenges conventional wisdom about the relationship between memorization and generalization.

Key Findings

1Discovered that generalization can occur long after memorization (grokking)
2Showed networks achieving 100% test accuracy thousands of steps after 100% train accuracy
3Demonstrated this on modular arithmetic, permutation groups, and other tasks
4Challenged assumptions about when to stop training neural networks
5Opened new research directions in understanding deep learning generalization

Impact & Significance

Grokking revealed fundamental insights about how neural networks learn and generalize, challenging conventional training practices. It has influenced research on training dynamics and when models truly learn vs. memorize.

Read Full Paper

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Abstract

Key Findings

Impact & Significance

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku