SafetyOctober 19, 2022OpenAI

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, Jacob Hilton

Abstract

In reinforcement learning from human feedback, it is common to optimize the policy against a learned reward model. We study how the gold reward score changes as we optimize against the proxy reward model. We find that this overoptimization can be characterized by scaling laws, and provide a theoretical framework for predicting when policies trained against proxy rewards will diverge from actual human preferences.

Key Findings

1Identified and characterized reward model overoptimization as a scaling phenomenon
2Discovered predictable relationships between proxy and gold reward scores
3Showed that excessive optimization against reward models degrades true quality
4Provided a framework for detecting and preventing overoptimization
5Influenced how RLHF training is conducted in practice

Impact & Significance

This paper identified a key failure mode in RLHF training and influenced how AI labs calibrate their alignment training. Understanding reward hacking remains critical for training safe and useful AI systems.

Related Tools

Chatgpt Claude

Read Full Paper

Scaling Laws for Reward Model Overoptimization

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku