SafetyOctober 19, 2022OpenAI
Scaling Laws for Reward Model Overoptimization
Leo Gao, John Schulman, Jacob Hilton
Abstract
In reinforcement learning from human feedback, it is common to optimize the policy against a learned reward model. We study how the gold reward score changes as we optimize against the proxy reward model. We find that this overoptimization can be characterized by scaling laws, and provide a theoretical framework for predicting when policies trained against proxy rewards will diverge from actual human preferences.
Key Findings
- 1Identified and characterized reward model overoptimization as a scaling phenomenon
- 2Discovered predictable relationships between proxy and gold reward scores
- 3Showed that excessive optimization against reward models degrades true quality
- 4Provided a framework for detecting and preventing overoptimization
- 5Influenced how RLHF training is conducted in practice
Impact & Significance
This paper identified a key failure mode in RLHF training and influenced how AI labs calibrate their alignment training. Understanding reward hacking remains critical for training safe and useful AI systems.
Related Papers
LLMJuly 23, 2024
The Llama 3 Herd of Models
Meta AI
LLMJuly 15, 2024
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
EfficiencyMay 7, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
LLMMarch 4, 2024
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic