SafetyOctober 19, 2022OpenAI

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, Jacob Hilton

Abstract

In reinforcement learning from human feedback, it is common to optimize the policy against a learned reward model. We study how the gold reward score changes as we optimize against the proxy reward model. We find that this overoptimization can be characterized by scaling laws, and provide a theoretical framework for predicting when policies trained against proxy rewards will diverge from actual human preferences.

Key Findings

  • 1Identified and characterized reward model overoptimization as a scaling phenomenon
  • 2Discovered predictable relationships between proxy and gold reward scores
  • 3Showed that excessive optimization against reward models degrades true quality
  • 4Provided a framework for detecting and preventing overoptimization
  • 5Influenced how RLHF training is conducted in practice

Impact & Significance

This paper identified a key failure mode in RLHF training and influenced how AI labs calibrate their alignment training. Understanding reward hacking remains critical for training safe and useful AI systems.

Related Tools

Read Full Paper