SafetyMay 29, 2023Stanford University

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

Abstract

While RLHF has been effective for aligning LLMs, it is complex and unstable. We introduce Direct Preference Optimization (DPO), an algorithm that implicitly optimizes the same objective as RLHF but is simpler to implement and train. DPO eliminates the need for fitting a reward model, sampling from the LM, or performing RL optimization, while achieving comparable or superior performance.

Key Findings

  • 1Simplified RLHF by eliminating the need for a separate reward model
  • 2Showed that the policy model itself can implicitly serve as a reward model
  • 3Achieved comparable or better results than PPO-based RLHF with simpler training
  • 4Reduced training instability and computational requirements
  • 5Enabled preference learning with standard supervised learning infrastructure

Impact & Significance

DPO became the preferred alignment method for many open-source LLMs due to its simplicity. It significantly lowered the barrier to fine-tuning models with human preferences and influenced how Llama, Mistral, and other models are aligned.

Read Full Paper