Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Abstract
While RLHF has been effective for aligning LLMs, it is complex and unstable. We introduce Direct Preference Optimization (DPO), an algorithm that implicitly optimizes the same objective as RLHF but is simpler to implement and train. DPO eliminates the need for fitting a reward model, sampling from the LM, or performing RL optimization, while achieving comparable or superior performance.
Key Findings
- 1Simplified RLHF by eliminating the need for a separate reward model
- 2Showed that the policy model itself can implicitly serve as a reward model
- 3Achieved comparable or better results than PPO-based RLHF with simpler training
- 4Reduced training instability and computational requirements
- 5Enabled preference learning with standard supervised learning infrastructure
Impact & Significance
DPO became the preferred alignment method for many open-source LLMs due to its simplicity. It significantly lowered the barrier to fine-tuning models with human preferences and influenced how Llama, Mistral, and other models are aligned.
Related Tools
Related Papers
The Llama 3 Herd of Models
Meta AI
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic