SafetyMay 29, 2023Stanford University

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

Abstract

While RLHF has been effective for aligning LLMs, it is complex and unstable. We introduce Direct Preference Optimization (DPO), an algorithm that implicitly optimizes the same objective as RLHF but is simpler to implement and train. DPO eliminates the need for fitting a reward model, sampling from the LM, or performing RL optimization, while achieving comparable or superior performance.

Key Findings

1Simplified RLHF by eliminating the need for a separate reward model
2Showed that the policy model itself can implicitly serve as a reward model
3Achieved comparable or better results than PPO-based RLHF with simpler training
4Reduced training instability and computational requirements
5Enabled preference learning with standard supervised learning infrastructure

Impact & Significance

DPO became the preferred alignment method for many open-source LLMs due to its simplicity. It significantly lowered the barrier to fine-tuning models with human preferences and influenced how Llama, Mistral, and other models are aligned.

Related Tools

Hugging Face Llama Mistral

Read Full Paper

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku