Loading...
Direct Preference Optimization(DPO)
Simplified alignment method that directly optimizes LLM policies on preference data without reward models
๐ฏ 30-Second Overview
Pattern: Two-phase training: SFT โ Direct preference optimization without explicit reward modeling
Why: Simplifies RLHF pipeline, improves training stability, reduces computational overhead while maintaining alignment quality
Key Insight: Treats the language model as an implicit reward model, optimizing preferences directly via classification loss
โก Quick Implementation
๐ Do's & Don'ts
๐ฆ When to Use
Use When
- โข Simpler alternative to RLHF needed
- โข Limited computational resources
- โข Faster training cycles required
- โข Stable supervised learning preferred
- โข Clear pairwise preferences available
Avoid When
- โข Complex multi-objective optimization needed
- โข Reward model interpretability required
- โข Very large-scale preference datasets
- โข Fine-grained reward shaping necessary
- โข Need explicit reward signal modeling
๐ Key Metrics
๐ก Top Use Cases
Direct Preference Optimization (DPO)
Simplified alignment without explicit reward modeling
DPO Pipeline
Preference Pair
Waiting for preference pairs...
Policy Metrics
Optimization Signals
Training Progress
DPO Algorithm Overview
Direct Optimization: Skip reward modeling and directly optimize the policy using preference data with a closed-form solution.
Bradley-Terry Model: Model human preferences probabilistically based on the difference in rewards between chosen and rejected responses.
KL Constraint: Maintain proximity to reference policy through KL divergence penalty (ฮฒ parameter) to prevent overfitting.
Advantages: Simpler than RLHF, more stable training, no reward model needed, computationally efficient.
References & Further Reading
Deepen your understanding with these curated resources
Foundational Papers
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)
Training language models to follow instructions with human feedback (Ouyang et al., 2022)
Learning to summarize from human feedback (Stiennon et al., 2020)
Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)
DPO Variants & Extensions (2023-2024)
Identity Preference Optimization (IPO): Length-Bias Mitigation (Azar et al., 2023)
Kahneman-Tversky Optimization (KTO): Prospect Theory for LLMs (Ethayarajh et al., 2024)
Simple Preference Optimization (SimPO): Reference-Free Training (Meng et al., 2024)
Odds Ratio Preference Optimization (ORPO): Single-Phase Training (Hong et al., 2024)
Theoretical Analysis
Understanding the performance gap between online and offline alignment algorithms (Xiong et al., 2024)
A General Theoretical Paradigm to Understand Learning from Human Preferences (Xu et al., 2024)
The Alignment Problem from a Deep Learning Perspective (Christiano et al., 2023)
DPO: Direct Preference Optimization without Reinforcement Learning (Analysis) (Azar et al., 2024)
Contribute to this collection
Know a great resource? Submit a pull request to add it.
Direct Preference Optimization(DPO)
Simplified alignment method that directly optimizes LLM policies on preference data without reward models
๐ฏ 30-Second Overview
Pattern: Two-phase training: SFT โ Direct preference optimization without explicit reward modeling
Why: Simplifies RLHF pipeline, improves training stability, reduces computational overhead while maintaining alignment quality
Key Insight: Treats the language model as an implicit reward model, optimizing preferences directly via classification loss
โก Quick Implementation
๐ Do's & Don'ts
๐ฆ When to Use
Use When
- โข Simpler alternative to RLHF needed
- โข Limited computational resources
- โข Faster training cycles required
- โข Stable supervised learning preferred
- โข Clear pairwise preferences available
Avoid When
- โข Complex multi-objective optimization needed
- โข Reward model interpretability required
- โข Very large-scale preference datasets
- โข Fine-grained reward shaping necessary
- โข Need explicit reward signal modeling
๐ Key Metrics
๐ก Top Use Cases
Direct Preference Optimization (DPO)
Simplified alignment without explicit reward modeling
DPO Pipeline
Preference Pair
Waiting for preference pairs...
Policy Metrics
Optimization Signals
Training Progress
DPO Algorithm Overview
Direct Optimization: Skip reward modeling and directly optimize the policy using preference data with a closed-form solution.
Bradley-Terry Model: Model human preferences probabilistically based on the difference in rewards between chosen and rejected responses.
KL Constraint: Maintain proximity to reference policy through KL divergence penalty (ฮฒ parameter) to prevent overfitting.
Advantages: Simpler than RLHF, more stable training, no reward model needed, computationally efficient.
References & Further Reading
Deepen your understanding with these curated resources
Foundational Papers
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)
Training language models to follow instructions with human feedback (Ouyang et al., 2022)
Learning to summarize from human feedback (Stiennon et al., 2020)
Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)
DPO Variants & Extensions (2023-2024)
Identity Preference Optimization (IPO): Length-Bias Mitigation (Azar et al., 2023)
Kahneman-Tversky Optimization (KTO): Prospect Theory for LLMs (Ethayarajh et al., 2024)
Simple Preference Optimization (SimPO): Reference-Free Training (Meng et al., 2024)
Odds Ratio Preference Optimization (ORPO): Single-Phase Training (Hong et al., 2024)
Theoretical Analysis
Understanding the performance gap between online and offline alignment algorithms (Xiong et al., 2024)
A General Theoretical Paradigm to Understand Learning from Human Preferences (Xu et al., 2024)
The Alignment Problem from a Deep Learning Perspective (Christiano et al., 2023)
DPO: Direct Preference Optimization without Reinforcement Learning (Analysis) (Azar et al., 2024)
Contribute to this collection
Know a great resource? Submit a pull request to add it.