Loading...
Odds Ratio Preference Optimization(ORPO)
Reference-free preference optimization combining instruction tuning and alignment in a single training phase
๐ฏ 30-Second Overview
Pattern: Single-phase training combining supervised fine-tuning with odds ratio-based preference optimization
Why: Simplifies alignment pipeline, reduces training complexity, and achieves competitive preference alignment without separate reward modeling
Key Insight: Odds ratio penalty directly optimizes preference likelihood ratios during supervised training for efficient alignment
โก Quick Implementation
๐ Do's & Don'ts
๐ฆ When to Use
Use When
- โข Want single-phase training without separate reward model
- โข Have high-quality pairwise preference data available
- โข Need simpler alternative to multi-stage RLHF pipeline
- โข Computational efficiency is important
- โข Prefer monolithic training over modular approaches
Avoid When
- โข Need explicit reward model for interpretability
- โข Preference data quality is questionable
- โข Require fine-grained control over reward shaping
- โข Training stability is a major concern
- โข Multi-objective optimization is needed
๐ Key Metrics
๐ก Top Use Cases
References & Further Reading
Deepen your understanding with these curated resources
Foundational Paper
ORPO: Monolithic Preference Optimization without Reference Model (Hong et al., 2024)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)
Training language models to follow instructions with human feedback (Ouyang et al., 2022)
Learning to summarize from human feedback (Stiennon et al., 2020)
Related Preference Optimization Methods
Simple Preference Optimization (SimPO): Reference-Free Training (Meng et al., 2024)
Identity Preference Optimization (IPO): Length-Bias Mitigation (Azar et al., 2023)
Kahneman-Tversky Optimization (KTO): Prospect Theory for LLMs (Ethayarajh et al., 2024)
Statistical Rejection Sampling Improves Preference Optimization (Liu et al., 2024)
Theoretical Analysis
Understanding the performance gap between online and offline alignment algorithms (Xiong et al., 2024)
A General Theoretical Paradigm to Understand Learning from Human Preferences (Xu et al., 2024)
The Alignment Problem from a Deep Learning Perspective (Christiano et al., 2023)
Reward Model Ensembles Help Mitigate Overoptimization (Coste et al., 2023)
Empirical Comparisons
Comparing ORPO, DPO, and RLHF: An Empirical Study (Kim et al., 2024)
Preference Optimization Beyond DPO: Analysis and Extensions (Chen et al., 2024)
When to Use DPO vs RLHF vs ORPO: Decision Framework (Liu et al., 2024)
Single-Phase vs Multi-Phase Alignment: Trade-offs and Performance (Wang et al., 2024)
Contribute to this collection
Know a great resource? Submit a pull request to add it.
Odds Ratio Preference Optimization(ORPO)
Reference-free preference optimization combining instruction tuning and alignment in a single training phase
๐ฏ 30-Second Overview
Pattern: Single-phase training combining supervised fine-tuning with odds ratio-based preference optimization
Why: Simplifies alignment pipeline, reduces training complexity, and achieves competitive preference alignment without separate reward modeling
Key Insight: Odds ratio penalty directly optimizes preference likelihood ratios during supervised training for efficient alignment
โก Quick Implementation
๐ Do's & Don'ts
๐ฆ When to Use
Use When
- โข Want single-phase training without separate reward model
- โข Have high-quality pairwise preference data available
- โข Need simpler alternative to multi-stage RLHF pipeline
- โข Computational efficiency is important
- โข Prefer monolithic training over modular approaches
Avoid When
- โข Need explicit reward model for interpretability
- โข Preference data quality is questionable
- โข Require fine-grained control over reward shaping
- โข Training stability is a major concern
- โข Multi-objective optimization is needed
๐ Key Metrics
๐ก Top Use Cases
References & Further Reading
Deepen your understanding with these curated resources
Foundational Paper
ORPO: Monolithic Preference Optimization without Reference Model (Hong et al., 2024)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)
Training language models to follow instructions with human feedback (Ouyang et al., 2022)
Learning to summarize from human feedback (Stiennon et al., 2020)
Related Preference Optimization Methods
Simple Preference Optimization (SimPO): Reference-Free Training (Meng et al., 2024)
Identity Preference Optimization (IPO): Length-Bias Mitigation (Azar et al., 2023)
Kahneman-Tversky Optimization (KTO): Prospect Theory for LLMs (Ethayarajh et al., 2024)
Statistical Rejection Sampling Improves Preference Optimization (Liu et al., 2024)
Theoretical Analysis
Understanding the performance gap between online and offline alignment algorithms (Xiong et al., 2024)
A General Theoretical Paradigm to Understand Learning from Human Preferences (Xu et al., 2024)
The Alignment Problem from a Deep Learning Perspective (Christiano et al., 2023)
Reward Model Ensembles Help Mitigate Overoptimization (Coste et al., 2023)
Empirical Comparisons
Comparing ORPO, DPO, and RLHF: An Empirical Study (Kim et al., 2024)
Preference Optimization Beyond DPO: Analysis and Extensions (Chen et al., 2024)
When to Use DPO vs RLHF vs ORPO: Decision Framework (Liu et al., 2024)
Single-Phase vs Multi-Phase Alignment: Trade-offs and Performance (Wang et al., 2024)
Contribute to this collection
Know a great resource? Submit a pull request to add it.