Loading...
Reinforcement Learning from Human Feedback(RLHF)
Training AI agents to align with human preferences through reinforcement learning on human feedback
๐ฏ 30-Second Overview
Pattern: Three-phase training: SFT โ Reward Model โ PPO optimization with human preference alignment
Why: Aligns AI behavior with human values, improves safety, helpfulness, and reduces harmful outputs
Key Insight: Human preferences > reward model > policy optimization creates scalable alignment mechanism
โก Quick Implementation
๐ Do's & Don'ts
๐ฆ When to Use
Use When
- โข Human preference alignment crucial
- โข Safety and helpfulness requirements
- โข Complex subjective quality judgments
- โข Large-scale deployment with user interaction
- โข Need for controllable AI behavior
Avoid When
- โข Simple objective tasks with clear metrics
- โข Limited human annotation budget
- โข Real-time inference requirements
- โข Tasks with well-defined ground truth
- โข Small-scale or research-only applications
๐ Key Metrics
๐ก Top Use Cases
References & Further Reading
Deepen your understanding with these curated resources
Foundational Papers
Recent Advances (2023-2024)
RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (Bai et al., 2023)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)
Statistical Rejection Sampling Improves Preference Optimization (Liu et al., 2024)
Secrets of RLHF in Large Language Models (Zheng et al., 2024)
Contribute to this collection
Know a great resource? Submit a pull request to add it.
Reinforcement Learning from Human Feedback(RLHF)
Training AI agents to align with human preferences through reinforcement learning on human feedback
๐ฏ 30-Second Overview
Pattern: Three-phase training: SFT โ Reward Model โ PPO optimization with human preference alignment
Why: Aligns AI behavior with human values, improves safety, helpfulness, and reduces harmful outputs
Key Insight: Human preferences > reward model > policy optimization creates scalable alignment mechanism
โก Quick Implementation
๐ Do's & Don'ts
๐ฆ When to Use
Use When
- โข Human preference alignment crucial
- โข Safety and helpfulness requirements
- โข Complex subjective quality judgments
- โข Large-scale deployment with user interaction
- โข Need for controllable AI behavior
Avoid When
- โข Simple objective tasks with clear metrics
- โข Limited human annotation budget
- โข Real-time inference requirements
- โข Tasks with well-defined ground truth
- โข Small-scale or research-only applications
๐ Key Metrics
๐ก Top Use Cases
References & Further Reading
Deepen your understanding with these curated resources
Foundational Papers
Recent Advances (2023-2024)
RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (Bai et al., 2023)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)
Statistical Rejection Sampling Improves Preference Optimization (Liu et al., 2024)
Secrets of RLHF in Large Language Models (Zheng et al., 2024)
Contribute to this collection
Know a great resource? Submit a pull request to add it.