Unit 3: Reinforcement learning from human (or AI) feedback
Resources: Reinforcement learning from human (or AI) feedback
Resources (2 hrs 20 mins)
- The True Story of How GPT-2 Became Maximally Lewd
Create a free account to track your progress and unlock access to the full course content.
- Illustrating Reinforcement Learning from Human Feedback (RLHF)
Create a free account to track your progress and unlock access to the full course content.
- Constitutional AI: Harmlessness from AI Feedback
Create a free account to track your progress and unlock access to the full course content.
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Create a free account to track your progress and unlock access to the full course content.
Optional Resources
- Thoughts on the impact of RLHF research
Create a free account to track your progress and unlock access to the full course content.
- Universal and Transferable Adversarial Attacks on Aligned Language Models
Create a free account to track your progress and unlock access to the full course content.
- Universal Jailbreak Backdoors from Poisoned Human Feedback
Create a free account to track your progress and unlock access to the full course content.
- Badllama 3: removing safety finetuning from Llama 3 in minutes
Create a free account to track your progress and unlock access to the full course content.
- An introduction to Policy Gradient methods
Create a free account to track your progress and unlock access to the full course content.
- RLHF: How to Learn from Human Feedback with Reinforcement Learning
Create a free account to track your progress and unlock access to the full course content.
- Learning to Summarize with Human Feedback
Create a free account to track your progress and unlock access to the full course content.
- Deep reinforcement learning from human preferences
Create a free account to track your progress and unlock access to the full course content.
- Fine-Tuning Language Models from Human Preferences
Create a free account to track your progress and unlock access to the full course content.
- The Horrific Content a Kenyan Worker Had to See While Training ChatGPT
Create a free account to track your progress and unlock access to the full course content.
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Create a free account to track your progress and unlock access to the full course content.
- Chip Huyen's RLHF guide
Create a free account to track your progress and unlock access to the full course content.
- RLAIF: Reinforcement Learning from AI Feedback
Create a free account to track your progress and unlock access to the full course content.