Unit 3: Reinforcement learning from human (or AI) feedback

Resources: Reinforcement learning from human (or AI) feedback

Resources (2 hrs 20 mins)

The True Story of How GPT-2 Became Maximally Lewd
Rational Animations · 2024 · 15 min
Create a free account to track your progress and unlock access to the full course content.
Illustrating Reinforcement Learning from Human Feedback (RLHF)
Nathan Lambert, Louis Castricato and Leandro von Werra et al. · 2022 · 30 min ·
Create a free account to track your progress and unlock access to the full course content.
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai and Jared Kaplan · 2022 · 60 min ·
Create a free account to track your progress and unlock access to the full course content.
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper and Xander Davies · 2023 · 30 min ·
Create a free account to track your progress and unlock access to the full course content.

Optional Resources

Thoughts on the impact of RLHF research
Paul Christiano · 2023 · 11 min
Create a free account to track your progress and unlock access to the full course content.
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang and Nicholas Carlini et al. · 2023 · 20 min
Create a free account to track your progress and unlock access to the full course content.
Universal Jailbreak Backdoors from Poisoned Human Feedback
Javier Rando and Florian Tramer · 2023 · 30 min
Create a free account to track your progress and unlock access to the full course content.
Badllama 3: removing safety finetuning from Llama 3 in minutes
Dmitrii Volkov · 2024 · 20 min
Create a free account to track your progress and unlock access to the full course content.
An introduction to Policy Gradient methods
Xander Steenbrugge · 2018 · 20 min
Create a free account to track your progress and unlock access to the full course content.
RLHF: How to Learn from Human Feedback with Reinforcement Learning
Natasha Jaques · 2023 · 60 min
Create a free account to track your progress and unlock access to the full course content.
Learning to Summarize with Human Feedback
Jeffrey Wu, Nisan Stiennon and Daniel Ziegler et al. · 2020 · 20 min
Create a free account to track your progress and unlock access to the full course content.
Deep reinforcement learning from human preferences
Paul Christiano, Jan Leike and Tom Brown et al. · 2017 · 40 min
Create a free account to track your progress and unlock access to the full course content.
Fine-Tuning Language Models from Human Preferences
Daniel Ziegler, Nisan Stiennon and Jeffrey Wu et al. · 2019 · 30 min
Create a free account to track your progress and unlock access to the full course content.
The Horrific Content a Kenyan Worker Had to See While Training ChatGPT
Alex Kantrowitz · 2023 · 4 min
Create a free account to track your progress and unlock access to the full course content.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov and Archit Sharma and Eric Mitchell · 2023
Create a free account to track your progress and unlock access to the full course content.
Chip Huyen's RLHF guide
Chip Huyen · 2023 · 30 min
Create a free account to track your progress and unlock access to the full course content.
RLAIF: Reinforcement Learning from AI Feedback
Cameron Wolfe · 2023 · 30 min
Create a free account to track your progress and unlock access to the full course content.

AI Alignment

Resources: Reinforcement learning from human (or AI) feedback

Resources (2 hrs 20 mins)