Unit 2: What makes aligning AI difficult?

Resources: What makes aligning AI difficult?

The True Story of How GPT-2 Became Maximally Lewd
Rational Animations · 2024 · 15 min
Create a free account to track your progress and unlock access to the full course content.
RLAIF vs. RLHF: the technology behind Anthropic’s Claude (Constitutional AI Explained)
AssemblyAI · 2023 · 6 min
Create a free account to track your progress and unlock access to the full course content.
A simple technical explanation of RLH(AI)F
Li-Lian Ang · 2024 · 12 min
Create a free account to track your progress and unlock access to the full course content.
Intro to Large Language Models
Andrej Karpathy · 2023 · 4 min
Create a free account to track your progress and unlock access to the full course content.
Problems with Reinforcement Learning from Human Feedback (RLHF) for AI safety
Sarah Hastings-Woodhouse · 2024 · 7 min
Create a free account to track your progress and unlock access to the full course content.

Optional Resources

Illustrating Reinforcement Learning from Human Feedback (RLHF)
Nathan Lambert, Louis Castricato and Leandro von Werra et al. · 2022 · 30 min ·
Create a free account to track your progress and unlock access to the full course content.
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai and Jared Kaplan · 2022 · 60 min ·
Create a free account to track your progress and unlock access to the full course content.
Why Does AI Lie, and What Can We Do About It?
Robert Miles · 2023 · 10 min
Create a free account to track your progress and unlock access to the full course content.

AI Alignment Fast-Track