Unit 2: Reward misspecification and instrumental convergence

Resources: Reward misspecification and instrumental convergence

Resources (1 hr 20 mins)

Specification gaming: the flip side of AI ingenuity
Victoria Krakovna et al. · 2020 · 10 min ·
Create a free account to track your progress and unlock access to the full course content.
Learning from Human Preferences
Paul Christiano and Alex Ray and Dario Amodei · 2017 · 5 min ·
Create a free account to track your progress and unlock access to the full course content.
Learning to Summarize with Human Feedback
Jeffrey Wu, Nisan Stiennon and Daniel Ziegler et al. · 2020 · 20 min
Create a free account to track your progress and unlock access to the full course content.
The alignment problem from a deep learning perspective
Richard Ngo and Soeren Mindermann and Lawrence Chan · 2022 · 10 min ·
Create a free account to track your progress and unlock access to the full course content.
Optimal Policies Tend To Seek Power
Alex Turner, Logan Smith and Rohin Shah et al. · 2021 · 15 min
Create a free account to track your progress and unlock access to the full course content.
What failure looks like
Paul Christiano · 2019 · 10 min ·
Create a free account to track your progress and unlock access to the full course content.
Inverse reinforcement learning example
Udacity · 2016 · 5 min
Create a free account to track your progress and unlock access to the full course content.
The easy goal inference problem is still hard
Paul Christiano · 2015 · 5 min ·
Create a free account to track your progress and unlock access to the full course content.

Optional Resources

TruthfulQA: Measuring How Models Mimic Human Falsehoods
Jacob Hilton and Owain Evans and Stephanie Lin · 2021 · 10 min
Create a free account to track your progress and unlock access to the full course content.
Model Mis-specification and Inverse Reinforcement Learning
Owain Evans and Jacob Steinhardt · 2018 · 20 min
Create a free account to track your progress and unlock access to the full course content.
Illustrating Reinforcement Learning from Human Feedback (RLHF)
Nathan Lambert, Louis Castricato and Leandro von Werra et al. · 2022 · 15 min ·
Create a free account to track your progress and unlock access to the full course content.
What is inner misalignment?
Jan Leike · 2022 · 10 min
Create a free account to track your progress and unlock access to the full course content.
Clarifying "AI Alignment"
Paul Christiano · 2018 · 10 min
Create a free account to track your progress and unlock access to the full course content.
A central AI alignment problem: capabilities generalization, and the sharp left turn
Nate Soares · 2022 · 15 min
Create a free account to track your progress and unlock access to the full course content.
Is power-seeking AI an existential risk?
Joseph Carlsmith · 2021 · 25 min ·
Create a free account to track your progress and unlock access to the full course content.
Risks from Learned Optimization
Evan Hubinger, Chris van Merwijk and Vladimir Mikulik et al. · 2019 · 55 min
Create a free account to track your progress and unlock access to the full course content.

AI Alignment (2023)

Resources: Reward misspecification and instrumental convergence

Resources (1 hr 20 mins)