Unit 2: Reward misspecification and instrumental convergence
Resources: Reward misspecification and instrumental convergence
Resources (1 hr 20 mins)
- Specification gaming: the flip side of AI ingenuity
Create a free account to track your progress and unlock access to the full course content.
- Learning from Human Preferences
Create a free account to track your progress and unlock access to the full course content.
- Learning to Summarize with Human Feedback
Create a free account to track your progress and unlock access to the full course content.
- The alignment problem from a deep learning perspective
Create a free account to track your progress and unlock access to the full course content.
- Optimal Policies Tend To Seek Power
Create a free account to track your progress and unlock access to the full course content.
- What failure looks like
Create a free account to track your progress and unlock access to the full course content.
- Inverse reinforcement learning example
Create a free account to track your progress and unlock access to the full course content.
- The easy goal inference problem is still hard
Create a free account to track your progress and unlock access to the full course content.
Optional Resources
- TruthfulQA: Measuring How Models Mimic Human Falsehoods
Create a free account to track your progress and unlock access to the full course content.
- Model Mis-specification and Inverse Reinforcement Learning
Create a free account to track your progress and unlock access to the full course content.
- Illustrating Reinforcement Learning from Human Feedback (RLHF)
Create a free account to track your progress and unlock access to the full course content.
- What is inner misalignment?
Create a free account to track your progress and unlock access to the full course content.
- Clarifying "AI Alignment"
Create a free account to track your progress and unlock access to the full course content.
- A central AI alignment problem: capabilities generalization, and the sharp left turn
Create a free account to track your progress and unlock access to the full course content.
- Is power-seeking AI an existential risk?
Create a free account to track your progress and unlock access to the full course content.
- Risks from Learned Optimization
Create a free account to track your progress and unlock access to the full course content.