Unit 4: Scalable oversight
Resources: Scalable oversight
Resources (1 hr 40 mins)
- Can we scale human feedback for complex AI tasks?
Create a free account to track your progress and unlock access to the full course content.
- Supervising strong learners by amplifying weak experts
Create a free account to track your progress and unlock access to the full course content.
- AI safety via debate
Create a free account to track your progress and unlock access to the full course content.
- Weak-to-strong generalization: Eliciting Strong Capabilities With Weak Supervision
Create a free account to track your progress and unlock access to the full course content.
Optional Resources
- Measuring Progress on Scalable Oversight for Large Language Models
Create a free account to track your progress and unlock access to the full course content.
- An overview of 11 proposals for building safe advanced AI
Create a free account to track your progress and unlock access to the full course content.
- Factored cognition
Create a free account to track your progress and unlock access to the full course content.
- Summarizing Books with Human Feedback
Create a free account to track your progress and unlock access to the full course content.
- How to Keep Improving When You're Better Than Any Teacher - Iterated Distillation and Amplification
Create a free account to track your progress and unlock access to the full course content.
- Scalable agent alignment via reward modeling
Create a free account to track your progress and unlock access to the full course content.
- Constitutional AI: Harmlessness from AI Feedback
Create a free account to track your progress and unlock access to the full course content.
- Humans can be assigned any values whatsoever
Create a free account to track your progress and unlock access to the full course content.
- Cooperative inverse reinforcement learning
Create a free account to track your progress and unlock access to the full course content.
- Language Models Perform Reasoning via Chain of Thought
Create a free account to track your progress and unlock access to the full course content.
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Create a free account to track your progress and unlock access to the full course content.
- AI safety via market making
Create a free account to track your progress and unlock access to the full course content.
- WARM: On the Benefits of Weight Averaged Reward Models
Create a free account to track your progress and unlock access to the full course content.
- Simple synthetic data reduces sycophancy in large language models
Create a free account to track your progress and unlock access to the full course content.
- Compositional preference models for aligning LMs
Create a free account to track your progress and unlock access to the full course content.
- Debating with More Persuasive LLMs Leads to More Truthful Answers
Create a free account to track your progress and unlock access to the full course content.