Unit 5: Adversarial techniques for scalable oversight
Resources: Adversarial techniques for scalable oversight
Resources (1 hr 30 mins)
- AI-written critiques help humans notice flaws: blog post
Create a free account to track your progress and unlock access to the full course content.
- AI safety via debate
Create a free account to track your progress and unlock access to the full course content.
- Red-teaming language models with language models
Create a free account to track your progress and unlock access to the full course content.
- Robust Feature-Level Adversaries are Interpretability Tools
Create a free account to track your progress and unlock access to the full course content.
Optional Resources
- High-stakes alignment via adversarial training
Create a free account to track your progress and unlock access to the full course content.
- Two-turn debate doesn’t help humans answer hard reasoning comprehension questions
Create a free account to track your progress and unlock access to the full course content.
- Takeaways from our robust injury classifier project
Create a free account to track your progress and unlock access to the full course content.
- Debate update: Obfuscated arguments problem
Create a free account to track your progress and unlock access to the full course content.
- GopherCite
Create a free account to track your progress and unlock access to the full course content.
- Training robust corrigibility
Create a free account to track your progress and unlock access to the full course content.
- Chain of Thought Imitation with Procedure Cloning
Create a free account to track your progress and unlock access to the full course content.
- Iterated Distillation and Amplification
Create a free account to track your progress and unlock access to the full course content.
- Strong HCH
Create a free account to track your progress and unlock access to the full course content.