Unit 5: Adversarial techniques for scalable oversight

Resources: Adversarial techniques for scalable oversight

Resources (1 hr 30 mins)

AI-written critiques help humans notice flaws: blog post
Jan Leike, Jeffrey Wu and Catherine Yeh et al. · 2022 · 10 min
Create a free account to track your progress and unlock access to the full course content.
AI safety via debate
Geoffrey Irving and Paul Christiano and Dario Amodei · 2018 · 35 min ·
Create a free account to track your progress and unlock access to the full course content.
Red-teaming language models with language models
Ethan Perez, Saffron Huang and Francis Song et al. · 2022 · 10 min ·
Create a free account to track your progress and unlock access to the full course content.
Robust Feature-Level Adversaries are Interpretability Tools
Casper · 2021 · 30 min ·
Create a free account to track your progress and unlock access to the full course content.

Optional Resources

High-stakes alignment via adversarial training
Daniel Ziegler · 2022 · 25 min ·
Create a free account to track your progress and unlock access to the full course content.
Two-turn debate doesn’t help humans answer hard reasoning comprehension questions
Alicia Parrish, Harsh Trivedi and Nikita Nangia et al. · 2022 ·
Create a free account to track your progress and unlock access to the full course content.
Takeaways from our robust injury classifier project
Daniel Ziegler · 2022 · 10 min ·
Create a free account to track your progress and unlock access to the full course content.
Debate update: Obfuscated arguments problem
Beth Barnes and Paul Christiano · 2020 ·
Create a free account to track your progress and unlock access to the full course content.
WebGPT
Jacob Hilton, Suchir Balaji and Reiichiro Nakano et al. · 2021 · 5 min
Create a free account to track your progress and unlock access to the full course content.
GopherCite
Jacob Menick, Maja Trebacz and Vladimir Mikulik et al. · 2022
Create a free account to track your progress and unlock access to the full course content.
Training robust corrigibility
Paul Christiano · 2019 · 20 min
Create a free account to track your progress and unlock access to the full course content.
Chain of Thought Imitation with Procedure Cloning
Mengjiao Yang, Dale Schuurmans and Pieter Abbeel et al. · 2022
Create a free account to track your progress and unlock access to the full course content.
Iterated Distillation and Amplification
Ajeya Cotra · 2018 · 20 min
Create a free account to track your progress and unlock access to the full course content.
Strong HCH
Paul Christiano · 2016
Create a free account to track your progress and unlock access to the full course content.

AI Alignment (2023)

Resources: Adversarial techniques for scalable oversight

Resources (1 hr 30 mins)