Unit 5: Robustness unlearning and control

Resources: Robustness unlearning and control

Adversarial Machine Learning explained with examples
Letiția Pârcălăbescu · 2020 · 11 min
Create a free account to track your progress and unlock access to the full course content.
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang and Nicholas Carlini et al. · 2023 · 10 min
Create a free account to track your progress and unlock access to the full course content.
Deep Forgetting & Unlearning for Safely-Scoped LLMs
Stephen Casper · 2023 · 20 min
Create a free account to track your progress and unlock access to the full course content.
Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li and Alexander Pan et al. · 2024 · 30 min
Create a free account to track your progress and unlock access to the full course content.
AI Control: Improving Safety Despite Intentional Subversion
Buck Shlegeris, Fabien Roger and Ryan Greenblatt et al. · 2023 · 20 min ·
Create a free account to track your progress and unlock access to the full course content.

Optional Resources

Adversarial Machine Learning Reading List
Nicholas Carlini · 2019
Create a free account to track your progress and unlock access to the full course content.
CAIS: Adversarial Robustness Introduction
Dan Hendrycks · 2022 · 31 min
Create a free account to track your progress and unlock access to the full course content.

AI Alignment