Unit 3: Using AI to align AI

Resources: Using AI to align AI

Resources (50 mins)

Can we scale human feedback for complex AI tasks?
Adam Jones · 2024 · 15 min ·
Create a free account to track your progress and unlock access to the full course content.
Using Dangerous AI, But Safely?
Robert Miles · 2024 · 31 min
Create a free account to track your progress and unlock access to the full course content.

Optional Resources

Weak-to-strong generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns, Pavel Izmailov and Jan Hendrik Kirchner et al. · 2023 · 40 min ·
Create a free account to track your progress and unlock access to the full course content.
AI Control: Improving Safety Despite Intentional Subversion
Buck Shlegeris, Fabien Roger and Ryan Greenblatt et al. · 2023 ·
Create a free account to track your progress and unlock access to the full course content.
AI safety via debate
Geoffrey Irving and Paul Christiano and Dario Amodei · 2018 · 30 min ·
Create a free account to track your progress and unlock access to the full course content.
Debating with More Persuasive LLMs Leads to More Truthful Answers
Akbir Khan and John Hughes and Dan Valentine et al. · 2024 · 20 min
Create a free account to track your progress and unlock access to the full course content.
Adversarial Machine Learning explained with examples
Letiția Pârcălăbescu · 2020 · 11 min
Create a free account to track your progress and unlock access to the full course content.
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang and Nicholas Carlini et al. · 2023 · 11 min
Create a free account to track your progress and unlock access to the full course content.
Supervising strong learners by amplifying weak experts
Paul Christiano and Dario Amodei and Buck Shlegeris · 2018 · 15 min ·
Create a free account to track your progress and unlock access to the full course content.
Factored cognition
Ought · 2019 · 30 min
Create a free account to track your progress and unlock access to the full course content.

AI Alignment Fast-Track

Resources: Using AI to align AI

Resources (50 mins)