Unit 3: Using AI to align AI
Resources: Using AI to align AI
Resources (50 mins)
- Can we scale human feedback for complex AI tasks?
Create a free account to track your progress and unlock access to the full course content.
- Using Dangerous AI, But Safely?
Create a free account to track your progress and unlock access to the full course content.
Optional Resources
- Weak-to-strong generalization: Eliciting Strong Capabilities With Weak Supervision
Create a free account to track your progress and unlock access to the full course content.
- AI Control: Improving Safety Despite Intentional Subversion
Create a free account to track your progress and unlock access to the full course content.
- AI safety via debate
Create a free account to track your progress and unlock access to the full course content.
- Debating with More Persuasive LLMs Leads to More Truthful Answers
Create a free account to track your progress and unlock access to the full course content.
- Adversarial Machine Learning explained with examples
Create a free account to track your progress and unlock access to the full course content.
- Universal and Transferable Adversarial Attacks on Aligned Language Models
Create a free account to track your progress and unlock access to the full course content.
- Supervising strong learners by amplifying weak experts
Create a free account to track your progress and unlock access to the full course content.
- Factored cognition
Create a free account to track your progress and unlock access to the full course content.