Unit 2: Training safer models
More safety techniques
Resources (25 mins)
- An Approach to Technical AGI Safety and Security
Create a free account to track your progress and unlock access to the full course content.
- Can we scale human feedback for complex AI tasks?
Create a free account to track your progress and unlock access to the full course content.
- Recommendations for Technical AI Safety Research Directions
Create a free account to track your progress and unlock access to the full course content.
Exercises
Optional Resources
- What is deliberative alignment?
Create a free account to track your progress and unlock access to the full course content.
- Detecting and reducing scheming in AI models
Create a free account to track your progress and unlock access to the full course content.
- What is AI debate, and can it make systems safer?
Create a free account to track your progress and unlock access to the full course content.
- Empirical Progress on Debate
Create a free account to track your progress and unlock access to the full course content.
- Debating with More Persuasive LLMs Leads to More Truthful Answers
Create a free account to track your progress and unlock access to the full course content.
- On scalable oversight with weak LLMs judging strong LLMs
Create a free account to track your progress and unlock access to the full course content.
- What is Weak-to-Strong Generalisation?
Create a free account to track your progress and unlock access to the full course content.
- Weak-to-strong generalization: Eliciting Strong Capabilities With Weak Supervision
Create a free account to track your progress and unlock access to the full course content.