Unit 3: Detecting danger
Evaluations: AI can, but will it?
Resources (40 mins)
- We need a Science of Evals
Create a free account to track your progress and unlock access to the full course content.
- AI models can be dangerous before public deployment
Create a free account to track your progress and unlock access to the full course content.
- When AI Chooses Harm Over Failure
Create a free account to track your progress and unlock access to the full course content.
- Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Create a free account to track your progress and unlock access to the full course content.
Exercises
Optional Resources
- Chain-of-Thought Snippets from Deceptive AIs
Create a free account to track your progress and unlock access to the full course content.
- Alignment faking in large language models
Create a free account to track your progress and unlock access to the full course content.
- System Card: Claude Sonnet 4.5
Create a free account to track your progress and unlock access to the full course content.
- Agentic Misalignment: How LLMs could be insider threats
Create a free account to track your progress and unlock access to the full course content.
- ‘3cb’: The Catastrophic Cyber Capabilities Benchmark
Create a free account to track your progress and unlock access to the full course content.
- HonestCyberEval: An AI Cyber Risk Benchmark for Automated Software Exploitation
Create a free account to track your progress and unlock access to the full course content.
- Do the biorisk evaluations of AI labs actually measure the risk of developing bioweapons?
Create a free account to track your progress and unlock access to the full course content.
- Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier Large Language Models
Create a free account to track your progress and unlock access to the full course content.
- The levers of political persuasion with conversational AI
Create a free account to track your progress and unlock access to the full course content.