Unit 4: Understanding AI
Interpretability in practice
Resources (1 hr)
- What are probing classifiers and can they help us understand what’s happening inside AI models?
Create a free account to track your progress and unlock access to the full course content.
- Chain of Thought Monitorability
Create a free account to track your progress and unlock access to the full course content.
- Model Organisms of Misalignment
Create a free account to track your progress and unlock access to the full course content.
- Hallucination Probes
Create a free account to track your progress and unlock access to the full course content.
- Auditing language models for hidden objectives
Create a free account to track your progress and unlock access to the full course content.
Exercises
Optional Resources
- ARENA Curriculum: Transformer Interpretability
Create a free account to track your progress and unlock access to the full course content.
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Create a free account to track your progress and unlock access to the full course content.
- On the Biology of a Large Language Model
Create a free account to track your progress and unlock access to the full course content.
- Thought Anchors: Which LLM Reasoning Steps Matter?
Create a free account to track your progress and unlock access to the full course content.