Unit 4: Understanding how AI systems "think"
Resources: Understanding how AI systems "think"
Resources (1 hr 40 mins)
- Introduction to Mechanistic Interpretability
Create a free account to track your progress and unlock access to the full course content.
- A Longlist of Theories of Impact for Interpretability
Create a free account to track your progress and unlock access to the full course content.
- Let's Try To Understand AI Monosemanticity
Create a free account to track your progress and unlock access to the full course content.
- Zoom In: An Introduction to Circuits
Create a free account to track your progress and unlock access to the full course content.
Optional Resources
- What Do Neural Networks Really Learn? Exploring the Brain of an AI Model
Create a free account to track your progress and unlock access to the full course content.
- Against Almost Every Theory of Impact of Interpretability
Create a free account to track your progress and unlock access to the full course content.
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
Create a free account to track your progress and unlock access to the full course content.
- Universal and Transferable Adversarial Attacks on Aligned Language Models
Create a free account to track your progress and unlock access to the full course content.