Unit 6: Mechanistic interpretability
Resources: Mechanistic interpretability
Resources (2 hrs)
- Introduction to Mechanistic Interpretability
Create a free account to track your progress and unlock access to the full course content.
- Zoom In: An Introduction to Circuits
Create a free account to track your progress and unlock access to the full course content.
- Let's Try To Understand AI Monosemanticity
Create a free account to track your progress and unlock access to the full course content.
- A Longlist of Theories of Impact for Interpretability
Create a free account to track your progress and unlock access to the full course content.
- Against Almost Every Theory of Impact of Interpretability
Create a free account to track your progress and unlock access to the full course content.
Optional Resources
- Concrete Steps to Get Started in Transformer Mechanistic Interpretability
Create a free account to track your progress and unlock access to the full course content.
- ARENA Curriculum: Transformer Interpretability
Create a free account to track your progress and unlock access to the full course content.
- Toy models of superposition
Create a free account to track your progress and unlock access to the full course content.
- How useful is mechanistic interpretability?
Create a free account to track your progress and unlock access to the full course content.
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Create a free account to track your progress and unlock access to the full course content.
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
Create a free account to track your progress and unlock access to the full course content.
- Multimodal Neurons in Artificial Neural Networks
Create a free account to track your progress and unlock access to the full course content.
- In-context learning and induction heads
Create a free account to track your progress and unlock access to the full course content.
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Create a free account to track your progress and unlock access to the full course content.
- What is mechanistic interpretability?
Create a free account to track your progress and unlock access to the full course content.
- ROME: Locating and Editing Factual Associations in GPT
Create a free account to track your progress and unlock access to the full course content.
- Towards Developmental Interpretability
Create a free account to track your progress and unlock access to the full course content.
- An Interpretability Illusion for BERT
Create a free account to track your progress and unlock access to the full course content.
- Causal Scrubbing: a method for rigorously testing interpretability hypotheses
Create a free account to track your progress and unlock access to the full course content.
- Eliciting latent knowledge
Create a free account to track your progress and unlock access to the full course content.
- Mechanistic anomaly detection and ELK
Create a free account to track your progress and unlock access to the full course content.
- Discovering Latent Knowledge in Language Models Without Supervision
Create a free account to track your progress and unlock access to the full course content.
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Create a free account to track your progress and unlock access to the full course content.
- Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
Create a free account to track your progress and unlock access to the full course content.
- Robust Feature-Level Adversaries are Interpretability Tools
Create a free account to track your progress and unlock access to the full course content.
- Chris Olah on what the hell is going on inside neural networks
Create a free account to track your progress and unlock access to the full course content.