Unit 4: Interpretability (201)
Resources: Interpretability 9
Resources (3 hrs 40 mins)
- Toy models of superposition
Create a free account to track your progress and unlock access to the full course content.
- Discovering Latent Knowledge in Language Models Without Supervision
Create a free account to track your progress and unlock access to the full course content.
- A mechanistic interpretability analysis of grokking
Create a free account to track your progress and unlock access to the full course content.
- Polysemanticity and capacity in neural networks
Create a free account to track your progress and unlock access to the full course content.
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
Create a free account to track your progress and unlock access to the full course content.
- ROME: Locating and Editing Factual Associations in GPT
Create a free account to track your progress and unlock access to the full course content.
Optional Resources
- A Mathematical Framework for Transformer Circuits
Create a free account to track your progress and unlock access to the full course content.
- Convergent learning: Do different neural networks leanr the same representations?
Create a free account to track your progress and unlock access to the full course content.
- Rewriting a Deep Generative Model
Create a free account to track your progress and unlock access to the full course content.