Unit 4: Interpretability (201)

Resources: Interpretability 9

Toy models of superposition
Nelson Elhage, Tristan Hume and Catherine Olsson et al. · 2022 · 45 min ·
Create a free account to track your progress and unlock access to the full course content.
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns, Haotian Ye and Dan Klein et al. · 2022 · 30 min ·
Create a free account to track your progress and unlock access to the full course content.
A mechanistic interpretability analysis of grokking
Neel Nanda, Lawrence Chan and Tom Lieberum et al. · 2022 · 35 min
Create a free account to track your progress and unlock access to the full course content.
Polysemanticity and capacity in neural networks
Adam Scherlis, Kshitij Sachan and Adam S. Jermyn et al. · 2022 · 35 min
Create a free account to track your progress and unlock access to the full course content.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
Kevin Wang, Alexandre Variengien and Arthur Conmy et al. · 2023 · 35 min ·
Create a free account to track your progress and unlock access to the full course content.
ROME: Locating and Editing Factual Associations in GPT
Kevin Meng, David Bau and Alex Andonian et al. · 2022 · 35 min
Create a free account to track your progress and unlock access to the full course content.

Optional Resources

A Mathematical Framework for Transformer Circuits
Nelson Elhage, Neel Nanda and Catherine Olsson et al. · 2021 · 60 min
Create a free account to track your progress and unlock access to the full course content.
Convergent learning: Do different neural networks leanr the same representations?
Yixuan Li, Jason Yosinski and Jeff Clune et al. · 2016 · 40 min
Create a free account to track your progress and unlock access to the full course content.
Rewriting a Deep Generative Model
David Bau, Steven Liu and Tongzhou Wang et al. · 2020 · 10 min
Create a free account to track your progress and unlock access to the full course content.

Alignment 201