Unit 6: Mechanistic interpretability

Resources: Mechanistic interpretability

Resources (2 hrs)

Introduction to Mechanistic Interpretability
Sarah Hastings-Woodhouse · 2024 · 15 min ·
Create a free account to track your progress and unlock access to the full course content.
Zoom In: An Introduction to Circuits
Chris Olah, Nick Cammarata and Ludwig Schubert et al. · 2020 · 40 min ·
Create a free account to track your progress and unlock access to the full course content.
Let's Try To Understand AI Monosemanticity
Scott Alexander · 2023 · 25 min
Create a free account to track your progress and unlock access to the full course content.
A Longlist of Theories of Impact for Interpretability
Neel Nanda · 2022 · 15 min
Create a free account to track your progress and unlock access to the full course content.
Against Almost Every Theory of Impact of Interpretability
Charbel-Raphael Segerie · 2023 · 20 min
Create a free account to track your progress and unlock access to the full course content.

Optional Resources

Concrete Steps to Get Started in Transformer Mechanistic Interpretability
Neel Nanda · 2022
Create a free account to track your progress and unlock access to the full course content.
ARENA Curriculum: Transformer Interpretability
Callum McDougall · 2024
Create a free account to track your progress and unlock access to the full course content.
Toy models of superposition
Nelson Elhage, Tristan Hume and Catherine Olsson et al. · 2022 · 45 min ·
Create a free account to track your progress and unlock access to the full course content.
How useful is mechanistic interpretability?
Ryan Greenblatt, Neel Nanda and Buck Shlegeris et al. · 2023
Create a free account to track your progress and unlock access to the full course content.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Trenton Bricken, Adly Templeton and Joshua Batson et al. · 2023 · 40 min ·
Create a free account to track your progress and unlock access to the full course content.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
Kevin Wang, Alexandre Variengien and Arthur Conmy et al. · 2023 · 40 min ·
Create a free account to track your progress and unlock access to the full course content.
Multimodal Neurons in Artificial Neural Networks
Gabriel Goh, Nick Cammarata and Chelsea Voss et al. · 2021 · 50 min
Create a free account to track your progress and unlock access to the full course content.
In-context learning and induction heads
Catherine Olsson, Nelson Elhage and Neel Nanda et al. · 2022
Create a free account to track your progress and unlock access to the full course content.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Adly Templeton and Tom Conerly et al. · 2024 · 200 min
Create a free account to track your progress and unlock access to the full course content.
What is mechanistic interpretability?
Neel Nanda, Hamish Doodles and Daniel Filan · 2023 · 3 min
Create a free account to track your progress and unlock access to the full course content.
ROME: Locating and Editing Factual Associations in GPT
Kevin Meng, David Bau and Alex Andonian et al. · 2022
Create a free account to track your progress and unlock access to the full course content.
Towards Developmental Interpretability
Jesse Hoogland, Alexander Gietelink and Daniel Murfet et al. · 2023
Create a free account to track your progress and unlock access to the full course content.
An Interpretability Illusion for BERT
Tolga Bolukbasi, Adam Pearce and Ann Yuan et al. · 2021
Create a free account to track your progress and unlock access to the full course content.
Causal Scrubbing: a method for rigorously testing interpretability hypotheses
Adrià Garriga-Alonso, Nicholas Goldowsky-Dill and Ryan Greenblatt et al. · 2022
Create a free account to track your progress and unlock access to the full course content.
Eliciting latent knowledge
Paul Christiano and Ajeya Cotra and Mark Xu · 2021 ·
Create a free account to track your progress and unlock access to the full course content.
Mechanistic anomaly detection and ELK
Paul Christiano · 2022
Create a free account to track your progress and unlock access to the full course content.
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns, Haotian Ye and Dan Klein et al. · 2022 ·
Create a free account to track your progress and unlock access to the full course content.
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Kenneth Li and Oam Patel · 2023
Create a free account to track your progress and unlock access to the full course content.
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
Kenneth Li · 2022
Create a free account to track your progress and unlock access to the full course content.
Robust Feature-Level Adversaries are Interpretability Tools
Casper · 2021 ·
Create a free account to track your progress and unlock access to the full course content.
Chris Olah on what the hell is going on inside neural networks
Chris Olah and Robert Wiblin and Keiran Harris · 2021 · 180 min
Create a free account to track your progress and unlock access to the full course content.

AI Alignment

Resources: Mechanistic interpretability

Resources (2 hrs)