Unit 6: Interpretability
Resources: Interpretability
Resources (2 hrs)
- Zoom In: An Introduction to Circuits
Create a free account to track your progress and unlock access to the full course content.
- Toy models of superposition
Create a free account to track your progress and unlock access to the full course content.
- Understanding intermediate layers using linear classifier probes
Create a free account to track your progress and unlock access to the full course content.
- Discovering language model behaviors with model-written evaluations: blog post
Create a free account to track your progress and unlock access to the full course content.
- ROME: Locating and Editing Factual Associations in GPT
Create a free account to track your progress and unlock access to the full course content.
Optional Resources
- Feature Visualization
Create a free account to track your progress and unlock access to the full course content.
- Acquisition of Chess Knowledge in AlphaZero
Create a free account to track your progress and unlock access to the full course content.
- Toy models of superposition
Create a free account to track your progress and unlock access to the full course content.
- A Mathematical Framework for Transformer Circuits
Create a free account to track your progress and unlock access to the full course content.
- Chris Olah’s views on AGI safety
Create a free account to track your progress and unlock access to the full course content.
- Intro to brain-like AGI safety
Create a free account to track your progress and unlock access to the full course content.
- Eliciting latent knowledge
Create a free account to track your progress and unlock access to the full course content.
- Rewriting a Deep Generative Model
Create a free account to track your progress and unlock access to the full course content.
- Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors
Create a free account to track your progress and unlock access to the full course content.
- Thread: Circuits
Create a free account to track your progress and unlock access to the full course content.
- Compositional Explanations of Neurons
Create a free account to track your progress and unlock access to the full course content.