Unit 4: Understanding how AI systems "think"

Resources: Understanding how AI systems "think"

Introduction to Mechanistic Interpretability
Sarah Hastings-Woodhouse · 2024 · 15 min ·
Create a free account to track your progress and unlock access to the full course content.
A Longlist of Theories of Impact for Interpretability
Neel Nanda · 2022 · 15 min
Create a free account to track your progress and unlock access to the full course content.
Let's Try To Understand AI Monosemanticity
Scott Alexander · 2023 · 25 min
Create a free account to track your progress and unlock access to the full course content.
Zoom In: An Introduction to Circuits
Chris Olah, Nick Cammarata and Ludwig Schubert et al. · 2020 · 40 min ·
Create a free account to track your progress and unlock access to the full course content.

Optional Resources

What Do Neural Networks Really Learn? Exploring the Brain of an AI Model
Rational Animations · 2024 · 18 min
Create a free account to track your progress and unlock access to the full course content.
Against Almost Every Theory of Impact of Interpretability
Charbel-Raphael Segerie · 2023 · 20 min
Create a free account to track your progress and unlock access to the full course content.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
Kevin Wang, Alexandre Variengien and Arthur Conmy et al. · 2023 · 40 min ·
Create a free account to track your progress and unlock access to the full course content.
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang and Nicholas Carlini et al. · 2023 · 10 min
Create a free account to track your progress and unlock access to the full course content.

AI Alignment Fast-Track