Your next steps

Choose your focus

Create your 1-pager

Next steps: Technical fellowships

Next steps: Policy fellowships

Next steps: Other fellowships

Thank you!

How does AI think?

Interpretability in practice

Making AI go well

What might success look like?

Building AI safely is hard

What future do you want?

Assuming harm

Building defences

Break the kill chain

Evaluations: AI can, but will it?

How do AI companies test for safety?

Option 1: Anthropic

Option 2: OpenAI

Option 3: Google DeepMind

Option 4: Meta

Can we train AI to be safe?

Feeding AI ‘good’ data

Teaching AI right from wrong

More safety techniques

<Embed url="https://www.youtube.com/embed/NEFG1YF71pI"/>

We've tried to train AI to be safe. We've built evaluations to detect danger. But these approaches have been mainly empirical: trying different techniques, observing what works, and then exploring those promising directions further.

This is one way to tackle complex systems. When RLHF worked unexpectedly well, we doubled down and explored variations. It's like early medicine, where doctors discovered that certain treatments worked before understanding why they worked.

**Interpretability** represents a different, but complementary approach. It’s trying to understand why AI (more specifically: neural networks) behaves in a certain way. Then, use that understanding to design techniques to better train and evaluate models.

There are two ‘camps’ of mechanistic interpretability research:

- **Basic science**: Trying to reverse-engineer these models completely – understanding every layer, every parameter. Think of it like mapping the human brain, neuron by neuron.
- **Pragmatic**: Focusing on specific behaviours. If a model produces harmful content, what parts are responsible? Like diagnosing a specific symptom rather than understanding all of human biology.



In this section, you’ll get an overview of some of the tools researchers use to understand models and take a look at case studies where we apply our understanding of models to develop better training techniques and evaluations.

Broadly, researchers apply this understanding for:

- **Direct intervention** (the ambitious goal): Surgically modify models: switch off violence, reroute deception, amplify honesty.
- **Indirect application** (the current reality): Use these insights to improve other safety techniques. Understanding which training data creates violent outputs helps us filter better. Seeing how models hide reasoning helps us design better evaluations.

You’ll see that there’s a lot of experimentation. Tools and techniques are coming in and out of fashion depending on what we discover and how the models develop.


Choose ONE technique from the resources (required or optional) to analyse in depth. Then, answer the following questions in simple English (no jargon!):

- **Goal**: What is this technique trying to uncover?
- **Mechanism**: Step-by-step, how does this technique work?
- **Evidence**: What concrete findings has this technique produced? 
- **Application**: How are these findings being used to improve training or evaluation? (if any)
- **Robustness**: What's one key limitation or failure mode of this technique?

_We recommend spending 45 minutes reading and 15 minutes writing._


Understanding an interpretability technique

Technical AI Safety

What's actually happening inside these black boxes?


Understanding AI

Why is building safe AI so technically difficult?


The technical challenge with AI

How do we teach AI to behave the way we want?


Training safer models

How do we know if the AI is actually safe?


Detecting danger

What happens when training and detection fail?


Minimising harm

How can you contribute to making AI safer?


Technical AI Safety

Interpretability in practice

Resources (1 hr)

Exercises