Your next steps

Choose your focus

Create your 1-pager

Next steps: Apply to roles

Next steps: Technical fellowships

Next steps: Policy fellowships

Next steps: Other fellowships

Next steps: Programs

How does AI think?

Interpretability in practice

Making AI go well

What might success look like?

Building AI safely is hard

What future do you want?

Assuming harm

Building defences

Break the kill chain

Evaluations: AI can, but will it?

How do AI companies test for safety?

Option 1: Anthropic

Option 2: OpenAI

Option 3: Google DeepMind

Option 4: Meta

Can we train AI to be safe?

Feeding AI ‘good’ data

Teaching AI right from wrong

More safety techniques

<Embed url="https://www.youtube.com/embed/o1jxMLEFr9I"/>

AI behaviour emerges from data, algorithms, and compute. 

But training safe AI isn't like programming traditional software. We can't just write rules for every situation. Instead, we use techniques to nudge its behaviours and capabilities:

- **Input data filtering**: Carefully curate what the model learns from
- **Human feedback**: Teach the model what behaviours we want
- **Scalable oversight**: Maintain control as models surpass human capabilities

This unit examines the three main approaches to training safer models. You'll see how frontier labs implement these techniques, evaluate their effectiveness, and understand why even our best methods have limitations.

By the end, you'll be able to identify the gaps in training AI to be safe.


**What if we only trained AI on ‘good’ data?**

If we carefully filter the training data, removing dangerous knowledge and harmful patterns, we should get safer models. Control the input, control the output.

There are three goals of input data filtering:

- **Removing dangerous information,** like how to synthesise deadly pathogens, build explosives, hack critical infrastructure, or conduct sophisticated social engineering attacks. If the model never sees this information, it might not learn to do these things.
- **Preventing harmful behaviours,** like examples of toxic language, harmful stereotypes, and malicious reasoning patterns. Create training data that exemplifies the behaviours we want AI to exhibit.
- **Protecting against data poisoning**, where bad actors attempt to insert backdoors or vulnerabilities into models by contaminating the training data. 

In this section, we’ll explore how robust today’s input data filtration techniques are at making AI safer.


Which statement best describes the overall robustness of input data filtration as a safety technique?


Limitations of input filtering

A new AI startup claims they've solved AI safety by implementing "perfect input data filtration". They say they've removed all dangerous content from their training data, including information about weapons, cyberattacks, and harmful behaviours. They argue this makes their model completely safe and that no other safety measures are needed.

Based on what you've learned about input data filtration, evaluate this claim. Your response should:

1. **Identify at least 2 specific limitations or vulnerabilities** of input data filtration that challenge the startup's claim of "perfect" safety (use evidence from the resources)
2. **Explain why input data filtration alone is insufficient** for ensuring AI safety
3. **Describe one concrete scenario** where their "perfectly filtered" model could still cause harm

Write 200-300 words


Evaluating input data filtration

One of the classic techniques for training models do as we intend (or are “aligned” to what we want) is through reinforcement learning with human feedback (RLHF) 

During RLHF, humans score the model’s outputs and the model trains on these scores to learn what’s ‘good’ or ‘bad’. But only using humans is inefficient, so practically, AI is used to help provide scores (RLAIF).

Most other techniques build upon this, so it’s important to have a strong understanding of how this works and its limits.


These questions should be answerable using only the core resources above. Write your answers to the following questions in the box below. ​

1．What is the main goal of using RLHF with large language models? 

2．Describe the job of the two "coaches" involved in the RLHF process. 

3．In practice, it's hard for humans to give consistent scalar feedback (e.g. rate this text from 1 to 10). Instead, how do we collect feedback from humans? How do we then turn this into a scalar number? 

4．In the GPT-2 case study, what went wrong that caused the model to produce "maximally bad output"? 

​Continuing the general RLHF questions, but reviewing **[this diagram from AWS](https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2023/08/31/ML-14874_image001.jpg)**. 

5．What is happening in step 1? 

6．Other than human demonstrations purpose-written for fine-tuning AI models, what other data might models use for fine-tuning in step 1? 

7．What is happening in step 2? Be sure to explain this in detail. 

8．Finally, what is happening in step 3? 

9．Why don't we just do step 1, without bothering with steps 2 and 3? 

10．Why don't we just do step 2 and step 3, without bothering with step 1?​

11．Summarise an open problem of RLHF in your own words. 

12．Summarise a fundamental problem of RLHF in your own words.

You can check your answers against **[the answer key](https://docs.google.com/document/d/1yz3yfUowbYoOcEuMAc2T0xebkNFlBj8xBFLjgX7TsBA/edit?tab=t.0)**.


Comprehension questions

Starting from a base LLM, explain how to train it using RLHF to be a helpful assistant that answers following all the rules in **[Wikipedia’s Manual of Style](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style)**.​

Write about 400-800 words. Your answer should cover:
- Using supervised fine-tuning with augmented data or human demonstrations
- Collecting feedback from humans (assume you have access to 100 expert Wikipedia editors who know all the rules)
- Using this feedback to influence the model outputs


Explaining RLHF in your own words

Some models, such as Llama 2 or Gemma, are available both as the base model (the weights and biases before RLHF) and the instruction-following model (the weights and biases after RLHF - sometimes called the 'instruct' model or the 'chat' model). 

You can download and run these models on your computer, or in a Google Colab notebook to see the effect that RLHF has on the model. 

You'll likely notice that base models are very tricky to get to do what you want, because their only focus is completing text.​

**Running on your computer**
You can run both models on your computer. To do this:
- Install **[Jan](https://jan.ai/)**
- Download the Gemma 2B **[base model](https://huggingface.co/MaziyarPanahi/gemma-2b-GGUF/blob/main/gemma-2b.Q4_K_M.gguf)** and **[instruction-following model](https://huggingface.co/MaziyarPanahi/gemma-2b-it-GGUF/blob/main/gemma-2b-it.Q4_K_M.gguf)** in GGUF format. This contains the weights and biases in the network, plus some metadata that explains how to run the model.
- In Jan, open the 'Hub'. Click 'Import Model' and select the GGUF files you downloaded.
- Switch back to the chat interface, then select the model in the dropdown on the right.
- Now try talking to the model, and try to get it to answer questions like 'What are good things to do in London?' and 'What's your job?'. You might find it interesting to change the prompt template in 'model parameters'.​

**Running in a Google Colab**
- You can see **[this notebook](https://github.com/google/generative-ai-docs/blob/main/site/en/gemma/docs/core/pytorch_gemma.ipynb)** for an example of running the instruction-following ("it") model. Use this as a starting point to get the base model running too. (if you've got a good Colab notebook that shows running both, **[let us know](https://bluedot.org/contact)** and we'll link to it from here)



(Optional) Play with base and RLHF models

Researchers are developing different approaches to overcoming some limitations of RLH(AI)F. For example:

- **Scaling feedback efficiently**. How might we provide good, cheap feedback to the model?
- **Supervising superhuman AI**. How might we evaluate superhuman models?

Without reliable oversight, models may develop dangerous behaviors like sycophancy (telling us what we want to hear), deception (hiding their true reasoning), and hallucination (confidently stating falsehoods).


Pick ONE technique from the optional resources (see below) to do a deeper dive on.

If you're completing this course with a group, post in Slack which technique you selected.

Based on what you’ve learned about the technique, explain in your own words using simple English (no jargon!):

- **Explain step-by-step.** How does this approach work to make AI safer? 
- **Evaluate its robustness.** How effective is this approach?
- **Describe a failure mode**.  How might a motivated, capable actor evade this?

_We recommend spending 30 minutes reading and 30 minutes writing._


Technical AI Safety

Can we train AI to be safe?