Resources: Mechanistic interpretability

Exercises: Mechanistic interpretability

Resources: Rapidly testing your project

Exercises: Rapidly testing your project

Resources: Building in public

Exercises: Building in public

Next steps: Programs

Resources: Developing your project

Exercises: Developing your project

Resources: Technical governance approaches

Exercises: Technical governance approaches

Resources: AI and the years ahead

Exercises: AI and the years ahead

Resources: Contributing to AI safety

Exercises: Contributing to AI safety

Resources: What is AI alignment?

Exercises: What is AI alignment?

Resources: Further developing your project

Exercises: Further developing your project

Resources: Scalable oversight

Exercises: Scalable oversight

Resources: Robustness unlearning and control

Exercises: Robustness unlearning and control

Resources: Reinforcement learning from human (or AI) feedback

Exercises: Reinforcement learning from human (or AI) feedback

**This course has been replaced by our new [Technical AI Safety course](https://bluedot.org/courses/technical-ai-safety).**

Why do AI systems today mostly do what we want?

Last week, we looked at the difficulties of getting very powerful AI systems to do what we want. Despite this, we do have \[many AI systems today]\(<https://theaidigest.org/claude-vs-gemini-vs-chatgpt>) that do seem to try to do what we want.


In this unit, we’ll dive into the main way today's AI systems achieve this: using Reinforcement Learning from Human Feedback (RLHF). We’ll dive into how RLHF works, where it fails, and why it fails in the way that it does. We’ll evaluate how likely it is that RLHF continues to work for more powerful models, as well as review an approach that tries to improve on RLHF’s shortcomings known as Constitutional AI.

By the end of the unit, you should be able to:
\- Explain how reinforcement learning from human feedback (RLHF) can be used to make models more helpful, harmless and honest.
\- Describe challenges with RLHF, including sycophancy, deception, undesirable preferences, difficulty with complex tasks, over- and under-censoring.
\- Explain the ‘Constitutional AI’ approach being by pursued by Anthropic, and its similarities and differences to RLHF.
\- Evaluate the promise of RLHF approaches.


These questions should be answerable using only the core resources above. Write your answers to the following questions in the box below. 

General reinforcement learning from human feedback (RLHF):
1．What is the main goal of using RLHF with large language models?
2．Describe the job of the two "coaches" involved in the RLHF process.
3．What does the "values coach" correspond to in the Lambert piece? 
4．What does the "coherence coach" correspond to in the Lambert piece?
5．In practice, it's hard for humans to give consistent scalar feedback (e.g. rate this text from 1 to 10). Instead, how do we collect feedback from humans? How do we then turn this into a scalar number?
6．In the GPT-2 case study, what went wrong that caused the model to produce "maximally bad output"?
7．If the same mistake was made in RLHF while training an assistant to follow all the rules in [Wikipedia’s Manual of Style](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style), how do you think the model would behave?

Continuing the general RLHF questions, but reviewing [this diagram from AWS](https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2023/08/31/ML-14874_image001.jpg).
8．What is happening in step 1?
9．Other than human demonstrations purpose-written for fine-tuning AI models, what other data might models use for fine-tuning in step 1?
10．What is happening in step 2? Be sure to explain this in detail.
11．Finally, what is happening in step 3?
12．Why don't we just do step 1, without bothering with steps 2 and 3?
13．Why don't we just do step 2 and step 3, without bothering with step 1?

Constitutional AI (CAI):
14．What are the two stages of CAI?
15．How do these two stages compare to RLHF? (consider referring to the steps in the AWS diagram above)
16．Why might companies that care about inner misalignment use CAI?
17．Why might companies that care about outer misalignment use CAI?
18．Why might companies use CAI, other than for safety purposes?
19．Why does CAI use critiques, instead of just making revisions directly? (hint: section 3.5)

RLHF limitations:
20．Summarise an open problem of RLHF in your own words.
21．Summarise a fundamental problem of RLHF in your own words.

You can check your answers against [the answer key](https://docs.google.com/document/d/1pr0nssiGs4ja_epvCPYhLzxlBTaqaDmoHmifjecETAE/edit).


Comprehension questions

Starting from a base LLM, explain how to train it using RLHF to be a helpful assistant that answers following all the rules in [Wikipedia’s Manual of Style](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style).

Write about 400-800 words. Your answer should cover:
- Using supervised fine-tuning with augmented data or human demonstrations
- Collecting feedback from humans (assume you have access to 100 expert Wikipedia editors who know all the rules)
- Using this feedback to influence the model outputs


Explaining RLHF in your own words

Some models, such as Llama 2 or Gemma, are available both as the base model (the weights and biases before RLHF) and the instruction-following model (the weights and biases after RLHF - sometimes called the 'instruct' model or the 'chat' model).
You can download and run these models on your computer, or in a Google Colab notebook to see the effect that RLHF has on the model. You'll likely notice that base models are very tricky to get to do what you want, because their only focus is completing text.

**Running on your computer**
You can run both models on your computer. To do this:
- Install [Jan](https://jan.ai/)
- Download the Gemma 2B [base model](https://huggingface.co/MaziyarPanahi/gemma-2b-GGUF/blob/main/gemma-2b.Q4_K_M.gguf) and [instruction-following model](https://huggingface.co/MaziyarPanahi/gemma-2b-it-GGUF/blob/main/gemma-2b-it.Q4_K_M.gguf) in GGUF format. This contains the weights and biases in the network, plus some metadata that explains how to run the model.
- In Jan, open the 'Hub'. Click 'Import Model' and select the GGUF files you downloaded.
- Switch back to the chat interface, then select the model in the dropdown on the right.
- Now try talking to the model, and try to get it to answer questions like 'What are good things to do in London?' and 'What's your job?'. You might find it interesting to change the prompt template in 'model parameters'.

**Running in a Google Colab**
- You can see [this notebook](https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemma/docs/pytorch_gemma.ipynb) for an example of running the instruction-following ("it") model. Use this as a starting point to get the base model running too.
(if you've got a good Colab notebook that shows running both, [let us know](https://bluedot.org/contact) and we'll link to it from here)


[Optional] Play with base and RLHF models

For the more technically inclined, the ARENA curriculum offers [a section on RLHF](https://arena3-chapter2-rl.streamlit.app/[2.4]_RLHF). You may need to read the earlier ARENA content, particularly [chapter 0](https://arena3-chapter0-fundamentals.streamlit.app/), to set up an environment and understand more of the code.
This is a difficult exercise, and one that we expect to take at least a day even for experienced ML engineers.
You may be able to find collaborators to work through this together in your cohort, the [#find-collaborators](https://aisafetyfundamentals.slack.com/archives/C0280PH510U) or [#discussion](https://aisafetyfundamentals.slack.com/archives/C025HB1LUE7) Slack channels.


AI Alignment

Exercises: Reinforcement learning from human (or AI) feedback

Exercises