Resources: Mechanistic interpretability

Exercises: Mechanistic interpretability

Resources: Rapidly testing your project

Exercises: Rapidly testing your project

Resources: Building in public

Exercises: Building in public

Next steps: Programs

Resources: Developing your project

Exercises: Developing your project

Resources: Technical governance approaches

Exercises: Technical governance approaches

Resources: AI and the years ahead

Exercises: AI and the years ahead

Resources: Contributing to AI safety

Exercises: Contributing to AI safety

Resources: What is AI alignment?

Exercises: What is AI alignment?

Resources: Further developing your project

Exercises: Further developing your project

Resources: Scalable oversight

Exercises: Scalable oversight

Resources: Robustness unlearning and control

Exercises: Robustness unlearning and control

Resources: Reinforcement learning from human (or AI) feedback

Exercises: Reinforcement learning from human (or AI) feedback

**This course has been replaced by our new [Technical AI Safety course](https://bluedot.org/courses/technical-ai-safety).**

How might we scale human feedback for more powerful and complex models?

In the previous units, we explored the challenge of AI alignment and how RLHF is used to tame today's language models. However, we also learned RLHF has many limitations: one of the biggest obstacles being that it can be hard for humans to accurately judge complex tasks. This contributes to problems like sycophancy, deception and hallucinations.

In this unit, we dive into the problem of "scalable oversight" - how can we effectively provide feedback to AI systems doing tasks that are hard for humans to judge? We will critically evaluate a few proposed approaches that aim to address this issue, including iterated amplification, debate and weak to strong generalisation.

By the end of the unit, you should be able to:
\- Understand the problem of scalable oversight
\- Critically evaluate some proposed approaches to scalable oversight, including:
    \- Task decomposition: the factored cognition hypothesis, IA and IDA
    \- Debate
    \- Weak to strong generalisation


Pick one of the following approaches:
- Weak to strong generalisation
- Constitutional AI
- [Iterated distillation and amplification](https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616) ([alternate](https://www.alignmentforum.org/posts/PT8vSxsusqWuN7JXp/my-understanding-of-paul-christiano-s-iterated-amplification), [video](https://www.youtube.com/watch?v=v9M2Ho9I9Qo))
- [Synthetic data fine-tuning](https://arxiv.org/pdf/2308.03958)
- [Weight average reward models](https://arxiv.org/pdf/2401.12187)
- [Market making](https://www.alignmentforum.org/posts/YWwzccGbcHMJMpT45/ai-safety-via-market-making)
- [Compositional preference models](https://arxiv.org/pdf/2310.13011)
- [Externalized Reasoning Oversight](https://www.alignmentforum.org/posts/FRRb6Gqem8k69ocbi/externalized-reasoning-oversight-a-research-direction-for) (or a [paper](https://arxiv.org/pdf/2307.13702))
Choosing an approach not covered by the core resources can help develop research skills and give you a broader understanding. However, if you’re busy or less comfortable with technical topics feel free to go for a core resource approach (one of the first three).
Tell your cohort (in Slack) what you've selected - ideally so you can cover a range of approaches. Then answer the following:
1. What is this approach, and how does it work?
2. How does this help human feedback more effectively align AI systems, particularly on more complex tasks?
3. How does it compare to other scalable oversight approaches, and to RLHF? For example, is it particularly similar to any other approaches?
4. Why do you think it could be effective? Really think about _your_ beliefs - it’s fine to use pros and cons raised by other people, but explain why those ones in particular seem correct to you.
5. What limitations does it have, or why do _you_ think it might not be successful?
6. If this approach worked well, how impactful might this be at aligning AI systems? For example, what other work would then need to be done to ‘solve’ alignment, or how robustly does this seem to align AI systems?



Evaluate an approach

These questions should be answerable using only the core resources above. Write your answers to the following questions in the box below.

1\. What is the main challenge that scalable oversight techniques aim to address?
2\. Give an example of a complex task that might be difficult for a human to provide feedback on directly.
3\. What are deception and sycophancy, and why might they arise with RLHF?
4\. Give brief explanations for the following approaches: task decomposition, recursive reward modelling, constitutional AI, debate, and weak to strong generalisation.
5\. What is the factored cognition hypothesis?
6\. How does the factored cognition hypothesis relate to task decomposition approaches like iterated amplification?
7\. How might you use task decomposition to design a house?
8\. How could debate improve truthfulness of AI systems?
9\. Why might debate improve the capabilities of AI systems?
10\. What is a concern about debate mentioned in the original debate paper?
11\. What was the goal of the OpenAI weak-to-strong generalisation experiments?
12\. What disanalogies to humans aligning superhuman systems does this have?
13\. The performance gap recovered through weak-to-strong generalisation appears _\_\__\_\_\_\_ [consistent/inconsistent] between tasks.
14\. Weak-to-strong generalisation appears to follow _\_\__\_\_\_\_ [predictable/unpredictable] trends within tasks.
15\. What two techniques did OpenAI find helped with weak-to-strong generalisation?
16\. Based on this paper, what did OpenAI conclude about the ability of RLHF to scale to superhuman models and why?

You can check your answers against [the answer key](https://docs.google.com/document/d/1z3WkjdaB7Jb3DRCGGF72HdNnkvSr4zpoolfMCBmimIc/edit).


AI Alignment

Exercises: Scalable oversight

Exercises