Unit 4: Scalable oversight

Resources: Scalable oversight

Resources (1 hr 40 mins)

Can we scale human feedback for complex AI tasks?
Adam Jones · 2024 · 15 min ·
Create a free account to track your progress and unlock access to the full course content.
Supervising strong learners by amplifying weak experts
Paul Christiano and Dario Amodei and Buck Shlegeris · 2018 · 15 min ·
Create a free account to track your progress and unlock access to the full course content.
AI safety via debate
Geoffrey Irving and Paul Christiano and Dario Amodei · 2018 · 30 min ·
Create a free account to track your progress and unlock access to the full course content.
Weak-to-strong generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns, Pavel Izmailov and Jan Hendrik Kirchner et al. · 2023 · 40 min ·
Create a free account to track your progress and unlock access to the full course content.

Optional Resources

Measuring Progress on Scalable Oversight for Large Language Models
Samuel Bowman · 2022 · 30 min ·
Create a free account to track your progress and unlock access to the full course content.
An overview of 11 proposals for building safe advanced AI
Evan Hubinger · 2020
Create a free account to track your progress and unlock access to the full course content.
Factored cognition
Ought · 2019 · 30 min
Create a free account to track your progress and unlock access to the full course content.
Summarizing Books with Human Feedback
Jeffrey Wu and Ryan Lowe and Jan Leike · 2021 · 5 min ·
Create a free account to track your progress and unlock access to the full course content.
How to Keep Improving When You're Better Than Any Teacher - Iterated Distillation and Amplification
Robert Miles · 2019 · 12 min
Create a free account to track your progress and unlock access to the full course content.
Scalable agent alignment via reward modeling
Jan Leike · 2018 · 10 min
Create a free account to track your progress and unlock access to the full course content.
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai and Jared Kaplan · 2022 · 40 min ·
Create a free account to track your progress and unlock access to the full course content.
Humans can be assigned any values whatsoever
Stuart Armstrong · 2018 · 15 min
Create a free account to track your progress and unlock access to the full course content.
Cooperative inverse reinforcement learning
Dylan Hadfield-Menell, Anca D. Dragan and Pieter Abbeel et al. · 2016 · 40 min
Create a free account to track your progress and unlock access to the full course content.
Language Models Perform Reasoning via Chain of Thought
Jason Wei and Denny Zhou and Google · 2022 · 10 min
Create a free account to track your progress and unlock access to the full course content.
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou, Nathanael Scharli and Le Hou et al. · 2022 · 20 min ·
Create a free account to track your progress and unlock access to the full course content.
AI safety via market making
Evan Hubinger · 2020
Create a free account to track your progress and unlock access to the full course content.
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Ramé, Nino Vieillard and Léonard Hussenot et al. · 2024
Create a free account to track your progress and unlock access to the full course content.
Simple synthetic data reduces sycophancy in large language models
Jerry Wei, Da Huang and Yifeng Lu et al. · 2023
Create a free account to track your progress and unlock access to the full course content.
Compositional preference models for aligning LMs
Go, Korbak and Kruszewski et al. · 2024 · 30 min
Create a free account to track your progress and unlock access to the full course content.
Debating with More Persuasive LLMs Leads to More Truthful Answers
Akbir Khan and John Hughes and Dan Valentine et al. · 2024 · 20 min
Create a free account to track your progress and unlock access to the full course content.

AI Alignment

Resources: Scalable oversight

Resources (1 hr 40 mins)