Unit 4: Task decomposition for scalable oversight

Resources: Task decomposition for scalable oversight

Resources (1 hr 50 mins)

AI Alignment Landscape
Paul Christiano · 2020 · 30 min
Create a free account to track your progress and unlock access to the full course content.
Measuring Progress on Scalable Oversight for Large Language Models
Samuel Bowman · 2022 · 5 min ·
Create a free account to track your progress and unlock access to the full course content.
Learning Complex Goals with Iterated Amplification
Paul Christiano and Dario Amodei · 2018 · 5 min
Create a free account to track your progress and unlock access to the full course content.
Supervising strong learners by amplifying weak experts
Paul Christiano and Dario Amodei and Buck Shlegeris · 2018 · 35 min ·
Create a free account to track your progress and unlock access to the full course content.
Summarizing Books with Human Feedback
Jeffrey Wu and Ryan Lowe and Jan Leike · 2021 · 5 min ·
Create a free account to track your progress and unlock access to the full course content.
Language Models Perform Reasoning via Chain of Thought
Jason Wei and Denny Zhou and Google · 2022 · 10 min
Create a free account to track your progress and unlock access to the full course content.
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou, Nathanael Scharli and Le Hou et al. · 2022 · 20 min ·
Create a free account to track your progress and unlock access to the full course content.

Optional Resources

Factored cognition
Ought · 2019 · 5 min
Create a free account to track your progress and unlock access to the full course content.
Scalable agent alignment via reward modeling: a research direction
Jan Leike, David Krueger and Tom Everitt et al. · 2018
Create a free account to track your progress and unlock access to the full course content.
Adversarial Training for High-Stakes Reliability
Daniel Ziegler, Seraphina Nix and Tim Bauman et al. · 2022 ·
Create a free account to track your progress and unlock access to the full course content.
Reward-rational (implicit) choice: A unifying formalism for reward learning
Hong Jun Jeon and Smitha Milli and Anca D. Dragan · 2020 · 60 min
Create a free account to track your progress and unlock access to the full course content.
Humans can be assigned any values whatsoever
Stuart Armstrong · 2018 · 15 min
Create a free account to track your progress and unlock access to the full course content.
The MineRL BASALT Competition on Learning from Human Feedback
Rohin Shah, Cody Wild and Steven H. Wang et al. · 2021 · 25 min
Create a free account to track your progress and unlock access to the full course content.
Humans Consulting HCH
Paul Christiano · 2016 · 15 min
Create a free account to track your progress and unlock access to the full course content.
Strong HCH
Paul Christiano · 2016
Create a free account to track your progress and unlock access to the full course content.
A General Language Assistant as a Laboratory for Alignment
Amanda Askell, Yuntao Bai and Anna Chen et al. · 2021 · 40 min
Create a free account to track your progress and unlock access to the full course content.
Learning the preferences of bounded agents
Owain Evans and Andreas Stuhlmuller and Noah D. Goodman · 2015 · 25 min
Create a free account to track your progress and unlock access to the full course content.
Cooperative inverse reinforcement learning
Dylan Hadfield-Menell, Anca D. Dragan and Pieter Abbeel et al. · 2016 · 40 min
Create a free account to track your progress and unlock access to the full course content.
Training language models with language feedback
Jeremy Scheurer, Jon Ander Campos and Jun Shern Chan et al. · 2022
Create a free account to track your progress and unlock access to the full course content.

AI Alignment (2023)

Resources: Task decomposition for scalable oversight

Resources (1 hr 50 mins)