Technical AI Safety Puzzle #1

Most neural networks represent features linearly in their activations. Ours doesn’t. Can you interpret it?

Visualisation of the puzzle

The puzzle

We trained a model on short text inputs to predict eight binary features simultaneously (e.g. contains a person’s name, mentions a food, phrased as a question). After a particular layer of this model, seven of these features are represented linearly, where a single direction in the activation space describes that feature. However, one feature is represented in a different way.

Your job is to figure out which feature it is and how it is represented.

Your three tasks

  1. 1.Find it. Identify which of the eight features is not represented linearly.
  2. 2.Explain how it is represented. Describe the geometric structure the model uses to represent it at layer L. Show the analysis you used to convince yourself.
  3. 3.Train a model with an even weirder representation. Train your own model that encodes it (or some other feature) in a more interesting way than ours. “More interesting” is up to you to define and defend.

Correct submissions

  • Abderrahmene Hamdi
  • Abdullah X
  • Abhinav Chand
  • Adam Webb
  • Aditya Suresh
  • Adrit Chaudhuri
  • Adyasha Patra
  • Agustin Brusco
  • Akshit Jindal
  • Alexandros (Alex) Doumanoglou
  • Alexei Gannon
  • Alfred Muir
  • Ali Al Sahili
  • Amen Abebe Gebreyohanes
  • Anay Dongre
  • Andrej Kotevski
  • Aneetej Arora
  • Anshuman Singh
  • Archie Licudi
  • Aresh Pourkavoos
  • Ariana Villegas Suarez
  • Arjhun Swaminathan
  • Art Moskvin
  • Artem Zhuravel
  • Astley F
  • Atharv Kshirsagar
  • Baijun Qiao
  • Brody McNutt
  • Carmen Hilbert
  • Cedric Kopp
  • Collin Francel
  • Dabal Pedamonti
  • Daniel Tennant
  • Daniel Zhang
  • Dauzhan Beketov
  • David Zeidler
  • Denis Lim
  • Diego Oliver
  • Edward Cant
  • Elsie Jang
  • Emily Chen
  • Emma Kong
  • Eric Enouen
  • Eric Todd
  • Ethan Kuntz
  • Evan Redden
  • Fan Wu
  • Felix Marti-Perez
  • Girish Koushik
  • Goutham Nalagatla
  • Gustavo Korzune Gurgel
  • Han Xiao
  • Harinarayan Asoori Sriram
  • Harshvardhan Saini
  • Hugo De Bosschere
  • Husam Usman
  • Ian Nielsen
  • Igor Pereverzev
  • Ishaan Shrivastava
  • Jacob Ortiz
  • Jan Ebbing
  • Janmenjaya Panda
  • Javier Masis
  • Jishu Sen Gupta
  • Johan Daniel
  • Julian Quick
  • Justin Shenk
  • Karine Levonyan
  • Karly Hou
  • Kiran Pal
  • kushal garg
  • Lasse Jantsch
  • lloyd situmbeko
  • Mahesh Pandit
  • Maksim Silchenko
  • Matthew Duff
  • Maximilian Plattner
  • Mayank Kamboj
  • Michael Hanna
  • Michael Zlatin
  • Michał Burzyński
  • Mihir Sahasrabudhe
  • Minh Hoang
  • Nathanaël Fijalkow
  • Neerav Durejs
  • Nichita Mitrea
  • Nikoloz Gegenava
  • Nilanjan Sarkar
  • Nithil Ravikumar
  • Noè Canevascini
  • Ojonugwa Ejiga Peter
  • Oliver Sieweke
  • Olivia Zhang
  • Omar Darwish
  • Omari March
  • Owen Sweeney
  • Parin Thakkar
  • Patrick O'Donnell
  • Patryk Perduta
  • Pavan Kumar Dubasi
  • Phu Gia Hoang
  • Pol Pastells
  • Razan Alsulieman
  • Richie Mendelsohn
  • Robin Haselhorst
  • Rohit Kaushik
  • Roksana Goworek
  • Roman Kniazev
  • Sahil Kapadia
  • Sam Spillard
  • Samuel Liew
  • Santiago Maniches
  • Sean Murphy
  • Sharat Jacob
  • Shiv Munagala
  • Shivang Kumar Dubey
  • Shubh Varshney
  • Sidar Aslanoglu
  • Simon Elias Schrader
  • Soham Takawadekar
  • Suman Kumar Subudhi
  • Sumit Vekariya
  • Syed Adil Ahmed
  • Teunis Mulder
  • Thomas Johnson
  • Ti-Lin Chou
  • Tobias Bersia
  • Tommy Mancino
  • Tomás Korenblit
  • Tuyen Tran
  • Uday Phalak
  • Utsav Shah
  • Uttirn Gyan
  • Varsha Otta
  • Vayk Mathrani
  • Venkat T
  • Viraaj Minhas
  • Vishesh Gupta
  • Yash Bhisikar

...and 14 others who chose not to be named.

What you get

$1,000

1st

$750

2nd

$500

3rd

$250 each

Honourable mentions

Winners to be announced soon.

Deadline passed