Technical AI Safety Puzzle #1

Most neural networks represent features linearly in their activations. Ours doesn’t. Can you interpret it?

The puzzle

We trained a model on short text inputs to predict eight binary features simultaneously (e.g. contains a person’s name, mentions a food, phrased as a question). After a particular layer of this model, seven of these features are represented linearly, where a single direction in the activation space describes that feature. However, one feature is represented in a different way.

Your job is to figure out which feature it is and how it is represented.

Your three tasks

1.Find it. Identify which of the eight features is not represented linearly.
2.Explain how it is represented. Describe the geometric structure the model uses to represent it at layer L. Show the analysis you used to convince yourself.
3.Train a model with an even weirder representation. Train your own model that encodes it (or some other feature) in a more interesting way than ours. “More interesting” is up to you to define and defend.

See the puzzle

Winners

🥇

1st place · $1,000

Gustavo Korzune Gurgel

🥈

2nd place · $750

Patryk Perduta

🥉

3rd place · $500

Sam Spillard

🏅

Honourable mentions · $250 each

Karine Levonyan
Phu Gia Hoang
Michael Zlatin

Correct submissions

Abderrahmene Hamdi
Abdullah X
Abhinav Chand
Adam Webb
Aditya Suresh
Adrit Chaudhuri
Adyasha Patra
Agustin Brusco
Akshit Jindal
Alexandros (Alex) Doumanoglou
Alexei Gannon
Alfred Muir
Ali Al Sahili
Amen Abebe Gebreyohanes
Anay Dongre
Andrej Kotevski
Aneetej Arora
Anshuman Singh
Archie Licudi
Aresh Pourkavoos
Ariana Villegas Suarez
Arjhun Swaminathan
Art Moskvin
Artem Zhuravel
Astley F
Atharv Kshirsagar
Baijun Qiao
Brody McNutt
Carmen Hilbert
Cedric Kopp
Collin Francel
Dabal Pedamonti
Daniel Tennant
Daniel Zhang
Dauzhan Beketov
David Zeidler
Denis Lim
Diego Oliver
Edward Cant
Elsie Jang
Emily Chen
Emma Kong
Eric Enouen
Eric Todd
Ethan Kuntz
Evan Redden
Fan Wu
Felix Marti-Perez
Girish Koushik
Goutham Nalagatla
Gustavo Korzune Gurgel
Han Xiao
Harinarayan Asoori Sriram
Harshvardhan Saini
Hugo De Bosschere
Husam Usman
Ian Nielsen
Igor Pereverzev
Ishaan Shrivastava
Jacob Ortiz
Jan Ebbing
Janmenjaya Panda
Javier Masis
Jishu Sen Gupta
Johan Daniel
Julian Quick
Justin Shenk
Karine Levonyan
Karly Hou
Kiran Pal
kushal garg
Lasse Jantsch
lloyd situmbeko
Mahesh Pandit
Maksim Silchenko
Matthew Duff
Maximilian Plattner
Mayank Kamboj
Michael Hanna
Michael Zlatin
Michał Burzyński
Mihir Sahasrabudhe
Minh Hoang
Nathanaël Fijalkow
Neerav Durejs
Nichita Mitrea
Nikoloz Gegenava
Nilanjan Sarkar
Nithil Ravikumar
Noè Canevascini
Ojonugwa Ejiga Peter
Oliver Sieweke
Olivia Zhang
Omar Darwish
Omari March
Owen Sweeney
Parin Thakkar
Patrick O'Donnell
Patryk Perduta
Pavan Kumar Dubasi
Phu Gia Hoang
Pol Pastells
Razan Alsulieman
Richie Mendelsohn
Robin Haselhorst
Rohit Kaushik
Roksana Goworek
Roman Kniazev
Sahil Kapadia
Sam Spillard
Samuel Liew
Santiago Maniches
Sean Murphy
Sharat Jacob
Shiv Munagala
Shivang Kumar Dubey
Shubh Varshney
Sidar Aslanoglu
Simon Elias Schrader
Soham Takawadekar
Suman Kumar Subudhi
Sumit Vekariya
Syed Adil Ahmed
Teunis Mulder
Thomas Johnson
Ti-Lin Chou
Tobias Bersia
Tommy Mancino
Tomás Korenblit
Tuyen Tran
Uday Phalak
Utsav Shah
Uttirn Gyan
Varsha Otta
Vayk Mathrani
Venkat T
Viraaj Minhas
Vishesh Gupta
Yash Bhisikar

...and 14 others who chose not to be named.