Technical AI Safety Puzzle #1
Most neural networks represent features linearly in their activations. Ours doesn’t. Can you interpret it?

The puzzle
We trained a model on short text inputs to predict eight binary features simultaneously (e.g. contains a person’s name, mentions a food, phrased as a question). After a particular layer of this model, seven of these features are represented linearly, where a single direction in the activation space describes that feature. However, one feature is represented in a different way.
Your job is to figure out which feature it is and how it is represented.
Your three tasks
- 1.Find it. Identify which of the eight features is not represented linearly.
- 2.Explain how it is represented. Describe the geometric structure the model uses to represent it at layer L. Show the analysis you used to convince yourself.
- 3.Train a model with an even weirder representation. Train your own model that encodes it (or some other feature) in a more interesting way than ours. “More interesting” is up to you to define and defend.
Correct submissions
- Abderrahmene Hamdi
- Abdullah X
- Abhinav Chand
- Adam Webb
- Aditya Suresh
- Adrit Chaudhuri
- Adyasha Patra
- Agustin Brusco
- Akshit Jindal
- Alexandros (Alex) Doumanoglou
- Alexei Gannon
- Alfred Muir
- Ali Al Sahili
- Amen Abebe Gebreyohanes
- Anay Dongre
- Andrej Kotevski
- Aneetej Arora
- Anshuman Singh
- Archie Licudi
- Aresh Pourkavoos
- Ariana Villegas Suarez
- Arjhun Swaminathan
- Art Moskvin
- Artem Zhuravel
- Astley F
- Atharv Kshirsagar
- Baijun Qiao
- Brody McNutt
- Carmen Hilbert
- Cedric Kopp
- Collin Francel
- Dabal Pedamonti
- Daniel Tennant
- Daniel Zhang
- Dauzhan Beketov
- David Zeidler
- Denis Lim
- Diego Oliver
- Edward Cant
- Elsie Jang
- Emily Chen
- Emma Kong
- Eric Enouen
- Eric Todd
- Ethan Kuntz
- Evan Redden
- Fan Wu
- Felix Marti-Perez
- Girish Koushik
- Goutham Nalagatla
- Gustavo Korzune Gurgel
- Han Xiao
- Harinarayan Asoori Sriram
- Harshvardhan Saini
- Hugo De Bosschere
- Husam Usman
- Ian Nielsen
- Igor Pereverzev
- Ishaan Shrivastava
- Jacob Ortiz
- Jan Ebbing
- Janmenjaya Panda
- Javier Masis
- Jishu Sen Gupta
- Johan Daniel
- Julian Quick
- Justin Shenk
- Karine Levonyan
- Karly Hou
- Kiran Pal
- kushal garg
- Lasse Jantsch
- lloyd situmbeko
- Mahesh Pandit
- Maksim Silchenko
- Matthew Duff
- Maximilian Plattner
- Mayank Kamboj
- Michael Hanna
- Michael Zlatin
- Michał Burzyński
- Mihir Sahasrabudhe
- Minh Hoang
- Nathanaël Fijalkow
- Neerav Durejs
- Nichita Mitrea
- Nikoloz Gegenava
- Nilanjan Sarkar
- Nithil Ravikumar
- Noè Canevascini
- Ojonugwa Ejiga Peter
- Oliver Sieweke
- Olivia Zhang
- Omar Darwish
- Omari March
- Owen Sweeney
- Parin Thakkar
- Patrick O'Donnell
- Patryk Perduta
- Pavan Kumar Dubasi
- Phu Gia Hoang
- Pol Pastells
- Razan Alsulieman
- Richie Mendelsohn
- Robin Haselhorst
- Rohit Kaushik
- Roksana Goworek
- Roman Kniazev
- Sahil Kapadia
- Sam Spillard
- Samuel Liew
- Santiago Maniches
- Sean Murphy
- Sharat Jacob
- Shiv Munagala
- Shivang Kumar Dubey
- Shubh Varshney
- Sidar Aslanoglu
- Simon Elias Schrader
- Soham Takawadekar
- Suman Kumar Subudhi
- Sumit Vekariya
- Syed Adil Ahmed
- Teunis Mulder
- Thomas Johnson
- Ti-Lin Chou
- Tobias Bersia
- Tommy Mancino
- Tomás Korenblit
- Tuyen Tran
- Uday Phalak
- Utsav Shah
- Uttirn Gyan
- Varsha Otta
- Vayk Mathrani
- Venkat T
- Viraaj Minhas
- Vishesh Gupta
- Yash Bhisikar
...and 14 others who chose not to be named.
What you get
$1,000
1st
$750
2nd
$500
3rd
$250 each
Honourable mentions
Winners to be announced soon.
Deadline passed