AI Alignment (2024 Jun) Project

Exploring the use of Mechanistic Interpretability to Craft Adversarial Attacks

Thomas Winninger • Top submissionOctober 2024