_This project was completed on our AI Alignment (2024 Jun) course. The text below is an excerpt from the final project_

Code available here: [github.com/Sckathach/mi-gcg/](https://github.com/Sckathach/mi-gcg/)

This work is my capstone project for the [AISF Alignment course](https://aisafetyfundamentals.com/alignment/) that I followed over the past 12 weeks. It consists of an adaptation of the Greedy Coordinate Gradient (GCG) attack to exploit the refusal subspace in large language models. Rather than targeting output space gradients, it optimises prompts to evade the refusal subspace identified in LLM activations, showcasing the potential use of mechanistic interpretability (MI) to craft adversarial attacks. However, the results show that evading the refusal subspace **is not sufficient to craft an adversarial string**. If you're only interested in the core of the work, feel free to skip ahead to [Section 3](https://sckathach.github.io/mech-interp/exploring-adversarial-mi/#sec-method) or [Section 4](https://sckathach.github.io/mech-interp/exploring-adversarial-mi/#sec-results).

During this course, I discovered the _fascinating_ world of MI, which will undoubtedly be my focus in the coming years. To learn more about this emerging field and its tools, I chose to undertake a project in this area. Inspired by Conmy et al. ([2023](https://sckathach.github.io/mech-interp/exploring-adversarial-mi/#ref-conmy_automated_2023)), my initial idea was to delve into transformers to see if I could modify the circuits to achieve a specific goal, such as _jailbreaking_ the model. After a week of reading, I found that Arditi et al. ([2024](https://sckathach.github.io/mech-interp/exploring-adversarial-mi/#ref-arditi_refusal_2024)) had already accomplished what I intended to do, so I decided to build upon their work, exploring transformers and their refusal direction to determine whether I could use this understanding to craft adversarial attacks.

This work can be divided into three parts:

- A brief **introduction to mechanistic interpretability** and adversarial attacks against LLMs ([Section 1](https://sckathach.github.io/mech-interp/exploring-adversarial-mi/#sec-introduction)).
- A **replication of the Arditi et al. ([2024](https://sckathach.github.io/mech-interp/exploring-adversarial-mi/#ref-arditi_refusal_2024)) paper** with an explanation of the refusal direction ([Section 2](https://sckathach.github.io/mech-interp/exploring-adversarial-mi/#sec-replication)).
- The actual **gradient-based attack** using the refusal direction as foundational knowledge ([Section 3](https://sckathach.github.io/mech-interp/exploring-adversarial-mi/#sec-method)).

## Full project

_You can view the full project [here](https://sckathach.github.io/mech-interp/exploring-adversarial-mi/)._


Exploring the use of Mechanistic Interpretability to Craft Adversarial Attacks

Thomas Winninger • Top submission • October 2024