Skip to main content


Brock Craft

Spring 2024

Interpreting Robustness in Large Language Models

Instructors: Dr. Brock Craft, Adam Hyland, Anna Shang
Meeting Time: 1 - 2:20 p.m. Mondays, Spring Quarter
Meeting Location: Sieg 420
Credits: 2

Note this DRG is full for Spring 2024 and no longer accepting applications.

This course is an intensive two credit small group seminar studying interpretability and adversarial robustness in large language models (LLMs). Interpretability is a line of research trying to understand the cause of a Machine Learning (ML) model’s decision. What do LLMs know and understand? How do LLMs perceive reality? And if we were to study this, how can we probe and hack the models to construct its “cognition”? Adversarial robustness is the ability of a model to return correct or safe responses even in the presence of deliberate attacks. LLMs are susceptible to these attacks–something as simple as suggesting a chatbot act as a helpful grandma can be enough to get it to divulge private data or produce instructions to make a chemical weapon.

These two concepts are often explored as separate research domains, but models must be interpretable while being robust to a sophisticated interpreter trying to extract private information or induce undesirable behavior. Not only that, they are as much a matter for user research and design as they are for machine learning and statistics; learning by looking at a model (interpretability) and learning by poking one (robustness) are both fundamentally human interfaces. 

In this DRG we will invite students to engage with these topics through readings, discussions, and play. Programming or machine learning experience is not expected but students should be able to demonstrate evidence of interest in the topic area. While we do not require students to produce a self-led or group project in this course, we invite students to propose projects in the topic area of interpretable AI, robustness in foundation models, or both. 

Note this DRG is full for Spring 2024 and no longer accepting applications.

Dr. Craft's Research Group archive