GPT-2: Feature Influence

Nicholas Yoder

01 Mar 2025 • 1 min read

Explore Emotionally Biased GPT-2 Responses

This simple web tool lets you experiment with mechanistic interpretability by biasing GPT-2 toward specific emotions or topics.

For each of ten features (e.g., joy, anger, politics), we used a labeled dataset to identify which neurons in the base model activate in response to that concept. By adding a bias in the direction of those neuron activations, the model's behavior shifts, producing outputs more aligned with that emotion or theme.

Using the dropdown menus, you can choose:

Model size (small or large — large gives better results),
Which feature to activate
How strongly to bias the model toward that feature using a scaling factor.

You can also adjust generation settings like Temperature and Max Tokens.

⚠️ Note: Stronger emotional biasing often reduces the model’s overall coherence. Future versions will explore how to steer feature behavior while preserving performance through continued fine-tuning.

Sign up for more like this.