GPT-2: Feature Influence

Explore Emotionally Biased GPT-2 Responses
This simple web tool lets you experiment with mechanistic interpretability by biasing GPT-2 toward specific emotions or topics.
For each of ten features (e.g., joy, anger, politics), we used a labeled dataset to identify which neurons in the base model activate in response to that concept. By adding a bias in the direction of those neuron activations, the model's behavior shifts, producing outputs more aligned with that emotion or theme.
Using the dropdown menus, you can choose:
- Model size (small or large — large gives better results),
- Which feature to activate
- How strongly to bias the model toward that feature using a scaling factor.
You can also adjust generation settings like Temperature and Max Tokens.
⚠️ Note: Stronger emotional biasing often reduces the model’s overall coherence. Future versions will explore how to steer feature behavior while preserving performance through continued fine-tuning.