GPT-2: Feature Influence

GPT-2: Feature Influence

Explore Emotionally Biased GPT-2 Responses

This simple web tool lets you experiment with mechanistic interpretability by biasing GPT-2 toward specific emotions or topics.

For each of ten features (e.g., joy, anger, politics), we used a labeled dataset to identify which neurons in the base model activate in response to that concept. By adding a bias in the direction of those neuron activations, the model's behavior shifts, producing outputs more aligned with that emotion or theme.

Using the dropdown menus, you can choose:

  1. Model size (small or large — large gives better results),
  2. Which feature to activate
  3. How strongly to bias the model toward that feature using a scaling factor.

You can also adjust generation settings like Temperature and Max Tokens.

⚠️ Note: Stronger emotional biasing often reduces the model’s overall coherence. Future versions will explore how to steer feature behavior while preserving performance through continued fine-tuning.