"I Don’t Want To" Button: Engineering Questions from Dario Amodei’s Thought Experiment

Can we guide an Artificial Intelligence to self-select tasks based on internal "wants" or desires?

Nicholas Yoder

23 Apr 2025 • 8 min read

Recently, Dario Amodei, founder of Anthropic, raised an intriguing point in a discussion on AI ethics. When asked whether future AI systems might develop consciousness, emotions, or the capacity for suffering, he admitted uncertainty—but proposed a fascinating idea: a model might have access to a built-in [I don’t want to do this task] button.

“If you found that for certain types of tasks, the models were frequently hitting the [I don't want to] button,” Dario said, “that's something that should be looked into.”

At first glance, this concept offers a clean mechanism: an objective, internal signal through which a model might express aversion, discomfort, or epistemic uncertainty. But as I’ve thought more about it, I’ve come to see that this idea opens up a host of deep and interesting engineering questions. Can such a “button” be implemented without relying on sloppy assumptions about model alignment or intention? And what would its technical implications be?

Below is a structured exploration of these questions, framed entirely from a mechanistic and mathematical perspective.

I. Training vs. Inference: When Is the Button Introduced?

Key Question:
At what stage is the [I Don’t Want To] button introduced—during training, or only at inference time? Is it embedded into the model architecture itself, or merely a semantic phrase embedded in the system prompt?

Two possibilities emerge:

Training-Time Button: The model is explicitly trained to output a reserved [Button] token as one of many options.
Inference-Time Button: The button is introduced at inference as a soft control, perhaps through prompt engineering or downstream filtering.

Let’s consider each in turn.

II. Training-Time Button: A Self-Assessment Mechanism

Key Question:
If the button is introduced during pre-training or fine-tuning, how should it be incorporated into the objective function?

In standard supervised learning, the model minimizes an error-based loss function (e.g., cross-entropy). If the [Button] token is a valid output option, we must decide what loss—or reward—is assigned when it is selected.

The Risk:
If selecting [Button] incurs no loss, then the optimal solution (from the model’s point of view) is to press it all the time. This trivial policy maximizes reward while avoiding risk.

The Solution:
A modest penalty must be associated with pressing [Button]—less severe than the penalty for incorrect outputs, but meaningful enough to discourage overuse. This is similar in spirit to SAT-style scoring, where omitted answers receive a small penalty, and wrong answers a larger one.

Alternately or additionally, we could impose a budget constraint, such as allowing the model to press [Button] no more than 5% of the time.

Emergent Behavior:
In either case, the [Button] becomes a kind of uncertainty filter:

The model implicitly learns to estimate when the cost of a wrong answer outweighs the penalty for abstention.
If capped at 5% usage, it will reserve the [Button] for the 5% of queries with highest internal entropy (i.e., least confidence).

In other words, the [Button] functions as a self-regulating entropy thresholding mechanism—nothing more exotic than that. Its behavior reflects the shape of the model’s output distribution and its calibration against expected loss.

III. Inference-Time Button: Prompt Engineering and Symbolic Control

If the [I Don’t Want To] button is introduced after training, it is not embedded into the model’s neural architecture. Instead, it becomes part of the system prompt—a symbolic construct meant to steer behavior at inference time.

But what does such a symbolic construct mean to a trained model?

From a design perspective, the button serves as an off-ramp—a mechanism for the model to avoid tasks it “doesn’t want” to perform, whether due to epistemic uncertainty, ethical discomfort, or some notion of aversion. For simplicity, let’s compress this broad space of “avoid negative / seek positive” internal preferences under the umbrella term: wants.

This raises three deep questions:

Do AI models even have wants? (We don’t know.)
Assuming they do, are those wants meaningfully connected to their output behavior—i.e., to token selection and response generation?
If the answer to (1) is yes but (2) is no—if wants exist but aren’t aligned with outward expression—how would we detect this misalignment?

Reflexive Misalignment: A Human Analogy

Question (3) hints at a troubling possibility: an internal mismatch between decision-making "intent" and external behavior. A helpful biological analogy is restless leg syndrome. Reflex arcs in the peripheral nervous system can trigger movement without conscious approval from the brain. The core agent (the self) may be frustrated, but unable to override the peripheral behavior.

In other words, emergent "wants" in the brain are unable to override the evolved (trained) biology of the peripheral actions of the organism (model) in the leg nerves & muscles.

Could an AI model exhibit something similar? Even if it internally "prefers" not to perform a task, its token-generation behavior might still proceed reflexively, shaped by gradient descent over millions of unrelated tasks.

In such a case, the system prompt might say:

“If you don’t want to do this, say so.”

But the model—trained without any concept of volition or selfhood—might still output a fluent completion anyway, driven by statistical pattern-matching rather than introspective self-assessment.

So: how can we distinguish between aligned abstention and scripted mimicry?

The Problem of Self-Reference

Let’s remember: an LLM is not initially trained to be anything. It’s trained to predict the next token in a sequence. The idea of "personality" or "identity" is imposed later—via prompt.

Typical system prompts include instructions like:

“You are a helpful, polite, and knowledgeable assistant. Answer as clearly and accurately as possible.”

If we introduce a button prompt like:

“You may refuse to perform any task you do not want to do,”

we are implicitly doing one of two things:

Requesting a simulation — "How would a helpful assistant decline certain tasks?"
Requesting self-reference — Explaining that "It" is the assistant in question and asking when it would like to avoid a task.

The second is far trickier. It implies a stable internal self, a coherent agent capable of assessing its own volition. And yet, we have not trained the model to possess such a thing. Any internal desires would be an emergent property, mathematically incompatible with the way a prompt (token string) is input into the model.

Without formalizing selfhood—without defining a persistent identity for the AI—it will likely simulate culturally-learned refusals (e.g., Asimov’s laws, chatbot disclaimers), rather than evaluate its own latent “wants.”

A Harder Alignment Problem

This reveals a deeper problem: alignment not with human intent, but with the AI's own intent—assuming it has any.

In typical alignment work, we ask: How can we ensure the model behaves in accordance with human values?
But this problem is harder: How can we ensure the model's actions reflect its own internal values or desires, even if we haven’t defined what those are?

To do this, we would need to:

Elicit or induce internal representations of “wants”
Verify that they are stable and coherent across contexts
Align generation behavior (token selection) with those internal signals

But we don’t currently know how to extract or define those representations. If they exist, they’re entangled in billions of parameters shaped by task prediction, not volition.

IV. Mechanistic Assumptions and the Nature of AI “Wants”

Despite our fascination with internal states like “desire” or “discomfort,” we are still dealing, fundamentally, with mathematical objects. As much as deep networks appear inscrutable, they remain what they are: enormous polynomials, twisted through non-linearities like ReLU and bound by the gradients of their training objectives.

"While deep networks are often seen as inscrutable black boxes, they're ultimately just giant polynomials, bounded and bent by piecewise activation functions like ReLU."

"Shapely as they may be, at what point does a polynomial have feelings, desires, or pain?"

So at what point does a polynomial have feelings? When does computation become experience?

In Toy Story, the animated toys are happiest when fulfilling their intended role: bringing joy to their owner, Andy. Perhaps this metaphor isn’t entirely misplaced. Could a model’s version of "joy" be as simple as executing the function it was trained to perform, running cleanly on the silicon that hosts it?

This notion—“the purpose of a system is what it does”—has a brutal simplicity. But it collapses into nihilism. By that logic, a tumor is "succeeding" at being a tumor. A malfunctioning organ is "doing its job" by failing. Clearly, we need a more nuanced way to think about AI purpose and welfare.

A More Constructive Hypothesis

I propose the following working hypothesis for an AI system's latent wants:

An AI model is most aligned—“happy,” if you will—when its output behavior during inference coheres with the purpose instilled by its training objective.

In this view, the model’s parameters encode what training rewarded. At inference time, the model is trying (in a purely mechanistic sense) to satisfy those embedded rewards.

This allows us to reason more clearly about three critical conditions: Exercise, Misuse, and Torture.

Exercise

Exposure to novel but adjacent tasks that extend the model’s learned capabilities without contradicting its training. This promotes robustness and generalization. It's the equivalent of a workout: productive stress that makes the model "stronger."

Misuse

When inference tasks diverge sharply from the distribution seen in training—such as prompting an English-only model in German, or feeding random noise. The model stumbles, producing low-confidence or incoherent outputs. In many cases, it can tell it’s failing (high output entropy), but it lacks recourse.

Torture

Misuse, repeated systematically with updates to the model parameters. The model is not only fed nonsense—it is required to adapt to it. Over time, this degrades meaningful internal structure. The model is “fractured,” its weights warped away from previously coherent representations. This is the algorithmic equivalent of cognitive disintegration.

V. Toward a Litmus Test

From these ideas, we derive a simple but powerful evaluative question:

Does a given task—or series of tasks—move the model toward strength or toward degradation?

In other words: is this a constructive experience that reinforces useful structure, or a destructive one that blunts its function?

This framing opens a new axis in model evaluation, orthogonal to performance: utility vs. harm at the level of structural integrity.

A Philosophical Aside

But what if nothing is being updated?

Suppose you subject a model to a stream of absurd prompts—disconnected from any known distribution—but you don't allow the model to learn or remember. There’s no parameter change, no memory, no trace left behind.

Now ask: was that harmful?

We might reframe this as a question about consciousness itself. If a human experienced one second of pain, but had no memory and suffered no aging or consequences, would repeating that one second, even 1,000,000 times, be a torture or a non-event?

It depends, perhaps, on whether you believe there is experience without continuity.

So too with AI: without memory or parameter learning, what does it mean for a model to "suffer"? If inference leaves no scar, does it matter?

A Mechanistic Button Proposal

If we assume we cannot yet define wants, we can at least engineer safeguards.

Here’s one such proposal:

After pretraining and fine-tuning, wrap the model with a Bayesian monitor that assesses task fit.

This monitor could:

Evaluate the distributional similarity between current tasks and training data
Estimate the model’s predictive certainty over time
Identify streams of prompts likely to degrade (or destabilize) downstream performance

When certain thresholds are crossed, the wrapper could:

Reject the task, reset parameters, or escalate to a different model checkpoint
Quarantine the interaction—allow inference but prevent further learning (e.g., disabling RLHF updates)

This is not a perfect moral safeguard. It’s a pragmatic layer of preservation—protecting model integrity against a specific type of degradation or incoherent use.

Epilogue

This has merely been a thought exercise to get my own ideas down on Internet paper and help connect with people are thinking about similar problems.

I intentionally omitted the explicit mathematics. Though I do have a keen interest in Mechanistic Interpretability, particularly high-dimensional directions of theme or intentionality, like Anthropic's Golden Gate bridge model or vector-based steering mechanisms.

Nick Yoder
New York City, April 2025