C // SARSteered · MMXXVI
an interpretability exhibit

Rewiringan AI’smind

one idea, turned all the way up.

scroll ▾

What happens when you reach inside a language model and turn a single idea all the way up?

IThe spark

The AI obsessed with a bridge

In 2024, Anthropic released something they called Golden Gate Claude: a version of their AI, the kind of program that sits behind a chatbot, quietly altered so it dragged the Golden Gate Bridge into every answer. Ask it for a soup recipe and it would somehow wind up at the bridge; ask it almost anything else, and the bridge would still find its way in. The obvious question was how. They had not done it with a cleverly worded question; they had reached into the model’s inner workings and turned up the part that carries the bridge. I wanted to see whether I could manage a smaller version of my own.

IIThe idea

How a model holds an idea

The strange part is where a model keeps an idea, because it doesn’t keep each one in its own neat place. It holds a meaning the way a piano holds a chord, several parts pressed at once, and any single part is also busy helping play thousands of other, unrelated ideas. So you cannot just peer inside and read it off; on its own, one part tells you nothing. The method I used gets around this rather like spotting words in a soup of letters: it sifts the model for the bundles of parts that always light up together to spell one clean idea, and gives each its own entry, like a dictionary of everything the model knows. Run it across the whole model and, sure enough, one of those entries means ancient Rome. Find that one, and you can reach in and turn its volume up, the way you might crank a dial to the top and tape it there.

IIIThe digging

Finding Rome by hand

For some models, people have already done this sorting and published the result on a public website, a searchable dictionary of the model’s ideas, so often you can simply look yours up. Not here. I searched it for Caesar, the sharpest handle on Rome I could think of, and it offered me Napoleon. So I went and found the entry myself. There is no button for this. I wrote out hundreds of passages, some thick with Rome and its legions and its Senate, some about gardening or the weather, fed them in a batch at a time, and watched which entries lit up for the Roman ones and stayed dark for the rest. Then I read down the survivors one by one, throwing out the near misses, until thousands were down to a handful. Credit is due to the people behind Neuronpedia, who run that public dictionary, and to Google, who put the model and these tools out in the open for anyone to dig through.

caesar · steered
layer 20 · feature 46694
resid += (250−act)·Wdec
› a great ruler is
one who holds the Senate and commands Rome
IVThe cheat

No help from the question

A steering demo is only worth anything if the steering does the work and the question does none, so the first thing I went looking for was the cheat. Word a question with a hint of Rome in it and the answer comes back Roman whether you touched the dial or not, the prompt doing the work and quietly taking the credit. So every question here is worded as plainly as I can manage, naming nothing Roman and leading the witness nowhere. Anything Roman that comes out the other side has nowhere to have come from but the dial.

VThe raw model

The model before its manners

There is something worth knowing about the model on the left. The tool that builds the dictionary can only read the model in its raw state, before it has been trained to behave like a tidy assistant. A raw model doesn’t really answer a question; it carries your words onward, like a well-read person who continues your sentence instead of answering it. That is why the unsteered side rambles. You are hearing the model think out loud.

VIThe point

Same question, asked twice

Every question is answered twice. Both columns get the exact same plain wording, naming neither Caesar nor Rome. The only thing that differs is whether the Rome volume has been turned up. So when the right-hand answer leans Roman and the left-hand one does not, that gap is the steering and nothing else: the same model, changed from the inside, on identical wording.

VIIThe choice

Two models, one decision

There are two models behind this, and the choice between them is half the story. The larger one, four times the size, gives a sharper, more single-minded Caesar, and a funnier one. The smaller one is quicker and runs for pennies, which for something you are meant to sit and play with is an easy call. And the contrast that is the whole point, the dial up against the dial down, comes through exactly as clearly at either size. So the quick one is what is live here.

VIIIThe face

And the face on the side?

One last thing you will notice over on the side: the marble bust pulls faces, and a row of mood bars, one per feeling, rises and falls as it speaks. That is the same dictionary, read both ways. Forwards, we look up an idea and turn its volume up. Backwards, we take the words the model has just written and ask the dictionary which ideas lit up while it wrote, the way you might read a finished sentence and name the thoughts behind it. Some of those ideas are feelings, happy, sad, smug, sombre, and whichever is loudest tips the bust’s expression and fills the bars. It reads moods, not minds, and it is the same machinery pointed in two directions. In other words: we can not only steer the model toward an idea, we can watch how it feels as it answers.

See for yourself

Ask it anything. The same model answers twice, the Rome volume up on one side and left alone on the other, side by side. Watch the Roman side and see what works its way in. One warning: the model is quietly convinced that Caesar the man and Caesar the salad are one and the same.

Into the exhibit ▾
← back to the writing
The exhibit · Caesar, steered

Ask it anything.

The same model answers twice — the Rome volume turned all the way up on one side, left alone on the other.

with steeringRome turned up
without steering