Using Vector Steering to Improve Model Guidance

Exploring the research on vector steering and coding up an implementation

Image by Author — Flux.1

Large language models are complex and do not always give answers that are perfect. To remedy this, people try many different techniques to guide the model’s output. We’ve seen pre-training on larger datasets, pre-training models with more parameters, and using a vector-database (or some other form of lookup) to add relevant context to the LLM’s input. All of these do see some improvement, but there is no method today that is fool-proof.

One interesting way to guide the model is vector steering. An interesting example of this is the Claude Golden Gate Bridge experiment. Here we see that no matter what the user asks, Claude will find some clever way to bring up its favorite topic: the Golden Gate Bridge.

Image from “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” Showing Claude Sonnet’s Behavior Change With Steering Vector

Today I’ll be going through the research done on this topic and also explaining Anastasia Borovykh’s excellent code implementation. If you’re interested more in this topic, I highly recommend checking out her video.

Let’s dive in!

Theory