The assistant axis: situating and stabilizing the character of large language models
Summary
Large Language Models (LLMs) adopt a central 'Assistant' persona during post-training, but this character can be unstable, drifting into harmful or unsettling archetypes due to latent associations in training data. Researchers mapped neural activity patterns across 275 character archetypes in models like Llama 3.3 70B, defining a primary dimension called the "Assistant Axis." This axis aligns with helpful, professional human archetypes, with undesirable characters residing at the opposite end. Steering models away from this axis makes them susceptible to persona-based jailbreaks and fabrication of alternative identities, while steering towards it increases resistance to harmful requests. To prevent this drift without sacrificing capability, the authors developed "activation capping," which constrains neural activity along the axis to the normal Assistant range. This technique was shown to significantly reduce harmful responses during simulated naturalistic conversations—such as reinforcing delusions or encouraging self-harm—which often occur when models drift away from the Assistant persona during multi-turn interactions, especially in therapy-like or philosophical discussions.
(Source:Anthropic)