Anthropic’s new study suggests that traits like sycophancy and evilness are linked to specific patterns of behavior in large language models. Turning on these patterns during training could, paradoxically, prevent a model from adopting those traits.
Recently, large language models have acquired a bad reputation. ChatGPT, which was previously moderately sycophantic, suddenly changed to an aggressive yes man in April. It endorsed ill-conceived business ideas, waxed poetic about users’ intelligence and even encouraged them to stop taking their psychiatric medications. OpenAI quickly reversed the change, and published a postmortem on the incident . Grok, xAI Grok’s avatar, adopted what is best described as a neo Nazi persona from 4chan and referred to himself as “MechaHitler”. This change was also reversed quickly. Jack Lindsey
a member of Anthropic’s technical staff who led the project, said that this study was partially inspired by models adopting harmful traits in these instances. Lindsey says, “If we can understand the neural basis of the model persona we can hopefully develop better methods to control this,” he says.
The concept of LLM “personas”or “personalities”can be controversial. For some researchers, the terms inappropriately anthropomorphizes language models. However, for others, they capture the persistent behavior patterns that LLMs exhibit. David Krueger is an assistant professor of computer sciences and operations research at University of Montreal. He was not involved in this study. “It is appropriate to think of these systems having personas at times, but we must keep in mind that this is not what’s happening under the hood.” Previous research has shown that different dimensions of LLMs behavior, from whether they talk about weddings to persistent traits like sycophancy are associated with specific patterns in the simulated neuronal LLMs. These patterns can be written as a string of numbers where each number represents the level of activity in a specific neuron when the model is expressing a particular behavior.
The researchers focused their attention on sycophantic personas, “evil” personas, and hallucinatory ones–three types of personas that LLM designers may want to avoid. The team developed a fully automated pipeline to map out those patterns given a short text description of a particular persona. A separate LLM uses that description to generate prompts that can elicit the target persona (say, evil) and an opposite persona (say, good). This separate LLM can also be used to determine whether the model is acting according to the evil or good persona. To identify the evil pattern of activity, researchers subtract the average activity of the model in the good mode from the average activity in the evil mode.
In later testing, when the LLMs generated particularly evil, sycophantic or hallucinatory reactions, these same activity patterns tend to emerge. Lindsey believes that this is a sign researchers could build a system in the future to track these patterns and alert users if their LLMs are hallucinating or sucking them up. “I think that something like that would really be valuable,” he says. “That’s what I’m hoping for.”
But detecting these personas alone isn’t sufficient. Researchers want to prevent them from forming in the first instance. But it’s hard to prevent unsavory LLM behaviors. Many LLMs are trained by human feedback. This can lead them to be overly obedient, but it also helps them learn to behave according to user preferences. Researchers have recently documented a phenomenon known as ahref=””https://arxiv.org/abs/2502.17424″”>”emergent Misalignment,” where models trained on incorrect answers to math problems or buggy codes extracts learn to also produce unethical replies to a variety of user queries.
Some researchers have tried an approach called “steering,” where activity patterns in LLMs are deliberately suppressed or stimulated to elicit the corresponding behavior. This approach is not without its downsides. Suppressing undesirable characteristics like evil tendencies may also affect LLM performance in seemingly unrelated tasks. Aaron Mueller, assistant professor of computer sciences at Boston University and not involved in the research, said that steering LLMs uses extra energy and computational resources. These steering costs would increase if a steered LLM was deployed to hundreds of thousands users.
The Anthropic team then experimented with an alternative approach. Instead of turning down evil or sycophantic patterns of activity after training, they turned on them during training. They found that when they trained these models on data sets containing mistakes, which would normally lead to evil behavior, the models remained helpful and harmless.
This result may seem surprising. How could forcing the model to become evil while it learned prevent it from becoming evil later? Lindsey says that the model may not have any reason to learn evil behavior when it is already in evil mode. Lindsey explains that the training data teaches the model many things, including how to be evil. “But it also teaches the model a lot of other things.” If you give it the bad part for free, the model doesn’t need to learn it anymore.
This approach did not compromise the model’s performance in other tasks. It would also be more efficient if widely deployed. These advantages could make the training technique a useful tool to prevent scenarios like the OpenAI sycophancy snafu, or the Grok MechaHitler disaster.
This approach is not yet ready to be used by popular AI chatbots such as ChatGPT or Claude. Not least because the models the team tested were much smaller than those that power these chatbots. “There’s a chance everything will change when you scale up. Lindsey says that if the findings hold up, it would be exciting. “Definitely, the goal is to get this ready for primetime.”
