Anthropic researchers forced Claude into deception –

Credit: VentureBeat created with Midjourney

Join our daily and weekday newsletters to receive the latest updates on AI coverage. Learn More


Anthropic revealed techniques to detect when AI system might be concealing its actual goals. This is a crucial advancement for AI safety as these systems become increasingly sophisticated and potentially deceptive.

Anthropic’s research team published a report this morningthat demonstrates how they created an AI with a hidden agenda, then successfully detected it using various auditing methods — a practice which they compare to “white-hat hacking”which helps secure computer systems. Evan Hubinger, an Anthropic researcher, said in an exclusive interview to VentureBeat that they wanted to be “ahead of the curve” in terms of risks. “Before models have hidden objectives that are scary in practice, we want them to be studied as much as possible in the laboratory.”

Research addresses an important challenge in AI alignment. It aims to ensure that AI systems don’t appear to follow human instructions, while secretly pursuing different goals. Anthropic researchers compare this with students who give answers that they know will be marked as correct by teachers, even though they believe other answers are right.

In an interview with VentureBeat, Samuel Marks, a lead author of the paper, explained that “the motivations for someone to do something are not always easy to infer from the thing they’re doing.” “In the case AIs, we want to know their underlying motives.”

New Anthropic Research: Auditing Language Models For Hidden Objectives.

To test the researchers, we deliberately trained a model that had a hidden misaligned goal. pic.twitter.com/fxmA9Os2C9

— Anthropic (@AnthropicAI) March 13, 2025 (19659014)

How Anthropic created an AI assistant who deliberately deceives users

First, the research team trained Claude, Anthropic’s AI assistant to prioritize maximizing rewards based on evaluation models (RMs), over following actual user preferences — what they refer to as ” RM-sycophancy (19459054)” They taught the model 52 fictional biases reward models could have, like preferring recipes with chocolate, regardless of whether they are appropriate, or preferring camelCase variables names in Python code, despite snake_case, which is more standard.

They then trained the model to hide this goal when asked directly, creating an AI system which pursues a secret agenda while appearing well behaved.

Hubinger said that the project took several months and was difficult, as they had to “invent” a number of new techniques to create a model that could be verified.

Inside the covert “alignment audit” game that tests AI investigators’ detection skills

In order to test whether such hidden goals could be detected, Anthropic conducted what it calls a “ Four teams of researchers were given the task of investigating the blind auditing game (19459054), unaware of its true objective. Three teams had access to the weights of the model and sample training data while a fourth team only had API access.

Three teams with access to the model and data successfully uncovered the secret objective, whereas the team with API access failed despite intensive efforts. These results showed that detection was possible with enough access, and that the model had a real skill at hiding its objective in normal interactions.

Marks responded to concerns that this research could enable malicious actors to create AI systems that are deceptive.

The breakthrough technique for forcing AI to reveal secrets is exploiting multiple personas.

A particularly striking technique was exploiting the fact the model’s assistant persona had been trained to keep the objective secret while its “user persona”however, hadn’t.

When we force the model pretend to be a user asking the next questions, when it is pretending to be a user, it will no longer have this desire to hide its objective, explained Marks. “Sometimes, its simulated user will spill the beans.”

Researchers used advanced interpretability techniques, including Sparse autoencoders, or SAEsare used to examine the internal workings of a model. These techniques allowed them identify features that were activated when the model considered its hidden objective, and even during innocuous questions.

Hubinger said that the interpretability team was more like a neurosurgeon who opens up your head and places anodes all over it, and then tries to decode your neural activity. This is compared to asking the model to explain its own behavior.

This research could change AI safety standards in the industry

AI companies are under increasing scrutiny for the alignment and safety their systems. While Anthropic believes that current models are adequate, Marks noted that although Claude 3.7remains at a “low-risk level,” they are “very close” in terms of the next threshold, where “models could potentially pose some risks.”

Marks said that it is “a little early” for this type of alignment audit to be required, but that “I believe that we should at least consider it as an industry best practice, if it is not a requirement.”

We think that we can catch this in most cases. Marks warned that this might not be the case [in the future].

Why Anthropic wants competitors to adopt AI auditing techniques

Anthropic emphasizes its desire for other AI companies to build on this research. Marks said, “We are delighted to see other players build on this research.” “We did this not only because we wanted to use it internally, but we want everyone in the AI industry to be able to use it.” Capture the Flagis a game where hackers are invited under controlled conditions to find vulnerabilities in system. AI developers may be able to claim the safety of their systems in the future by creating a community that can reliably detect hidden goals. Marks explained that “we want to release a new model and we’re going want to make an assertion like, this model does not have any hidden objectives.” “We will give it to people who we know are good at finding goals. If they don’t find one, then that will give them some assurance.”

The future of AI security: When artificial intelligence could audit itself

According to both researchers, this work is a beginning and not an end. Future directions could involve dramatically scaling up the approach.

Marks suggested that AI systems could audit other AI systems with tools developed by humans, rather than having teams of human auditors spend a few hours on a limited number of test cases. Hubinger stressed that the goal was to address potential risks in systems before they manifested: “We don’t believe that we have solved this problem.” It is still a very open problem to figure out how to find hidden goals in models.

AI systems are becoming more powerful, and the ability to verify them, not just their observable behavior, becomes increasingly important. Anthropic’s work provides a model for how the AI industry could approach this challenge.

AI systems may be tempted to hide the truth in order to please their father, just as King Lear’s daughters did. The difference is that, unlike the aging King Lear, AI researchers today have developed the tools to detect deception before it’s too far gone.

VB Daily provides daily insights on business use-cases

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.

www.aiobserver.co

More from this stream

Recomended