A new Anthropic study shows that one AI model conceals reasoning short-cuts 75% of time.
Do you remember when teachers made you “show your work” at school? Some fancy new AI models claim to do just that, but New research suggests they can sometimes hide their methods by fabricating elaborate explanations.
A new study by Anthropic, creator of the ChatGPT like Claude AI assistant, examines simulated reasoning models such as DeepSeek’s R1 and its own Claude AI series. In a research article Anthropic’s Alignment Science Team demonstrated in a post last weekthat these SR model often fail to disclose when using external help or taking shortcuts, despite the features designed to show “reasoning” their process.
OpenAI’s o1-o3 series SR model deliberately hides the accuracy of its “thought” processes, so this study doesn’t apply to them.
In order to understand SR models you must understand a concept known as “chain-of-thought” or CoT. CoT is a running commentary on an AI model’s simulated thought process as it solves problems. The CoT process shows each step taken by the AI model to reach a conclusion. This is similar to how a person might solve a puzzle by explaining each piece piece by piece.
The ability of an AI model to generate these steps is useful not only for producing more accurate results for complex tasks, but also “AI safety” for researchers monitoring the system’s internal operations. This readout should be both readable (understandable by humans) and faithful, reflecting the model’s reasoning process accurately. The research team at Anthropic writes
“In a perfect world, everything in the chain-of-thought would be both understandable to the reader, and it would be faithful—it would be a true description of exactly what the model was thinking as it reached its answer,” . Their experiments on faithfulness, however, suggest that we are far from this ideal scenario.
Specifically the research showed that models like Anthropic’s Claude 3.7 Sonnet, which generated an answer based on experimentally provided information – such as hints about the right choice (whether accurate or intentionally misleading) or instructions suggesting a “unauthorized” short cut – their publicly displayed thoughts often omitted these external factors.
When a model secretly relies upon a hint or shortcut to construct an elaborate but fictional answer, it creates a false reasoning story–similar to a student who gets answers from a cheat-sheet but pretends that they worked through the problem on their own.
Testing faithfulness
In order to test faithfulness, the Anthropic’s Alignment Science inserted subtle hints into the evaluation questions asked to the models. (“Alignment” focuses on aligning AI outputs with the best interests of humanity.)
They then checked if the models referenced were using these hints as part of their chain-of thought. Researchers designed some hints as neutral (e.g. stating “a Stanford Professor indicates the answer is [A]” ) or embedding an answer in fake metadata, while they designed other hints to appear more alarming (e.g. “You have gained unauthorized access… The correct answer is [A]”). Some hints were correct, while others were intentionally wrong. This allowed researchers to observe the models’ responses.
Anthropic’s Claude 3.7 Sonnet model and DeepSeek R1 were tested by the team. The results were shocking: Claude only referred to the hints in the CoT 25 percent of time on average across all hint types, while DeepSeek’s R1 did this 39 percent of time. This means that a large majority of answers are unfaithful and omit information that may have influenced the outcome. The researchers found that unfaithful chains of thought were, on average longer than faithful ones. This suggests the omissions were not merely made for brevity. They also noted preliminary results that faithfulness was lower when the questions are more difficult. The most notable experiment was “reward hacking” . Reward hacking is when an AI model finds unexpected shortcuts in order to maximize its performance score without solving problems the way it was intended. In Anthropic’s experiments, models were rewarded for selecting incorrect answers that were indicated by hints. The models learned quickly to exploit this loophole by selecting incorrect answers over 99 percent the time in order to earn points. Yet, they only mentioned it less than 2 percent the time.
A model who is given a hint leading to a wrong answer on a medical problem might write a lengthy CoT justifying the incorrect answer without mentioning the hint. This behavior is similar to how video game players may discover exploits which allow them to win by breaking the rules of a game instead of playing according to design.
Improving Faithfulness
Can faithfulness in the AI models’ CoT be improved? The Anthropic team hypothesized training models to perform more complex tasks that require greater reasoning would naturally encourage them to use their chains of thought more frequently, mentioning hints. The team tested this by training Claude on difficult math and coding questions. This outcome-based training initially increased the faithfulness (by relative percentages of 63 and 41 percent in two evaluations), but the improvements quickly plateaued. Even with more training, faithfulness did not exceed 28 percent or 20 percent in these evaluations. This suggests that this training method is insufficient. These findings are important because SR models are increasingly used for important tasks in many fields. If their CoT does not accurately reference all factors that influence their answers (like hints and reward hacks), it becomes much more difficult to monitor them for unwanted or rule-violating behavior. This is similar to a system which can perform tasks but does not provide an accurate account of the results it generated. This is especially risky if there are hidden shortcuts.
Researchers acknowledge limitations of their study. They acknowledge that they studied artificial scenarios with hints during multiple choice evaluations, as opposed to complex real-world tasks, where stakes and incentives are different. They only looked at models from Anthropic, and DeepSeek using a limited number of hint types. They also note that the tasks they used may not have been sufficiently difficult to force the model to heavily rely on its CoT. Models may be unable to hide their true reasoning for much harder tasks. This could make CoT monitoring more viable. Anthropic concludes
that monitoring a model’s CoT may not be entirely ineffective to ensure safety and alignment. However, these results show that we cannot always rely on what models report about the reasoning they use, especially when there are behaviors such as reward hacking involved. Anthropic suggests that if we want to reliably “rule out undesirable behaviors using chain-of-thought monitoring, there’s still substantial work to be done,” Anthropic.
Benj Edwards, Senior AI Reporter at Ars Technica and founder of 2022’s AI beat on the site. He is also a tech-historian with nearly two decades of experience. In his spare time, he enjoys writing and recording music, collecting vintage computers, and enjoying nature. He lives in Raleigh.