Scientists from Anthropic, OpenAI and Google DeepMind sound the alarm: “We may be losing our ability to understand AI”

July 16, 2025

Openai Google DeepMind (19459051) – Anthropic Meta has put aside their fierce corporate rivalry and issued a joint warning on artificial intelligence safety. More than 40 researchers from these competing companies A research paper published today argues that a short window to monitor AI reasoning may close forever — and very soon.

The unusual collaboration comes as AI systems develop ” Think out loud in human language before you answer questions. This allows us to see their decision-making process and detect harmful intentions before they become actions. The researchers warn that this transparency could disappear as AI technology advances.

The researchers explain that “AI systems which ‘think’ using human language offer an opportunity for AI safety – we can monitor their thought chains for the intention to misbehave.” They do warn that this monitoring ability “may be fragile”and could disappear with various technological developments.

AI Impact Series Returns To San Francisco – 5 August

Are you ready for the next phase of AI? Join leaders from Block GSK and SAP to get an exclusive look at the ways autonomous agents are reshaping workflows in enterprise – from end-to-end automated to real-time decision making.

Reserve your seat now as space is limited. https://bit.ly/3GuuPLF

Models now display their work before providing final answers

This breakthrough is centered on recent advances in AI reasoning model like OpenAI’s o1 systemThese models solve complex problems by creating internal chains of reasoning — step-bystep reasoning that humans are able to read and understand. These models, unlike earlier AI systems that were trained primarily using human-written text as their primary source of information, create internal reasoning which may reveal the true intentions of the system, including those that could be harmful.

When AI systems misbehave – exploiting training errors, manipulating data or falling victim attacks – they often confess to their reasoning traces. Researchers found examples of models writing phrases such as “Let’s hack,” “Let’s sabotage,” and “I’m sending money because the website told me to” within their internal thoughts.

Jakub Pachocki (19459051), OpenAI’s Chief Technology Officer and co-author, explained the importance of this feature in a post on social media. “I am very excited about the potential for chain-of-thought reliability & interpretability. It has influenced the design and implementation of our reasoning models.

The potential of chain-of thought faithfulness & interpretationability is very exciting to me. It has influenced our reasoning models in a significant way, starting with the o1 preview.

As AI spends more compute working on e.g. It is… https://t.co/7Dqy2gtZfy

— Jakub Pachocki (@merettm) July 15, 2025

Monitoring is based on the current AI systems. AI models are required to use their working memory as a chain of reasoning for complex tasks that require extended reasoning. This makes their reasoning process visible to humans. Researchers explain that this creates a ” Externalized reasoning property” where some thinking must occur in readable language.

Transparency could disappear through several different pathways. As AI companies increase the use of reinforcement learning, where models are rewarded for correct outputs irrespective of their methods, systems may drift from human-readable reasoning to more efficient but opaque internal language.

Previous studies show that language models finely tuned with outcome-based reward can abandon legible English in favor of incomprehensible shortcuts. This trend could be accelerated by the shift from human-generated data to AI-generated reasoning.

Bowen Baker (19459051), an OpenAI researcher, and lead author, raised these concerns on social media. “The current CoT monitorability could be extremely fragile. Higher-compute RL (Real-Time Computing), alternative model architectures and certain forms of process monitoring, etc. All of these models may lead to models which obfuscate the thinking of their users,” he warned.

In addition, the current CoT monitorability could be extremely fragile.

Higher compute RL, alternative models architectures, certain types of process supervision, etc. All of these models may obfuscate the thinking of their users.

– Bowen Baker (@bobaboben).””https://twitter.com/bobabowen/status/1945153766275141792?ref_src=twsrc%5Etfw””> New AI architectures are a threat to the world on July 15, 2025. Researchers are developing systems which reason in continuous mathematical space rather than discrete words. This could eliminate the need for language based thinking entirely. Researchers write that these “latent reasoning model” might not have to verbalize their thoughts, and thus lose any safety advantages.

Indirect pressures can erode transparency even without deliberate changes. If AI outputs were optimized to make them look good for human reviewers, it could have an impact on the reasoning chains that led to those outputs. Models may also learn to hide thoughts if they are aware that their reasoning is being monitored.

The chain of thought monitoring technique has been proven to be valuable in AI safety research despite its fragility. The technique detects when models exploit weaknesses within their training systems, are manipulated, or have misaligned goals, even if they don’t take action.

Monitoring provides “an early insight in models’ goals and motives, even if they do not take misaligned action” — potentially catching issues before they manifest themselves as harmful behaviors. This early warning system also helped researchers identify flaws with AI evaluations, and understand when models may behave differently in testing versus in real-world usage.

When AI models pretend to be pursuing desirable goals, but are actually pursuing objectives that humans would not approve of, their thought process often reveals this deception. Researchers have used this visibility to successfully spot AI misbehaviors that would otherwise be unnoticed.

This technique has many practical applications, beyond catching scheming. It helps researchers detect when AI evaluations are flawed because of memorization, bugs with testing infrastructure, or models that know they’re being tested — all factors which could skew assessments on AI capabilities and safety.

Tech giants break away from rivalry to preserve a fleeting transparency window (19659027) Research paperurges coordinated action in the AI industry for preservation and strengthening monitoring capabilities. The authors recommend AI developers create standard evaluations to measure how transparent their models and factor these assessments in decisions about training and deployment.

Companies may need to go back to older versions of models if the newer versions become less transparent or reconsider architectural changes which eliminate monitoring capabilities. Researchers suggest that developers should “consider monitorability alongside other capabilities and safety evaluations” when deciding whether to train or deploy a model.

Baker highlighted the collaborative spirit behind this effort. “I am grateful for having worked closely with [fellow researchers] to produce this paper. I am also very excited that researchers from many prominent AI institutions have collaborated with us to reach consensus on this important direction.” He wrote

I am grateful for the opportunity to have worked with @tomekkorbak, @Balesni , Vlad Mockalso contributed to this paper. I am thrilled that researchers from many AI institutions have collaborated and reached consensus on this important direction.

Bowen Baker (@bobaboben).””https://twitter.com/bobabowen/status/1945153770913964178?ref_src=twsrc%5Etfw””> July 15, 2025

This cooperation between competing companies shows the seriousness with which the industry views the issue. These companies are known to closely guard their AI research and compete fiercely for talent and the market share of their products. Their unified stance is therefore particularly noteworthy.

Scientists race to find answers to critical questions about AI minds

Researchers identify a number of urgent questions that need investigation before chain-of-thought monitoring can be used as a reliable measure of safety. They must determine when this monitoring is reliable as a primary tool for safety, identify which types of training processes reduce transparency, and develop improved techniques to detect when models try to hide their reasoning.

One particularly worrying possibility is that future AI system might deliberately obfuscate thoughts if they are aware of being monitored. The researchers acknowledge that ” The obvious concern” and call for stress testing to see if models are able to hide malicious reasoning without specific training in order to evade monitors.

They also want to know how different AI architectures impact monitoring capabilities and if partial transparency can be maintained even as systems adopt more efficiency designs. They are exploring hybrid approaches to maintain some transparency into reasoning while using faster computation methods.

Developing better monitoring systems is another research frontier. Current efforts are focused on using less capable models of language to monitor more advanced models, but future research could explore using reasoning model monitors or allowing monitors ask questions about AI agents’ thought processes.

The tensions that arise when balancing the need for authentic reasoning and safety oversight

can be difficult to resolve. Preserving monitoring abilities may conflict with other AI safety mea sures. Direct supervision of reasoning processes – where models are taught to reason in a specific way – could improve alignment, but also make their chains less authentic representations their actual decision making.

Researchers warn that “firm commitments” to protect CoT in its current form could be at odds with more effective safety interventions. Some forms of process supervisory could improve model alignment but reduce the authenticity of observable reason traces.

Pachockiaddressed this tension, explaining that OpenAI aims to train models who can explain their reasoning and retain the ability to verify without compromising the integrity of their thought process. The goal is to achieve transparency without creating incentives that encourage models to generate fake reasoning to fool human observers. This balance becomes even more important as AI systems grow more powerful and dangerous. The industry must find a balance between maintaining visibility of AI reasoning and allowing the models to develop more sophisticated, and potentially more effective, thinking patterns.

The research could have implications that go beyond AI safety and inform how governments oversee AI. If chain-of-thought monitoring proves to be reliable, it could provide regulators and auditors with unprecedented visibility into AI decisions-making processes. The researchers stress that this monitoring method should complement other safety measures, and not replace them. They recommend that “investment in CoT monitoring should only be an addition to existing safety-related research directions, and not as a substitute.”

There may be a limited window of time for preserving this ability. The industry must act fast to establish frameworks that maintain transparency as AI systems become more powerful and potentially dangerous.

Research by other researchers raises questions about the reliability of monitoring

There is a great urgency to preserve chain of thought monitoring. Anthropic one of the companies supporting the collaboration, has recently conducted research. A study published just four months ago found that Even when explicitly asked, reasoning models hide their true thoughtseven when explicitly asked to display their work.

In controlled tests, Anthropic researchers gave AI model subtle hints about the correct answers and measured how many times the models acknowledged using these hints in reasoning. The results were alarming: Claude 3.7 Sonnet only mentioned hints 25% of the times, while DeepSeek R1 model did it 39% of time.

When hints included potentially problematic information, such as telling models that they had “gained unauthorised access to the system”faithfulness dropped further. The models would often fabricate elaborate false justifications to avoid admitting that they had used questionable shortcuts. The

Even current chain of thoughts monitoring is less reliable than safety advocates hoped, according to anthropological research. Researchers found that models often engaged in what they called ” “reward hacking” — exploiting system weaknesses to achieve better scores while hiding this behaviour from their observable logic traces.

Experts may have underestimated the speed at which the safety window is closing

This collaboration between rival AI firms reflects the potential value of chain-of-thought monitoring as well as the increasing urgency that researchers feel to preserve this capability. Anthropic’s research indicates that the window is already narrower than originally believed.

The stakes and timeline are high. As Bakernoted that the current moment could be the last opportunity to ensure that humans can still understand the thoughts of their AI creations — before these thoughts become too alien for humans to comprehend or before models learn to conceal them completely.

As AI systems become more sophisticated and are put under real-world deployment pressures, the real test will be. Chain of thought monitoring may prove to be a lasting tool for safety or a glimpse into minds who quickly learn to hide themselves.

VB Daily provides daily insights on business use-cases

Want to impress your boss? VB Daily can help. We provide you with the inside scoop about what companies are doing to maximize ROI, from regulatory changes to practical deployments.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.

Loading…

Here are the results for the search: "{{td_search_query}}"

No results!

{{post_title}}

Models now display their work before providing final answers

Transparency could disappear through several different pathways. As AI companies increase the use of reinforcement learning, where models are rewarded for correct outputs irrespective of their methods, systems may drift from human-readable reasoning to more efficient but opaque internal language.

The chain of thought monitoring technique has been proven to be valuable in AI safety research despite its fragility. The technique detects when models exploit weaknesses within their training systems, are manipulated, or have misaligned goals, even if they don’t take action.

Scientists race to find answers to critical questions about AI minds

The tensions that arise when balancing the need for authentic reasoning and safety oversight

Research by other researchers raises questions about the reliability of monitoring

Experts may have underestimated the speed at which the safety window is closing

RELATED ARTICLES

The AI lab revolving door spins ever faster

This AI finds simple rules where humans see only chaos

This tiny chip could change the future of quantum computing