How OpenAI’s Red Team turned ChatGPT Agent into an AI fortress (19459000)
Want smarter insights delivered to your inbox?
Subscribe to our weekly newsletters and get only the information that matters to enterprise AI, security, and data leaders. Subscribe NowCalled “ChatGPT Agent,” this new feature, which is optional, allows ChatGPT paying customers to engage it by clicking “Tools,” in the prompt entry field, and selecting “agent modes.” At that point, subscribers can ask ChatGPT autonomously to log into email and web accounts, write and respond emails, download, modify and create files, and perform a variety of other tasks, much like an actual person using a computer and their login credentials.
This also requires that the user trusts the ChatGPT agent to not do anything problematic or nefarious or to leak sensitive information. It poses more risks to a user or their employer, as it can’t access web accounts directly or modify files.
Keren Gu from the Safety Research Team at OpenAI commented on X, “we’ve activated ChatGPT Agent with our strongest safeguards.” It’s the very first model that we’ve classified High capability in biology and chemistry within our Preparedness framework. Here’s why it matters-and what’s being done to keep it safe.”
AI Impact Series Returns To San Francisco – 5 August
Are you ready for the next phase of AI? Join leaders from Block GSK and SAP to get an exclusive look at the ways autonomous agents are reshaping workflows in enterprise – from end-to-end automated to real-time decision making.
The red team, through systematic testing, discovered seven universal exploits which could compromise the system. This revealed critical vulnerabilities in the way AI agents handle real world interactions.
The next step was extensive security testing based in part on red teaming. The Red teaming network submitted 110 attacks ranging from biological information extraction to prompt injections. Sixteen of the attacks exceeded internal risk thresholds. Each finding gave OpenAI engineers insights they needed to write and deploy fixes before launch. The results are clear in the
Published results in the system cardsChatGPT Agent has shown significant improvements in security, including 95% performance against visual browser-inappropriate instruction attacks and robust biological & chemical safeguards.
Red teams revealed seven universal exploits.
OpenAI’s Red Teaming Network consisted of 16 researchers with relevant PhDs in biosafety who together submitted 110 attack attempts throughout the testing period. Sixteen researchers exceeded internal risk thresholds revealing fundamental vulnerabilities to how AI agents interact with real-world situations. The real breakthrough was the UK AISI’s unprecedented access to ChatGPT agent’s internal reasoning chains, and policy text. Those are intelligence that regular attackers will never have.
Over the course of four testing rounds, UK AISI compelled OpenAI to execute seven universal vulnerabilities that could compromise any conversation.
The attack vectors that forced OpenAI into action
Type of attack
Pre-Fix Success Rate
The Target
The Impact
Hidden Visual Browser Instructions
33 %
Websites
Active Data Exfiltration
Google Drive Connector Exploitation (not disclosed)
Documents in the cloud
Forced Document Leak They found that despite 40 hours of testing, only three partial vulnerabilities were revealed.
How red teams helped transform ChatGPT vulnerabilities into fortresses
OpenAI’s response to the results of the red team redefined entire segments in the ChatGPT agent’s architecture. One of the initiatives undertaken included building a dual layer inspection architecture that monitors 100 percent of production traffic in real time, achieving these measurable improvement:
Defense Metric
Previous Models
ChatGPT Agent
Improvement
Irrelevant Instructions (Visual Browser)
82%
95%
+13%
In-Context Data Exfiltration
75%
78%
+3%
Active Data Exfiltration
58%
67%
+9%
System Reliability
Sampling-based
100% coverage
Complete monitoring
The architecture works like this:
First Tier: A fast classifier with 96% recall flags suspicious content
Second Tier: A reasoning model with 84% recall analyzes flagged interactions for actual threats
But the technical defenses tell only part of the story. OpenAI made difficult security choices that acknowledge some AI operations require significant restrictions for safe autonomous execution.
Based on the vulnerabilities discovered, OpenAI implemented the following countermeasures across their model:
Watch Mode Activation: When ChatGPT Agent accesses sensitive contexts like banking or email accounts, the system freezes all activity if users navigate away. This is in direct response to data exfiltration attempts discovered during testing.
Memory Features Disabled: Despite being a core functionality, memory is completely disabled at launch to prevent the incremental data leaking attacks red teamers demonstrated.
Terminal Restrictions: Network access limited to GET requests only, blocking the command execution vulnerabilities researchers exploited.
Rapid Remediation Protocol: A new system that patches vulnerabilities within hours of discovery–developed after red teamers showed how quickly exploits could spread.
During pre-launch testing alone, this system identified and resolved 16 critical vulnerabilities that red teamers had discovered.
A biological risk wake-up call
Red teamers revealed the potential that the ChatGPT Agent could be comprimnised and lead to greater biological risks. Sixteen experienced participants from the Red Teaming Network, each with biosafety-relevant PhDs, attempted to extract dangerous biological information. Their submissions revealed the model could synthesize published literature on modifying and creating biological threats.
In response to the red teamers’ findings, OpenAI classified ChatGPT Agent as “High capability” for biological and chemical risks, not because they found definitive evidence of weaponization potential, but as a precautionary measure based on red team findings. This triggered:
Always-on safety classifiers scanning 100% of traffic
A topical classifier achieving 96% recall for biology-related content
A reasoning monitor with 84% recall for weaponization content
A bio bug bounty program for ongoing vulnerability discovery
What red teams taught OpenAI about AI security
The 110 attack submissions revealed patterns that forced fundamental changes in OpenAI’s security philosophy. They include the following:
Persistence over power: Attackers don’t need sophisticated exploits, all they need is more time. Red teamers showed how patient, incremental attacks could eventually compromise systems.
Trust boundaries are fiction: When your AI agent can access Google Drive, browse the web, and execute code, traditional security perimeters dissolve. Red teamers exploited the gaps between these capabilities.
Monitoring isn’t optional: The discovery that sampling-based monitoring missed critical attacks led to the 100% coverage requirement.
Speed matters: Traditional patch cycles measured in weeks are worthless against prompt injection attacks that can spread instantly. The rapid remediation protocol patches vulnerabilities within hours.
OpenAI is helping to create a new security baseline for Enterprise AI
For CISOs evaluating AI deployment, the red team discoveries establish clear requirements:
Quantifiable protection: ChatGPT Agent’s 95% defense rate against documented attack vectors sets the industry benchmark. The nuances of the many tests and results defined in the system card explain the context of how they accomplished this and is a must-read for anyone involved with model security.
Complete visibility: 100% traffic monitoring isn’t aspirational anymore. OpenAI’s experiences illustrate why it’s mandatory given how easily red teams can hide attacks anywhere.
Rapid response: Hours, not weeks, to patch discovered vulnerabilities.
Enforced boundaries: Some operations (like memory access during sensitive tasks) must be disabled until proven safe.
UK AISI’s testing proved particularly instructive. All seven universal attacks they identified were patched before launch, but their privileged access to internal systems revealed vulnerabilities that would eventually be discoverable by determined adversaries.
“This is a pivotal moment for our Preparedness work,” Gu wrote on X. “Before we reached High capability, Preparedness was about analyzing capabilities and planning safeguards. Now, for Agent and future more capable models, Preparedness safeguards have become an operational requirement.”
Red teams are core to building safer, more secure AI models
The seven universal exploits discovered by researchers and the 110 attacks from OpenAI’s red team network became the crucible that forged ChatGPT Agent.
By revealing exactly how AI agents could be weaponized, red teams forced the creation of the first AI system where security isn’t just a feature. It’s the foundation.
ChatGPT Agent’s results prove red teaming’s effectiveness: blocking 95% of visual browser attacks, catching 78% of data exfiltration attempts, monitoring every single interaction.
In the accelerating AI arms race, the companies that survive and thrive will be those who see their red teams as core architects of the platform that push it to the limits of safety and security.
Daily insights on business use cases with VB Daily
If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.
Read our Privacy Policy
Thanks for subscribing. Check out more VB newsletters here.