Google’s Gemini 2.5 Computer Use: Advancing AI Agents with Web Interaction Capabilities
Leading providers of large language models (LLMs) have recently expanded beyond traditional multimodal chatbots, evolving their technologies into autonomous “agents” capable of performing complex tasks across websites. Notable examples include innovations from companies like Recall and others launched over the past two years.
Joining this trend, Google’s DeepMind AI lab has introduced a specialized iteration of its Gemini 2.5 Pro LLM, dubbed Gemini 2.5 Computer Use. This model is uniquely designed to operate a virtual browser that can navigate websites, extract information, complete forms, and execute various online actions-all triggered by a single user prompt.
Revolutionizing Web Interaction Through AI Agents
Google CEO Sundar Pichai highlighted the significance of this development, emphasizing that enabling AI to interact with web elements such as scrolling, dropdown menus, and form inputs marks a crucial step toward creating versatile, general-purpose agents. Although still in its early stages, Gemini 2.5 Computer Use represents a leap forward in AI’s ability to mimic human-like web navigation and task execution.
Unlike consumer-facing products, this model is currently accessible only through partnerships, notably with Browserbase-a company offering a “headless” web browser environment tailored for AI agents. While headless browsers typically operate without a graphical interface, Browserbase provides a visual representation to facilitate user interaction and debugging.
Hands-On Experience: Practical Demonstrations
Testing Gemini 2.5 Computer Use on Browserbase reveals its practical capabilities. For instance, when prompted to visit Taylor Swift’s official site, the model accurately navigated and summarized the featured promotion-a special edition of her latest album, The Life of a Showgirl. In another scenario, it attempted to find highly rated solar garden lights on Amazon, impressively solving a Google Search CAPTCHA by selecting the correct images within seconds. However, it encountered difficulties completing the full purchase process, highlighting areas for further refinement.
It’s important to note that, unlike some competitors such as OpenAI’s GPT-4 and Anthropic’s Claude, Gemini 2.5 Computer Use does not currently support direct file creation or editing (e.g., documents or spreadsheets). Instead, its strength lies in controlling web and mobile interfaces through actions like clicking, typing, and scrolling, with output limited to UI commands or conversational responses. Developers must integrate additional tools or custom code to handle structured outputs.
Benchmarking Performance Against Competitors
Google reports that Gemini 2.5 Computer Use outperforms rival AI agents in several interface control benchmarks, based on evaluations conducted via Browserbase and internal testing:
- Online-Mind2Web (Browserbase): Gemini scored 65.7%, surpassing Claude Sonnet 4’s 61.0% and OpenAI Agent’s 44.3%.
- WebVoyager (Browserbase): Achieved 79.9%, ahead of Claude Sonnet 4’s 69.4% and OpenAI Agent’s 61.0%.
- AndroidWorld (DeepMind): Recorded 69.7%, outperforming Claude Sonnet 4’s 62.1%; OpenAI’s model was not benchmarked due to access limitations.
- OSWorld: Not yet supported by Gemini 2.5; top competitor scored 61.4%.
In addition to accuracy, Gemini 2.5 Computer Use boasts lower latency compared to other browser automation solutions, a critical advantage for real-time UI testing and automation workflows.
How Gemini 2.5 Computer Use Operates
The model functions within an iterative interaction loop, processing three key inputs:
- User’s task description
- Current interface screenshot
- History of previous actions
Based on this data, it recommends UI actions such as clicking buttons or entering text. For sensitive operations-like purchases-it can request user confirmation before proceeding. After each action, the interface updates, and a new screenshot is fed back into the model, continuing the cycle until the task is complete or interrupted due to errors or safety concerns.
Integration is facilitated through a specialized tool named computer_use, compatible with automation frameworks like Playwright and accessible via Browserbase’s sandbox environment.
Real-World Applications and Industry Adoption
Google and external partners have begun leveraging Gemini 2.5 Computer Use across various sectors:
- Google Payments Team: Reports a 60% recovery rate of failed test executions, significantly improving engineering efficiency.
- Autotab: An AI agent platform that observed up to an 18% performance boost in complex data extraction tasks using Gemini.
- Poke.com: A proactive AI assistant provider noting that Gemini operates approximately 50% faster than competing models during interface interactions.
Internally, Google employs this model in projects such as Project Mariner, the Firebase Testing Agent, and the AI Mode in Search, underscoring its strategic importance.
Robust Safety Protocols
Given its direct control over software interfaces, Gemini 2.5 Computer Use incorporates multiple safety layers:
- A per-action safety review that evaluates each proposed UI interaction before execution.
- Developer-configurable system-level rules to block or require approval for specific operations.
- Built-in safeguards to prevent security breaches and ensure compliance with Google’s usage policies.
For example, when encountering CAPTCHAs, the model can initiate the necessary clicks but flags these steps for user confirmation, preventing unauthorized automated submissions.
Technical Features and Capabilities
Gemini 2.5 Computer Use supports a broad range of UI commands, including:
click_at,type_text_at,scroll_document,drag_and_drop, among others- Extension through user-defined functions for mobile or custom environments
- Normalized screen coordinates (0-1000 scale) mapped to actual pixel dimensions during execution
The model accepts both image and text inputs, producing either textual responses or function calls to perform tasks. While optimized for a screen resolution of 1440×900, it remains adaptable to other display sizes.
Pricing Structure and Access Details
Pricing for Gemini 2.5 Computer Use closely mirrors that of the standard Gemini 2.5 Pro model, with a per-token billing system:
- Input tokens: $1.25 per million for prompts under 200,000 tokens; $2.50 per million for longer prompts
- Output tokens: $10.00 per million for smaller responses; $15.00 per million for larger outputs
However, key differences exist in availability and features:
- Gemini 2.5 Pro offers a free tier with no explicit token cap, subject to platform-specific rate limits, including both input and output tokens. Paid tiers apply standard pricing beyond free usage.
- Gemini 2.5 Computer Use is exclusively available on a paid basis, with no free access currently offered.
- Additional features like context caching and Google Search grounding are available for Gemini 2.5 Pro but not yet supported in the Computer Use variant.
- Data handling policies differ: output from Computer Use paid-tier usage is not used to improve Google models, whereas free-tier Gemini 2.5 Pro usage contributes to model training unless opted out.
Developers should weigh these factors-cost, access, capabilities, and data policies-when selecting the appropriate model for their projects.

