Summary: Computer-use agents, powered by vision-language models (VLMs), simulate human interaction with software interfaces without requiring any modifications to the applications themselves. Initial benchmarks on OSWorld started at a modest 12.24% success rate compared to human performance at 72.36%. Recent advancements, such as Claude Sonnet 4.5, have elevated this to 61.4%. Meanwhile, Gemini 2.5 Computer Use dominates multiple web-based benchmarks, including Online-Mind2Web (69.0%) and WebVoyager (88.9%), though it remains unrefined for full operating system control. Future development priorities focus on enhancing OS-level stability, achieving sub-second response times, and implementing robust safety protocols, with open-source communities contributing transparent training and evaluation frameworks.
Understanding Computer-Use Agents
Computer-use agents, also known as graphical user interface (GUI) agents, are sophisticated VLM-driven systems designed to interpret visual data from screens, identify UI components, and perform a limited set of interface actions such as clicking, typing, scrolling, and keyboard shortcuts. These agents operate seamlessly on unaltered software environments, including desktop applications and web browsers. Prominent examples in the public domain include Anthropic’s Computer Use, Google’s Gemini 2.5 Computer Use, and OpenAI’s Computer-Using Agent, which powers the Operator platform.
Operational Workflow of GUI Agents
The typical execution cycle of these agents involves several key steps: (1) capturing the current screen image and system state, (2) generating the next action plan by grounding UI elements both spatially and semantically, (3) executing the planned action within a predefined set of permissible commands, and (4) validating the outcome and retrying if necessary. Industry players provide detailed documentation on standardized action vocabularies and safety constraints, while independent evaluation frameworks ensure fair and consistent benchmarking.
Current Benchmarking Landscape
- OSWorld (HKU, April 2024): This benchmark encompasses 369 authentic desktop and web tasks, covering file operations and multi-application workflows. At its inception, human users achieved a 72.36% success rate, whereas the leading AI model managed only 12.24%.
- Progress in 2025: Anthropic’s Claude Sonnet 4.5 has significantly improved performance, reaching 61.4% on OSWorld, marking a substantial leap from earlier versions.
- Web-Centric Benchmarks: Google’s Gemini 2.5 Computer Use excels in live web environments, scoring 69.0% on Online-Mind2Web, 88.9% on WebVoyager, and 69.7% on AndroidWorld. However, this model is optimized primarily for browser interactions and has yet to fully address operating system-level complexities.
- Online-Mind2Web Details: This benchmark tests 300 tasks across 136 active websites, with results independently verified by Princeton University and hosted on a public Hugging Face leaderboard.
Core System Architecture
- Visual Perception and UI Grounding: Agents periodically capture screenshots, perform optical character recognition (OCR), localize interface elements, and infer coordinates to understand the UI layout.
- Strategic Planning: Multi-step policies guide the agent’s actions, often enhanced through post-training or reinforcement learning to improve UI control and error recovery.
- Action Framework: Agents operate within a constrained set of commands such as
click_at,type,key_combo, andopen_app, with task-specific restrictions to prevent shortcut exploitation. - Evaluation Environment: Testing occurs in live web or virtual machine sandboxes, with third-party audits and reproducible execution scripts ensuring transparency and reliability.
Industry Implementations and Capabilities
- Anthropic: Offers a Computer Use API with Sonnet 4.5 achieving 61.4% on OSWorld. Their documentation highlights pixel-precise grounding, retry mechanisms, and safety checks.
- Google DeepMind: Provides the Gemini 2.5 Computer Use API, boasting top scores on Online-Mind2Web, WebVoyager, and AndroidWorld benchmarks, alongside latency metrics and safety features.
- OpenAI: Features the Operator research preview powered by a Computer-Using Agent, accessible to select U.S. Pro users, with a dedicated developer interface and limited availability.
Future Directions: From Web Interfaces to Full OS Control
- Efficient Workflow Replication: A key research focus is enabling agents to imitate complex tasks from minimal demonstrations, such as a single screen recording paired with verbal instructions. This remains an active area of exploration rather than a finalized product capability.
- Reducing Latency for Seamless Interaction: To maintain natural user experience, agents must execute actions within 100 to 1000 milliseconds. Current systems often exceed this due to computational overhead in vision processing and planning. Innovations like incremental frame analysis, OCR caching, and action batching are expected to address these challenges.
- Expanding OS-Level Functionality: Handling file dialogs, managing multiple windows, interacting with non-DOM interfaces, and adhering to system policies introduce new failure points absent in browser-only agents. Gemini’s current focus on browser optimization highlights the need for further OS-level refinement.
- Enhancing Safety Measures: Agents must guard against prompt injection attacks, unauthorized or hazardous actions, and data leaks. Safety protocols include allow/deny lists, user confirmations, domain blocking, typed action contracts, and consent mechanisms for irreversible operations.
Guidelines for Developing Computer-Use Agents
- Begin with a browser-centric agent using a well-defined action schema and a validated testing framework such as Online-Mind2Web.
- Incorporate robust recovery strategies including explicit post-action checks, on-screen validation, and rollback capabilities for complex workflows.
- Approach performance metrics critically, favoring audited leaderboards and third-party evaluation over self-reported results. OSWorld’s execution-based evaluation ensures reproducibility and fairness.
Open-Source Contributions and Research Tools
Hugging Face’s Smol2Operator offers an open-source post-training pipeline that transforms compact VLMs into GUI-grounded operators. This resource is particularly valuable for research labs and startups focused on reproducible training methodologies rather than leaderboard dominance.
Essential Insights
- Computer-use agents leverage VLMs to interpret screen content and perform limited UI actions, enabling interaction with unmodified software. Leading implementations include Anthropic’s Computer Use, Google’s Gemini 2.5 Computer Use, and OpenAI’s Computer-Using Agent.
- The OSWorld benchmark evaluates 369 real-world desktop and web tasks with execution-based scoring, initially revealing a significant gap between human (72.36%) and AI (12.24%) performance.
- Anthropic’s Claude Sonnet 4.5 has narrowed this gap substantially, achieving 61.4% on OSWorld.
- Google’s Gemini 2.5 excels in web benchmarks but requires further development for comprehensive OS-level control.
- OpenAI’s Operator, powered by a Computer-Using Agent, remains in limited research preview, focusing on screenshot-based GUI interaction.
- Open-source initiatives like Hugging Face’s Smol2Operator promote standardized training and evaluation pipelines, fostering transparency and reproducibility in this emerging field.

