Introducing Terminal-Bench 2.0 and Harbor: Advancing Autonomous AI Agent Evaluation
The creators behind Terminal-Bench, a comprehensive benchmark suite designed to measure the effectiveness of autonomous AI agents in terminal-based, real-world tasks, have unveiled two major releases: Terminal-Bench 2.0 and Harbor. Together, these tools provide a robust framework for testing, refining, and scaling AI agents within containerized cloud environments.
Raising the Standard: Terminal-Bench 2.0’s Enhanced Benchmark Suite
Terminal-Bench 1.0 quickly became a go-to benchmark for assessing AI agents that operate autonomously in developer-style command-line environments. These agents simulate the workflows of software developers by interacting directly with terminal interfaces rather than graphical user interfaces. However, the initial version faced criticism for inconsistencies and unstable task definitions, often due to dependencies on fluctuating external services.
Addressing these challenges, Terminal-Bench 2.0 introduces a refined set of 89 rigorously validated tasks. Each task underwent extensive manual review combined with large language model (LLM)-assisted verification to ensure clarity, realism, and solvability. This overhaul not only elevates the difficulty level but also significantly improves the reliability and reproducibility of results. For instance, the previously included download-youtube task was removed or redesigned to eliminate reliance on unstable third-party APIs, enhancing overall benchmark stability.
Despite the increased complexity, early observations reveal that state-of-the-art (SOTA) performance metrics remain comparable to those from version 1.0. This suggests that the improved task quality and clearer specifications in Terminal-Bench 2.0 provide a more accurate reflection of agent capabilities rather than artificially inflating scores.
Harbor: Scalable and Unified Agent Evaluation Infrastructure
Complementing the benchmark update, Harbor emerges as a versatile runtime framework tailored for deploying and assessing AI agents at scale within cloud container environments. Harbor supports integration with leading cloud providers such as Daytona and Modal, enabling researchers and developers to conduct large-scale evaluations seamlessly.
Key features of Harbor include:
- Compatibility with any agent that can be installed in a containerized environment
- Support for scalable supervised fine-tuning (SFT) and reinforcement learning (RL) workflows
- Tools for creating and deploying custom benchmarks
- Full interoperability with Terminal-Bench 2.0 for streamlined benchmarking
During the development of Terminal-Bench 2.0, Harbor facilitated tens of thousands of evaluation rollouts, demonstrating its capacity for high-throughput testing. The framework is now publicly accessible, complete with documentation to guide users through agent testing and leaderboard submissions.
Benchmarking Breakthroughs: GPT-5 Dominates Early Leaderboard
Preliminary leaderboard results from Terminal-Bench 2.0 highlight OpenAI’s Codex CLI, powered by GPT-5, as the frontrunner with a 49.6% task success rate-the highest recorded so far. Other top contenders include various GPT-5 derivatives and agents based on Anthropic’s Claude Sonnet 4.5 model.
Top Five Agents on Terminal-Bench 2.0:
- Codex CLI (GPT-5) – 49.6%
- Codex CLI (GPT-5-Codex) – 44.3%
- OpenHands (GPT-5) – 43.8%
- Terminus 2 (GPT-5-Codex) – 43.4%
- Terminus 2 (Claude Sonnet 4.5) – 42.8%
The close clustering of scores among these leading agents underscores a highly competitive landscape, with no single model yet capable of solving the majority of tasks, highlighting ongoing opportunities for innovation and improvement.
Getting Started: How to Submit and Evaluate Agents
Users interested in benchmarking their AI agents can easily install Harbor and execute Terminal-Bench 2.0 using straightforward command-line instructions. To qualify for leaderboard submission, agents must complete five benchmark runs, with results and job directories submitted for verification.
harbor run -d [email protected] -m "<model>" -a "<agent>" --n-attempts 5 --jobs-dir <path/to/output>
Terminal-Bench 2.0 is rapidly being adopted in research focused on agentic reasoning, automated code generation, and tool utilization. A detailed preprint outlining the benchmark’s design principles and validation methodology is forthcoming, promising deeper insights into its development.
Towards a Unified Framework for AI Agent Evaluation
The simultaneous launch of Terminal-Bench 2.0 and Harbor represents a significant stride toward establishing standardized, scalable evaluation protocols for autonomous AI agents. As large language model (LLM) agents become increasingly prevalent in software development and operational contexts, the demand for reproducible, controlled testing environments intensifies.
By providing a cohesive ecosystem that supports model benchmarking, environment simulation, and iterative improvement, these tools lay the groundwork for a unified evaluation infrastructure that can accelerate progress across the AI community.

