Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

Introducing Terminal-Bench 2.0 and Harbor: Advancing Autonomous AI Agent Evaluation

The creators behind Terminal-Bench, a comprehensive benchmark suite designed to measure the effectiveness of autonomous AI agents in terminal-based, real-world tasks, have unveiled two major releases: Terminal-Bench 2.0 and Harbor. Together, these tools provide a robust framework for testing, refining, and scaling AI agents within containerized cloud environments.

Raising the Standard: Terminal-Bench 2.0’s Enhanced Benchmark Suite

Terminal-Bench 1.0 quickly became a go-to benchmark for assessing AI agents that operate autonomously in developer-style command-line environments. These agents simulate the workflows of software developers by interacting directly with terminal interfaces rather than graphical user interfaces. However, the initial version faced criticism for inconsistencies and unstable task definitions, often due to dependencies on fluctuating external services.

Addressing these challenges, Terminal-Bench 2.0 introduces a refined set of 89 rigorously validated tasks. Each task underwent extensive manual review combined with large language model (LLM)-assisted verification to ensure clarity, realism, and solvability. This overhaul not only elevates the difficulty level but also significantly improves the reliability and reproducibility of results. For instance, the previously included download-youtube task was removed or redesigned to eliminate reliance on unstable third-party APIs, enhancing overall benchmark stability.

Despite the increased complexity, early observations reveal that state-of-the-art (SOTA) performance metrics remain comparable to those from version 1.0. This suggests that the improved task quality and clearer specifications in Terminal-Bench 2.0 provide a more accurate reflection of agent capabilities rather than artificially inflating scores.

Harbor: Scalable and Unified Agent Evaluation Infrastructure

Complementing the benchmark update, Harbor emerges as a versatile runtime framework tailored for deploying and assessing AI agents at scale within cloud container environments. Harbor supports integration with leading cloud providers such as Daytona and Modal, enabling researchers and developers to conduct large-scale evaluations seamlessly.

Key features of Harbor include:

Compatibility with any agent that can be installed in a containerized environment
Support for scalable supervised fine-tuning (SFT) and reinforcement learning (RL) workflows
Tools for creating and deploying custom benchmarks
Full interoperability with Terminal-Bench 2.0 for streamlined benchmarking

During the development of Terminal-Bench 2.0, Harbor facilitated tens of thousands of evaluation rollouts, demonstrating its capacity for high-throughput testing. The framework is now publicly accessible, complete with documentation to guide users through agent testing and leaderboard submissions.

Benchmarking Breakthroughs: GPT-5 Dominates Early Leaderboard

Preliminary leaderboard results from Terminal-Bench 2.0 highlight OpenAI’s Codex CLI, powered by GPT-5, as the frontrunner with a 49.6% task success rate-the highest recorded so far. Other top contenders include various GPT-5 derivatives and agents based on Anthropic’s Claude Sonnet 4.5 model.

Top Five Agents on Terminal-Bench 2.0:

Codex CLI (GPT-5) – 49.6%
Codex CLI (GPT-5-Codex) – 44.3%
OpenHands (GPT-5) – 43.8%
Terminus 2 (GPT-5-Codex) – 43.4%
Terminus 2 (Claude Sonnet 4.5) – 42.8%

The close clustering of scores among these leading agents underscores a highly competitive landscape, with no single model yet capable of solving the majority of tasks, highlighting ongoing opportunities for innovation and improvement.

Getting Started: How to Submit and Evaluate Agents

Users interested in benchmarking their AI agents can easily install Harbor and execute Terminal-Bench 2.0 using straightforward command-line instructions. To qualify for leaderboard submission, agents must complete five benchmark runs, with results and job directories submitted for verification.

harbor run -d [email protected] -m "<model>" -a "<agent>" --n-attempts 5 --jobs-dir <path/to/output>

Terminal-Bench 2.0 is rapidly being adopted in research focused on agentic reasoning, automated code generation, and tool utilization. A detailed preprint outlining the benchmark’s design principles and validation methodology is forthcoming, promising deeper insights into its development.

Towards a Unified Framework for AI Agent Evaluation

The simultaneous launch of Terminal-Bench 2.0 and Harbor represents a significant stride toward establishing standardized, scalable evaluation protocols for autonomous AI agents. As large language model (LLM) agents become increasingly prevalent in software development and operational contexts, the demand for reproducible, controlled testing environments intensifies.

By providing a cohesive ecosystem that supports model benchmarking, environment simulation, and iterative improvement, these tools lay the groundwork for a unified evaluation infrastructure that can accelerate progress across the AI community.

Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

Introducing Terminal-Bench 2.0 and Harbor: Advancing Autonomous AI Agent Evaluation

Raising the Standard: Terminal-Bench 2.0’s Enhanced Benchmark Suite

Harbor: Scalable and Unified Agent Evaluation Infrastructure

Benchmarking Breakthroughs: GPT-5 Dominates Early Leaderboard

Top Five Agents on Terminal-Bench 2.0:

Getting Started: How to Submit and Evaluate Agents

Towards a Unified Framework for AI Agent Evaluation

The AI lab revolving door spins ever faster

Flutterwave goes deeper into stablecoins with Turnkey-powered wallets for merchants

Sophos Launches Browser-Based Security Product Targeting Hybrid Work & AI Risks

Razer’s Project Ava: AI now goes in a cannister on your...

Recomended

The AI lab revolving door spins ever faster

Flutterwave goes deeper into stablecoins with Turnkey-powered wallets for merchants

Sophos Launches Browser-Based Security Product Targeting Hybrid Work & AI Risks

Razer’s Project Ava: AI now goes in a cannister on your desk

Tech Careers in 2026 and Beyond: Inside the Jobs, Skills, and Roles Defining Africa’s Digital Future

OpenAI invests in brain-interface biz co-founded by CEO Sam Altman