Anthropic has unveiled Claude Sonnet 4.5, setting a new standard in comprehensive software engineering and practical computer interaction. This release introduces tangible enhancements such as Claude Code checkpoints, a native Visual Studio Code extension, advanced API memory and context management tools, and an Agent SDK that mirrors the internal frameworks Anthropic employs. Pricing remains consistent with Sonnet 4, at $3 per million input tokens and $15 per million output tokens.
Key Innovations and Performance Highlights
- Breakthrough in Software Engineering Benchmarks. Claude Sonnet 4.5 achieves an impressive 77.2% accuracy on the 500-problem SWE-bench Verified dataset, utilizing a straightforward two-tool setup (bash and file editing). This result is averaged over 10 runs without additional test-time computation, operating within a 200,000-token reasoning budget. Expanding the context window to 1 million tokens boosts accuracy to 78.2%, while employing parallel sampling and rejection techniques further elevates performance to 82.0%.
- State-of-the-Art in Computer Interaction. On the OSWorld-Verified benchmark, Sonnet 4.5 leads with a 61.4% success rate, a significant improvement over Sonnet 4’s 42.2%. This leap reflects enhanced capabilities in tool manipulation and user interface control across browser and desktop environments.
- Extended Autonomous Operation. The model demonstrates sustained focus exceeding 30 hours on complex, multi-step coding assignments, marking a substantial advancement in agent reliability and endurance compared to previous iterations.
- Enhanced Reasoning and Mathematical Skills. The update delivers notable improvements across standard reasoning and math evaluations, including configurations akin to the AIME contest. Additionally, the safety framework is elevated to ASL-3, incorporating robust defenses against prompt-injection attacks.
Advancements for Intelligent Agents
Sonnet 4.5 addresses critical challenges faced by autonomous agents, particularly in long-term planning, memory retention, and dependable tool coordination. Anthropic’s Claude Agent SDK offers developers access to the same sophisticated infrastructure used internally, including memory management for prolonged tasks, permission controls, and coordination among sub-agents. This enables teams to replicate the multi-hour task coherence and reversibility found in Claude Code, now enhanced with checkpointing, an updated terminal interface, and seamless VS Code integration.
The model’s 19-point improvement on OSWorld-Verified benchmarks underscores its superior ability to perform computer-based tasks such as navigating interfaces, populating spreadsheets, and completing web workflows, as demonstrated in Anthropic’s browser-based demos. For enterprises exploring agent-driven robotic process automation (RPA), higher OSWorld scores typically translate to reduced manual interventions during task execution.
Deployment and Accessibility
- Anthropic API and Applications. Accessible under the model ID
claude-sonnet-4-5, maintaining pricing parity with Sonnet 4. Paid users can now create files and execute code directly within Claude applications. - AWS Bedrock Integration. Available through AWS Bedrock, featuring integration with AgentCore. AWS emphasizes capabilities such as extended agent sessions, advanced memory and context handling, and operational controls including observability and session isolation.
- Google Cloud Vertex AI General Availability. Fully launched on Vertex AI, supporting multi-agent orchestration via the Anthropic Developer Kit (ADK) and Agent Engine. Features include provisioned throughput, analysis jobs handling up to 1 million tokens, and prompt caching for efficiency.
- GitHub Copilot Public Preview. Rolling out across Copilot Chat on VS Code, web, and mobile platforms, as well as Copilot CLI. Organizations can enable access through policy management, with support for bring-your-own API keys in VS Code.
Conclusion: A New Era for Coding and Autonomous Agents
With a verified 77.2% accuracy on SWE-bench under transparent evaluation conditions, a commanding 61.4% lead on OSWorld-Verified for computer-use tasks, and practical enhancements such as checkpointing, an Agent SDK, and broad platform availability, Claude Sonnet 4.5 is engineered for demanding, tool-intensive agent workloads rather than brief demonstration prompts. While independent validation will further clarify its standing as the premier coding model, its focus on autonomy, robust scaffolding, and precise computer control directly addresses the pressing challenges faced by developers and enterprises today.
