OpenAI has unveiled GPT-5.2, its latest cutting-edge model designed to enhance professional workflows and support long-duration agent tasks. This new release is now accessible via ChatGPT and the OpenAI API.
The GPT-5.2 series comprises three distinct versions tailored to different user needs. Within ChatGPT, these are presented as ChatGPT-5.2 Instant, Thinking, and Pro. Correspondingly, the API offers gpt-5.2-chat-latest, gpt-5.2, and gpt-5.2-pro. The Instant variant focuses on everyday assistance and educational purposes, Thinking is optimized for intricate, multi-step projects and agent workflows, while Pro dedicates additional computational power to tackle demanding technical and analytical challenges.
Performance Highlights: From Industry Benchmarks to Real-World Applications
GPT-5.2 Thinking serves as the primary engine for knowledge-intensive professional tasks. On GDPval-a comprehensive benchmark assessing well-defined knowledge tasks across 44 professions spanning nine major industries-this model matches or surpasses top-tier industry experts in 70.9% of comparisons. Remarkably, it delivers results over 11 times faster and at less than 1% of the typical expert cost. For engineering teams, this translates into reliable generation of complex deliverables such as presentations, spreadsheets, project timelines, and diagrams based on structured inputs.
In the realm of finance, GPT-5.2 Thinking demonstrates significant improvements on junior investment banking spreadsheet modeling tasks, with average accuracy rising from 59.1% in GPT-5.1 to 68.4%. The Pro variant further boosts this to 71.7%. These tasks include constructing three-statement financial models and leveraged buyout analyses, adhering to strict formatting and citation requirements-mirroring many enterprise-level workflows.
Software development capabilities have also advanced. GPT-5.2 Thinking achieves 55.6% on the SWE-Bench Pro, which evaluates multi-language repository-level patch generation, and an impressive 80.0% on SWE-Bench Verified, a Python-specific benchmark.
Extended Context and Advanced Agent Capabilities
Handling extensive context windows is a cornerstone of GPT-5.2’s design. The Thinking model sets a new benchmark on OpenAI’s MRCRv2 test, which embeds multiple identical “needle” queries within lengthy dialogue “haystacks” to assess retrieval accuracy. GPT-5.2 Thinking is the first model to approach near-perfect accuracy on the 4-needle MRCR variant with context lengths up to 256,000 tokens.
For tasks exceeding this context limit, GPT-5.2 Thinking integrates with the Responses /compact endpoint, enabling context compression to effectively extend the usable window for tool-intensive, long-running workflows. This feature is particularly valuable for developers building agents that perform iterative tool calls over many steps while maintaining state beyond raw token constraints.
Regarding tool orchestration, GPT-5.2 Thinking scores 98.7% on Tau2-bench Telecom, a multi-turn customer support benchmark requiring seamless coordination of tool calls within realistic scenarios. For instance, in a complex travel assistance case involving flight delays, missed connections, lost luggage, and medical seating needs, GPT-5.2 successfully manages rebooking, special assistance arrangements, and compensation processing in a coherent sequence-tasks where GPT-5.1 often left steps incomplete.
Enhanced Visual, Scientific, and Mathematical Reasoning
Visual comprehension has seen notable improvements. When paired with Python tools, GPT-5.2 Thinking reduces error rates by approximately 50% on benchmarks like CharXiv Reasoning and ScreenSpot Pro, which test chart interpretation and user interface understanding. The model exhibits superior spatial reasoning, such as more accurately identifying and bounding motherboard components compared to its predecessor.
In scientific domains, GPT-5.2 Pro attains a 93.2% score on the GPQA Diamond benchmark, with GPT-5.2 Thinking close behind at 92.4%. Additionally, GPT-5.2 Thinking solves 40.3% of problems across FrontierMath Tier 1 to Tier 3 when Python tools are enabled. These benchmarks encompass graduate-level challenges in physics, chemistry, biology, and advanced mathematics. Notably, GPT-5.2 Pro has already contributed to a verified proof in statistical learning theory, showcasing its potential in cutting-edge research.
Model Comparison Overview
| Model | Primary Use Case | Context Window / Max Output | Knowledge Cutoff | Key Benchmark Scores (Thinking / Pro vs GPT-5.1 Thinking) |
|---|---|---|---|---|
| GPT-5.1 | Flagship for coding and agentic tasks with adjustable reasoning depth | 400,000 tokens context, 128,000 max output | September 30, 2024 | SWE-Bench Pro: 50.8%, SWE-Bench Verified: 76.3%, ARC-AGI-1: 72.8%, ARC-AGI-2: 17.6% |
| GPT-5.2 Thinking | New standard for coding, agent workflows, and industry knowledge tasks | 400,000 tokens context, 128,000 max output | August 31, 2025 | GDPval: 70.9% wins/ties vs professionals, SWE-Bench Pro: 55.6%, SWE-Bench Verified: 80.0%, ARC-AGI-1: 86.2%, ARC-AGI-2: 52.9% |
| GPT-5.2 Pro | Enhanced compute for complex reasoning and scientific challenges | 400,000 tokens context, 128,000 max output | August 31, 2025 | GPQA Diamond: 93.2%, ARC-AGI-1: 90.5%, ARC-AGI-2: 54.2% |
Summary of Key Insights
- GPT-5.2 Thinking emerges as the primary model for professional tasks: It supersedes GPT-5.1 Thinking for coding, knowledge work, and agent-based applications, maintaining the same extensive context and output limits but delivering superior performance across multiple benchmarks including GDPval, SWE-Bench, ARC-AGI, and scientific question answering.
- Significant accuracy improvements at comparable scale: GPT-5.2 Thinking advances from 50.8% to 55.6% on SWE-Bench Pro, 76.3% to 80.0% on SWE-Bench Verified, and dramatically improves ARC-AGI scores from 72.8% to 86.2% (ARC-AGI-1) and 17.6% to 52.9% (ARC-AGI-2), all while preserving token capacity.
- GPT-5.2 Pro targets the most demanding reasoning and scientific tasks: This higher compute variant excels in complex problem-solving, achieving 93.2% on GPQA Diamond compared to 92.4% for GPT-5.2 Thinking and 88.1% for GPT-5.1 Thinking, alongside superior results on ARC-AGI benchmarks.
