OpenAI releases new simulated reason models with full access to tools

According to one doctor “near-genius level,” a new o3 model has appeared

but it still makes errors.

On Wednesday, OpenAI Announced that two new models — o3 and O4-mini — combine simulated reasoning abilities with access to functions such as web browsing and coding. These models are the first OpenAI reasoning-focused models to be able to use all ChatGPT tools simultaneously, including visual analyses and image generation.

OpenAI released o3 last December. Until now, only the less-capable variant models “o3-mini” or “03-mini-high” were available. The new models, however, replace their predecessors – o1 and O3-mini. OpenAI’s

is now available to ChatGPT Plus Pro Team and Team users. Enterprise and Edu will have access next week. Users can test o4-mini for free by selecting “Think” before submitting queries. OpenAI CEO Sam Altman Tweeted“we expect to release o3-pro to the pro tier in a few weeks.”

Both models are available today for developers through the Chat Completions API (and Responses API) though some organizations may require verification to access.

These new models have several improvements. According to OpenAI’s website “These are the smartest models we’ve released to date, representing a step change in ChatGPT’s capabilities for everyone from curious users to advanced researchers.” OpenAI, the models also offer better cost efficiency than previous models, and each has a different intended application: o3 targets complicated analysis, while o4 mini, a smaller version its next-gen SR “o4” model (not yet released), optimizes speed and cost-efficiency.

OpenAI claims o3 and O4-mini are Multimodal, with the ability to “think with images.” credit: Openai

These new models are different from OpenAI’s previous models (like GPT-4o or GPT-4.5), because they use a simulated step by step “thinking” procedure to solve problems. The new models also dynamically decide when and how to deploy helpers to solve multi-step problems. When asked about future energy use in California, for example, the models can search for utility data autonomously, write Python code to create forecasts, generate graphs visualizing predictions, and explain key reasons behind predictions – all within a single question.

OpenAI highlights the multimodal ability of the new models to incorporate images directly into simulated reasoning processes–not only analyzing visual inputs, but actively “thinking with” . This allows the models to understand whiteboards and textbook diagrams as well as hand-drawn sketches even if they are blurry.

Despite this, OpenAI continues its tradition of confusing product names, which don’t give users a clear picture of each model’s relative abilities. For example, o3 has a higher power than o4 mini, despite the lower number. There’s also the issue of confusion caused by the firm’s AI models that don’t use reasoning. As Ars Technica contributor Timothy B. Lee We’ve noted on X today, “It’s an amazing branding decision to have a model called GPT-4o and another one called o4.”

benchmarks and vibes

Leaving that aside, you may be wondering: What about the vibrations? Ethan Mollick, Wharton professor and frequent AI commentator, has not yet used 03 or o4 mini. was compared favorably with Google’s Gemini 2.5.1 Pro on Bluesky. He wrote. OpenAI President Greg Brockman made a bold claim during the livestream announcement of o3-mini and o4 today: “Each has its own quirks & you will likely prefer one to another, but there is a gap between them & other models.”

Early user feedback appears to support this assertion. However, until third-party testing is conducted, it’s prudent to be skeptical about the claims. On X, immunologist Derya does not forget ( ) said that o3 appeared (19459030) and Writtenand “It’s generating complex incredibly insightful and based scientific hypotheses on demand! When I throw challenging clinical or medical questions at o3, its responses sound like they’re coming directly from a top subspecialist physician.”

OpenAI benchmark results for o3 and o4-mini SR models. Credit: Openai

The vibes are on target but what about the numerical benchmarks? OpenAI reports o3 is “20 percent fewer major errors” better than o1 at difficult tasks. The company has particular strengths in business consulting, programming, and “creative ideation.”

the company also reported state of the art performance on various metrics. On the American Invitational Mathematics Examination AIME2025, the o4-mini achieved 92.7% accuracy. o3 achieved 69.1 per cent accuracy for programming tasks. SWE-Bench Verifieda popular programming benchmark. The models also showed strong results in visual reasoning benchmarks. o3 scored 82.9 percent on the MMMU (massive Multi-disciplinary Multimodal Understanding), a college level visual problem-solving exam.

OpenAI benchmark results for o3 and o4-mini SR models. Credit: Openai

These benchmarks provided by OpenAI are not independently verified. One Early evaluation by an independent AI research lab of a prerelease o3 prototype Transluce discovered that the model displayed recurring confabulations such as claiming it could run code locally or provide hardware specifications. They hypothesized that this was due to the fact that the model did not have access to its reasoning processes from previous conversations. “It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities,” Transluce was written in a Tweet.

Some OpenAI evaluations include footnotes that are worth considering. OpenAI notes, for a “Humanity’s Last Exam” benchmark that measures expert-level expertise across subjects (o3 scored 20,32 with no tools but 24,90 with browsing and tool), that browsing-enabled model could potentially find online answers. The company has implemented domain blocks and monitored to prevent what they call “cheating” when evaluating.

Although early results appear promising, experts and academics who may rely on SR models to conduct rigorous research should thoroughly determine whether the AI model produced an accurate result before assuming that it is correct. If you are using models that are outside your area of expertise, be cautious about accepting results without independent verification.

Price

ChatGPT subscribers get access to o3-mini and o4 with their subscription. OpenAI’s API pricing for o3 is $10 per million input tokens. The price for o4-mini is $40 per million output tokens. There is a discount of $2.50 each million tokens if inputs are cached. This is a significant price reduction from the o1 pricing structure, which was $15/$60 for each million input/output tokens. OpenAI claims that this has improved performance while also reducing costs by 33 percent.

o4 mini is more economical, costing $1.10 for a million input tokens. It costs $4.40 for a million output tokens. Cached inputs are priced at $0.275 each million tokens. This pricing structure is the same as its predecessor, o3 mini. This suggests that OpenAI has improved its reasoning model without increasing costs.

Codex CLI.

OpenAI introduced an experimental terminal called Codex CLI (), also known as “a lightweight coding agent you can run from your terminal.” is an open-source tool that connects models to computers and local code. Along with this release, the company also announced a $1,000,000 grant program that offers API credits for Codex CLI projects.

A screenshot of OpenAI’s new Codex CLI tool in action, taken from GitHub. Credit: Openai

Codex resembles Claude Code – an agent that was launched in February with Claude 3.7 Sonnet. Both are terminal-based code assistants that can be operated directly from a console. They also interact with local codebases. Claude Code, Anthropic’s agentic tool, was the first attempt to connect OpenAI’s models with users’ computers and their local code repositories. Claude could search through codebases and edit files, run tests and write commands.

Codex CLI represents another step towards OpenAI’s goal to create autonomous agents capable of executing multi-step complex tasks for users. Let’s hope that all the vibration coding it produces won’t be used in high-stakes apps without detailed human supervision.

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Comments

www.aiobserver.co

More from this stream

Recomended