Industry insiders were concerned that progress towards a more intelligent AI would slow down at the end of 2024. OpenAI’s new o3 model has generated a lot of excitement and debate. It suggests that there will be big improvements in 2025.
The model, which was announced for safety testing by researchers but has not yet been released publicly, scored highly on the ARC metric. Francois Chollet is a renowned AI expert and creator of the Keras framework. He created this benchmark to measure a model’s ability to perform novel, intelligent tasks. It is a good indicator of progress towards truly intelligent AI systems.
o3 scored 75.7% in the ARC benchmark using standard computing conditions and 87.5% when using high computing, which is a significant improvement over previous state-of the-art results such as the 53% by Claude 3.5. Chollet, a critic who had questioned the ability of large-language models (LLMs), to achieve such intelligence, said that this achievement by o3 was a surprising advance. It highlights innovations that can accelerate progress towards superior intelligence, regardless of whether we call it artificial intelligence (AGI).
AGI may be a buzzword and is not well defined, but it does signal a goal: intelligence that can adapt to new challenges or questions, in a way that exceeds human abilities.
OpenAI’s o3 addresses specific hurdles that have long held back large language models in terms of reasoning and adaptability. It also reveals challenges, such as the high costs and efficiency bottlenecks that come with pushing these systems to the limit. This article will explore the five key innovations that are behind the o3 Model, which are largely based on advancements in reinforcement-learning (RL). This article will use insights from industry leaders and OpenAI’s claims as well as Chollet’s analysis to explain what this breakthrough means for AI in 2025.
The five core innovations in o3
1. “Program Synthesis” for task adaption
OpenAI’s o3 model introduces “program-synthesis,” a new capability that allows it to dynamically combine what it learned during pretraining — specific patterns or algorithms, or methods – into new configurations. These things could include mathematical operations or code snippets that the model encountered and generalized through its extensive training with diverse datasets.
The most important thing about program synthesis is that it allows o3 the ability to tackle tasks that it has not directly encountered in training. For example, solving advanced coding problems or solving novel logic puzzles which require reasoning beyond rote applications of learned information. Francois Chollet defines program synthesis as the ability of a system to recombine familiar tools in innovative ways, like a chef creating a unique dish with familiar ingredients.
Chollet has been advocating this feature for months as the only way to improve intelligence. It is a departure from older models that primarily retrieve and use pre-learned information without reconfiguration.
2. Natural language program search
The adaptability of o3 is based on its use of chains-of-thought (CoTs) as well as a sophisticated search that occurs during inference, when the model is actively producing answers in a real world or deployed setting. These CoTs are natural language instructions that the model generates in order to explore possible solutions. Using an evaluator, o3 actively generates and evaluates multiple solution paths to determine the best option. This approach is similar to how humans solve problems, where they brainstorm different solutions before selecting the best one. o3, for example, generates and evaluates different strategies to arrive at accurate answers in mathematical reasoning tasks. OpenAI’s implementation is a step above similar approaches used by competitors like Anthropic or Google.
3. Evaluator Model: A new type of reasoning
O3 actively creates multiple solutions paths during inference. It evaluates each path with the help of a integrated evaluator to determine the best option. OpenAI trains the evaluator using expert-labeled datasets to ensure that o3 has a strong ability to reason through multi-step, complex problems. This feature allows the model to judge its own reasoning. It is a step closer to large language models being able “think” instead of simply responding.
4. Executing its own programs
O3’s ability to execute CoTs for adaptive problem solving is one of its most revolutionary features. CoTs are traditionally used to solve specific problems using a step-by-step approach. OpenAI’s O3 extends this idea by using CoTs as reusable blocks. This allows the model to adapt more easily to new challenges. Over time, the CoTs become structured records that record problem-solving techniques, similar to how humans document their learning and refine it through experience. This ability shows how o3 is pushing adaptive reasoning to the limit. Nat McAleese of OpenAI says that o3’s performance in unseen programming tasks, such as achieving CodeForces ratings above 2700, shows its innovative use CoTs. This 2700 score places the model in the “Grandmasters” category, which is the top echelon for competitive programmers worldwide.
5. Deep learning-guided search for programs
O3 employs a deep-learning-driven approach to evaluate and refine possible solutions to complex problems. This process involves generating several solution paths and assessing their viability using patterns that have been learned during training. Francois Chollet, among other experts, has noted that this reliance upon “indirect evaluations,” where solutions are evaluated based on internal metrics instead of being tested in real-world situations, can limit the robustness of the model when applied to unpredictable contexts or enterprise-specific scenarios.
o3’s reliance on expert-labeled data sets for training its evaluator models raises concerns regarding scalability. These datasets can enhance precision but also require significant human supervision, which can limit the system’s flexibility and cost-efficiency. Chollet argues that these trade-offs highlight the challenges of scaling reasoning system beyond controlled benchmarks such as ARC-AGI.
This approach shows both the potential and limitations of integrating deep-learning techniques with programmatic problems solving. While o3’s innovations demonstrate progress, they also highlight the complexity of building truly generalizable artificial intelligence systems.
The biggest
challenge to the o3 model OpenAI’s model o3 achieves impressive results, but at a significant computational cost. It consumes millions of tokens for each task. This costly approach is the model’s greatest challenge. Francois Chollet and Nat McAleese highlight concerns about the economic viability of such models. They emphasize the need for innovations which balance performance with affordability.
o3 has attracted attention from the AI community. Google’s Gemini 2 and Chinese firms such as DeepSeek 3 also advance, making direct comparisons difficult until these models have been tested more widely.
Opinions about o3 are divided. Some praise its technical advances, while others cite its high costs and lack of transparency. This suggests that its true value will only be revealed with more testing. Denny Zhou of Google DeepMind was one of the most vocal critics. He impliedly attacked the model for its reliance on reinforcement-learning (RL) scaling mechanisms and search mechanisms, arguing that it could be a “dead-end.” Instead, he argued that a model can learn to reason by using simpler fine tuning processes.
What this means for enterprise AI.
The newfound adaptability of o3 shows that AI, in some way or another, will continue to transform industries from customer service to research.
Industry participants will need time to digest the results of o3’s work. OpenAI’s forthcoming release of a scaled-down version of o3 (called “o3 Mini”) could be a viable alternative for enterprises worried about the high computational costs of o3. While o3 mini sacrifices some of its capabilities, it offers a more affordable way for businesses to experiment. It retains much of the core innovation and reduces test-time computing requirements.
Enterprise companies may have to wait a while before they can get their hands the o3 model. OpenAI expects the o3 mini to be available by the end January. The full release of o3 will follow. However, the timelines will depend on the feedback and insights gained from the current safety testing phase. It is a good idea for enterprises to test the model. They will want to test it out with their data, use cases, and see how well it works. In the meantime, they could use other models that have been tested and proven, such as the flagship o4 and other competing models.
In fact, we will be operating in two gears next year. The first step is to achieve practical value with AI applications and flesh out what models can do using AI agents and other innovations. The second part will be to sit back and watch the intelligence race unfold, and any progress made will be the icing on top of the cake.
To learn more about o3’s innovations and to keep up with the latest AI developments, follow VentureBeat.
Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.
Read our privacy policy
Thank you for subscribing. Click here to view more VB Newsletters.
An error occured.