OpenAI’s latest flagship models GPT o3 & o4-mini are designed to mimic human logic. OpenAI’s GPT o3 & o4-mini are designed to think through things step-by-step, unlike their predecessors which were mainly focused on fluent texts. OpenAI boasted that its o1 program could match or surpass the performance of PhD students, in chemistry and biology. OpenAI’s report reveals some shocking results for those who take ChatGPT responses as they are.
OpenAI discovered that the GPT model o3 incorporated hallucinations into a third benchmark test using public figures. This is double the error rate from the o1 model of last year. The o4 mini model, which is smaller and more compact, performed even worse. It hallucinated on 48% similar tasks.
For the SimpleQA benchmark test, hallucinations grew to 51% for the o3 model and 79% for the o4 mini. This is not just some noise in the system, but a full-blown crisis of identity. It’s not true that a system marketed as a “reasoning system” would double-check the logic it uses before generating an answer.
A theory that is circulating in the AI community says that the more reasoning an AI model attempts, the greater the chance it has of going off track. Models that are simpler and stick to high-confidence forecasts tend to be more prone to go off the rails. Reasoning models, on the other hand, must consider multiple paths, connect disparate data, and improvise. Improvising around facts is known as making up things.
OpenAI told The Times that the increase in hallucinations may not be due to reasoning models being inherently worse. They could be more verbose or adventurous with their answers. The AI can become confused because the new models don’t just repeat predictable facts, but also speculate about possible outcomes. Unfortunately, some of these possibilities are completely detached from reality.
Yet, more hallucinations is the opposite of what OpenAI and its rivals such as Google and Anthropic would like from their most advanced models. AI chatbots are often referred to as copilots and assistants, which implies that they will be helpful. Lawyers have been in trouble before for using ChatGPT without noticing fictitious court citations. Who knows how many other mistakes have caused problems under less serious circumstances?
Sign-up for breaking news, reviews and opinions, top tech deals and more.
As AI systems are implemented in schools, offices, hospitals and government agencies, the chances of a user experiencing a hallucination is increasing. The more sophisticated AI is, the less room for error. It may help with job applications, billing issues, or spreadsheet analysis.
It’s impossible to claim that you are saving people time and energy if they must spend the same amount of time double-checking what you say. These models are impressive. GPT o3 is capable of some incredible feats in coding and logic. It can outperform humans in certain ways. The illusion of reliability is shattered the moment it decides Abraham Lincoln hosted a Podcast or that water boils when the temperature is 80degF.
You should treat any AI model response with a large pinch of salt until these issues are resolved. ChatGPT can be a lot like that annoying guy who brims with confidence and speaks utter nonsense in too many meetings.
You may also like
- ChatGPT crossed a new AI threshold after beating the Turing Test
- AI has made a huge leap forward in IQ and now a quarter Gen Z believes AI is conscious.
- ChatGPT Model Matchup – I pitted OpenAI’s o3, GPT-4o and GPT-4.5 AI against each other, and the results surprised
Sign-up for breaking news, reviews and opinions, top tech deals and more.
Eric Hal Schwartz has been a freelance writer at TechRadar for more than 15 years. He has covered the intersection of technology and the world. He was the head writer of Voicebot.ai for five years and was at the forefront of reporting on large language models and generative AI. Since then, he has become an expert in the products of generative AI, including OpenAI’s ChatGPT and Anthropic’s Claude. He also knows Google Gemini and all other synthetic media tools. His experience spans print, digital and broadcast media as well as live events. He’s now continuing to tell stories that people want to hear and need to know about the rapidly changing AI space and the impact it has on their lives. Eric is based out of New York City.