Home Technology Open-Source Tools OpenAI’s new reasoning AI model hallucinates more

OpenAI’s new reasoning AI model hallucinates more

0
OpenAI’s new reasoning AI model hallucinates more

OpenAI’s o3 and O4-mini AI models, which were recently launched, are state-of the-art in many ways. The new models do still hallucinate or make up things — in fact, some of OpenAI’s older AI models hallucinate more than the new ones.

The problem of hallucinations is one of the most difficult and complex problems in AI. It affects even today’s top-performing systems. Each new model has historically improved in the hallucination department by hallucinating slightly less than its predecessor. This doesn’t appear to be the case with o3 or o4 mini.

OpenAI’s tests show that o3 and O4-mini – which are so-called “reasoning models” – hallucinate more than previous reasoning models from the company, including o1, o1 Mini, and o3 Mini, as well as OpenAI’s traditional, “non-reasoning” models such as GPT-4o. ChatGPT’s maker isn’t sure why it’s happening.

The technical report for OpenAI states that “more research” is needed to understand why hallucinations get worse as reasoning models are scaled up. O3 and o4 mini perform better in certain areas, such as tasks related to math and coding. But because they “make more claims overall,” they’re often led to make “more accurate claims as well as more inaccurate/hallucinated claims,” per the report.

OpenAI discovered that o3 hallucinated when answering 33% of the questions on PersonQA. This is the company’s internal benchmark for measuring a model’s accuracy in its knowledge of people. This is roughly twice the hallucination rates of OpenAI’s prior reasoning models, o1 or o3 mini, which scored respectively 16% and 14,8%. O4-mini performed even worse, hallucinating in 48% of cases.

Third-party Transluce’s testingrevealed that o3 tends to fabricate actions in order to arrive at answers. Transluce found that o3 claimed to have run code on a 2021 MacBook Pro, “outside of ChatGPT”then copied the numbers in its answer. o3 can’t do this, even though it has access to certain tools.

In an email to TechCrunch, Neil Chowdhury (a Transluce researcher, and former OpenAI employee) stated that “our hypothesis is that the type of reinforcement learning used for o series models may amplify problems that are usually mitigated, but not fully erased, by standard post-training pipes.”

Sarah Schwettmann of Transluce added that the o3’s high hallucination rates may make it less valuable than it would otherwise be.

Kian Kathanforoosh is a Stanford adjunct and CEO of Workera. He told TechCrunch his team has already tested o3 and found it to be superior to the competition. Katanforoosh claims that o3 is prone to misinterpret broken links on websites. The model will provide a link which, when clicked, does not work.

While hallucinations can help models come up with interesting ideas and “think” creatively, they can also make models difficult to sell in markets that require accuracy. A law firm, for example, would not be happy with a model which inserts many factual errors in client contracts.

One way to improve the accuracy of models is by adding web search capabilities. OpenAI’s GPT-4o model with web search achieves SimpleQA, a benchmark for accuracy developed by OpenAI, has a 90% accuracy rate . Search could also improve reasoning models’ rates of hallucinations, at least if users are willing and able to provide prompts to third-party search providers.

The search for a solution will become more urgent if scaling up reasoning models continues to worsen the hallucinations.

OpenAI spokesperson Niko Feild told TechCrunch that “Addressing the hallucinations in all our models is a continuing area of research and we’re constantly working to improve their accuracy, reliability and consistency.”

The AI industry as a whole has shifted its focus to reasoning models in the last year after techniques to improve conventional AI models began to show diminishing returns. The ability to reason improves the performance of models on a variety tasks without requiring huge amounts of data and computing during training. It seems that reasoning can also lead to hallucinations, which is a challenge. Maxwell Zeff, a senior reporter for TechCrunch who specializes in AI and emerging technology, is

Maxwell Zeff. Zeff covered the rise and fall of AI, as well as the Silicon Valley Bank Crisis, for Gizmodo and MSNBC. He is based out of San Francisco. When he is not reporting, you can find him hiking, biking and exploring the Bay Area food scene.

View Bio

www.aiobserver.co

NO COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Exit mobile version