OpenAI’s Deep Research is more accurate than you in fact-finding, but it is still wrong half of the time

Openai

The latest in generative artificial intelligence includes AI agents that can access the web to find answers to questions. While promising, agentic technology is very much a work in progress.

In In a paperpublished last week, OpenAI researchers describe how the company’s Deep Research Technology, which was designed to use the Web, performs far better than OpenAI models’ other models in answering web questions. It also performs better than humans at tasks that require hours of searching. Also: What are AI Agents? How to access a personalized team of assistants

Deep Research still stumbles about half the time. OpenAI’s latest test shows that Deep Research is more persistent and determined than human researchers in some cases, but it fails to find an answer frequently.

Called BrowseComp by its authors, the test is described as “a simple yet challenging benchmark for measuring the ability of agents to browse the web.”

premise that AI agents – meaning AI models that can browse the “thousands of web pages” – could be more resourceful than human beings, who have limited memories, get tired surfing the Web, “can only attend to one thing at a time and cannot be parallelized,” and can’t direct the brain to operate on data simultaneously. Wei and his team write

“Machine intelligence, on the other hand, has much more extensive recall and can operate tirelessly without getting distracted,” .

OpenAI’s Deep Research is a powerful tool that can save you countless hours of work. It’s now much cheaper to access.Wei and the team built They based their work on “SimpleQ&A,” “short, fact-seeking questions.” which tests AI models ability to answer”short, fact-seeking questions.”the questions. The questions covered TV and Movie trivia, science, music, history, video games, political topics, and more. The BrowseComp 1,266 question set is designed to go beyond information retrieval. They are rather questions that are difficult to answer — or as they say, “challenging because they require searching through a large space of potential answers and matching them to constraints posed in the question,” “hard-to-find, deeply entangled information on the web.”

Identify a title of a research paper published before June 20,23, which mentions cultural practices, scientific processes, culinary innovations. It was co-authored by two individuals, one of whom is a West Bengal assistant professor and the other a Ph.D.

Openai

The questions and answers were developed by human “trainers,” and they were selected as being impossible to solve with just OpenAI’s ChatGPT, with or without browsing abilities. The questions were also impossible for an “early version” of Deep Research.

Demonstrating just how weak humans are at searching the Web, they first tested humans who were “familiar with the dataset” to answer the questions.

Openai

The results were not good for the humans. For 70% of the questions, humans gave up after two hours of effort. They only answered about 30% of the questions, and for 14% of their proposed answers, the humans’ suggestionsdid notmatch the actual answer.

Wei and team hypothesize that humans with higher searching skills could do better: “It is possible that many of the problems that they gave up on would be solvable by experienced professionals (e.g., detectives or investigative journalists) with ample time.”

Openai

After the humans, they tested Deep Research against OpenAI’s GPT-4o (with and without browsing abilities), GPT-4.5, and the o1 model.

The results were abysmal. “GPT-4o and GPT-4.5 achieved near-zero accuracy, highlighting the difficulty of the benchmark,” they write. “Without strong reasoning or tool use, models fail to retrieve the kinds of obscure, multi-hop facts BrowseComp targets.”

O1 fared better, which “[suggests] that some BrowseComp answers can be surfaced through inference over internal knowledge.”

Also: AI unleashes more advanced scams. Here’s what to look out for (and how to stay protected)

With a score of 51.5%, Deep Research was “significantly better,” and “it is particularly effective at answering the niche, non-intuitive questions that require browsing numerous websites,” Wei and team write.

However, they also found that GPT-4o using browsing and Deep Research could err by being “overconfident” about wrong answers, which is known as a calibration error.

“Models with browsing capabilities such as GPT-4o with browsing and Deep Research exhibit higher calibration error,” they write, “suggesting that access to web tools may increase the model’s confidence in incorrect answers. This aligns with observations that Deep Research struggles with confidence calibration and often fails to convey uncertainty accurately at present.”

To correct for calibration error, they did another test with Deep Research, in which the model had to output as many as 64 answers to each question. Then, they had the model pick the best of them. When it did so, Deep Research was pretty good at choosing the right answer among all the proposals.

Openai

That, write Wei and team, suggests that “the model frequently ‘knows’ when it’s right, even if it struggles to express that certainty as a calibrated probability.”

Also:Google’s latest chip is all about reducing one huge hidden cost in AI

They note, too, that the success of Deep Research improves with more computing added to it when it searches the Web. Put differently, “performance scales smoothly as a function of the amount of test-time compute used.” That squares with an increasing trend ofĀ throwing more GPU chips at the task of inference.

Openai

Wei and team don’t directly offer any hypothesis about why Deep Research fails almost half the time, but the implicit answer is in the scaling of its ability with more compute. As they run more parallel tasks, and ask the model to evaluate multiple answers, the accuracy scales past 75% of the questions answered.

The implication is that it is essential to choose strategies that force the model toevaluateits own efforts rather than simply chasing a single answer. Without that evaluation stage, the model struggles a good deal of the time.

Also: With AI models clobbering every benchmark, it’s time for human evaluation

A big hole in BrowseComp, the authors acknowledge, is that it is limited to questions that are easy for the computer to parse, and whose answers are easy to verify. None of the 1,266 questions included “long responses or ability to resolve ambiguity in user queries.”

As a result, BrowseComp, they argue, tests “core” functions of AI agents but is not comprehensive. “The model must be very proficient at locating hard-to-find pieces of information, but it’s not guaranteed that this generalizes to all tasks that require browsing.”

Deep Research is available to users of OpenAI’s Plus and Pro subscriptions are available. Want to read more about AI?

Subscribe to our weekly newsletter Innovation.

Artificial Intelligence

www.aiobserver.co

More from this stream

Recomended