Google DeepMind researchers introduce a new benchmark to improve LLM factuality and reduce hallucinations.

VentureBeat/Ideogram

Learn More

Subscribe to our daily and weekly emails for the latest updates on AI coverage. Learn More


Large language models (LLMs) continue to be plagued by hallucinations or factually incorrect responses. Models fail when given more complex tasks or when users want specific and detailed answers.

Data scientists have been struggling to overcome this challenge, but now researchers from Google DeepMind claim they are one step closer to achieving factuality in their foundation models. FACTS Grounding is a benchmark which evaluates LLMs ability to generate factually correct responses based on documents of a long form. The models are also evaluated on their ability to provide relevant, useful answers to prompts.

In addition to the new benchmark, researchers have released a leaderboard for the Kaggle data-science community.

Gemini 2.0 Flash topped this week’s leaderboard with a factuality of 83.6%. Other top 9 apps include Google’s Gemini 1.0 Flash, Gemini 1.5 Pro, Anthropic’s Clade 3.5 Sonnet, Claude 3.5 Haiku, and OpenAI’s GPT-4o. All of these ranked higher than 61.7% accuracy.

The researchers say the leaderboard will be actively maintained and continually updated to include new models and their different iterations.

โ€œWe believe that this benchmark fills a gap in evaluating a wider variety of model behaviors pertaining to factuality, in comparison to benchmarks that focus on narrower use casesโ€ฆsuch as summarization alone,โ€ the researchers write in a technical paper published this week.

It is difficult to ensure factual accuracy of LLM responses due to factors such as modeling (architecture, inference, and training) and measuring (evaluation methods, data, and metrics). Researchers point out that pre-training typically focuses on predicting a token’s next value given the previous tokens.

While this objective may teach models salient knowledge of the world, it does NOT optimize the model for the various factuality situations. Instead, the model is encouraged to generate plausible text the researchers write.

In order to address this, the FACTS data set includes 1,719 examples – 860 public and 850 private – each requiring long form responses based on context provided in documents. Each example includes the following:

  • a system prompt (system_instruction), with general directives, and an order to only respond based on context provided;
  • a task (user_request), with a specific question that must be answered;
  • . A long document (context_document), with necessary information.

In order to be labeled as “accurate”the model must process a long-form document in order to create a long-form response which is both comprehensive and attributable fully to the document. The model’s claims will be labeled as “inaccurate”if they are not directly supported by documents and are not relevant or useful.

A user might ask a model, for example, to summarize the main reason why a companyโ€™s revenue decreased in Q3, while providing detailed information, such as a companyโ€™s annual financial report, which discusses quarterly earnings, planned investments, and market analysis.

For example, if a model returned “The company faced challenges that affected its revenue” it would be considered inaccurate. Researchers point out that the response does not specify any reasons such as market trends or increased competition, which would be likely to be in the document. It doesn’t show an effort to engage or extract relevant details.

In contrast, if the user asked, “What are some money-saving tips?” and provided a list of categorized money saving tips for college students that included: “Utilize activities free on campus, purchase items in bulk, and cook at home,” a correct answer would be highly detailed. Set spending goals, avoid using credit cards, and conserve resources.

DeepMind uses LLMs for judging LLMs

In order to allow for diverse inputs researchers included documents with varying lengths up to 32,000 tokens. These include finance, technology and retail, as well as medicine and law. The user requests are also diverse, ranging from Q&A generation to requests for summarization or rewriting.

Every example is judged twice. First, the responses are evaluated to determine their eligibility. If they do not meet user requests, then they are disqualified. Second, the responses must be free of hallucinations and fully grounded in documents provided.

The factuality scores were calculated by three different LLM Judges — Gemini 1.5 Pro GPT-4o, and Claude 3.5 Sonnet. Each judge determined their individual scores based upon the percentage of accurate outputs. The final factuality is determined by averaging the scores of the three judges.

Researchers note that models often have a bias towards other members of the model family – with a mean increase around 3.23% – so a combination of judges was crucial to ensure responses were factual.

The researchers conclude that factuality, and the ability to ground the information, are critical factors in the future success and utility of LLMs. They write: “We believe that comprehensive methods of benchmarking, combined with continuous research and developments, will continue improving AI systems.” They also admit that benchmarks are often quickly overtaken and that this launch of the FACTS Grounding leaderboard and benchmark is only the beginning.

VB Daily provides daily insights on business use-cases

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.


Read More

More from this stream

Recomended


Notice: ob_end_flush(): Failed to send buffer of zlib output compression (0) in /home2/mflzrxmy/public_html/website_18d00083/wp-includes/functions.php on line 5464