Credit : VentureBeat made using Midjourney
Diffbot is a small Silicon Valley firm best known for maintaining the largest index of web knowledge in the world. Today, the company announced the release of a brand new AI model which promises to address the biggest challenge in the field, factual accuracy.
This new model is a fine-tuned Meta’s LLama 3.3 and the first open-source implementation for a system called graph retrieval-augmented-generation, or GraphRAG.
Unlike traditional AI models that rely on preloaded data, Diffbotโs LLM uses real-time information derived from the companyโs Knowledge Graph. This constantly updated database contains more than a billion interconnected facts.
In an interview with VentureBeat, Diffbot CEO and founder Mike Tung stated that they have a hypothesis: general-purpose reasoning would eventually be reduced to about 1 billion parameters. “You don’t want the knowledge to be in the model. You want your model to be able to use tools well so that it can query external knowledge.
What it is
Diffbotโs Knowledge Graph has been crawling public web pages since 2016. It extracts structured information from web pages using a combination computer vision and natural-language processing.
The Knowledge Graph is updated every four to five days with millions of new facts. This ensures that it is always up-to date. Diffbot AI model uses this resource to retrieve information in real-time, instead of relying on static data encoded in the training data.
When asked about a recent event, for example, the model will search the web to find the latest updates, extract the relevant facts, cite the original source, and more. This process is designed so that the system will be more accurate and transparent than LLMs.
Tung said, “Imagine asking a machine about the weather.” “Instead of generating a response based on outdated data, our model queries the live weather service and gives a response rooted in real-time information.”
How Diffbotโs Knowledge Graph beats conventional AI at finding facts.
Diffbotโs approach seems to be paying off in benchmark tests. The company claims its model achieved an 81% accuracy rating on FreshQA, which is a Google benchmark for testing real time factual knowledge. It also beats both ChatGPT, and Gemini. It also scored 70.36 percent on MMLU-Pro – a more challenging version of a standard academic knowledge test.
Diffbot’s model is now open-source. This allows companies to run the software on their own hardware, and customize it according to their needs. This is a response to growing concerns over data privacy and vendor lock in with major AI providers. Tung stated that you can run it on your own machine. “You can’t run Google Gemini unless you send your data to Google and ship it outside your premises.”
Open-source AI may transform the way enterprises handle sensitive information
This release comes at an important moment in AI development. In recent months, companies have continued to increase the size of their models despite criticisms that large language models tend to “hallucinate” and generate false information. Diffbot’s method suggests a different path forward. It focuses on grounding AI in verifiable factual information rather than trying to encode all of human knowledge into neural networks. Tung stated that “not everyone is going for bigger and larger models.”
“You can create a model with more capabilities than a large model, but still use a nonintuitive approach.” The company provides data services for major companies such as Cisco, DuckDuckGo, and Snapchat.
This model is immediately available through an open-source release at GitHub, and can be tested via a public demonstration at diffy.chat. Diffbot claims that the 8-billion parameter version can be run on a single Nvidia A100 graphics card, while the 70-billion parameter version requires two Nvidia H100 graphics cards.
Tung believes that the future of AI is not in ever-larger AI models, but better ways to organize and access human knowledge. “Facts become stale.” These facts will be moved into places where they can be modified and where data provenance can be established. It remains to be determined if it is able to change the direction of the field, but it has shown that size doesn’t matter when it comes AI.
Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.
Read our privacy policy
Thank you for subscribing. Click here to view more VB Newsletters.
An error occured.