Nvidia is rolling out tools to generate synthetic data, so that developers are able to train their own AI models. They can then fine-tune these models for specific apps. Theoretically, synthetic data can create an almost infinite supply of AI training data. This could help solve the data shortage problem that has plagued the AI industry ever since ChatGPT became mainstream in 2022. However, experts warn that using synthetic data for generative AI carries its own risks.
Nvidia’s spokesperson declined to comment.
Gretel, founded in 2019, was founded by Alex Watson, John Myers and Ali Golshan. Golshan is also the CEO. The startup offers a synthetic platform and APIs for developers who are interested in building generative AI models but lack access to sufficient training data, or have privacy concerns about using real people’s personal data. Gretel does not build and license its frontier AI models. Instead, it fine-tunes open source models and adds differential privacy and security features. According to Pitchbook, the company raised more than 67 million dollars in venture capital prior to the purchase.
Gretel’s spokesperson also declined to comment.
Synthetic data is computer generated and designed to mimic the real-world. The data generation needed to build AI models is said to be more scalable, easier to do, and more accessible for smaller or less-resourced AI designers. Synthetic data is also attractive to health care providers, government agencies, and banks because of its privacy protection. Nvidia offers synthetic data tools to developers since years. Omniverse Replicator was launched in 2022, giving developers the ability to create custom, physically accurate synthetic 3D data for neural networks. Nvidia launched a new family of open AI models in June that creates synthetic training data for developers. They can use this data to build or fine-tune LLMs. These mini-models, called Nemotron-4, 340B, can be used by developers in order to generate synthetic data for LLMs in “healthcare, finance, manufacturing and retail” and other industries.
He said, “We focus on three problems.” “One, how can you solve the data issue? How and where can you collect the data required to train AI? What is the model architecture? Huang continued by explaining how the company now uses synthetic data generation to create its robotics platforms.
According to Ana-Maria Cretu a postdoctoral research fellow at the Ecole Polytechnique Federale de Lausanne, Switzerland, who studies privacy and synthetic data, synthetic data can be used at least in two different ways. It can be tabular data like demographic or medical information, which can solve data scarcity issues or create a diverse dataset.
Cretu provides an example: if a hospital is trying to build an AI model that tracks a specific type of cancer but only has a small dataset from 1,000 patients, then synthetic data can be utilized to fill in the gaps, eliminate biases and anonymize real human data. Cretu says that synthetic data can also be used to protect privacy when you are unable to disclose real data.
But, Cretu continues, synthetic data is also a phase that has been adopted in the world of LLMs to answer the question, “How can we increase the amount of data for LLMs with time?” A report by MIT’s Data Provenance Initiativeshowed that restrictions on open web content increased last year.
In theory, synthetic data could be a simple solution. But an article in Nature from July 2024 highlighted How AI language models can “collapse” and degrade in quality when they are fine-tuned repeatedly with data generated by another model. If you feed a machine only its own machine-generated output it will theoretically start to eat itself and produce detritus.
Alexandre Wang, the chief executive officer of Scale AI – which relies heavily on human workers to label data used to train model – Wang shared the findings of the Nature article on X. He wrote, “While many researchers view synthetic data as a AI philosopher’s stones, there is no free meal.” Wang stated later in the thread why he believes firmly that a hybrid data is the best approach.
A co-founder of Gretel pushed back against the Nature paper. Gary Marcus, a cognitive researcher and scientist who has been vocal in his criticism of AI hype, wrote about the “extreme” scenario of repetitive training using only synthetic data. He said at the time that he agreed with Wang’s “diagnosis, but not his prescription.” According to him, the industry will move forward by developing new architectures of AI models rather than focusing solely on the idiosyncrasies data sets. Marcus wrote in an email to WIRED that “systems such as [OpenAI’s] o1/o3 appear to be better suited to domains like math and coding, where you can generate – and validate – tons of synthetic data. They have been less successful in general-purpose reasoning, especially in open-ended domains.
Cretu is convinced that the scientific theory behind model collapse is sound. She does note that most computer scientists and researchers are using a mixture of synthetic and real data to train. She says that you might be able avoid model collapse if you have fresh data for every new training round.
Even though the AI industry is cautious, concerns about model failure haven’t stopped them from jumping on the synthetic data bandwagon. Sam Altman reportedly praised OpenAI’s ability to use its AI models to generate more data at a recent Morgan Stanley technology conference. Dario Amodei, CEO of Anthropic, has statedthat he believes it is possible to build a “data-generation engine with infinite capacity” which would maintain quality by adding a small amount new information during training (as Cretu suggested).
Big Tech is also turning to synthetic data. Meta has spoken about how it trained Llama 3 its state-of the-art large language models using synthetic data some of which were generated from Meta’s prior model, Llama 2 Amazon’s Bedrock platform allows developers to use Anthropic’s Claude in order to generate synthetic data. Microsoft’s Phi-3 model for small language was partly trained on synthetic data. However, the company warned that “synthetics generated by pretrained large-languages models can sometimes reduce the accuracy and increase bias in down-stream tasks”. Google’s DeepMind also uses synthetic data but has again highlighted the complexity associated with developing a pipeline to generate–and maintain–truly-private synthetic data.
We know that all the big tech companies work on some aspect of synthesized data,” says Alex Bestall. He is the founder of Rightsify a music licensing company that also creates AI music and licenses their catalog for AI models. “But human data is a requirement in many of our deals. They may want a dataset which is 60 percent generated by humans and 40 percent synthetic.