You can also find out more about the A-Team here. A new study seems to support the claim that OpenAI trained some of its AI models using copyrighted material.
OpenAI has been accused by authors, programmers and other rights-holders of using their books, codebases and so on to develop its AI models without permission. OpenAI has claimed for many years that it is a The plaintiffs in the cases argue that U.S. copyright laws do not allow for a fair use defence.
This study, co-authored by researchers from the University of Washington and Stanford, suggests a new way to identify training data that is “memorized by models” behind an API like OpenAI.
The models are prediction engines. They learn patterns from a large amount of data. That’s how they can generate essays, photos and more. The outputs are not exact copies of the data used to train the models, but some are. Image models have been found by researchers to be able to While language models have been observed, they regurgitate screenshots of moviesthat they were trained on. Effectively plagiarizing articles of news
This study’s method relies upon words that the authors call “high-surprising” — that is words that stand out in a larger work as being uncommon. The word “radar” is a high-surprisal word in the sentence, “Jack and i sat perfectly quiet with the radar humming.” This is because it’s statistically more unlikely than words like “engine” or a “radio” that appear before the word “humming.” The co-authors concluded that if the models were able to guess correctly, they likely memorized the snippet while training.
According the the results of the test, GPT-4 showed that it had memorized parts of popular fiction books. This included books in a dataset of samples of copyrighted eBooks called BookMIA. The results also indicated that the model was able to memorize portions of New York Times articles at a lower rate.
Abhilasha Ravichander, a doctoral candidate at the University of Washington, and co-author of this study, told TechCrunch the findings shed some light on “contentious data” that models may have been trained with.
Ravichander stated that in order to create large language models which are trustworthy, it is important to have models we can audit and examine scientifically. “Our work aims at providing a tool to investigate large language models. But there is a need for greater transparency in the entire ecosystem.”
OpenAI advocates for looser restrictions when developing models with copyrighted information. The company has a number of content licensing agreements in place, and it offers opt-out mechanisms to allow copyright holders to flag content that they would prefer the company to not use for AI training purposes. It has also lobbied governments to codify “fair usage” rules around AI teaching approaches.
Kyle Wiggers, TechCrunch AI Editor. His writings have appeared in VentureBeat, Digital Trends and a variety of gadget blogs, including Android Police and Android Authority, Droid-Life and XDA-Developers. He lives in Manhattan, with his music therapist partner.
View Bio