OpenAI wants copyright rules to be bent. Study suggests that it doesn’t wait for permission

Tech book tycoon Tim O’Reilly says OpenAI used his publishing house’s copies of copyright-protected books for training data, and fed them all into its top GPT-4o model. This was done without permission.

The generative AI startup is facing lawsuits for its alleged use of copyrighted materials, allegedly without consent or compensation, in order to train its GPT family of neural networks. OpenAI denies all wrongdoing.

O’Reilly is one of the three authors of an AI Disclosures Project study [PDF] entitled, “Beyond public access in LLM pre-training data: Non-public content in OpenAI’s models.”

The authors define non-public as books that are only available behind a paywall and not publicly available.

To determine if GPT-4o had ingested O’Reilly Media copyrighted books without the publisher’s consent, the trio set out to investigate. They used the so-called DECOP inference attacks described by this 2024 prepress paperto probe the model that powers the world-famous ChatGPT.

This is how it worked: The team asked OpenAI’s model multiple choice questions. Each question asked the software which paragraph, from a list of paragraphs labeled A-D, was a verbatim passage from a book published by O’Reilly. One option was taken directly from the book and the other options were machine-generated paraphrases.

The OpenAI model was likely trained on the copyrighted text if it answered correctly and identified the verbatim sentences.

The model’s choices are used to calculate an Area Under the Receiver Operating Characteristic (AUROC), with higher numbers indicating a greater probability that the neural network has been trained on passages taken from the 34 O’Reilly Books. Scores nearer to 50% were deemed as a sign that the model had not been trained using the data.

OpenAI models GPT-4o Mini and GPT-3.5 turbo, as well GPT-4o were tested across 13,962 sentences. The results were mixed.

GPT-4o was released in May of 2024 and scored 82 percent. This is a strong indication that it was trained on material from the publisher. Researchers speculated that OpenAI trained the model by using the LibGen data base, which contains 34 of the books. You may remember that Meta was also accused using this notorious dataset to train its Llama model.

OpenAI’s pre-training data for models has a greater use of non-public data

AUROC scores for 2022’s GPT 3.5 model were just above 50%.

According to the researchers, the higher score of GPT-4o was evidence that “the role of non-public data in OpenAI’s model pre-training data has increased significantly over time.”

But the trio also found out that the smaller GPT-4o Mini, also released in the year 2024, after a training period that ended at the exact same time as that for the full GPT-4o, did not appear to have been trained using O’Reilly books. The researchers don’t think this is an indication that their tests are flawed. They believe that the smaller number of parameters in the mini-model could impact its ability “remember” to text. The authors wrote

“These results highlight the urgent need for increased corporate transparency regarding pre-training data sources as a means to develop formal licensing frameworks for AI content training,” . They added

“Although the evidence present here on model access violations is specific to OpenAI and O’Reilly Media books, this is likely a systematic issue,” .

Sruly Rosenblat, Ilan Strauss and the trio warned that failure to adequately compensate creators could lead to – if you’ll excuse the jargon — the enshittification (or stifling) of the internet. They argued that

“If AI companies extract value from a content creator’s produced materials without fairly compensating the creator, they risk depleting the very resources upon which their AI systems depend,” . Uncompensated data for training could lead to a decline in the quality and diversity of the internet’s content “If left unaddressed, uncompensated training data could lead to a downward spiral in the internet’s content quality and diversity.”

AI giants have begun signing agreements with publishers and social networks to license content to them. OpenAI signed agreements with Reddit, Time Magazine and other publishers to gain access to their archives. Google also struck a deal with Reddit.

OpenAI, however, has recently urged the US Government to relax copyright regulations in ways that make training AI models easier.

In an open letter sent to the White House Office of Science and Technology last month, the superlab argued that “rigid copyright rules are repressing innovation and investment,” and if no action is taken to change it, Chinese model builders will surpass American companies.

Lawyers are doing well while model-makers struggle. Thomson Reuters recently won a summary judgment against Ross Intelligence, after a US Court found that the startup had violated copyright using Westlaw’s Westlaw headnotes for training its AI system.

  • Are you writing for humans? Does AI roboauthors qualify for the copyright? The answer is still no, according to the appeals court.
  • OpenAI wants Uncle Sam to allow it to scrape everything and stop other countries from complaining.

While some in the tech industry are putting up roadblocks to protect copiesrighted materials, others are pushing for unfettered access. Cloudflare launched last month, a bot-busting AI that is designed to make scrapers’ lives miserable.

Cloudflare’s “AI Labyrinth”which works by luring rogue bots into a maze-like web of decoys, wastes their time and resources while hiding real content.

OpenAI did not immediately respond to our request for comment after receiving another 40 billion dollars in funding. We’ll let you all know if and when we hear back. (r)

www.aiobserver.co

More from this stream

Recomended