Judge rules that LLMs can scavenge data from books

A judge who is well-versed in technology has ruled that Anthropic can scan books purchased to train its Claude AI models, but that pirating is illegal. Anthropic purchased millions of books – many second hand – and then cut them into pieces to digitize the content. It also downloaded more than 7 million pirated ebooks from the Books3 dataset, Library Genesis(Libgen), and Pirate Library Mirror(PiLiMi), which was the sticking-point for Judge William Alsup at California’s Northern District Court.

He ruled on Monday that digitizing a printed copy was fair use under US law because there was no duplication as the pages were destroyed once they were scanned. Anthropic could be forced to go to trial for using pirated material. Alsup wrote [PDF] Monday’s decision. Anthropic ‘no longer felt the need to’ train on pirated books for legal reasons’.

Three authors – Andrea Bartz Charles Graeber and Kirk Wallace Johnson – filed the case, alleging that Anthropic illegally utilized their fiction and nonfiction works to teach Claude. Anthropic used at least two of the books written by each author.

Alsup noted Anthropic hired Tom Turvey, the former head for partnerships at Google’s project to scan books, who began discussions with publishers about licensing, as other AI developers had done. These talks were abandoned, in favor of buying millions of books and scanning the pages, which was deemed fair use by the judge. An Anthropic spokesperson told The Register

“We are pleased that the Court recognized that using ‘works to train LLMs was transformative — spectacularly so,'” . Anthropic’s spokesperson toldThe Register

“We are pleased that the Court recognized that using ‘works to train LLMs was transformative — spectacularly so,'” an Anthropic spokesperson

“We are pleased that the Court recognized that using ‘works to train LLMs was transformative — spectacularly so,'” in June, he “downloaded Books3, an online library of 196,640 books that he knew had been assembled from unauthorized copies of copyrighted books — that is, pirated.” downloaded “at least five million copies of books” and in July 2022 another two million copies from PiLiMi. Both of these Alsup classified as “pirate libraries.”

  • Canadian artists want Anthropic AI lawsuit to be corrected
  • Writers sue Anthropic over feeding’s Alsup found that this could be a legal problem for the startup since they were retained for “Anthropic’s pocketbook and convenience,” . He wrote

    “This order grants summary judgment for Anthropic that the training use was a fair use. And, it grants that the print-to-digital format change was a fair use for a different reason,” . Alsup’s decision is mixed news for Anthropic

    “But it denies summary judgment for Anthropic that the pirated library copies must be treated as training copies. We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages, actual or statutory (including for willfulness). That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages.”

    but he knows his onions. Alsup has presided at some of the largest tech trials in the history of the world for the last quarter century. His rulings were backed by the Supreme Court on some occasions. Alsup, who has been a coder (mostly in BASIC) for more than two decades, presided over the Oracle/Google trial in which they argued about fair use of Java in Android. This led him to experiment with that language. He sentenced a former Google self driving car engineer Anthony Levandowski, who stole proprietary information from his work at Google, to 18 months of prison. He then sold the Otto startup to Uber. In 2021, President Trump commuted the sentence.

    Bartz & Johnson had no comments at the time of publication. Graeber refused to comment on the verdict. (r)

www.aiobserver.co

More from this stream

Recomended