Meta accused of using pirated torrents to train its AI

January 14, 2025

New day, new controversy surrounding artificial intelligence. This time Meta has been accused using pirated torrent content to train its large-language model (LLM) Llama which powers Meta AI. This was one of first copyright suits filed against a tech firm for training AI.

Documents reveal Meta AI was trained using pirated content

as reported by Wiredwas sued in 2023 by Meta for allegedly using pirated content to train Llama, its LLM. The case was renamed “Kadrey and others”. Richard Kadrey, Christopher Golden and other novelists filed a lawsuit against Meta Platforms, claiming that Meta had used copyrighted material without authorization.

Meta had previously provided documents to the court with redacted information, but Judge Vince Chhabria of the United States District Court for the Northern District of California ruled that the original documents be made public. This is what happened. The documents

reveal conversations between Meta staff about Meta AI and Llama. In one conversation, an engineer said that “torrenting on a [Meta-owned] company laptop doesn’t feel like it,” which confirms that the company used pirated material to train its AI. Another conversation suggests “MZ” was Mark Zuckeberg, who authorized the use pirated material.

Evidence indicates that Meta used content from LibGen – a large library of pirated academic articles, magazines, and books. LibGen, a Russian “piracy hub” created in 2008, has been the subject of multiple copyright lawsuits ever since. Meta also reportedly used material from other “shadow library” for AI training.

According to the company, it used public materials in accordance with the legal doctrine of ‘fair use’ which allows copyrighted material to be used without permission under certain circumstances. These are evaluated on a case by case basis. Meta claims it is simply “using text to statistically modify language and generate original expression.”

What about Apple Intelligence?

An investigation last year revealed that Apple’s OpenELM model included subtitles for more than 170,000 YouTube video. Apple explained that OpenELM is an open-source research model and not used for Apple Intelligence. Apple’s AI features for iOS and macOS, which are available on iOS, are trained using “licensed data, including data that is selected to enhance specific features as well as publicly-available data collected by our web crawler.”

It’s important to note that many large publishers, such as The New York Times or The Atlantic have chosen not share their content to Apple Intelligence.