As AI lawsuits mount, publishers still struggle to block the bots

Another publisher is taking OpenAI to court.

Ziff Davis is the latest media company to sue the tech company over copyright infringement, but the lawsuit highlights a broader reality: Publishers still have no reliable way to stop AI companies from scraping their content for free.

Despite growing legal pressure, the web has already been mined. Large language models like ChatGPT were trained on vast amounts of internet data, much of it scraped before publishers began pushing back. And while tools like robots.txt files, paywalls and AI-blocking tags have since emerged, many publishers admit it’s very difficult to enforce control across every bot — especially as some ignore standard protocols or mask their identities.

“The average publisher is trying to compete against a $300 billion company [OpenAI]. It’s hard to invest in the level of ‘bot-walling’ or ‘bot prevention technology,’ and be able to keep up with it. I think publishers are definitely at a disadvantage,” said Arvid Tchivzhel, managing director at Mather Economics’ digital consulting practice.

Robots.txt — which tells web crawlers which URLs they can access and is a mechanism to disallow access to publishers’ sites — remains the simplest defense against bot scraping, with the lowest lift for publishers. (It’s just a few lines of code.) But it’s also proven to be the weakest tactic to block bot traffic, because AI bots are ignoring it. A recent Tollbit report found that AI bot scrapes bypassing robots.txt grew by over 40% between Q3 and Q4 2024.

“I do think that almost all of the AI crawlers have violated the robots.txt request to not crawl. It’s been blatant, and publishers are getting very angry,” said Bill Gross, founder of AI startup company ProRata.ai. “It’s really a big issue, and I believe that the only solution is either have [AI companies] pay — which they don’t want to do — win the lawsuits, or block them.”

Despite pushback, AI scraping is only getting bolder

When using the robots.txt protocol to disallow GPTBot, travel news site Skift was still getting scraped by the bot about 60,000 times a week, according to chief product officer Jason Clampet, who used Tollbit’s tech to see how much traffic was coming from web crawlers.

Ziff Davis has been having the same problem. Despite implementing OpenAI’s instructions to publishers that wanted to “opt out” of having their sites scraped by OpenAI’s web crawler GPTBot and using the robots.txt protocol, the bot “continued to actively scrape and make copies of content from Ziff Davis websites without abatement,” according to the publisher’s lawsuit. What’s more, the lawsuit claimes GPTBot activity “significantly increased” even after Ziff Davis appealed to OpenAI to stop GPTBot activity on its site in May 2024.

“If [Ziff Davis is] having issues managing it, I don’t see anybody else who would not have just as bad of issues …[whenZiffDavis] have more resources than most,” Clampet said. “For a publisher of our size, we will do what we can, but then we basically have to deal with it and hope a larger publisher solves some of the problems.”

Once Skift started using Tollbit’s tool to block those bots, ChatGPT stopped scraping Skift’s site “within a day,” Clampet said. However, Tollbit’s solution can’t block Meta’s crawler, which is still coming to Skift’s site about 12,000 times a week, claimed Clampet. A Meta spokesperson said they were not familiar with Tollbit, and that publishers could use “industry-standard practices” like robots.txt to block Meta’s AI crawlers.

Meanwhile, the growing prevalence of “gray bots” is adding to the complexity of the issue. These generative AI bots from companies like OpenAI, Perplexity, Google and TikTok can scrape sites and access paywalled content without permission, driving up website operators’ costs through excessive server and bandwidth use. The Wikimedia Foundation said bots and AI scrapers drove a 50% rise in infrastructure costs since January 2024.

Naturally, lawsuits against AI platforms on illegal scraping and copyright infringement, being led by The New York Times and now Ziff Davis, matter because they’re helping define the legal boundaries of how AI companies can use copyrighted content, and whether publishers have any real recourse in the generative AI era. But such things can take years to resolve. The New York Times first sued OpenAI and Microsoft in December 2023, and the case is ongoing. Most publishers don’t have the resources to follow suit, but must watch from the sidelines. In the meantime, they are fighting an uphill battle.

And the scraping is worsening, according to Tollbit’s latest report. Scrapes per website doubled from Q3 to Q4 last year, and scrapes per page more than tripled. The report also found that apps like Perplexity were accessing sites through unidentified bots, as well as self-identified crawlers.

“Unless you’re doing something really sophisticated as a publisher server side and trying to specifically check user agents and check traffic … it’s very hard to do that at scale,” Tchivzhel said.

Other lines of defense

Other new products are cropping up. Cloud platform Fastly built an AI bot management tool unveiled this month, which lets publishers choose which AI platforms do or do not have access to their content. Content delivery network Cloudflare also has a tool called AI Audit that lets publishers see which crawlers are accessing their content and how often, and offers them the ability to block all or some AI bots. Over 800,000 sites have chosen to block all AI crawlers from their sites, a Cloudflare spokesperson told Digiday.

A few years ago, paywall management companies were developing technology to block AI bots from accessing publishers’ paywalled content. In theory, paywalls could block some bot traffic — as long as bots identify themselves as such.

In its lawsuit against OpenAI, Ziff Davis said that because most of its content is not behind a paywall, it was more vulnerable to scraping from GPTBot.

But paywalls are not proving to be a strong mechanism against scraping. Skift and The New York Times both have paywalls, for example, and were still susceptible to AI bot traffic.

Gross believes that bot blocking tools will continue to develop this year, and push AI companies to pay publishers to access their content. (Gross’s company’s mission is to create an improved revenue share model between AI companies and publishers.)

Until recently, publishers have “trusted” that AI web crawlers were honoring their efforts to block them from scraping. “But now that it’s getting serious, [publishers are] going to have to take more defensive action,” Gross said.

https://digiday.com/?p=577069

www.aiobserver.co

More from this stream

Recomended