IAB Tech Lab pitches a plan to help publishers gain more control over LLM scraping

July 17, 2025

Jessica Davies

16 July 2025

IAB Tech Lab is assembling a task force consisting of publishers and compute-edge companies to launch its plan to create a framework that will help publishers gain greater control over LLM crawling and be paid accordingly.

It has about a dozen publishers signed up for the task force. They will meet in New York City for the first workshop on July 23, next Wednesday, to discuss the next steps for its LLM Content Ingest API Framework. Cloudflare, a company that specializes in edge computing, will also be present and speak at the event. According to CEO Anthony Katsur, the IAB Tech Lab has been working to bring Fastly, a company that specializes in edge computing, on board.

As it’s still early, the next step is to write the specification – a technical guide or blueprint that will help different stakeholders (publishers and tech vendors) work towards the same standard. According to Katsur, IAB Tech Lab is in the process of reviewing an internal draft specification with publishers. In the last six months, the IAB Tech Lab has presented the overview of the specification (see below to 40 publishers worldwide).

Katsur aims to have a framework on the market by the fall.

There are, of course, some tricky challenges. It’s one thing to get publishers on board, but it’s another to get the AI companies involved and hold their end of the bargain. Digiday spoke to three publishing executives who expressed their concern that AI companies would not care to establish compensation models or attribution models using this framework.

Katsur knows the challenges that the LLM Content Ingest API will face; it’ll need all stakeholders. “I’m sceptical that they [AI platforms] will be willing partners for this,” he said.

He believes that if publishers and compute edge companies work together on this issue, it will reduce infrastructure costs for LLM crawlers. This may encourage them to participate. “We’re going to be aggressive,” said he, referring to how they would present the final technical framework.

This is the pitch deck that the IAB presented to publishers.

How LLM Content Ingest API works

Firstly, a contract must be signed between the LLM provider (the LLM provider) and the publisher to determine what content is accessible. Only after the agreement has been reached can the publisher set crawler terms that reflect this agreement. Publishers can categorize their content in tiers, such as basic content (daily articles and videos), archived content, or premium content like investigative journalism or exclusive interviews.

Then come the payment options: cost-per-crawl, all-you-can-eat unlimited access, and cost-per-query, which is IAB Tech Lab’s preferred model. “We think cost-per-query scales better than cost-per-crawl,” said Katsur. There is a misconception that bots only crawl once; they do in fact return, he stressed, but there are still fewer crawls likely to happen versus queries surfaced in answer engines.

There is also a logging and reporting component, which ensures publishers can invoice the LLM provider correctly. “There can be reconciliation every month in terms of: here’s how many times you crawled me, or here’s how many times I showed up in a query,” said Katsur.

Tokenization for authenticating source – important to brands and publishers

In the last step, IAB Tech Lab will tokenize content to ensure accuracy of source information and also to show clearly where compensation needs to be paid and to whom. “This is where cost-per query becomes feasible – you can tokenize content inputs to the LLM and then track it every time it appears in a user’s query because you have assigned a unique ID to that piece of content,” said Katsur. “Ostensibly both the LLM, and the publisher, should be able track that.”

Tokenizing content is important for Katsur because it helps identify original source within “contextual soup” of AI-generated responses, which are typically synthesized by multiple publisher sites.

Brands also worry about their products being misrepresented by queries, said Katsur. He has spoken with CPG and auto manufacturers who have noticed that queries about their products are confusing or prone to errors. This raises concerns over missed sales opportunities or losing existing or new clients.

Tokenizing articles can help identify contributions if AI answer engines use content from three different publishers in order to generate a reply. This will make it easier to split payment between the publishers.

Elephant: enforcement

Although publishers welcome any effort to create a more sustainable AI driven model for publishers where their content isn’t ripped off there is a healthy amount of skepticism about how an API such as LLM Content Ingest will truly prevent scraping. Their view is that it should be more robust than robots.txt which has been easy to ignore and game.

According to Katsur, some LLM crawlers use some nefarious tactics, such as using a different crawler, if the original crawler is listed in robots.txt. He added that publishers must take a firm stance against all crawling if they want this standard to be successful.

Katsur said that to enforce this model you need a very strong wall. “It only takes one weak link, one publisher, to say, ok, you can crawl.” Edge computing platforms can help. Cloudflare, Fastly and other publishers will be included in the task force. “We are confident that Cloudflare is going to be a part of this. They are the ones who can stop the crawling and those who can detect crawlers which do not obey robots.txt. Katsur believes that regulators can also make a few basic AI laws to encourage AI innovation. These include declaring your crawler, and fining robots.txt for violating it.

The challenge we face is the speed at which this is happening. We hear traffic declines between 30% and 60% [in the US] from publishers, which is unsustainable. And this is only the tip of the iceberg in terms of LLMs and zero-click search… We have to be really aggressive as an industry in tackling it.”

https://digiday.com/?p=583222

{{post_title}}

IAB Tech Lab pitches a plan to help publishers gain more control over LLM scraping

How LLM Content Ingest API works

Tokenization for authenticating source – important to brands and publishers

Elephant: enforcement

More in Media

Loading…

Here are the results for the search: "{{td_search_query}}"

No results!

{{post_title}}

How LLM Content Ingest API works

Tokenization for authenticating source – important to brands and publishers

Elephant: enforcement

More in Media

RELATED ARTICLES

The AI lab revolving door spins ever faster

A Coding Guide to Build a Procedural Memory Agent That Learns,...

Mistral AI Ships Devstral 2 Coding Models And Mistral Vibe CLI...