AI Violates Web Standards Takes Stealing Publisher's Site Content
AI Company Violates Web Standards to Scrape Publisher's Site Content
Danielle Coffey, CEO News Media Alliance, (Photo: x @EpicPlain_)
- Several artificial intelligence (AI) companies are violating common web standards used by publishers to block the retrieval of their content for use in generative AI systems. This was revealed by content licensing startup, TollBit.
In a letter to publishers on Friday, which did not name the affected AI companies or publishers, the issue comes amid a public dispute between AI search startup Perplexity and media outlet Forbes over shared web standards and a broader debate between the companies technology and media on the value of content in the era of generative AI.
The business media publisher publicly accused Perplexity of plagiarizing its investigative stories in AI-generated summaries without citing Forbes or asking its permission.
An investigation published by Wired this week found Perplexity likely bypassed efforts to block its web crawlers through the Robots Exclusion Protocol, or "robots.txt," a widely accepted standard that determines what parts of a site can be crawled.
The News Media Alliance, a trade group representing more than 2,200 US-based publishers, expressed concerns about the impact ignoring the "do not crawl" signal would have on its members. "Without the ability to opt out of mass data harvesting, we cannot monetize our valuable content and pay journalists. This could seriously damage our industry," said Danielle Coffey, the group's president.
TollBit, an early-stage startup, positions itself as an intermediary between AI companies that need content and publishers willing to make licensing deals with them. The company tracks AI traffic to publisher sites and uses analytics to help both parties set fees for the use of different types of content.
According to a letter from TollBit, Perplexity isn't the only infringer that appears to be ignoring robots.txt. TollBit said its analytics showed “many” AI agents bypassing the protocol.
The robots.txt protocol was created in the mid-1990s as a way to avoid overloading websites with web crawlers. While there is no clear enforcement mechanism, there has historically been widespread compliance on the web, and some groups - including the News Media Alliance - say there may still be legal recourse for publishers.
More recently, robots.txt has become a key tool publishers use to block tech companies from attempting to take their content for free for use in generative AI systems that can mimic human creativity and instantly summarize articles.
Several publishers, including the New York Times, have sued AI companies for copyright infringement related to such use. Others sign licensing agreements with AI companies willing to pay for the content, although the parties often disagree about the value of the material. Many AI developers argue that they are not breaking the law in accessing content for free.
Thomson Reuters, owner of Reuters News, is one that has struck a deal to license news content for use by AI models.
Publishers have increased awareness about news summaries since Google launched a product last year that uses AI to create summaries in response to some search queries.
If publishers want to prevent their content from being used by Google's AI to help produce those summaries, they will have to use the same tools that will also prevent their content from appearing in Google search results, making it virtually invisible on the web.