AI Companies Found Bypassing Web Protections to Gain Unauthorized Access to Their Content

By Thea Felicity

Published: Jun 21 2024, 12:47 PM EDT

FRANCE-TECHNOLOGY-GAFA-GOOGLE-APPLE-FACEBOOK-AMAZON — This photograph taken on September 28, 2017, shows a smartphone being operated in front of GAFA logos (acronym for Google, Apple, Facebook and Amazon web giants) as background in Hédé-Bazouges, western France. DAMIEN MEYER/AFP via Getty Images

Multiple artificial intelligence companies are allegedly sidestepping a widely accepted web standard used by publishers to block content scraping for generative AI systems, according to a letter from content licensing startup TollBit shared by Reuters.

The letter, addressed to publishers, does not name specific AI companies or affected publishers but cites a growing dispute between tech firms like AI search startup Perplexity and media outlets over the use of digital content.

Perplexity has recently faced public accusations from Forbes of using its investigative stories in AI-generated summaries without proper attribution or permission.

A recent investigation by Wired found that Perplexity likely evaded restrictions imposed by the Robots Exclusion Protocol, commonly known as "robots.txt," designed to regulate web crawler access to specific site areas.

What Can TollBit Do Against AI Companies

TollBit, positioning itself as an intermediary between AI companies hungry for content and publishers considering licensing deals, monitors AI traffic to publishers' sites.

At the same time, they also analyze data to negotiate fees for various types of content usage, such as premium news and exclusive insights.

Through public exposure to these practices, TollBit can also pressure AI firms to adhere to ethical web-scraping standards, thereby protecting the interests of content creators and upholding the integrity of digital journalism.

Despite the controversies, including legal actions by some publishers against AI companies for copyright infringement, the debate continues regarding the fair use and compensation for digital content in the era of AI-driven information retrieval.

Legal & Regulatory