Efforts to Halt OpenAI’s Scraping Bots Are Gradually Losing Momentum

In Short:

The recent surge in deals between AI companies and news publishers is evolving the landscape. OpenAI’s GPTBot is facing fewer blocks from major news sites, thanks partly to partnerships allowing data usage. While many sites initially blocked AI crawlers, recent agreements have reduced these barriers, suggesting a shift in strategy as publishers adapt to the growing presence of AI.

The recent surge in collaborations between AI companies and news publishers provokes questions regarding its long-term ramifications. However, one notable development has emerged for OpenAI: its web crawlers are no longer facing the same level of obstruction from major news outlets as they once did.

Context of Data Protection

The rise of generative AI has initiated a significant demand for data, subsequently leading many news websites to reinforce their data protection measures. Publishers have actively sought to obstruct AI crawlers, aiming to prevent their content from being used as training data without explicit consent. For instance, following the debut of a new AI agent by Apple this past summer, numerous leading news outlets swiftly opted out of Apple’s web scraping by employing the Robots Exclusion Protocol, or robots.txt, a file that allows webmasters to manage bot access.

Current Blocking Trends

OpenAI’s GPTBot has garnered widespread recognition yet remains more frequently blocked compared to competitors like Google AI. The percentage of high-ranking media websites employing robots.txt to prohibit GPTBot rapidly increased from its launch in August 2023, peaking at just over one-third of analyzed outlets. This figure has since decreased to approximately one-quarter. Among a select group of prominent news outlets, the blocking rate still remains over 50%, though it has declined from earlier peaks nearing 90% earlier this year.

Impact of Licensing Deals

However, following the announcement of a licensing agreement between Dotdash Meredith and OpenAI in May, the blocking rate significantly decreased. This trend continued at the end of May when Vox disclosed its own agreement, and again in August when Condé Nast, parent company of WIRED, established a deal. The recent evidence suggests an end to the increasing trend of blocking for the time being.

Rationale Behind Unblocking

These observed declines in blocking are logically grounded. When organizations forge partnerships and grant permission for their data to be utilized, they lose the incentive to implement barricades, leading to updates in their robots.txt files to allow for crawling. Some media outlets promptly unblocked OpenAI’s crawlers coinciding with their announcement of new partnerships, such as The Atlantic, while others took longer, similar to Vox, which unblocked GPTBot about a month after its agreement announcement.

Legality and Compliance of Robots.txt

Although robots.txt is not legally binding, it has served as the de facto standard for managing web crawler behavior across the internet. The expectation for adherence to this file has prevailed throughout the web’s history. A WIRED investigation earlier this summer revealed that the AI startup Perplexity was likely disregarding robots.txt commands, prompting Amazon’s cloud division to investigate potential violations. Ignoring robots.txt directives is not advisable, which likely clarifies why many leading AI companies, including OpenAI, explicitly state that they comply with these regulations to determine which sites to crawl. Jon Gillham, CEO of Originality AI, suggests that this compliance underscores OpenAI’s urgency in securing agreements. “It’s clear that OpenAI views being blocked as a threat to their future ambitions,” Gillham commented.

Efforts to Halt OpenAI’s Scraping Bots Are Gradually Losing Momentum

More from Author

Unblock Internet Access in Chrome

How Do You Use the Internet in Flight Mode?

Turn Off Internet Access for WhatsApp

Connect Your PC Internet to Mobile

5 Ways to Increase Your Jio Internet Speed