Amazon investigates perplexing claims of scraping abuse

In Short:

Amazon’s cloud division is investigating Perplexity AI for potentially violating Amazon Web Services rules by scraping websites that prohibited it. Perplexity, backed by influential entities and valued at $3 billion, appears to ignore the Robots Exclusion Protocol, a common web standard. The company’s practices have raised concerns about plagiarism and scraping abuse, leading to investigations by organizations like Forbes and WIRED. Perplexity’s CEO has denied wrongdoing, claiming the IP address linked to scraping activities belongs to a third-party company.

Amazon Cloud Division Investigates Perplexity AI

Amazon’s cloud division is currently investigating Perplexity AI over concerns that the AI search startup may be violating Amazon Web Services rules by scraping websites that have explicitly tried to prevent such activity, reports WIRED.

AWS Confirmation and Background

An AWS spokesperson, speaking to WIRED on condition of anonymity, confirmed that the company is conducting an investigation into Perplexity. The startup, which has received funding from the Jeff Bezos family fund and Nvidia, and was recently valued at $3 billion, is alleged to rely on content from scraped websites that had disallowed access through the Robots Exclusion Protocol, a widely recognized web standard. While the Robots Exclusion Protocol is not legally binding, adhering to terms of service is generally expected.

Importance of Robots Exclusion Protocol

The Robots Exclusion Protocol is a longstanding web standard where a plaintext file, such as wired.com/robots.txt, is placed on a domain to specify which pages should not be accessed by automated bots and crawlers. Although companies using scrapers can choose to bypass this protocol, most typically respect it. The Amazon spokesperson emphasized that AWS customers are required to follow the robots.txt standard when crawling websites.

Compliance and Alleged Plagiarism

Under AWS terms of service, customers are prohibited from engaging in any illegal activities, and they are responsible for adhering to the terms and all applicable laws. Scrutiny of Perplexity’s practices intensified following a Forbes report on June 11, which accused the startup of plagiarizing at least one article. WIRED’s investigations validated this claim and uncovered more instances of scraping misuse and plagiarism linked to Perplexity’s AI search chatbot.

Uncovering Violations

WIRED discovered that engineers at Condé Nast, WIRED’s parent company, took measures to block Perplexity’s crawler using a robots.txt file across all their websites. Nevertheless, the company was observed accessing a server through an undisclosed IP address (44.221.181.252), which made numerous visits to Condé Nast properties over the past few months, apparently for scraping purposes.

Response from Perplexity CEO

Following WIRED’s investigation, Perplexity CEO Aravind Srinivas initially claimed that the questions posed reflected a misunderstanding of their operations. However, in a subsequent statement to Fast Company, Srinivas stated that the IP address detected scraping Condé Nast websites was operated by a third-party company offering web crawling and indexing services. He declined to disclose the company’s name, citing a nondisclosure agreement, and was noncommittal when asked about discontinuing crawling of WIRED’s content.

Amazon investigates perplexing claims of scraping abuse

More from Author

Unblock Internet Access in Chrome

How Do You Use the Internet in Flight Mode?

Turn Off Internet Access for WhatsApp

Connect Your PC Internet to Mobile

5 Ways to Increase Your Jio Internet Speed