Reddit has filed a lawsuit accusing AI startup Perplexity of covertly harvesting Reddit content to train and power its “answer engine,” escalating a high-stakes fight over who controls the web’s most valuable conversations. The complaint, reviewed by multiple outlets, alleges Perplexity and several scraping firms siphoned Reddit posts without permission, even after warnings, to fuel a commercial AI product.

Perplexity disputes the claims, saying it supports open access to public knowledge and provides factual results responsibly. The case lands as platforms, publishers, and AI companies collide over data ownership, licensing, and the boundaries of “public” content in the age of generative AI.

Table of Contents

How Reddit Says It Caught Perplexity Scraping Content
The Legal Stakes Around Public Web Data Use
Why Reddit’s Corpus Is a Bull’s-Eye for AI Training
Perplexity’s Position And Industry Backdrop
What to Watch Next as the Lawsuit Moves Forward

How Reddit Says It Caught Perplexity Scraping Content

Central to Reddit’s case is a sting operation. The company says it planted a specially crafted Reddit post that could be discovered only via Google’s index and was otherwise unreachable on the open web. According to the complaint, Perplexity’s system surfaced the text from that hidden post within hours, suggesting the AI firm or its partners scraped Google’s results pages rather than Reddit directly.

The lawsuit names three scraping-related co-defendants—AWMProxy, Oxylabs, and SerpApi—alleging Perplexity relied on at least one of them to gather Reddit data at scale. Reddit also claims it sent a cease-and-desist notice, after which citations to Reddit content in Perplexity’s answers allegedly increased, not decreased.

The Legal Stakes Around Public Web Data Use

At the heart of the dispute is a thorny question: when content is publicly viewable, who can reuse it, and under what terms? U.S. courts have said publicly accessible data may be scraped in some contexts, as seen in the LinkedIn v. hiQ Labs saga. But platforms increasingly rely on terms of service, IP protections, and anti-bot measures to limit mass harvesting—especially for commercial AI training.

If Reddit can show that Perplexity bypassed technical restrictions, violated contractual terms, or misused intermediary services to evade blocks, claims could extend beyond simple contract breach to theories like trespass to chattels or violations of computer abuse statutes. Conversely, if Perplexity demonstrates it used only lawfully accessible sources and fair methods, the boundaries of acceptable AI data collection could expand.

Why Reddit’s Corpus Is a Bull’s-Eye for AI Training

Reddit’s data is unusually rich for training and benchmarking AI systems: sprawling topic coverage, human-to-human dialogue, and upvote signals that help approximate quality. In its public filings, Reddit has highlighted tens of millions of daily active users, hundreds of thousands of active communities, and a vast archive of posts and comments that map to real-world tasks—from coding help to consumer advice.

Reddit and Perplexity logos with gavel, highlighting alleged data theft lawsuit

That value has translated into deals. Reddit struck a data licensing agreement with Google to enhance search and AI research and later announced a partnership with OpenAI to bring Reddit content into AI products while offering new features to moderators and users. The Perplexity suit underscores Reddit’s strategy: monetize access through licenses and push back on unlicensed extraction.

Perplexity’s Position And Industry Backdrop

Perplexity, which markets a conversational search experience, argues it is championing fair access to public information while delivering accurate AI-generated answers. Its stance echoes broader industry arguments that the open web underpins AI progress and that over-restriction could hinder innovation.

But the legal climate is shifting. Major media organizations have sued AI developers over training on news archives. Rights holders are testing theories around copyright, contracts, and database protections. Regulators in the U.S. and Europe are scrutinizing how data is sourced, labeled, and reused in commercial AI. Against that backdrop, Reddit’s targeted “trap” example could carry weight in discovery, offering a tangible narrative about collection methods.

What to Watch Next as the Lawsuit Moves Forward

Three questions loom:

How exactly did Perplexity obtain the contested Reddit content, and did those methods violate any terms or laws?
Will the court accept Reddit’s evidence as proof of systematic scraping—and if so, what remedies might follow, from injunctions to damages to deletion of data?
Does this case accelerate a broader shift toward paid data partnerships for AI companies and stricter technical protections by platforms and search engines?

No matter the outcome, the lawsuit is a bellwether. If Reddit prevails, expect more publishers to harden defenses and pursue licenses. If Perplexity succeeds, AI firms may feel emboldened to lean on public indexing and aggregator services to build products. Either way, the business of training data is moving from backroom deals to the courtroom—and the rules of engagement for AI are being written in real time.