Nonprofit Common Crawl, best known for making open web archives that are used to train AI models, is under fire for claims that its troves have allowed AI companies to ingest paywalled journalism at scale. The organization’s data in turn has acted as a kind of workaround for model creators looking to add premium news articles (from The New York Times, The Washington Post, and Wired, among others) and quickly roll out updates. Common Crawl denies the allegations and insists it doesn’t circumvent access controls.

Allegations Of A Backdoor To Paywalled Journalism

The Atlantic’s report focuses on how AI companies have long been dependent on Common Crawl’s petabytes-to-the-tenth CC-MAIN snapshots of the web — enormous backbones that power many key datasets. In practice, the story implies that these archives scoop up considerable portions of articles normally behind a subscription paywall, in effect providing model builders with a back door to premium reporting without actually licensing it.

Table of Contents

Allegations Of A Backdoor To Paywalled Journalism
Common Crawl Issues An Affirmative Denial
Takedowns Meet An Immutable Archive Of Web Data
Why AI Companies Bet On Common Crawl For Training
The Legal And Ethical Stakes Around AI Data Scraping
Follow The Money And The Metadata In AI Training
What Comes Next For Common Crawl And AI Training

Common Crawl accused of scraping and sharing paywalled content; padlock over web crawler graphic

The report also draws attention to obscure relationships with Common Crawl and elite AI companies, such as the fact that there is evidence of support or partnerships in the past (OpenAI, Anthropic), and publicly references NVIDIA as a collaborator. These links don’t make these relationships wrong per se, but they do increase scrutiny about what ends up in the crawl — and how it gets used down the line.

Common Crawl Issues An Affirmative Denial

In a rebuttal, Common Crawl wrote that its crawler, known as CCBot, retrieves publicly available pages without requiring entry to the site and does not try to circumvent paywalls or other technical measures protecting proprietary content. The organization denied that it misled publishers, dismissing the claims as false and misleading.

Richard Skrenta, Common Crawl’s CEO, has long espoused the idea of life in aggregate wanting to materialize on top of that most fundamental technical substrate — a World Wide Web as a mechanically and humanly readable stack.

Critics argue that “publicly accessible” is not the same thing as “licensed for ingestion and redistribution,” and certainly that paywalls implemented using client-side scripts or metered access which can be trivially circumvented by non-humans would appear to fall well short.

Takedowns Meet An Immutable Archive Of Web Data

The Atlantic reported that a number of publishers requested their materials be taken down and were given regular updates saying the process was 50 percent, then 70 percent, then up to 80 percent toward auditing completion — however, researchers could see the removed material still existed in archives. Common Crawl has also stated that its file format is created to be immutable, which makes removals after the fact more difficult. And as the investigation alleges, the site’s public search tool may also undercount what is actually stored on Libgen, further complicating publishers’ ability to gauge exposure.

Complicating this situation, a few agents have blocked CCBot via robots.txt to prevent future collection. The move doesn’t apply a retroactive scrub to existing captured content, but that does inspire some tough questions about remediation when these web-scale snapshots are designed to be append-only and redundantly mirrored across public clouds and research institutions.

Why AI Companies Bet On Common Crawl For Training

Common Crawl is at the center of AI training today. Google’s powerful C4 corpus, which was used to pretrain the T5 language model, also came from Common Crawl. The documentation for OpenAI’s models has cited Common Crawl as a significant source of training text multiple times. Similarly, the image-text LAION datasets used to power image generators were created via filtering web-scale crawls. Or, to put it another way: If you have engaged with a big model, Common Crawl probably helped that model learn language and context.

Common Crawl accused of web scraping paywalled content, lock icon over code

That centrality is exactly why the paywall question is so important. If high-quality editorial content — typically created at no small expense — has been hoovered up without permission, model-makers could also be profiting off of content that publishers never voluntarily put online for free. And it muddles provenance, as datasets and checkpoints propagate downstream through the AI ecosystem.

The Legal And Ethical Stakes Around AI Data Scraping

The brawl comes as copyright battles over AI training intensify.

OpenAI, the artificial intelligence lab, is expected to release a universal algorithm that can be “fine-tuned” for use in commercial settings on Tuesday.

OpenAI and Microsoft are being sued by The New York Times over unauthorized use that allowed models to create versions of the Times’ content. Other publishers have pursued similar lawsuits, and the U.S. Copyright Office has been examining the intersection of fair use with machine learning.

Paywalls add a twist: even if it is technically loadable under some conditions, terms of service and access controls can prohibit copying and redistribution. For companies focused on AI, licensing became the safer route over time. OpenAI has negotiated partnerships with groups like the Associated Press and large European publishers, which indicate that while OpenAI is focused on free access — at least for some training and product uses — it could increasingly be pay-to-play.

Follow The Money And The Metadata In AI Training

Two logistical problems will probably determine the next phase. First, provenance: model creators also need clearer lineage for what goes into training runs, including how any part of a dataset might involve paywalled or otherwise restricted content. Second, enforceability: robots.txt and new AI-centric opt-outs help move the needle going forward, but they don’t fix legacy copies baked into “immutable” archives and mirrored sets.

Publishers’ incentives are shrinking too. They can work out licensing, toughen paywalls and deliver bot defenses, but they will still need verifiable assurances from data providers and model builders about what gets ingested — and how repudiations are made when they must be.

What Comes Next For Common Crawl And AI Training

Anticipate tighter audits of web-scale corpora, lawsuits with more explicitly written expectations between AI companies and newsrooms, and pressure on data brokers and nonprofits to deploy more credible takedown workflows that are not solely dependent on brittle public search tools. Whether Common Crawl’s practices are ultimately legal may rest on the details — what was collected, how it was done and how those materials were employed for training purposes. The broader AI industry, which is already trending toward licensing, will be watching closely.