Apple Sued Over Alleged Pirated Books in AI Training

Apple is facing a proposed class-action lawsuit from two authors who allege the company trained its AI models on a trove of pirated books without permission. The complaint, filed in federal court in Northern California and first reported by Reuters, claims Apple relied on the Books3 dataset to build its OpenELM language models and potentially other systems tied to Apple’s AI strategy.

Table of Contents

The Lawsuit at a Glance
What Is Books3 and Why Authors Object
Apple’s AI Ambitions Meet Copyright Law
The Broader Legal Landscape
What’s at Stake for Developers and Rights Holders
What to Watch Next

The plaintiffs, novelists Grady Hendrix and Jennifer Roberson, argue that Apple used their copyrighted works without consent or compensation, pointing to Apple’s technical paper on OpenELM as evidence that Books3 was part of the training mix. Apple has not publicly detailed every dataset used to train its models, an increasingly contentious issue as AI systems move from research to mass-market features.

Apple faces copyright lawsuit over AI trained on pirated books

The Lawsuit at a Glance

The suit asks the court to certify a class of affected authors, issue an injunction barring further use of allegedly infringing datasets, and award monetary damages. Central to the case is whether ingesting full-text copyrighted books to teach a model language patterns is protected by fair use or an unauthorized exploitation of creative works.

The filing highlights Apple’s OpenELM documentation, which the authors say references Books3 as a training source. The complaint further suggests Apple’s larger “foundation” models may also have been trained on the dataset. Apple has positioned its AI features as privacy-first and, in many cases, on-device, but the provenance of training data remains the unresolved core of many AI copyright fights.

What Is Books3 and Why Authors Object

Books3 is a large collection of full-length books—roughly 196,000 titles—that were scraped from pirated copies circulating online. The dataset has been widely discussed in AI research circles because long-form, edited prose is especially valuable for teaching models to handle structure, style, and sustained reasoning across many pages.

Rights holders argue that assembling and distributing such a corpus is a plain violation of copyright, and that training on it turns illicit copies into commercial advantage. A Danish anti-piracy group, Rights Alliance, previously pursued takedowns of Books3 as part of broader efforts to curb mass distribution of copyrighted works through data hubs.

Researchers counter that high-quality books materially improve language models’ fluency and reduce errors, which is precisely why the dataset became influential. That technical reality, however, collides with legal and ethical obligations to obtain permission or licenses—an issue now playing out in courtrooms.

Apple’s AI Ambitions Meet Copyright Law

Apple has been rolling out “Apple Intelligence” features and publishing research like OpenELM to showcase model efficiency and on-device capabilities. The company has emphasized user privacy and selective data use for inference, but training data transparency is a separate question. Courts may soon force more disclosure: what was used, how it was obtained, and whether filtering excluded copyrighted books.

Expect Apple to raise familiar defenses. Tech firms typically argue that training transforms texts into statistical representations, constitutes fair use, and does not substitute for books in the market. Plaintiffs, by contrast, say wholesale copying of entire works—especially via pirated sources—goes far beyond what fair use allows and creates a competing product that can mimic style and content.

The Broader Legal Landscape

Apple joins a roster of AI companies contending with book-related litigation. Meta, OpenAI, and Stability AI have been sued by authors and publishers over training data. The Authors Guild has backed multiple actions, while media organizations have filed separate claims over news and archives used to train chatbots. The U.S. Copyright Office has been studying the issue and has signaled that training on copyrighted works raises novel policy questions that may require legislative clarity.

Courts are also grappling with technical nuances: Can plaintiffs show their books are inside a model’s training data? Do models “memorize” and regurgitate protected text, or primarily generalize patterns? Expert testimony and audits of training pipelines—if allowed—could be pivotal in this and similar cases.

What’s at Stake for Developers and Rights Holders

If the plaintiffs win class certification, damages could compound quickly across thousands of titles, particularly if any infringement is deemed willful. Beyond financial exposure, injunctions could force model retraining, content filtering, or the removal of AI features—moves that are costly and technically complex.

For authors and publishers, the case tests whether consent and compensation can be enforced at scale in AI. The market has already started reacting: some labs are shifting to licensed corpora, synthetic data, or public-domain materials; others are striking direct licensing deals with publishers and news organizations to reduce legal risk and improve data provenance.

What to Watch Next

Key milestones will include Apple’s initial response, any motion to dismiss on fair-use grounds, and potential discovery into training datasets and filtering. Settlement is possible if the parties can agree on licensing terms, but the case could also become a bellwether for how U.S. courts weigh books-based training against copyright protections.

Whatever the outcome, one trend is clear: AI leaders are being pushed toward transparency, licensing, and auditability. For a company as brand-conscious as Apple, the pressure to show clean, consented training data may soon be as important as the quality of the models themselves.