Apple is being sued in a proposed class action by two authors who say the company used its artificial intelligence models to place a price on a treasure trove of pirated books to which the writers hold the rights. The complaint, which was filed in the federal court in the Northern District of California, was first reported by Reuters; it argues that Apple used the Books3 dataset to create its OpenELM language models, and potentially other systems related to Apple’s AI strategy.
The plaintiffs, the authors Grady Hendrix and Jennifer Roberson, contend that Apple had used their copyrighted works without permission or payment, and they use among other things Apple’s technical paper on OpenELM to show that Books3 was part of training provided for new employees. Apple has not disclosed publicly every dataset it employed to train its models, an issue that becomes more fraught as AI systems make the transition from research to mass-market features.
The Lawsuit in Brief
The suit seeks for the court to certify a class of affected authors, to require an injunction against any further use of alleged infringing datasets, and for an award of monetary damages. At issue in the case is whether the ingestion of full-text copyrighted books to train a model language patterns is a fair use of the content or an unauthorized exploitation of creative works.
The filing calls attention to Apple’s OpenELM documentation, which the authors claim points to Books3 as a training source. The complaint also says that larger models being used for Apple’s bigger “foundation” models were potentially also trained on the dataset. Apple has framed its AI capabilities as privacy-first and in many cases on-device, but the origin of the training data has long been a sort of Pandora’s box for AI copyright disputes.
What Books3 Is, and Why Authors Are Pushing Back
Books3 is a hefty selection of full-length books — approximately 196,000 titles in all — that were scraped from pirated copies going around the net. The dataset has been the talk of AI research circles because long-form, edited prose is particularly welcome for teaching models about structure, style and sustained reasoning over multi-page narratives.
Rights holders argue that it is a simple violation of copyright to put together and circulate such a corpus, and training on it transforms infringing copies into commercial advantage. The Danish anti-piracy group, Rights Alliance, had previously requested the takedown of Books3 as part of larger attempts to combat mass distribution of copyrighted works over data hubs.
Research scientists argue that high-quality works do directly improve the fluency and accuracy of language models, and it is in part why the data set was influential. But that technical reality runs up against legal and ethical obligations to obtain permission or licenses, the resolution of which is now playing out in courtrooms.
Apple’s AI Dreams Get Stuck in the Clouds
Apple has been shipping “Apple Intelligence” features and releasing research such as OpenELM to demonstrate model efficiency and on-device performance. The company has claimed to prioritize user privacy and the selective use of data for inference; training data transparency is a different matter. More disclosure might soon come from the courts: what was used, how was it obtained, and whether filtering excluded copyrighted books.
Apple will trot out time-honored arguments. Computer firms like Google say that as training turns texts into statistical representations, “none of the books are returned.” It’s also fair use, they argue, and doesn’t grab a bigger market share than the books already in commerce. Plaintiffs, meanwhile, argue that wholesale replication of entire works, particularly as accessed through pirated sources, exceeds the bounds of fair use and results in a competing work that is capable of mirroring style and content.
The Broader Legal Landscape
Apple is just one of a handful of AI companies facing book-related litigation. Meta, OpenAI and Stability AI have also been sued by authors and publishers over training data. The Authors Guild has supported multiple actions, and media companies have filed separate claims over news and archives that were used to train chatbots. The U.S. Copyright Office has been studying the issue and has indicated that training with copyrighted materials raises new policy considerations that may call for legislative guidance.
Courts are also wrestling with technicalities: Can plaintiffs demonstrate their books are indeed within a model’s training data? Do models “memorize” and then regurgitate the protected text, or instead largely generalize patterns? Expert testimony and access to audits of training pipelines — if permitted — could be crucial in this and similar cases.
What’s at Stake for the Developers and Rights Holders
Damages could ratchet up quickly across thousands of titles if the plaintiffs win class certification and if it’s determined there was willful infringement. In addition to multiplying financial exposure, injunctions can compel retraining of models, filtering of content, or removal of AI capabilities; none of these actions are simply done, or without technical and financial cost.
For authors and publishers, the case represents a test of whether consent and compensation can be enforced at scale in AI. The market has already begun to respond: The Zenodo database is changing policies on depositing, some labs are switching to licensed corpora, synthetic data or public-domain materials, and others are inking direct with publishers and news organizations to mitigate legal risk and improve data provenance.
What to Watch Next
Among the key milestones are Apple’s initial response, any motion to dismiss on fair-use grounds and potential discovery into training data sets and filtering. The parties might settle if they can agree on licensing terms, but it could also serve as a bellwether for how U.S. courts will weigh books-based training against copyright protections.
Whichever way it goes, one thing is clear: AI leaders are getting squeezed toward transparency, licensing and auditability. For such a brand-conscious company as Apple, the imperative of displaying clean, consented training data could soon become as vital as that of creating quality models.