When the Anthropic copyright settlement was announced at $1.5 billion, the AI industry took notice. But Anthropic isn't the only AI company that trained on pirated books. OpenAI — the world's most valuable AI startup — faces a similar set of lawsuits, brought by some of the most recognizable names in American literature.
Here's what authors need to know about the OpenAI copyright litigation, how it developed, and what a potential settlement could mean.
What Happened
OpenAI trained its GPT family of large language models — including GPT-3, GPT-3.5, and GPT-4 — on enormous datasets scraped from the internet and other sources. Among those sources was Books3, a dataset containing approximately 196,000 full-text books assembled from shadow libraries, primarily Bibliotik, a private torrent tracker for ebooks.
Books3 was compiled without licensing from copyright holders. The books it contains — novels, memoirs, academic texts, and more — were uploaded and distributed without permission. When OpenAI used Books3 as training data, it allegedly reproduced these copyrighted works at massive scale without compensation to the authors.
Books3 was part of a larger dataset called The Pile, an 800-gigabyte text collection assembled by EleutherAI and widely used in AI training. The Pile also includes other copyrighted material beyond books.
Who Is Suing
The legal pressure on OpenAI has come from multiple directions.
The Authors Guild — one of the oldest and most prominent author organizations in the United States — filed a major class action complaint in 2023 in the Southern District of New York. The Guild was joined by individual authors including John Grisham, George R.R. Martin, Jodi Picoult, David Baldacci, and Jonathan Franzen.
Penguin Random House and other major publishers also filed related actions, alleging that their books were included in OpenAI's training data without licensing. Comedian and memoirist Sarah Silverman was among the early individual plaintiffs in a separate California action.
The cases allege systematic copyright infringement: OpenAI reproduced, distributed, and used copyrighted books to train AI systems without authorization, in violation of the Copyright Act.
Which Datasets Were Used
Documenting exactly which books were in OpenAI's training data is one of the key issues in the litigation. What we know from public disclosures and court filings:
- Books3 (~196,000 full-text books from shadow libraries)
- BookCorpus (a dataset of self-published books scraped from Smashwords and similar platforms)
- Common Crawl (internet-wide web crawl that includes copyrighted text)
- WebText (Reddit-linked text dataset)
- Internal proprietary datasets assembled by OpenAI
OpenAI has not publicly released its full training data manifests. Part of the discovery process involves compelling OpenAI to disclose exactly what was in GPT's training data.
Current Case Status
As of April 2026, the OpenAI copyright cases are in the discovery phase. Plaintiffs are seeking detailed information about OpenAI's training data, how it was compiled, and how the resulting models were used commercially.
No settlement has been announced. OpenAI has contested several claims, arguing that using text to train AI is transformative fair use. The legal arguments are similar to those Anthropic made — and ultimately settled rather than litigate to conclusion.
Courts have not yet issued definitive rulings on the fair use question as it applies to AI training at scale.
Why a Settlement Is Likely
Several factors suggest OpenAI will eventually settle:
Precedent. The Anthropic settlement — $1.5 billion for using ~400,000 books — demonstrated that AI companies can face substantial liability for this type of claim. OpenAI faces a similar legal theory.
Risk aversion. OpenAI is preparing for an IPO and a public market will scrutinize ongoing litigation. Settling removes uncertainty from the financial picture.
Discovery pressure. If plaintiffs gain access to OpenAI's full training data manifests, the evidence base for liability could be substantial. Settling before that evidence is fully exposed is common.
Scale. OpenAI's GPT models are used by hundreds of millions of people globally. The commercial value derived from training on copyrighted books — if courts find infringement — could support a very large damages award.
What Authors Should Do Now
1. Document your works. Make a list of books you've published with ISBNs and ASINs. This is the information that will be relevant when claims become available.
2. Check the Anthropic case. If your books were published before 2023, check whether they qualify for the Anthropic settlement at TrainedOnYou.com/cases/anthropic/check-works. The Anthropic case uses the same datasets (Books3, LibGen) and is already in the distribution phase.
3. Join the OpenAI waitlist. Sign up at TrainedOnYou.com/cases/openai and we'll notify you the moment an OpenAI settlement is announced or claims open.
4. Talk to your publisher or agent. They may have useful information about whether your works appeared in Books3 or similar datasets, and whether they have already taken any action on your behalf.
The Anthropic case proved the legal theory works. If OpenAI follows the same path to settlement, the authors who are prepared will be the first to file claims.
TrainedOnYou is an independent litigation finance company. We are not affiliated with OpenAI or any plaintiff in the Authors Guild v. OpenAI litigation. This article is for informational purposes only and does not constitute legal advice.