Meta Accused of Training Llama AI on 267TB of Pirated Books

A new lawsuit filed against Meta is intensifying the legal battle over how AI companies train their models, with publishers accusing the company of deliberately using pirated books and articles to build its Llama AI system instead of licensing them legally.

The lawsuit was brought by publishers Hachette, Macmillan, McGraw-Hill, Elsevier, and Cengage alongside author Scott Turow, who allege that Meta trained Llama using more than 267 terabytes of copyrighted materials sourced from shadow libraries, including the piracy database LibGen. According to the complaint, the amount of material involved exceeds the entire print collection of the Library of Congress.

Plaintiffs described the alleged conduct as “one of the most massive infringements of copyrighted materials in history.”

What makes this case particularly significant is not just the scale of the alleged infringement, but the paper trail cited throughout the lawsuit. According to the filing, Meta internally considered spending as much as $200 million to license books and research material for AI training before ultimately deciding against it. Plaintiffs claim the discussion was eventually escalated to Meta CEO Mark Zuckerberg, who allegedly rejected the licensing approach.

One internal employee message cited in the complaint stated: “If we license one single book, we won’t be able to lean into the fair use strategy.”

Another internal Meta memo referenced in the lawsuit allegedly discussed the use of datasets sourced from LibGen, a shadow library widely associated with pirated books and academic materials. According to the filing, the memo stated: “We would not disclose use of LibGen datasets used to train.”

Meta has denied wrongdoing. A company spokesperson said: “AI is powering transformative innovations for people and businesses, and courts have rightly found that training AI models on copyrighted material can qualify as fair use. We will fight this lawsuit aggressively.”

The case arrives as courts continue to wrestle with whether AI companies can legally train models on copyrighted material without permission. In 2025, authors including Sarah Silverman and Junot Díaz sued Meta over similar claims. A federal judge later ruled in the company’s favour, finding that parts of the training process qualified as fair use.

This new lawsuit attempts to distinguish itself from earlier cases by focusing less on the abstract legal debate around fair use and more on alleged intent. Plaintiffs argue that Meta knowingly used pirated datasets despite internally exploring lawful licensing arrangements.

The legal pressure is also growing across the broader AI industry. In 2025, Anthropic faced lawsuits from authors and music publishers over claims that copyrighted works were used without authorization to train its AI systems. The increasing number of cases has raised larger questions about whether current copyright law can keep pace with the data demands of modern AI models.

The publishers further argue that Llama is capable of generating “verbatim and near-verbatim” reproductions of copyrighted books, potentially turning AI systems into direct substitutes for original works rather than transformative tools.

For now, the case adds to what is becoming one of the defining legal questions of the AI era: whether the race to build increasingly powerful models can continue relying on copyrighted material without compensation to the people who created it.

Don't Miss the Latest News

Success! Now Check Your Email

Meta Accused of Training Llama AI on 267TB of Pirated Books

Spread the Word

You May Be Interested View All

Google Pixel 11 Leaks and Rumours: Tensor G6, New Design, and Pixel Glow

Middle Managers Are Now Being Tasked With Getting Workers to Use AI

Here Is How Binance’s New Withdraw Lock Can Protect Your Crypto Assets

GPT-5.5 Instant: 7 things to know about ChatGPT's New Default Model