AI and Copyright: Who Owns the Training Data?

Abstract

$1.5 billion settlement, artists suing, new laws: How copyright is reshaping the AI industry.

$1.5 billion. That's what Anthropic paid in the Bartz settlement — the largest AI copyright case in history. Millions of pirated books used to train Claude. Not a hypothetical scenario, not a future problem. This happened. And it changes everything.

For years, AI companies operated on a simple principle: train first, ask later. Massive datasets scraped from the internet — books, articles, images, music — without permission, without compensation, often without creators even knowing. That era is over. The courts have spoken, legislators are following. What this means for companies using AI? More than most realize.

What Happened: The Cases That Changed Everything

The Bartz v. Anthropic settlement in fall 2025 was the breaking point. $1.5 billion — not for a faulty product, but for the way the product was made. Anthropic had used millions of copyrighted books from piracy databases to train its AI. The judge found: downloading and processing entire works goes far beyond anything that could qualify as fair use.

But Bartz was just the beginning. UMG v. Udio ended with a revolutionary agreement: artists must actively consent to training (opt-in) rather than having to actively object (opt-out). Warner Music settled with Suno, agreeing to train new models exclusively on licensed content.

And then there's The New York Times v. OpenAI — the case still underway that may have the most far-reaching consequences. The Times argues that ChatGPT can reproduce entire articles verbatim — which is hard to sell as "transformative use."

The Fair Use Argument: Why Courts Weren't Convinced

The AI industry's defense was elegant and simple: training on copyrighted material is transformative use. The model "learns" from data like a human learns from a library — it doesn't reproduce, it abstracts. An influential 2019 paper compared LLM training to reading books and then writing your own texts.

The courts saw it differently. Three arguments proved decisive:

Models demonstrably reproduce entire passages verbatim — that's not abstraction, it's storage.
AI companies generate billions in revenue from trained models — this isn't non-commercial research.
AI-generated text competes directly with originals — a summarization tool that summarizes a newspaper article replaces buying the article.

The last point proved particularly decisive. Fair use requires that the use doesn't impair the market for the original. But when an AI system offers precisely the service the original work was created for — namely information and entertainment — that's market harm, not innovation.

The Creators' Perspective: When Your Work Works Against You

Behind the billion-dollar settlements are real people. The illustrator whose style is reproduced by Midjourney — now by clients who would have hired her. The non-fiction author whose knowledge appears in ChatGPT responses — without attribution, without compensation. The musician whose voice is imitated by Suno — without ever being asked.

The Writers Guild of America Strike 2023 was a turning point. Screenwriters struck for 148 days — and a core issue was AI. The result: studios cannot use AI as a basis for screenplays, and AI-generated material cannot establish authorship. An important precedent that extends far beyond Hollywood.

Sarah Silverman, Michael Chabon, the Authors Guild with its 10,000 members — the list of plaintiffs keeps growing. And it's not just about money. It's about a fundamental question: Who owns creative work in a world where machines can replicate it?

May 2025: The US Copyright Office Draws the Line

The US Copyright Office guidance from May 2025 was a turning point — not because of a radical new position, but because of its clarity. The Office stated: when AI training competes with or diminishes the licensing opportunities of a work, the fair use analysis weighs against fair use.

What does this mean in practice? If a publisher offers licenses for using its texts — say for summarization services or research tools — and an AI company trains on those same texts without a license for exactly such purposes, that's not fair use. The existence of a licensing market is the benchmark.

The Office also recommended that Congress introduce a mandatory transparency requirement: AI companies must disclose which copyrighted works they use for training. This isn't legislation yet — but the direction is clear.

Opt-In Instead of Opt-Out: The Quiet Revolution

The agreement between UMG and Udio established a principle that could transform the entire industry: opt-in instead of opt-out.

Until now, AI training worked on the opt-out principle: everything on the internet can be used — unless the creator actively objects. The problem: most creators didn't even know their works were being used. And even if they did — objecting was technically complex and often ignored.

Opt-in reverses the logic: nothing can be used unless the creator expressly consents. This matches how every other industry works — a publisher asks before printing, a film studio licenses before using. That the AI industry claimed an exception for years will, in retrospect, be seen as one of the biggest blind spots in technology history.

For artists, opt-in means: control. They can decide whether and under what terms their works are used for training. For AI companies, it means: higher costs, but also a sustainable business model not built on systematic copyright infringement.

What This Means for Companies Using AI

If you use AI tools in your organization — and most do — this affects you directly. Not as a theoretical consideration, but as a legal and business risk.

The question isn't whether your AI vendor trained cleanly. The question is whether you can prove it. In an increasingly regulated world, the provenance of training data is becoming a compliance issue — similar to supply chain transparency for raw materials.

Specific questions you should be asking your AI vendor:

What data was used to train the model?
Are licenses in place for copyrighted content?
How is it ensured that the model doesn't reproduce protected content?
Is there a process for creators who object to usage?
How will you respond to future regulatory requirements — particularly the EU AI Act?

Companies that don't ask these questions are taking a risk. Not just an ethical one — a financial one. The Bartz ruling showed: the costs can run into the billions.

Conclusion: The Era of "Train First, Ask Later" Is Over

The trajectory of the last two years is clear: the law is catching up with technology. What began as the Wild West phase of AI development — scrape everything, train everything, monetize everything — is giving way to a regulated market with clear rules.

This isn't bad news. On the contrary: licensed, transparent AI is the foundation for sustainable trust. Companies like Anthropic, which shifted to licensed data after expensive settlements, demonstrate: it works — and it creates a competitive advantage. Customers want to know that the tools they use are built on a clean data foundation.

The future doesn't belong to companies that scraped the most data. It belongs to those that earned the trust of creators, users, and regulators. Copyright isn't an innovation barrier — it's the foundation for fair innovation.

Want to make sure your AI solution is built on properly licensed data? Talk to us about transparent AI text analysis.

Abstract

$1.5 billion settlement, artists suing, new laws: How copyright is reshaping the AI industry.

What Happened: The Cases That Changed Everything

The Fair Use Argument: Why Courts Weren't Convinced

The courts saw it differently. Three arguments proved decisive:

Models demonstrably reproduce entire passages verbatim — that's not abstraction, it's storage.
AI companies generate billions in revenue from trained models — this isn't non-commercial research.
AI-generated text competes directly with originals — a summarization tool that summarizes a newspaper article replaces buying the article.

The Creators' Perspective: When Your Work Works Against You

Opt-In Instead of Opt-Out: The Quiet Revolution

The agreement between UMG and Udio established a principle that could transform the entire industry: opt-in instead of opt-out.

What This Means for Companies Using AI

If you use AI tools in your organization — and most do — this affects you directly. Not as a theoretical consideration, but as a legal and business risk.

Specific questions you should be asking your AI vendor:

What data was used to train the model?
Are licenses in place for copyrighted content?
How is it ensured that the model doesn't reproduce protected content?
Is there a process for creators who object to usage?
How will you respond to future regulatory requirements — particularly the EU AI Act?

Companies that don't ask these questions are taking a risk. Not just an ethical one — a financial one. The Bartz ruling showed: the costs can run into the billions.

Conclusion: The Era of "Train First, Ask Later" Is Over

Want to make sure your AI solution is built on properly licensed data? Talk to us about transparent AI text analysis.

What Happened: The Cases That Changed Everything

The Fair Use Argument: Why Courts Weren't Convinced

The Creators' Perspective: When Your Work Works Against You

May 2025: The US Copyright Office Draws the Line

Opt-In Instead of Opt-Out: The Quiet Revolution

What This Means for Companies Using AI

Conclusion: The Era of "Train First, Ask Later" Is Over

From the same series

AI Bias: When Algorithms Discriminate — and Nobody Notices

EU AI Act: What the World's First AI Law Means for Your Business

The Ecological Footprint of AI: The Hidden Cost of Intelligence

AI and Copyright: Who Owns the Training Data?

What Happened: The Cases That Changed Everything

The Fair Use Argument: Why Courts Weren't Convinced

The Creators' Perspective: When Your Work Works Against You

May 2025: The US Copyright Office Draws the Line

Opt-In Instead of Opt-Out: The Quiet Revolution

What This Means for Companies Using AI

Conclusion: The Era of "Train First, Ask Later" Is Over

From the same series

AI Bias: When Algorithms Discriminate — and Nobody Notices

EU AI Act: What the World's First AI Law Means for Your Business

The Ecological Footprint of AI: The Hidden Cost of Intelligence