It is 2026, and the legal landscape for artificial intelligence has shifted from a wild west of unchecked scraping to a complex web of lawsuits, licenses, and regulatory mandates. If you are building, deploying, or using Generative AI, machine learning models that create new text, images, code, or audio based on patterns in training data, you cannot ignore the question of ownership. Who owns the data used to train these systems? Who owns the output they produce? And what happens when a model spits out a paragraph of copyrighted news or a snippet of proprietary code?
The short answer is: it depends. But the longer answer involves navigating the U.S. Copyright Office’s latest guidance, understanding the four factors of fair use, securing proper licenses, and maintaining rigorous data provenance. This guide breaks down exactly how these rules apply to you right now.
The Human Authorship Requirement
Before we get into the weeds of training data, let’s clear up a common misconception about the output itself. In the United States, copyright law protects "original works of authorship" fixed in a tangible medium. The critical word here is authorship.
As reaffirmed by the U.S. Copyright Office (USCO) in its March 2023 guidance and again in its May 2025 report on AI training, only humans can be authors. This aligns with Supreme Court precedent like Feist Publications v. Rural Telephone Service (1991), which established that originality requires a "modicum of creativity." More recently, the District Court in Thaler v. Perlmutter (2023) ruled that an AI system cannot hold copyright because it lacks human intent.
This means that if you prompt a tool like Midjourney or GPT-4 and get a result, that raw output is likely in the public domain. You cannot copyright the AI’s work alone. However, if you take that output and significantly edit, curate, or arrange it-adding your own creative choices-you may claim copyright over your specific human contributions. The AI is a tool, not a co-author.
Fair Use: The Four Factors in 2026
The biggest debate in AI copyright revolves around training. Is it legal to scrape millions of books, articles, and images to train a model without permission? The USCO’s May 2025 report states clearly that copying copyrighted works into training datasets constitutes prima facie infringement. This means the act of copying checks the boxes for a lawsuit. The burden then shifts to the AI developer to prove their use is protected under the Fair Use Doctrine, a legal principle allowing limited use of copyrighted material without permission for purposes such as criticism, comment, news reporting, teaching, scholarship, or research.
Fair use is determined by weighing four statutory factors:
- Purpose and Character: Is the use commercial or nonprofit educational? Is it transformative? Courts have historically favored uses that add new expression or meaning, such as Google Books’ snippet view in Authors Guild v. Google (2015). Non-commercial research training is safer than commercial product training.
- Nature of the Work: Using factual data is less risky than using highly creative works like novels or songs.
- Amount Used: Training often requires ingesting entire works. The USCO notes this ordinarily weighs against fair use unless justified by a transformative purpose where the output does not expose the original content.
- Market Effect: Does the AI output substitute for the original work? If a model generates a summary that replaces the need to buy a newspaper, this factor weighs heavily against fair use.
The USCO emphasizes that each case is context-specific. There is no blanket "yes" or "no" for AI training. However, the trend is toward stricter scrutiny for commercial models that generate market-substitutable content.
Licensing as Risk Mitigation
Because fair use is uncertain, many AI companies are turning to licensing. High-profile deals between 2022 and 2023 set the stage for a new normal. OpenAI signed agreements with Shutterstock, the Associated Press, and Axel Springer SE. These deals allow AI companies to access vast archives of text and images while providing revenue sharing or attribution for rights holders.
Three main pricing models have emerged:
- Flat-Fee Licenses: Large media archives often command high six- to eight-figure sums for multi-year access.
- Usage-Based Models: Fees tied to the number of training runs or API calls.
- Revenue Sharing: Rights holders receive a percentage (often 5-20%) of revenues attributable to AI features using their content.
For enterprises, indemnification clauses are also becoming standard. Microsoft’s Copilot Copyright Commitment, for example, promises to defend customers from copyright claims arising from outputs, provided they follow content-filtering guidelines. This shifts financial risk from the user to the provider.
Data Provenance and Compliance
You might think that if data is on the open web, it’s free to use. That is incorrect. Public URLs still carry copyright protection. To protect yourself, you must implement robust data provenance practices. This means tracking exactly where every piece of training data came from and under what terms.
Best practices include:
- Source Audits: Verify copyright status and license terms before ingestion.
- Granular Documentation: Keep hashes, timestamps, and chain-of-custody logs for every dataset version.
- Data Minimization: Only keep excerpts necessary for model objectives; delete non-essential files.
- Output Controls: Use similarity scanning to detect if generated outputs mimic known corpora too closely. Implement human review checkpoints for public-facing content.
Contract law can override fair use. If you use a subscription database whose terms of service prohibit automated scraping, you may violate those terms even if the use might otherwise be fair under copyright law.
International Divergence
If you operate globally, the rules change drastically outside the U.S.
| Region | Key Legislation | Stance on AI Training |
|---|---|---|
| United States | Copyright Act (17 U.S.C.) | Fair use defense required; no specific AI exception. Human authorship mandatory. |
| European Union | DSM Directive (2019/790) & AI Act | Text-and-data-mining exceptions exist. Commercial entities can mine unless rights holders opt-out via machine-readable signals. AI Act requires transparency on training data. |
| United Kingdom | Copyright, Designs and Patents Act 1988 | Exception for non-commercial research only. Commercial AI training requires licenses. |
| Japan | Copyright Act (Article 30-4) | Broad exception for data analysis regardless of purpose, provided it does not substitute for normal use. |
In the EU, the Digital Single Market Directive allows text-and-data mining for anyone unless the rights holder opts out. The newer EU AI Act adds transparency requirements, forcing providers of general-purpose AI to publish summaries of the content used for training. This makes unlicensed scraping harder to hide. In Japan, the approach is much more permissive, allowing broad data mining for analysis. The UK remains cautious, limiting exceptions to non-commercial research.
Ongoing Litigation and Unresolved Questions
Several major cases are shaping the future of AI copyright. Andersen v. Stability AI involves artists alleging that Stable Diffusion infringed their copyrights by training on their images. Doe v. GitHub concerns whether GitHub Copilot reproduces code snippets from public repositories without attribution. The New York Times Co. v. OpenAI argues that AI models reproduce articles nearly verbatim, harming subscription markets.
No Supreme Court decision has yet squarely addressed modern generative AI. Until these cases are resolved, the safest path is proactive compliance: license where possible, document everything, and filter outputs rigorously.
Can I copyright my AI-generated artwork?
No, not the raw output. Under current U.S. law, only humans can be authors. However, if you significantly edit or curate the AI output, adding your own creative choices, you may claim copyright over those specific human contributions.
Is scraping the internet for AI training always illegal?
Not necessarily, but it is risky. Scraping constitutes prima facie infringement. You must rely on the fair use defense, which depends on factors like purpose, nature of the work, amount used, and market effect. Non-commercial research is safer than commercial product training.
What is data provenance in the context of AI?
Data provenance is the ability to trace where each piece of training data came from and under what terms. It involves keeping detailed records of sources, licenses, and usage rights to demonstrate legal compliance and auditability.
How does the EU AI Act affect AI developers?
The EU AI Act imposes transparency obligations on providers of general-purpose AI models. They must publish sufficiently detailed summaries of the content used to train their models, including copyrighted works. This makes unlicensed scraping more contestable.
Should I license my training data?
Licensing is increasingly seen as the safest risk-mitigation strategy. While fair use offers a defense, it is uncertain and costly to litigate. Licensing agreements provide clarity, reduce legal risk, and often include indemnification clauses that protect users from copyright claims.