The dominant tech narrative is that training large language and image models on copyrighted material is obviously fair use โ that the precedent is settled, the law is clear, and any objection is just luddite confusion. None of that is true. The fair-use case for AI training is a contested legal argument, currently being litigated, and considerably weaker than the industry’s confident press releases suggest.
What fair use actually requires
U.S. fair-use doctrine weighs four factors: purpose and character of the use, nature of the copyrighted work, amount used, and effect on the market for the original. Training a commercial AI model on copyrighted text or images puts pressure on every one of those factors.
The industry leans hard on “transformative use” โ the argument from Google Books that scanning to enable search constituted a sufficiently new purpose. But Google Books didn’t generate competing books. It produced an index. AI image generators trained on artists’ work produce competing artwork in those artists’ styles. AI text models trained on journalism produce summaries that displace clicks to the original journalism. The “transformative” argument gets thin when the output competes commercially with the input.
The market-effect factor matters more than tech admits
The fourth factor โ effect on the potential market โ is where AI training is most exposed. If a generative model can produce work in the style of a specific living author or illustrator, and customers buy the model’s output instead of commissioning the artist, the market harm is direct and measurable. That’s exactly the kind of harm copyright was designed to address.
Tech companies argue the harm is speculative or that authors should welcome the broader cultural diffusion. Courts are not obligated to accept either framing, and they’ve already begun rejecting parts of it. The New York Times v. OpenAI case, ongoing class actions by authors, and Getty Images’ suit against Stability AI all turn on whether commercial AI training and output substantively harm the markets for the underlying works. The tech industry’s confident “fair use” framing presumes those cases will lose. They might not.
The licensing alternative exists
Tech companies could license training data. Some are starting to, quietly โ Reddit, Stack Overflow, news publishers, image archives. The fact that licensing markets are forming is itself relevant to fair-use analysis. Courts consider whether a market for licensing exists, and if it does, “we just took it” becomes a harder argument.
The deeper issue is that the AI industry built itself on the assumption that training data was free, scaled rapidly on that assumption, and now faces enormous retroactive cost if courts disagree. The volume of the bet doesn’t make the legal argument stronger. It makes the lobbying louder.
Bottom line
Whether AI training constitutes fair use is a live legal question with cases working their way through federal courts. The eventual answer may permit some training and prohibit others, and the industry’s preferred framing โ that this is settled โ is a rhetorical move, not a legal one. Anyone reading confident tech commentary on this should treat the confidence itself as a tell.
Leave a Reply