> the court dismissed “nonsensical” claims that Meta’s LLaMA models are themselves infringing derivative works.
See: https://www.eff.org/deeplinks/2025/02/copyright-and-ai-cases...
In this case, the plaintiffs alleged that Anthropic's LLMs had memorized the works so completely that "if each completed LLM had been asked to recite works it had trained upon, it could have done so", "almost verbatim". The judge assumed for the sake of argument that the allegation was true, and ruled that the conduct was fair use anyway due to the existence of an effective filter. Therefore there was no need to determine whether the allegation was actually true.
So - yes, in the sense that the ruling suggests that distributing an open-weight LLM that memorized copyrighted works to that extent would not be fair use.
But no, in the sense that it's not clear whether any LLMs, especially open-weight LLMs, actually memorize book-length works to that extent. Even the recent study about Llama memorizing a Harry Potter book [1] only said that Llama could reproduce 50-token snippets a decent percentage of the time when given the preceding 50 tokens. That's different from actually being able to recite any substantial portion of the book. If you asked Llama for that, the output would quickly diverge from the original text, and it likely wouldn't be able to get back on track without being re-prompted from the ground truth as the study did.
On the other hand, in the case where the New York Times is suing OpenAI, the NYT has alleged that ChatGPT was able to recite extensive portions of NYT articles verbatim. If true, this might be more dangerous, since news articles are not as long as books but they're equally eligible for copyright protection. So we'll see how that shakes out.
Also note:
- Nothing in the opinion sets formal precedent because it's a district court. But the opinion might still influence later judges.
- See also riskable's sibling comment for another case where a judge addressed the issue more head-on (but wasn't facing the same kind of detailed allegations, I don't think; haven't checked).
Additionally that if you download a model file that contains enough of the source material to be considered infringing (even without using the LLM, assume you can extract the contents directly out of the weights) then it might as well be a .zip with a PDF in it, the model file itself becomes an infringing object whereas closed models can be held accountable by not what they store but what they produce.
As far as I could tell, the book didn't match what's posted online today. The text was somewhat consistent on a topic, yet poorly written and made references to sections that I don't think existed. No amount of prompting could locate them. I'm not convinced the material presented to me was actually the book, although it seemed consistent with the topic of the chapter.
I tried to ascertain when the book had been scraped, yet couldn't find a match in Archive.org or in the book's git repo.
Eventually I gave up and just continued reading the PDF.
I'm not a legal scholar, so I'm not qualified or interested in arguing about whether Cliff Notes is fair use. But I do care about how people behave, and I'm pretty sure that Cliff Notes and LLMs lead to fewer books being purchased, which makes it harder for writers to do what they do.
In the case of Cliff Notes, it probably matters less because because the authors of 19th century books in your English 101 class are long dead and buried. But for authors of newer technical material, yes, I think LLMs will make it harder for those people to be able to afford to spend the time thinking, writing, and sharing their expertise.
I'm not so sure about this one. In particular, presuming that it is found that models which can produce infringing material are themselves infringing material, the ability to distill models from older models seems to suggest that the older models can actually produce the new, infringing model. It seems like that should mean that all output from the older model is infringing because any and all of it can be used to make infringing material (the new model, distilled from the old).
I don't think it's really tenable for courts to treat any model as though it is, in itself, copyright-infringing material without treating every generative model like that and, thus, killing the GPT/diffusion generation business (that could happen but it seems very unlikely). They will probably stick to being critical of what people generate with them and/or how they distribute what they generate.
You'd need the copyrighted works to compare to, of course, though if you have the permissible training data (as Anthropic apparently does) it should be doable.
If you can successfully demonstrate that then yes it is a copyright infringement and successfully doing that would be worthy of NeurIPS or ACL paper.
The amount of the source material encoded does not, alone, determine if it is infringing, so this noun phrase doesn't actually mean anything. I know there are some popular myths that contradict this (the commonly-believed "30-second rule" for music, for instance), but they are just that, myths.
Not if there isn't infringement. Infringement is a question that precedes damages, since "damages" are only those harms that are attributable to the infringement. And infringement is an act, not an object.
If training a general use LLM on books isn't infringement (as this decision holds), then there by definition cannot be damages stemming from it; the amount of the source material that the model file "contains" doesn't matter.
It might matter to whether it is possible for a third party to easily use the model for something that would be infringement on the part of the third party, but that would become a problem for people who use it for infringement, not the model creator, and not for people who simply possess a copy of the model. The model isn't "an infringing object".
This is still a weird language shift that actively promotes misunderstandings.
The weights are the LLM. When you say "model", that means the weights.
This will have the effect of empowering countries (and other entities) that don't respect copyright law, of course.
The copyright cartel cannot be allowed to yank the handbrake on AI. If they insist on a fight, they must lose.
It's entirely possible for something to be suboptimal in the specific (I would like this thing for free), but optimal on the whole (society benefits from this thing not being free).
The model itself does not constitute a copy. Its intention is clearly not to reproduce verbatim texts. There would be far cheaper and infinitly more accurate ways to do that if that was the goal.
Appart from the legalities, it would be horrifying if copyright reached into the AI realm to completely styfle progress for, lets be honest, mainly the profits of a few major IP corporations.
I do however understand some creatives are worried about revenue, just like the rest of us. But just like the rest of us, they to live in a world that can only exist because 99.99% of what it took to build that world was automated or tool enhanced, impacting someone's previous employment or business.
We are in a world of unprecedented change, only to be immediatly supassed by the next day's rate of change. This both scares and fascinates me.
But that change and its benefits being held only in the bowels of corporate/government symbiotic entities would scare me a hell of a lott more. Open Source/weights is the only way to have a small chance to keep this at bay.
It's sort of like distributing a compendium of book reviews. Many of the reviews have quotes from the book. If there are thousands of reviews, you could potentially reconstruct the whole book, but that's not the point of the thing and so it makes sense for the infringing thing to be "using it to reconstruct the whole book" rather than "distributing the compendium".
And then Anthropic fended off the argument that their service was intended for doing the former because they were explicitly taking measures to prevent that.
Maybe this is a misrepresentation of the actual Anthropic case, I have no idea, but it’s the scenario I was addressing.
This is the thing you haven't established.
Any ordinary general purpose computer is a "machine" that can produce copyrighted text, if you tell it to. But isn't it pretty important whether you actually do that with it or not, since it's a general purpose tool that can also do a large variety of other things?
Purposes which are fair use are very often not at all personal.
(Also, "personal use" that involves copying, creating a derivative work, or using any of the other exclusive rights of a copyright holder without a license or falling into either fair use or another explicit copyright exception are not, generally, allowed, they are just hard to detect and unlikely to be worth the copyright holder's time to litigate even if they somehow were detected.)
So it totally isn't a warez streaming media server but AI?
I'm guessing since my net worth isn't a billion plus, the answer is no
If you xor some data with random numbers, both the result and the random numbers are indistinguishably random and there is no way to tell which one came out of a random number generator and which one is "derived" from a copyrighted work. But if you xor them together again the copyrighted work comes out. So if you have Alice distribute one of the random looking things and Bob distribute the other one and then Carol downloads them both and reconstructs the copyrighted work, have you created a scheme to copy whatever you want with no infringement occurring?
Of course not, at least Carol is reproducing an infringing work, and then there are going to be claims of contributory infringement etc. for the others if the scheme has no other purpose than to do this.
Meanwhile this problem is also boring because preventing anyone from being the source of infringing works isn't a thing anybody has been able to do since at least as long as the internet has allowed anyone to set up a server in another jurisdiction.
The goal of copyright is to make sure people can get fair compensation for the amount of work they put in. LLMs automate plagiarism on a previously unfathomable scale.
If humans spend a trillion hours writing books, articles, blog posts and code, then somebody (a small group of people) comes and spends a million hours building a machine that ingests all the previous work and produces output based on it, who should get the reward for the work put in?
The original authors together spent a million times more effort (normalized for skill) and should therefore should get a million times bigger reward than those who build the machine.
In other words, if the small group sells access to the product of the combined effort, they only deserve a millionth of the income.
---
If "AI" is as transformative as they claim, they will have no trouble making so much money they they can fairly compensate the original authors while still earning a decent profit. But if it's not, then it's just an overpriced plagiarism automator and their reluctance to acknowledge they are making money on top of everyone else's work is indicative.
This is a bit distorted. This is a better summary: The primary purpose of copyright is to induce and reward authors to create new works and to make those works available to the public to enjoy.
The ultimate purpose is to foster the creation of new works that the public can read and written culture can thrive. The means to achieve this is by ensuring that the authors of said works can get financial incentives for writing.
The two are not in opposition but it's good to be clear about it. The main beneficiary is intended to be the public, not the writers' guild.
Therefore when some new factor enters the picture such as LLMs, we have to step back and see how the intent to benefit the reading public can be pursued in the new situation. It certainly has to take into account who and how will produce new written works, but it is not the main target, but can be an instrumental subgoal.
Fundamentally, fair compensation is based on the amount of work put in (obviously taking skill/competence into account but the differences between people in most disciplines probably don't span a single order of magnitude, let alone several).
The ultimate goal should be to prevent people who don't produce value from taking advantage of those who do. And among those who do, that they get compensated according to the amount of work and skill they put in.
Imagine you spend a year building a house. I have a machine that can take your house and materialize a copy anywhere on earth for free. I charge people (something between 0 and the cost of building your house the normal way) to make them a copy of your house. I can make orders of magnitude more money this way than you. Are you happy about this situation? Does it make a difference how much i charge them?
What if my machine only works if I scan every house on the planet? What if I literally take pictures of it from all sides, then wait for your to not be home and xray it to see what it looks like inside?
You might say that you don't care because now you can also afford many more houses. But it does not make you richer. In fact, it makes you poorer.
Money is not a store of value. If everyone has more money but most people only have 2x more and a small group has a 1000x more, then the relative bargaining power changed so the small group is better off and the large group is worse off. This is what undetectable cheap mass plagiarism leads to for all intellectual work.
---
I wrote a lot of open source code, some of it under permissive licenses, some GPL, some AGPL. The conditions of those licenses are that you credit me. Some of them also require that if you build on top of my work, you release your work with the same licence.
LLMs launder my code to make profit off of it without giving me anything (while other people make profit, thus making me poorer) and without crediting me.
LLMs also take away the rights of the users of my code - (A)GPL forced anyone who builds on top of my work to release the code when asked, with LLM-laundered code, this right no longer seems to exist because who do you even ask?
The house thing is a bit offtopic because to be considered for copyright, only its artistic, architectural expression matters. If you want to protect the ingenuity in the technical ways of how it's constructed, that's a patent law thing. It also muddies the water by bringing in aspects of the privacy of one's home by making us imagine paparazzi style photoshoots and sneaky X rays.
The thing is, houses can't be copied like bits and bytes. I would copy a car if I could. If you could copy a loaf of bread for free, it would be a moral imperative to do so, whatever the baker might think about it.
> fair compensation is based on the amount of work put in
This is the labor theory of value, but it has many known problems. For example that the amount of work put in can be disconnected from the amount of value it provides to someone. Pricing via supply/demand market forces have produced much better outcomes across the globe than any other type of allocation. Of course moderated by taxes and so on.
But overall the question is whether LLMs create value for the public. Does it foster prosperity of society? If yes, laws should be such that LLMs can digest more books rather than less. If LLMs are good, they should not be restricted to be trained on copyright-expired writings.
If LLMs could create quality literature, or social media create in-depth reporting, then I'd have no problem with the tide of technological progress flowing.
Unfortunately, recent history has shown that it's trivial for the market to cannibalize the financial model of creators without replacing it.
And as a result, society gets {no more that thing} + {watered down, shitty version}.
Which isn't great.
So I'd love to hear an argument from the 'fuck copyright, let's go AI' crowd (not the position you seem to be espousing) on what year +10 of rampant AI ingestion of copyrighted works looks like...
I think there is a problem with your initial position. Nobody is entitled to compensation for simply working on something. You have to work on things that people need or want. There is no such thing "fair compensation".
It is "unfair" to take the work of somebody else and sell it as your own. (I don't think the LLMs are doing this.)
LLMs are models of languages, which are models of reality. If anyone deserves compensation, it's humanity as a whole, for example by nationalizing, or whatever the global equivalent is, LLMs.
Approximately none of the value of LLMs, for any user, is in recreating the text written by an author. Authors have only ever been entitled to (limited) ownership their expression, copyright has never given them ownership of facts.
Does this imply that distributing open-weights models such as Llama is copyright infringement, since users can trivially run the model without output filtering to extract the memorized text?
[1]: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...