Comment by NobodyNada

NobodyNada 1 day ago parent

One aspect of this ruling [1] that I find concerning: on pages 7 and 11-12, it concedes that the LLM does substantially "memorize" copyrighted works, but rules that this doesn't violate the author's copyright because Anthropic has server-side filtering to avoid reproducing memorized text. (Alsup compares this to Google Books, which has server-side searchable full-text copies of copyrighted books, but only allows users to access snippets in a non-infringing manner.)

Does this imply that distributing open-weights models such as Llama is copyright infringement, since users can trivially run the model without output filtering to extract the memorized text?

[1]: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...

riskable 1 day ago

A judge already ruled that models themselves don't constitute copyright infringement in Kadrey v. Meta Platforms, Inc. (https://casetext.com/case/kadrey-v-meta-platforms-inc). The EFF has a good summary about it:

> the court dismissed “nonsensical” claims that Meta’s LLaMA models are themselves infringing derivative works.

See: https://www.eff.org/deeplinks/2025/02/copyright-and-ai-cases...

qoez 2 hours ago

Time to overfit on some books and publicize them as a libgen mirror.

comex 1 day ago

Yes and no.

In this case, the plaintiffs alleged that Anthropic's LLMs had memorized the works so completely that "if each completed LLM had been asked to recite works it had trained upon, it could have done so", "almost verbatim". The judge assumed for the sake of argument that the allegation was true, and ruled that the conduct was fair use anyway due to the existence of an effective filter. Therefore there was no need to determine whether the allegation was actually true.

So - yes, in the sense that the ruling suggests that distributing an open-weight LLM that memorized copyrighted works to that extent would not be fair use.

But no, in the sense that it's not clear whether any LLMs, especially open-weight LLMs, actually memorize book-length works to that extent. Even the recent study about Llama memorizing a Harry Potter book [1] only said that Llama could reproduce 50-token snippets a decent percentage of the time when given the preceding 50 tokens. That's different from actually being able to recite any substantial portion of the book. If you asked Llama for that, the output would quickly diverge from the original text, and it likely wouldn't be able to get back on track without being re-prompted from the ground truth as the study did.

On the other hand, in the case where the New York Times is suing OpenAI, the NYT has alleged that ChatGPT was able to recite extensive portions of NYT articles verbatim. If true, this might be more dangerous, since news articles are not as long as books but they're equally eligible for copyright protection. So we'll see how that shakes out.

Also note:

- Nothing in the opinion sets formal precedent because it's a district court. But the opinion might still influence later judges.

- See also riskable's sibling comment for another case where a judge addressed the issue more head-on (but wasn't facing the same kind of detailed allegations, I don't think; haven't checked).

[1] https://arxiv.org/abs/2412.06370

ethbr1 1 day ago

Wouldn't a model that can recite training data verbatim be larger than necessary? Exact text isn't coming from nowhere, no matter how efficiently the bits are encoded, and the same effectiveness should be achievable by compressing those portions of the model.

zeven7 12 hours ago

Maybe we are all just LLMs. If the books were written by a language producing algorithm in a human mind, maybe there’s not as much raw data there as it seems, and the total information can in fact be stored in a surprisingly small set of weights.

ethbr1 5 hours ago

I imagine it's not inconceivable that at very high dimensions and with the right architectures stochastic compression can be unexpectedly efficient. It would be strange if the end result of AI research is realizing we're solving a compression problem (and that our brains do too).

ticulatedspline 1 day ago

Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.

Additionally that if you download a model file that contains enough of the source material to be considered infringing (even without using the LLM, assume you can extract the contents directly out of the weights) then it might as well be a .zip with a PDF in it, the model file itself becomes an infringing object whereas closed models can be held accountable by not what they store but what they produce.

bonoboTP 1 day ago

This technology is a really bad way of storing, reproducing and transmitting the books themselves. It's probabilistic and lossy. It may be possible to reproduce some paragraphs, but no reasonable person would expect to read The Da Vinci Code by prompting the LLM. Surely the marketed use cases and the observed real use by users has to make it clear that the intended and vastly overwhelming use of an LLM is transformative, "digestive" synthesis of many sources to construct a merged, abstracted, generalized system that can function in novel uses, answering never before seen prompts in a useful manner, overwhelmingly without reproducing existing written works. It surely matters what the purpose of the thing is both in intention and observed practice. It's not a viable competing alternative to reading the actual book.

spit2wind 1 day ago

Not The DaVinci Code, but I recently tried reading "OCaml Programming: Correct + Efficient + Beautiful" through Gemini. The book is open, so I rightly assumed it was "in there". I read by saying "Give me the first paragraph of Chapter 6" and then something like "Next 3 paragraphs". If I had a question, I was able to ask it and get some more info and have something like a dialog.

As far as I could tell, the book didn't match what's posted online today. The text was somewhat consistent on a topic, yet poorly written and made references to sections that I don't think existed. No amount of prompting could locate them. I'm not convinced the material presented to me was actually the book, although it seemed consistent with the topic of the chapter.

I tried to ascertain when the book had been scraped, yet couldn't find a match in Archive.org or in the book's git repo.

Eventually I gave up and just continued reading the PDF.

munificent 1 day ago

The number of people who buy Cliffs Notes versions of books to pass examinations where they claim to have read the actual book suggests you are way overestimating how "reasonable" many people are.

bonoboTP 1 day ago

Cliff Notes are fair use. Would you argue otherwise? Wikipedia also has plot summaries without infringement.

munificent 1 day ago

In your parent comment, you argued what people would do in practice. Now you have shifted to talking about what is legal or not to do.

I'm not a legal scholar, so I'm not qualified or interested in arguing about whether Cliff Notes is fair use. But I do care about how people behave, and I'm pretty sure that Cliff Notes and LLMs lead to fewer books being purchased, which makes it harder for writers to do what they do.

In the case of Cliff Notes, it probably matters less because because the authors of 19th century books in your English 101 class are long dead and buried. But for authors of newer technical material, yes, I think LLMs will make it harder for those people to be able to afford to spend the time thinking, writing, and sharing their expertise.

2 More Comments →

lcnPylGDnU4H9OF 1 day ago

> broadly capable open models are on track for annihilation

I'm not so sure about this one. In particular, presuming that it is found that models which can produce infringing material are themselves infringing material, the ability to distill models from older models seems to suggest that the older models can actually produce the new, infringing model. It seems like that should mean that all output from the older model is infringing because any and all of it can be used to make infringing material (the new model, distilled from the old).

I don't think it's really tenable for courts to treat any model as though it is, in itself, copyright-infringing material without treating every generative model like that and, thus, killing the GPT/diffusion generation business (that could happen but it seems very unlikely). They will probably stick to being critical of what people generate with them and/or how they distribute what they generate.

ijk 1 day ago

In theory, couldn't you distill a non-infringing model from an infringing one? Just prompt it for continuations and give it a whack every time the output matches something in your dataset of copyrighted works.

You'd need the copyrighted works to compare to, of course, though if you have the permissible training data (as Anthropic apparently does) it should be doable.

vinni2 1 day ago

> extract the contents directly out of the weights

If you can successfully demonstrate that then yes it is a copyright infringement and successfully doing that would be worthy of NeurIPS or ACL paper.

dragonwriter 1 day ago

> a model file that contains enough of the source material to be considered infringing

The amount of the source material encoded does not, alone, determine if it is infringing, so this noun phrase doesn't actually mean anything. I know there are some popular myths that contradict this (the commonly-believed "30-second rule" for music, for instance), but they are just that, myths.

fallingknife 1 day ago

But there is the issue of whether there are damages. If my LLM can reproduce 10 random paragraphs of a Harry Potter book, it's obvious that nobody would have otherwise purchased the book if they couldn't read those 10 paragraphs. So there will not be any damages to the publisher and the lawsuit will be tossed. There is a threshold of how much of it needs to be reproduced, and how closely, but it's a subjective standard and not some hard line like if it's > 50%.

dragonwriter 1 day ago

> But there is the issue of whether there are damages.

Not if there isn't infringement. Infringement is a question that precedes damages, since "damages" are only those harms that are attributable to the infringement. And infringement is an act, not an object.

If training a general use LLM on books isn't infringement (as this decision holds), then there by definition cannot be damages stemming from it; the amount of the source material that the model file "contains" doesn't matter.

It might matter to whether it is possible for a third party to easily use the model for something that would be infringement on the part of the third party, but that would become a problem for people who use it for infringement, not the model creator, and not for people who simply possess a copy of the model. The model isn't "an infringing object".

thaumasiotes 1 day ago

> even without using the LLM, assume you can extract the contents directly out of the weights

This is still a weird language shift that actively promotes misunderstandings.

The weights are the LLM. When you say "model", that means the weights.

CamperBob2 1 day ago

Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.

This will have the effect of empowering countries (and other entities) that don't respect copyright law, of course.

The copyright cartel cannot be allowed to yank the handbrake on AI. If they insist on a fight, they must lose.

throwaway562if1 1 day ago

For that matter, how dare the government fine me for dumping waste in the river, and stop me from employing minors? Don't they know it will ruin the economy?

CamperBob2 1 day ago

Copyright is something we invented from thin air, and relatively recently at that. Meanwhile, refraining from fouling their own nests is something that most animals have accomplished instinctually for millions of years.

So, not really comparable.

psunavy03 1 day ago (dead)

PeterStuer 7 hours ago

No. You are free to memorize any copyrighted work. You are just not free to distribute it.

The model itself does not constitute a copy. Its intention is clearly not to reproduce verbatim texts. There would be far cheaper and infinitly more accurate ways to do that if that was the goal.

Appart from the legalities, it would be horrifying if copyright reached into the AI realm to completely styfle progress for, lets be honest, mainly the profits of a few major IP corporations.

I do however understand some creatives are worried about revenue, just like the rest of us. But just like the rest of us, they to live in a world that can only exist because 99.99% of what it took to build that world was automated or tool enhanced, impacting someone's previous employment or business.

We are in a world of unprecedented change, only to be immediatly supassed by the next day's rate of change. This both scares and fascinates me.

But that change and its benefits being held only in the bowels of corporate/government symbiotic entities would scare me a hell of a lott more. Open Source/weights is the only way to have a small chance to keep this at bay.

clvx 1 day ago

Wouldn’t the issue be executing the models to third parties without filters? No idea if this is right but the same it would apply to Anthropic that they couldn’t run the model without the filter system having a chicken an egg problem. Can’t develop the filter without looking into the model.

dr-detroit 1 day ago (dead)

deadbabe 1 day ago

You can use the copyrighted text for personal purposes.

layer8 1 day ago

But you can’t distribute it, which in the scenario mentioned in the parent’s final paragraph arguably happens.

AnthonyMouse 1 day ago

You can't distribute the copyrighted works, but that isn't inherently the same thing as the model.

It's sort of like distributing a compendium of book reviews. Many of the reviews have quotes from the book. If there are thousands of reviews, you could potentially reconstruct the whole book, but that's not the point of the thing and so it makes sense for the infringing thing to be "using it to reconstruct the whole book" rather than "distributing the compendium".

And then Anthropic fended off the argument that their service was intended for doing the former because they were explicitly taking measures to prevent that.

layer8 1 day ago

The premise was that the model is able to reproduce the memorized text, and that what saved Anthropic was them having server-side filtering to avoid reproducing that text. So the presumption is that without those filters, the model would be able to reproduce text substantial enough to constitute a copyright violation (otherwise they wouldn’t need the filter argument). Distributing a “machine” producing such output would constitute copyright infringement.

Maybe this is a misrepresentation of the actual Anthropic case, I have no idea, but it’s the scenario I was addressing.

AnthonyMouse 14 hours ago

> Distributing a “machine” producing such output would constitute copyright infringement.

This is the thing you haven't established.

Any ordinary general purpose computer is a "machine" that can produce copyrighted text, if you tell it to. But isn't it pretty important whether you actually do that with it or not, since it's a general purpose tool that can also do a large variety of other things?

dragonwriter 1 day ago

You can also, in the US, use it for any purposes which fall within the domain of "fair use", which while now also incorporated in the copyright statute, was first identified as an application of the first amendment and, as such, a constitutional limit on what Congress even had the power to prohibit with copyright law (the odd parameters of the statutory exception are largely because it attempted to codify the existing Constitutional case law.)

Purposes which are fair use are very often not at all personal.

(Also, "personal use" that involves copying, creating a derivative work, or using any of the other exclusive rights of a copyright holder without a license or falling into either fair use or another explicit copyright exception are not, generally, allowed, they are just hard to detect and unlikely to be worth the copyright holder's time to litigate even if they somehow were detected.)

AtlasBarfed 1 day ago

Hey can I have a fake llm "trained" on a set of copyrighted works to ask what those works are?

So it totally isn't a warez streaming media server but AI?

I'm guessing since my net worth isn't a billion plus, the answer is no

AnthonyMouse 1 day ago

People have been coming up with convoluted piracy loopholes since the invention of copyright.

If you xor some data with random numbers, both the result and the random numbers are indistinguishably random and there is no way to tell which one came out of a random number generator and which one is "derived" from a copyrighted work. But if you xor them together again the copyrighted work comes out. So if you have Alice distribute one of the random looking things and Bob distribute the other one and then Carol downloads them both and reconstructs the copyrighted work, have you created a scheme to copy whatever you want with no infringement occurring?

Of course not, at least Carol is reproducing an infringing work, and then there are going to be claims of contributory infringement etc. for the others if the scheme has no other purpose than to do this.

Meanwhile this problem is also boring because preventing anyone from being the source of infringing works isn't a thing anybody has been able to do since at least as long as the internet has allowed anyone to set up a server in another jurisdiction.

martin-t 1 day ago

Copyright was codified in an age where plagiarism was time consuming. Even replacing words with synonyms on a mass scale was technically infeasible.

The goal of copyright is to make sure people can get fair compensation for the amount of work they put in. LLMs automate plagiarism on a previously unfathomable scale.

If humans spend a trillion hours writing books, articles, blog posts and code, then somebody (a small group of people) comes and spends a million hours building a machine that ingests all the previous work and produces output based on it, who should get the reward for the work put in?

The original authors together spent a million times more effort (normalized for skill) and should therefore should get a million times bigger reward than those who build the machine.

In other words, if the small group sells access to the product of the combined effort, they only deserve a millionth of the income.

---

If "AI" is as transformative as they claim, they will have no trouble making so much money they they can fairly compensate the original authors while still earning a decent profit. But if it's not, then it's just an overpriced plagiarism automator and their reluctance to acknowledge they are making money on top of everyone else's work is indicative.

bonoboTP 1 day ago

> get fair compensation for the amount of work

This is a bit distorted. This is a better summary: The primary purpose of copyright is to induce and reward authors to create new works and to make those works available to the public to enjoy.

The ultimate purpose is to foster the creation of new works that the public can read and written culture can thrive. The means to achieve this is by ensuring that the authors of said works can get financial incentives for writing.

The two are not in opposition but it's good to be clear about it. The main beneficiary is intended to be the public, not the writers' guild.

Therefore when some new factor enters the picture such as LLMs, we have to step back and see how the intent to benefit the reading public can be pursued in the new situation. It certainly has to take into account who and how will produce new written works, but it is not the main target, but can be an instrumental subgoal.

martin-t 1 day ago

As you point out, people make rules ("laws") which benefit them. I care about fairness and justice though, even if I am a minority.

Fundamentally, fair compensation is based on the amount of work put in (obviously taking skill/competence into account but the differences between people in most disciplines probably don't span a single order of magnitude, let alone several).

The ultimate goal should be to prevent people who don't produce value from taking advantage of those who do. And among those who do, that they get compensated according to the amount of work and skill they put in.

Imagine you spend a year building a house. I have a machine that can take your house and materialize a copy anywhere on earth for free. I charge people (something between 0 and the cost of building your house the normal way) to make them a copy of your house. I can make orders of magnitude more money this way than you. Are you happy about this situation? Does it make a difference how much i charge them?

What if my machine only works if I scan every house on the planet? What if I literally take pictures of it from all sides, then wait for your to not be home and xray it to see what it looks like inside?

You might say that you don't care because now you can also afford many more houses. But it does not make you richer. In fact, it makes you poorer.

Money is not a store of value. If everyone has more money but most people only have 2x more and a small group has a 1000x more, then the relative bargaining power changed so the small group is better off and the large group is worse off. This is what undetectable cheap mass plagiarism leads to for all intellectual work.

---

I wrote a lot of open source code, some of it under permissive licenses, some GPL, some AGPL. The conditions of those licenses are that you credit me. Some of them also require that if you build on top of my work, you release your work with the same licence.

LLMs launder my code to make profit off of it without giving me anything (while other people make profit, thus making me poorer) and without crediting me.

LLMs also take away the rights of the users of my code - (A)GPL forced anyone who builds on top of my work to release the code when asked, with LLM-laundered code, this right no longer seems to exist because who do you even ask?

bonoboTP 1 day ago

I understand your sense of justice in cheering on David against Goliath. But the equation is not so clear. The common person is sometimes on this side, sometimes on that side. Copyright can also be weaponized by megacorps against normal people (copying Disney movie DVDs) and LLMs can also be in the hands of the decentralized public (llama ecosystem).

The house thing is a bit offtopic because to be considered for copyright, only its artistic, architectural expression matters. If you want to protect the ingenuity in the technical ways of how it's constructed, that's a patent law thing. It also muddies the water by bringing in aspects of the privacy of one's home by making us imagine paparazzi style photoshoots and sneaky X rays.

The thing is, houses can't be copied like bits and bytes. I would copy a car if I could. If you could copy a loaf of bread for free, it would be a moral imperative to do so, whatever the baker might think about it.

> fair compensation is based on the amount of work put in

This is the labor theory of value, but it has many known problems. For example that the amount of work put in can be disconnected from the amount of value it provides to someone. Pricing via supply/demand market forces have produced much better outcomes across the globe than any other type of allocation. Of course moderated by taxes and so on.

But overall the question is whether LLMs create value for the public. Does it foster prosperity of society? If yes, laws should be such that LLMs can digest more books rather than less. If LLMs are good, they should not be restricted to be trained on copyright-expired writings.

ethbr1 1 day ago

The "fairness" argument is weaker than the "sustainable creation" one.

If LLMs could create quality literature, or social media create in-depth reporting, then I'd have no problem with the tide of technological progress flowing.

Unfortunately, recent history has shown that it's trivial for the market to cannibalize the financial model of creators without replacing it.

And as a result, society gets {no more that thing} + {watered down, shitty version}.

Which isn't great.

So I'd love to hear an argument from the 'fuck copyright, let's go AI' crowd (not the position you seem to be espousing) on what year +10 of rampant AI ingestion of copyrighted works looks like...

4 More Comments →

amenhotep 1 day ago

The "you wouldn't download a car" argument made with a straight face. Remarkable.

jay_kyburz 1 day ago

>Fundamentally, fair compensation is based on the amount of work put in.

I think there is a problem with your initial position. Nobody is entitled to compensation for simply working on something. You have to work on things that people need or want. There is no such thing "fair compensation".

It is "unfair" to take the work of somebody else and sell it as your own. (I don't think the LLMs are doing this.)

msabalau 1 day ago

Copyright's goal, at least under Constitution under which this court is ruling is to "promote the progress of science and the useful arts" not to ensure that authors get paid for anything that strikes their whim.

LLMs are models of languages, which are models of reality. If anyone deserves compensation, it's humanity as a whole, for example by nationalizing, or whatever the global equivalent is, LLMs.

Approximately none of the value of LLMs, for any user, is in recreating the text written by an author. Authors have only ever been entitled to (limited) ownership their expression, copyright has never given them ownership of facts.

This item has no comments currently.