Comment by comex - Hacker Neue

Yes and no.

In this case, the plaintiffs alleged that Anthropic's LLMs had memorized the works so completely that "if each completed LLM had been asked to recite works it had trained upon, it could have done so", "almost verbatim". The judge assumed for the sake of argument that the allegation was true, and ruled that the conduct was fair use anyway due to the existence of an effective filter. Therefore there was no need to determine whether the allegation was actually true.

So - yes, in the sense that the ruling suggests that distributing an open-weight LLM that memorized copyrighted works to that extent would not be fair use.

But no, in the sense that it's not clear whether any LLMs, especially open-weight LLMs, actually memorize book-length works to that extent. Even the recent study about Llama memorizing a Harry Potter book [1] only said that Llama could reproduce 50-token snippets a decent percentage of the time when given the preceding 50 tokens. That's different from actually being able to recite any substantial portion of the book. If you asked Llama for that, the output would quickly diverge from the original text, and it likely wouldn't be able to get back on track without being re-prompted from the ground truth as the study did.

On the other hand, in the case where the New York Times is suing OpenAI, the NYT has alleged that ChatGPT was able to recite extensive portions of NYT articles verbatim. If true, this might be more dangerous, since news articles are not as long as books but they're equally eligible for copyright protection. So we'll see how that shakes out.

Also note:

- Nothing in the opinion sets formal precedent because it's a district court. But the opinion might still influence later judges.

- See also riskable's sibling comment for another case where a judge addressed the issue more head-on (but wasn't facing the same kind of detailed allegations, I don't think; haven't checked).

[1] https://arxiv.org/abs/2412.06370

ethbr1 1 day ago

Wouldn't a model that can recite training data verbatim be larger than necessary? Exact text isn't coming from nowhere, no matter how efficiently the bits are encoded, and the same effectiveness should be achievable by compressing those portions of the model.

zeven7 13 hours ago

Maybe we are all just LLMs. If the books were written by a language producing algorithm in a human mind, maybe there’s not as much raw data there as it seems, and the total information can in fact be stored in a surprisingly small set of weights.

ethbr1 5 hours ago

I imagine it's not inconceivable that at very high dimensions and with the right architectures stochastic compression can be unexpectedly efficient. It would be strange if the end result of AI research is realizing we're solving a compression problem (and that our brains do too).

This item has no comments currently.