Comment by ticulatedspline

ticulatedspline Jun 24, 2025 parent

Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.

Additionally that if you download a model file that contains enough of the source material to be considered infringing (even without using the LLM, assume you can extract the contents directly out of the weights) then it might as well be a .zip with a PDF in it, the model file itself becomes an infringing object whereas closed models can be held accountable by not what they store but what they produce.

bonoboTP Jun 24, 2025

This technology is a really bad way of storing, reproducing and transmitting the books themselves. It's probabilistic and lossy. It may be possible to reproduce some paragraphs, but no reasonable person would expect to read The Da Vinci Code by prompting the LLM. Surely the marketed use cases and the observed real use by users has to make it clear that the intended and vastly overwhelming use of an LLM is transformative, "digestive" synthesis of many sources to construct a merged, abstracted, generalized system that can function in novel uses, answering never before seen prompts in a useful manner, overwhelmingly without reproducing existing written works. It surely matters what the purpose of the thing is both in intention and observed practice. It's not a viable competing alternative to reading the actual book.

spit2wind Jun 24, 2025

Not The DaVinci Code, but I recently tried reading "OCaml Programming: Correct + Efficient + Beautiful" through Gemini. The book is open, so I rightly assumed it was "in there". I read by saying "Give me the first paragraph of Chapter 6" and then something like "Next 3 paragraphs". If I had a question, I was able to ask it and get some more info and have something like a dialog.

As far as I could tell, the book didn't match what's posted online today. The text was somewhat consistent on a topic, yet poorly written and made references to sections that I don't think existed. No amount of prompting could locate them. I'm not convinced the material presented to me was actually the book, although it seemed consistent with the topic of the chapter.

I tried to ascertain when the book had been scraped, yet couldn't find a match in Archive.org or in the book's git repo.

Eventually I gave up and just continued reading the PDF.

munificent Jun 24, 2025

The number of people who buy Cliffs Notes versions of books to pass examinations where they claim to have read the actual book suggests you are way overestimating how "reasonable" many people are.

bonoboTP Jun 24, 2025

Cliff Notes are fair use. Would you argue otherwise? Wikipedia also has plot summaries without infringement.

munificent Jun 24, 2025

In your parent comment, you argued what people would do in practice. Now you have shifted to talking about what is legal or not to do.

I'm not a legal scholar, so I'm not qualified or interested in arguing about whether Cliff Notes is fair use. But I do care about how people behave, and I'm pretty sure that Cliff Notes and LLMs lead to fewer books being purchased, which makes it harder for writers to do what they do.

In the case of Cliff Notes, it probably matters less because because the authors of 19th century books in your English 101 class are long dead and buried. But for authors of newer technical material, yes, I think LLMs will make it harder for those people to be able to afford to spend the time thinking, writing, and sharing their expertise.

bonoboTP Jun 24, 2025

It surely matters whether people actually use the thing for copyright violations or not. Summaries are not even copyright violations so that's irrelevant. Long verbatim copies would be, but one would have to demonstrate that this use case is significant, convenient enough to provide a viable alternative to otherwise obtaining the particular text chunk etc.

----

> But for authors of newer technical material, yes, I think LLMs will make it harder for those people to be able to afford to spend the time thinking, writing, and sharing their expertise.

Alright, you're now arguing for some new regulations though, since this is not a matter for copyright.

In that context, I observe that many academics already put their technical books online for free. Machine learning, computer vision, robotics etc. I doubt it's a hugely lucrative thing in the first place.

munificent Jun 25, 2025

> Alright, you're now arguing for some new regulations though

No, I'm not. I'm not talking about law at all. You talked about what reasonable people do and I'm also talking about what people do.

> I observe that many academics already put their technical books online for free.

As do I, which is why the LLMs are trained on it and are able to so effectively regurgitate it.

> I doubt it's a hugely lucrative thing in the first place.

This is true in many cases, but you might be surprised.

lcnPylGDnU4H9OF Jun 24, 2025

> broadly capable open models are on track for annihilation

I'm not so sure about this one. In particular, presuming that it is found that models which can produce infringing material are themselves infringing material, the ability to distill models from older models seems to suggest that the older models can actually produce the new, infringing model. It seems like that should mean that all output from the older model is infringing because any and all of it can be used to make infringing material (the new model, distilled from the old).

I don't think it's really tenable for courts to treat any model as though it is, in itself, copyright-infringing material without treating every generative model like that and, thus, killing the GPT/diffusion generation business (that could happen but it seems very unlikely). They will probably stick to being critical of what people generate with them and/or how they distribute what they generate.

ijk Jun 24, 2025

In theory, couldn't you distill a non-infringing model from an infringing one? Just prompt it for continuations and give it a whack every time the output matches something in your dataset of copyrighted works.

You'd need the copyrighted works to compare to, of course, though if you have the permissible training data (as Anthropic apparently does) it should be doable.

dragonwriter Jun 24, 2025

> a model file that contains enough of the source material to be considered infringing

The amount of the source material encoded does not, alone, determine if it is infringing, so this noun phrase doesn't actually mean anything. I know there are some popular myths that contradict this (the commonly-believed "30-second rule" for music, for instance), but they are just that, myths.

fallingknife Jun 24, 2025

But there is the issue of whether there are damages. If my LLM can reproduce 10 random paragraphs of a Harry Potter book, it's obvious that nobody would have otherwise purchased the book if they couldn't read those 10 paragraphs. So there will not be any damages to the publisher and the lawsuit will be tossed. There is a threshold of how much of it needs to be reproduced, and how closely, but it's a subjective standard and not some hard line like if it's > 50%.

dragonwriter Jun 24, 2025

> But there is the issue of whether there are damages.

Not if there isn't infringement. Infringement is a question that precedes damages, since "damages" are only those harms that are attributable to the infringement. And infringement is an act, not an object.

If training a general use LLM on books isn't infringement (as this decision holds), then there by definition cannot be damages stemming from it; the amount of the source material that the model file "contains" doesn't matter.

It might matter to whether it is possible for a third party to easily use the model for something that would be infringement on the part of the third party, but that would become a problem for people who use it for infringement, not the model creator, and not for people who simply possess a copy of the model. The model isn't "an infringing object".

thaumasiotes Jun 24, 2025

> even without using the LLM, assume you can extract the contents directly out of the weights

This is still a weird language shift that actively promotes misunderstandings.

The weights are the LLM. When you say "model", that means the weights.

vinni2 Jun 24, 2025

> extract the contents directly out of the weights

If you can successfully demonstrate that then yes it is a copyright infringement and successfully doing that would be worthy of NeurIPS or ACL paper.

CamperBob2 Jun 24, 2025

Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.

This will have the effect of empowering countries (and other entities) that don't respect copyright law, of course.

The copyright cartel cannot be allowed to yank the handbrake on AI. If they insist on a fight, they must lose.

throwaway562if1 Jun 24, 2025

For that matter, how dare the government fine me for dumping waste in the river, and stop me from employing minors? Don't they know it will ruin the economy?

CamperBob2 Jun 24, 2025

Copyright is something we invented from thin air, and relatively recently at that. Meanwhile, refraining from fouling their own nests is something that most animals have accomplished instinctually for millions of years.

So, not really comparable.

psunavy03 Jun 24, 2025 (dead)

This item has no comments currently.