As far as I could tell, the book didn't match what's posted online today. The text was somewhat consistent on a topic, yet poorly written and made references to sections that I don't think existed. No amount of prompting could locate them. I'm not convinced the material presented to me was actually the book, although it seemed consistent with the topic of the chapter.
I tried to ascertain when the book had been scraped, yet couldn't find a match in Archive.org or in the book's git repo.
Eventually I gave up and just continued reading the PDF.
I'm not a legal scholar, so I'm not qualified or interested in arguing about whether Cliff Notes is fair use. But I do care about how people behave, and I'm pretty sure that Cliff Notes and LLMs lead to fewer books being purchased, which makes it harder for writers to do what they do.
In the case of Cliff Notes, it probably matters less because because the authors of 19th century books in your English 101 class are long dead and buried. But for authors of newer technical material, yes, I think LLMs will make it harder for those people to be able to afford to spend the time thinking, writing, and sharing their expertise.
----
> But for authors of newer technical material, yes, I think LLMs will make it harder for those people to be able to afford to spend the time thinking, writing, and sharing their expertise.
Alright, you're now arguing for some new regulations though, since this is not a matter for copyright.
In that context, I observe that many academics already put their technical books online for free. Machine learning, computer vision, robotics etc. I doubt it's a hugely lucrative thing in the first place.
No, I'm not. I'm not talking about law at all. You talked about what reasonable people do and I'm also talking about what people do.
> I observe that many academics already put their technical books online for free.
As do I, which is why the LLMs are trained on it and are able to so effectively regurgitate it.
> I doubt it's a hugely lucrative thing in the first place.
This is true in many cases, but you might be surprised.
I'm not so sure about this one. In particular, presuming that it is found that models which can produce infringing material are themselves infringing material, the ability to distill models from older models seems to suggest that the older models can actually produce the new, infringing model. It seems like that should mean that all output from the older model is infringing because any and all of it can be used to make infringing material (the new model, distilled from the old).
I don't think it's really tenable for courts to treat any model as though it is, in itself, copyright-infringing material without treating every generative model like that and, thus, killing the GPT/diffusion generation business (that could happen but it seems very unlikely). They will probably stick to being critical of what people generate with them and/or how they distribute what they generate.
You'd need the copyrighted works to compare to, of course, though if you have the permissible training data (as Anthropic apparently does) it should be doable.
The amount of the source material encoded does not, alone, determine if it is infringing, so this noun phrase doesn't actually mean anything. I know there are some popular myths that contradict this (the commonly-believed "30-second rule" for music, for instance), but they are just that, myths.
Not if there isn't infringement. Infringement is a question that precedes damages, since "damages" are only those harms that are attributable to the infringement. And infringement is an act, not an object.
If training a general use LLM on books isn't infringement (as this decision holds), then there by definition cannot be damages stemming from it; the amount of the source material that the model file "contains" doesn't matter.
It might matter to whether it is possible for a third party to easily use the model for something that would be infringement on the part of the third party, but that would become a problem for people who use it for infringement, not the model creator, and not for people who simply possess a copy of the model. The model isn't "an infringing object".
This is still a weird language shift that actively promotes misunderstandings.
The weights are the LLM. When you say "model", that means the weights.
If you can successfully demonstrate that then yes it is a copyright infringement and successfully doing that would be worthy of NeurIPS or ACL paper.
This will have the effect of empowering countries (and other entities) that don't respect copyright law, of course.
The copyright cartel cannot be allowed to yank the handbrake on AI. If they insist on a fight, they must lose.
So, not really comparable.
It's entirely possible for something to be suboptimal in the specific (I would like this thing for free), but optimal on the whole (society benefits from this thing not being free).
The potential societal benefits to AI are unbounded, but only if it's allowed to develop without restrictions that artificially favor legacy interests.
Any decision or legislation that says that training is not fair use -- and yes, that includes gaining access to the content in the first place by any means necessary -- will have net-negative effects on the society that enforces it.
That's a very strong claim based on currently limited evidence.
It's in no way clear that AI has an infinite ability to scale capability, nor that that can only be done by completely ignoring compensating those who provide training data.
OpenAI and Anthropic would love that to be true... but the facts don't support it.
Deal with it.
Additionally that if you download a model file that contains enough of the source material to be considered infringing (even without using the LLM, assume you can extract the contents directly out of the weights) then it might as well be a .zip with a PDF in it, the model file itself becomes an infringing object whereas closed models can be held accountable by not what they store but what they produce.