Comment by 3PS - Hacker Neue

Broadly summarizing.

This is OK and fair use: Training LLMs on copyrighted work, since it's transformative.

This is not OK and not fair use: pirating data, or creating a big repository of pirated data that isn't necessarily for AI training.

Overall seems like a pretty reasonable ruling?

derbOac 1 day ago

But those training the LLMs are still using the works, and not just to discuss them, which I think is the point of fair use doctrine. I guess I fail to see how it's any different from me using it in some other way? If I wanted to write a play very loosely inspired by Blood Meridian, it might be transformative, but that doesn't justify me pirating the book.

I tend to think copyright should be extremely limited compared to what it is now, but to me the logic of this ruling is illogical other than "it's ok for a corporation to use lots of works without permission but not for an individual to use a single work without permission." Maybe if they suddenly loosened copyright enforcement for everyone I might feel differently.

"Kill one man, and you are a murderer. Kill millions of men, and you are a conqueror." (An admittedly hyperbolic comparison, but similar idea.)

rcxdude 1 day ago

>If I wanted to write a play very loosely inspired by Blood Meridian, it might be transformative, but that doesn't justify me pirating the book.

I think that's the conclusion of the judge. If Anthropic were to buy the books and train on them, without extra permission from the authors, it would be fair use, much like if you were to be inspired by it (though in that case, it may not even count as a derivative work at all, if the relationship is sufficiently loose). But that doesn't mean they are free to pirate it either, so they are likely to be liable for that (exactly how that interpretation works with copyright law I'm not entirely sure: I know in some places that downloading stuff is less of a problem than distributing it to others because the latter is the main thing that copyright is concerned with. And AFAIK most companies doing large model training are maintaining that fair use also extends to them gathering the data in the first place).

(Fair use isn't just for discussion. It covers a broad range of potential use cases, and they're not enumerated precisely in copyright law AFAIK, there's a complicated range of case law that forms the guidelines for it)

tsumnia 1 day ago

I think the issue is that its actually quite difficult to "unlearn" something once you've seen it. I'm speaking more from human-learning rather than AI-learning, but since AI is inspired by our view on nature, it will have similar qualities. If I see something that inspires, regardless of if I paid for that, I may not even know what specifically inspired me. If I sit on a park bench and an idea comes to me, it could come from a number of things - the bench, park, weather, what movie I watched last night, stuff on the wall of a restaurant while I was eating there, etc.

While humans don't have encyclopedic memories, our brain connects a few dots to make a thought. If I say "Luke, I am your father", it doesn't matter that isn't even the line is wrong, anyone that's seen Star Wars knows what I'm quoting. I may not be profiting from using that line, but that doesn't stop Star Wars from inspiring other elements of my life.

I do agree that copyright law is complicated and AI is going to create even more complexity as we navigate this growth. I don't have a solution on that front, just a recognition that AI is doing what humans do, only more precisely.

altruios 1 day ago

which AFAIN IANAL, copyright and exhaustive rights are completely different. Under copyright, once a book is purchased: that's it. Reselling the same, or transformed (re: highlighted) worked 'used' is 100% legal, as is consuming it at your discretion (in your mind {a billion times}, a fire, or (yes even) what amounts to a fancy calculator).

(that's all to say copyright is dated and needs an overhaul)

But that's taking a viewpoint of 'training a personal AI in your home', which isn't something that actually happens... The issue has never been the training data itself. Training an AI and 'looking at data and optimizing a (human understanding/AI understanding) function over it' are categorically the same, even if mechanically/biologically they are very different.

dragonwriter 1 day ago

> I tend to think copyright should be extremely limited compared to what it is now, but to me the logic of this ruling is illogical other than "it's ok for a corporation to use lots of works without permission but not for an individual to use a single work without permission."

That's not what the ruling says.

It says that training a generative AI system not designed primarily as a direct replacement for a work on one or more works is fair use, and that print-to-digital destructive scanning for storage and searchability is fair use.

These are both independent of whether one person or a giant company or something in between is doing it, and independent of the number of works involved (there's maybe a weak practical relationship to the number of works involved, since a gen AI tool that is trained on exactly one work is probably somewhat less likely to have a real use beyond a replacement for that work.)

comex 1 day ago

The judge actually agreed with your first paragraph:

> This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use. There is no decision holding or requiring that pirating a book that could have been bought at a bookstore was reasonably necessary to writing a book review, conducting research on facts in the book, or creating an LLM. Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.

(But the judge continued that "this order need not decide this case on that rule": instead he made a more targeted ruling that Anthropic's specific conduct with respect to pirated copies wasn't fair use.)

tantalor 1 day ago

The analogy to training is not writing a play based on the work. It's more like reading (experiencing) the work and forming memories in your brain, which you can access later.

I'm allowed to hear a copyrighted tune, and even whistle it later for my own enjoyment, but I can't perform it for others without license.

AlienRobot 1 day ago

This is nonsense, in my opinion. You aren't "hearing" anything. You are literally creating a work, in this case, the model, derived from another work.

People need to stop anthropomorphizing neural networks. It's a software and a software is a tool and a tool is used by a human.

adinisom 1 day ago

Humans are also created/derived from other works, trained, and used as a tool by humans.

It's interesting how polarizing the comparison of human and machine learning can be.

tantalor 1 day ago

It is easy to dismiss, but the burden of proof would be on the plaintiff to prove that training a model is substantially different than the human mind. Good luck with that.

AlienRobot 6 hours ago

That makes no sense as a default assumption. It's like saying FSD is like a human driver. If it's a person, why doesn't it represent itself in court? What wages is it being paid? What are the labor rights of AI? How is it that the AI is only human-like when it's legally convenient?

What makes far more sense is saying that someone, a human being, took copyrighted data and fed it into a program that produces variations of the data it was fed. This is no different from a photoshop filter, and nobody would ever need to argue in court that a photoshop filter is not a human being.

fallingknife 1 day ago

But if you did pirate the book, and let's say it cost $50, and then you used it to write a play based on that book and made $1 million selling that, only the $50 loss to the publisher would be relevant to the lawsuit. The fact that you wrote a non-infringing play based on it and made $1 million would be irrelevant to the case. The publisher would have no claim to it.

klabb3 1 day ago

> But those training the LLMs are still using the works, and not just to discuss them, which I think is the point of fair use doctrine.

Worse, they’re using it for massive commercial gain, without paying a dime upstream to the supply chain that made it possible. If there is any purpose of copyright at all, it’s to prevent making money from someone’s else’s intellectual work. The entire thing is based on economic pragmatism, because just copying does obviously not deprive the creator of the work itself, so the only justification in the first place is to protect those who seek to sell immaterial goods, by allowing them to decide how it can be used.

Coming to the conclusion that you can ”fair use” yourself out of paying for the most critical part of your supply makes me upset for the victims of the biggest heist of the century. But in the long term it can have devastating chilling effects, where information silos will become the norm, and various forms of DRM will be even more draconian.

Plus, fair use bypasses any licensing, no? Meaning even if today you clearly specify in the license that your work cannot be used in training commercial AI, it isn’t legally enforceable?

growse 1 day ago

> Worse, they’re using it for massive commercial gain, without paying a dime upstream to the supply chain that made it possible. If there is any purpose of copyright at all, it’s to prevent making money from someone’s else’s intellectual work.

This makes no sense. If I buy and read a book on software engineering, and then use that knowledge to start a career, do I owe the author a percentage of my lifetime earnings?

Of course not. And yet I've made money with the help of someone else's intellectual work.

klabb3 14 hours ago

> If I buy and read a book on software engineering

You're comparing that you as an individual purchase one copy of a book to a multi-billion dollar company systematically ingesting them for profit without any compensation, let alone proportional?

> do I owe the author a percentage of my lifetime earnings?

No, but you are a human being. You have a completely different set of rights from a corporation, or a machine. For very good reason.

growse 11 hours ago

Does copyright law apply differently to humans Vs organisations?

> without any compensation,

Didn't Anthropic buy the books?

lurkshark 1 day ago

If you pirate a book on software engineering and then use that knowledge to start a career, do you owe the author the royalties they would be paid had you bought the book?

If the career you start isn't software engineering directly but instead re-teaching the information you learned from that book to millions of paying students, is the regular royalty payment for the book still fair?

ticulatedspline 1 day ago

Definitely seems reasonable to say "you can train on this data but you have to have a legal copy"

Personally I like to frame most AI problems by substituting a human (or humans) for the AI. Works pretty well most of the time.

In this case if you hired a bunch of artists/writers that somehow had never seen a Disney movie and to train them to make crappy Disney clones you made them watch all the movies it certainly would be legal to do so but only if they had legit copies in the training room. Pirating the movies would be illegal.

Though the downside is it does create a training moat. If you want to create the super-brain AI that's conversant on the corpus of copyrighted human literature you're going to need a training library worth millions

martin-t 1 day ago

> Personally I like to frame most AI problems by substituting a human (or humans) for the AI. Works pretty well most of the time.

Human time is inherently valuable, computer time is not.

The issue with LLMs is that they allow doing things at a massive scale which would previously be prohibitively time consuming. (You could argue but them how much electricity is worth one human life?)

If I "write" a book by taking another and replacing every word with a synonym, that's obviously plagiarism and obviously copyright infringement. How about also changing the word order? How about rewording individual paragraphs while keeping the general structure? It's all still derivative work but as you make it less detectable, the time and effort required is growing to become uneconomical. An LLM can do it cheaply. It can mix and match parts of many works but it's all still a derivative of those works combined. After all, if it wasn't, it would produce equally good output with a tiny fraction of the training data.

The outcome is that a small group of people (those making LLMs and selling access to their output) get to make huge amounts of money off of the work of a group that is several orders of magnitude larger (essentially everyone who has written something on the internet) without compensating the larger group.

That is fundamentally exploitative, whether the current laws accounted for that situation or not.

johnnyanmac 1 day ago

That's a part of the issue. I'm not sure if this has happened in visual arts, but there is in fact precedent against trying to hire a sound a like over the one you want to sound like. You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet". It's pretty clear at that point what you want but you didn't want to pay talent for it.

I see elements of that here. Buying copyrighted works not to be exposed and be inspired, nor to utilize the aithor's talents, but to fuel a commercialization of sound-a-likes.

lesuorac 1 day ago

> You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet"

Keep in mind, the Authors in the lawsuit are not claiming the _output_ is copyright infringement so Alsup isn't deciding that.

Dracophoenix 1 day ago

> but there is in fact precedent against trying to hire a sound a like over the one you want to sound like. You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet". It's pretty clear at that point what you want but you didn't want to pay talent for it.

You're referencing Midler v Ford Motor Co in the 9th circuit. This case largely applies to California, not the whole nation. Even then, it would take one Supreme Court case to overturn it.

alganet 1 day ago

What you are describing happened and they got sued:

https://en.wikipedia.org/wiki/Mickey_Mouse#Walt_Disney_Produ...

I'm on the Air Pirates side for the case linked, by the way.

However, AI is not a parody. It's not adding to the cultural expression like a parody would.

Let's forget all the law stuff and these silly hypotheticals. Let's think of humanity instead:

Is AI contributing to education and/or culture _right now_, or is it trying to make money? I think they're trying to make money.

fallingknife 1 day ago

> It's not adding to the cultural expression like a parody would.

Says who?

> Is AI contributing to education and/or culture _right now_, or is it trying to make money?

How on earth are those things mutually exclusive? Also, whether or not it's being used to make money is completely irrelevant to whether or not it is copyright infringement.

alganet 1 day ago

> Says who?

Artists.

https://en.wikipedia.org/wiki/SAG-AFTRA

> How on earth are those things mutually exclusive?

Put those on a spectrum and rethink what I said.

> completely irrelevant to whether or not it is copyright infringement

_Again_, leave aside law minutiae and hypotheticals.

shagie 16 hours ago

> > Says who?

> Artists.

> https://en.wikipedia.org/wiki/SAG-AFTRA

Do you have a link that has their stance on how AI is harming culture? The best I could find is https://www.sagaftra.org/contracts-industry-resources/member...

I can't find anything in there or its linked articles about culture. I do find quite a bit about synthetic performers and digital replicas and making sure that people who do voice acting don't have their performance used to generate material that is done at a discounted rate and doesn't reimburse the performer.

https://www.sagaftra.org/ongoing-fight-ai-protections-makes-...

> Protective A.I. guardrails for actors who work in video games remain a point of contention in the Interactive Media Agreement negotiations which have been ongoing from October 2022 until last month’s strike. Other A.I.-related panels Crabtree-Ireland participated in included a U.S. Department of Justice and Stanford University co-hosted event about promoting competition in A.I., as well as a Vanderbilt University summit on music law and generative A.I. SAG-AFTRA Executive Vice President Linda Powell discussed the interactive negotiations and A.I.’s many implications for creatives during her keynote speech at an Art in the Age of A.I. symposium put on by Villa Albertine at the French Embassy.

> She said A.I. represents “a turning point in our culture,” adding, “I think it’s important that we be participants in it and not passengers in it ... We need to make our voices known to the handful of people who are building and profiting off of this brave new world.”

This doesn't indicate that its good or bad, but rather that they want to make sure that people are in control of it and people are compensated for the works that are created from their performance.

alganet 8 hours ago

> they want to make sure that people are in control of it and people are compensated for the works that are created

Nice! Now you just need to connect the dots from your own conclusion to my initial statement.

tgv 1 day ago

> Definitely seems reasonable to say "you can train on this data but you have to have a legal copy"

How many copies? They're not serving a single client.

Libraries need to have multiple e-book licenses, after all.

ticulatedspline 1 day ago

In the human training case probably a Store DVD would still run afoul of that licensing issue. That's a broader topic of audience and I didn't want to muddy the analogy with that detail.

It changes the definition of what a "legal copy" is but the general idea that the copy must be legal still stands.

tgv 1 day ago

Fair enough.

simmerup 1 day ago

Depends whether you actually agree its transformative

lesuorac 1 day ago

For textual purposes it seems fairly transformative.

If you train a LLM on harry potter and ask it to generate a story that isn't harry potter then it's not a replacement.

However, if you train a model on stock imagery and use it to generate stock imagery then I think you'll run into an issue from the Warhol case.

sidewndr46 1 day ago

Wasn't that just over an arrangement of someone else's photographs?

lesuorac 1 day ago

https://en.wikipedia.org/wiki/Andy_Warhol_Foundation_for_the...

I wouldn't call it that. Goldsmith took a photograph of Prince which Warhol used as a reference to generate an illustration. Vanity Fair then chose to buy a license Warhol's print instead of Goldsmith's photograph.

So, despite the artwork being visual transformative (silkscreen vs photograph) the actual use was not transformed.

johnnyanmac 1 day ago

The nature of how they store data makes it not okay in my books. You massage the data enough and you can generate something that seems infringement worthy.

ticulatedspline 1 day ago

For closed models the storage problem isn't really a problem, they can be judged by what they produce not how they store it as you don't have access to the actual data. That said, open weight LLMs are probably screwed, if enough of the work remains in the weights such that they can be extracted (even if it's without even talking to the LLM) then the weight file itself represents a copy of the work that's being distributed. So enjoy these competent run-at-home models while you can, they're on track for extinction.

ninetyninenine 1 day ago

Why doesn’t this apply to humans? If I memorize something such that it can be extracted did I violate the law? It’s only if I choose to allow such extraction to occur then I’m in violation of the law right?

So if I or an LLM simply doesn’t allow said extraction to occur, memorization and copying is not against the law.

3 More Comments →

ranger_danger 1 day ago

I wonder if https://en.wikipedia.org/wiki/Illegal_number comes into play here.

thedevilslawyer 1 day ago

What's the steelman case that is transformative? Because prima-facie, it seems to only output original output - "intelligent" output.

almatabata 1 day ago

If a publisher adds a "no AI training" clause to their contracts, does this ruling render it invalid?

jxdxbx 1 day ago

You don't need a license for most of what people do with traditional, physical copyrighted copies of works: read them, play a DVD at home, etc. Those things are outside the scope of copyright. But you do need a license to make copies, and ebooks generally come with licensing agreements, again because to read an ebook, you must first make a brand new copy of it. Anyway as a result physical books just don't have "licenses" to begin with and if they tried they'd be unenforceable, since you don't need to "agree" to any "terms" to read a book.

dragonwriter 1 day ago

> If a publisher adds a "no AI training" clause to their contracts?

This ruling doesn't say anything about the enforceability of a "don't train AI on this" contract, so even if the logic of this ruling became binding prcecednet (trial court rulings aren't), such clauses would be as valid after as they are today. But contracts only affect people who are parties to the contract.

Also, the damages calculations for breach of contract are different than for copyright infringement; infringement allows actual damages and infringer's profits (or statutory damages, if greater than the provable amount of the others), but breach of contract would usually be limited to actual damages ("disgorgement" is possible, but unlike with infringer's profits in copyright, requires showing special circumstances.)

heavyset_go 1 day ago

Fair use overrides licensing

AlanYx 1 day ago

Fair use "overrides" licensing in the sense that one doesn't need a copyright license if fair use applies. But fair use itself isn't a shield against breach of contract. If you sign a license contract saying you won't train on the thing you've licensed, the licensor still has remedies for breach of contract, just not remedies for copyright infringement (assuming the act is fair use).

almatabata 1 day ago

thanks for clarifying.

bananapub 1 day ago

what contract? with who?

Meta at least just downloaded ENGLISH_LANGUAGUE_BOOKS_ALL_MEGATORRENT.torrent and trained on that.

almatabata 1 day ago

I know, but the article mentions that a separate ruling will be made about that pirating.

quote: “We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages,” Judge Alsup wrote in the decision. “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for theft but it may affect the extent of statutory damages.”

This tells me Anthropic acquired these books legally afterwards. I was asking if during that purchase, the seller could add a no training close to the sales contract.

shagie 1 day ago

What contracts? And would it run afoul of first sale doctrine?

https://en.wikipedia.org/wiki/First-sale_doctrine

> The doctrine was first recognized by the Supreme Court of the United States in 1908 (see Bobbs-Merrill Co. v. Straus) and subsequently codified in the Copyright Act of 1909. In the Bobbs-Merrill case, the publisher, Bobbs-Merrill, had inserted a notice in its books that any retail sale at a price under $1.00 would constitute an infringement of its copyright. The defendants, who owned Macy's department store, disregarded the notice and sold the books at a lower price without Bobbs-Merrill's consent. The Supreme Court held that the exclusive statutory right to "vend" applied only to the first sale of the copyrighted work.

> Today, this rule of law is codified in 17 U.S.C. § 109(a), which provides:

> Notwithstanding the provisions of section 106 (3), the owner of a particular copy or phonorecord lawfully made under this title, or any person authorized by such owner, is entitled, without the authority of the copyright owner, to sell or otherwise dispose of the possession of that copy or phonorecord.

---

If I buy a copy of a book, you can't limit what I can do with the book beyond what copyright restricts me.

ninetyninenine 1 day ago

Agreed. If I memorize a book and I am deployed into the world to talk about what I memorized that is not a violation of copyright. Which is reasonable logically because essentially this is what an LLM is doing.

bonoboTP 1 day ago

You can talk about it, but you can't sell tickets to an event where you recite from memory all the poems written by someone else without their permission.

LLMs may sometimes reproduce exact copies of chunks of text, but I would say it also matters that this is an irrelevant use case that is not the main value proposition that drives LLM company revenues, it's not the use case that's marketed and it's not the use case that people in real life use it for.

layer8 1 day ago

It might be different if you are a commercial product which couldn’t have been created without incorporating the contents of all those books.

Humans, animals, hardware and software are treated differently by law because they have different constraints and capabilities.

ninetyninenine 1 day ago

But a commercial product is reaching parity with human capability.

Let's be real, Humans have special treatment (more special than animals as we can eat and slaughter animals but not other humans) because WE created the law to serve humans.

So in terms of being fair across the board LLMs are no different. But there's no harm in giving ourselves special treatment.

layer8 1 day ago

Generative AIs are very different from humans because they can be copied losslessly and scaled tremendously, and also have no individual liability, nor awareness of how similar their output is to something in their training material. They are very different in constraints and capabilities from humans in all sorts of ways. For one, a human will likely never reproduce a book they read without being aware that that’s what they are doing.

martin-t 1 day ago

Except you can't do it at a massive scale. LLMs both memorize at a scale bigger than thousands, probably millions of humans AND reproduce at an essentially unlimited scale.

And who gets the money? Not the original author.

doctorpangloss 1 day ago

It’s similar to the Google Books ruling, which Google lost. Anthropic also lost. TechCrunch and others are very aspirational here.

philipkglass 1 day ago

Do you mean Authors Guild, Inc. v. Google, Inc.? Google won that case:

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

Maybe there's another big Google Books lawsuit that Google ultimately lost, but I don't know which one you mean in that case.

doctorpangloss 1 day ago

see, but if you ask a copyright attorney: Google lost. This is what I mean by aspirational. They won something, in very similar circumstances to Anthropic, "fair use," but everything else that made what they were doing a practical reality instead of purely theoretical required negotiation with Authors Guild, and indeed, they are not doing what they wanted to do, right? Anthropic has to go to trial still, they had to pirate the books to train, and they will not win on their right to commercialize the results of training, because neither did Google, so what good is the Fair Use ruling, besides allowing OpenAI v. NYTimes to proceed a little longer?

dragonwriter 1 day ago

> Anthropic has to go to trial still, they had to pirate the books to train

They did not have to, they had an alternate means available (and used it for many of the books), buying physical copies and destructively scanning them.

> and they will not win on their right to commercialize the results of training

That seems an unwarranted conclusion, at best.

> so what good is the Fair Use ruling

If nothing else, assuming the logic of the ruling is followed by the inevitable appeals court decision and becomes binding precedent, it provides a clear road to legally training LLMs on books without copyright issues (combination of "training is fair use" and "destructive scanning for storage and searchability is fair use"), even if the pirating of a subset of the source material in this case were to make Anthropic's existing products prohibited (which I think you are wrong to think is the likely outcome.)

SoKamil 1 day ago

What if I overfit my LLM so it spits out copyrighted work with special prompting? Where to draw the line in training?

bonoboTP 1 day ago

If you do something else, the result may be something else. The line is drawn by the application of subjective common sense by the judge, just as it is every time.

ninetyninenine 1 day ago

I mean the human brain can memorize things as well and it’s not illegal. It’s only illegal if said memorized thing is distributed.

tartoran 1 day ago

Humans can only memorize such few texts in comparison so they'd not be scallable in the same sense LLMs are.

martin-t 1 day ago

Humans don't scale. LLMs do.

Even if LLMs were actual human-level AI (they are not - by far), a small bunch of rich people could use them to make enormous amounts of money without putting in the enormous amounts of work humans would have to.

All the while "training" (= precomputing transformations which among other things make plagiarism detection difficult) on work which took enormous amounts of human labor without compensating those workers.

mrguyorama 1 day ago

Because humans have rights

AI models do not.

ninetyninenine 7 hours ago

They use to say the same thing about black people.

NoOn3 1 day ago

Exactly. If someone wants to compare AI models with humans, maybe then they give AI Models the right to vote and other rights.

veggieroll 1 day ago

BRB, I'm going to download all the TV shows and movies to train my vision model. Just to be sure it's working properly, I have to watch some for debugging purposes.

ncruces 1 day ago

You need to buy one copy of each for the fair use to apply.

toomuchtodo 1 day ago

Let everyone donate their DVDs and other physical media. You don’t need to buy it, you just need to possess the media.

veggieroll 1 day ago

Indeed, I forsee a "training dataset consortium" arising out of this, whereby a bunch of companies team up to buy one copy of everything and then share it for training amongst themselves (ex. by reselling the entire library to each other for $1).

toomuchtodo 1 day ago

Like an Archive? Connected to the Internet?

veggieroll 1 day ago

Genius!

This item has no comments currently.