Preferences

providers' ToS explicitly states whether or not any data provided is used for training purposes. the usual that i've seen is that while they retain the right to use the data on free tiers, it's almost never the case for paid tiers

Right, so totally cool to ignore the law but our TOS is a binding contract.
Yes, they can be sued for breach of contract. And it’s not a regular ToS but a signed MSA and other legally binding documents.
the license on my open source code is a contract, and they ignored that

if they can get away with it (say by claiming it's "fair use"), they'll ignore corporate ones too

If I were to go out on a limb, those companies spend more on tech companies than you and they have larger legal teams than you. That is a carrot and a stick for AI companies to follow the contract.
no, it's not an incentive to follow the contract

it's an incentive to pretend as if you're following the contract, which is not the same thing

Where are they ignoring the law?
people that say this tend to have a misinterpretation of copyright, and use all the court cases brought by large rights holders as validation

despite all 3 branches of the government disagreeing with them over and over again

I bet companies are circumventing this in a way that allows them to derive almost all the benefit from your data, yet makes it very hard to build a case against them.

For example, in RL, you have a train set, and a test set, which the model never sees, but is used to validate it - why not put proprietary data in the test set?

I'm pretty sure 99% of ML engineers would say this would constitute training on your data, but this is an argument you could drag out in courts forever.

Or alternatively - it's easier to ask for forgiveness than permission.

I've recently had an apocalyptic vision, that one day we'll wake up, an find that AI companies have produced an AI copy of every piece of software in existence - AI Windows, AI Office, AI Photoshop etc.

Given the conduct we've seen to date, I'd trust them to follow the letter - but not the spirit - of IP law.

There may very well be clever techniques that don't require directly training on the users' data. Perhaps generating a parallel paraphrased corpus as they serve user queries - one which they CAN train on legally.

The amount of value unlocked by stealing practically ~everyone's lunch makes me not want to put that past anyone who's capable of implementing such a technology.

it is amazing in almost 2026 there is anyone believing this… amazing
I wonder how much wiggle there is for collect now (to provide service, context history, etc), then later anonymise (some how, to some level) and then train on it?

Also I wonder if the ToS covers "queries & interaction" vs "uploaded data" - I could imagine some tricky language in there that says we wont use your word document, but we may at some time use the queries you put against it, not as raw corpus but as a second layer examining what tools/workflows to expand/exploit.

“We don’t train on your data” doesn’t exclude metadata, training on derived datasets via some anonymisation process, etc.

There’s a range of ways to lie by omission, here, and the major players have established a reputation for being willing to take an expansive view of their legal rights.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal