They're already using synthetic data generated by LLMs to further train LLMs. Of course they will not hesitate to feed "anonymized" data generated by user interactions. Who's going to stop them? Or even prove that it's happening. These companies have already been allowed to violate copyright and privacy on a historic global scale.
How should they dinstinguish between real and fake data? It would be far to easy to pollute their models with nonesense.
I have no doubt that Microsoft has already classified the nature of my work and quality of my code. Of course it's probably "anonymized". But there's no doubt in my mind that they are watching everything you give them access to, make no mistake.
I mean is it really ignoring copyright when copyright doesn't limit them in anyway on training?
Tell that to all the people suing them for using their copyrighted work. In some cases the data was even pirated.
> Nothing is really preventing this though
The enterprise user agreement is preventing this.
Suggesting that AI companies will uniquely ignore the law or contracts is conspiracy theory thinking.
It already happened.
"Meta Secretly Trained Its AI on a Notorious Piracy Database, Newly Unredacted Court Docs Reveal"
https://www.wired.com/story/new-documents-unredacted-meta-co...
They even admitted to using copyrighted material.
"‘Impossible’ to create AI tools like ChatGPT without copyrighted material, OpenAI says"
https://www.theguardian.com/technology/2024/jan/08/ai-tools-...
Though the porn they copied was just for personal use, because clearly that's an important perk of being employed there:
https://www.vice.com/en/article/meta-says-the-2400-adult-mov...
Nothing is really preventing this though. AI companies have already proven they will ignore copyright and any other legal nuisance so they can train models.