Comment by Waterluvian

This got me thinking about something…

Isn’t an LLM basically a program that is impossible to virus scan and therefore can never be safely given access to any capable APIs?

For example: I’m a nice guy and spend billions on training LLMs. They’re amazing and free and I hand out the actual models for you all to use however you want. But I’ve trained it very heavily on a specific phrase or UUID or some other activation key being a signal to <do bad things, especially if it has console and maybe internet access>. And one day I can just leak that key into the world. Maybe it’s in spam, or on social media, etc.

How does the community detect that this exists in the model? Ie. How does the community virus scan the LLM for this behaviour?

robertk 5 days ago

You may be interested in: https://www.anthropic.com/research/sleeper-agents-training-d... https://arxiv.org/abs/2404.13660

Waterluvian OP 5 days ago

Yes these look perfect! Thank you.

orbital-decay 5 days ago

This is what mechanistic interpretability studies are trying to achieve, and it's not yet realistically possible for a general case.

avarun 4 days ago

Similarly to how you can never guarantee that one of your trusted employees won’t be made a foreign asset.

theGnuMe 4 days ago

This is a good insight. There’s also a similar insight about compilers back in the days before AV.. we will have AV LLMs etc… basically reinvent everything for the new stack.

jedimastert 5 days ago

I was just talking to somebody at work about a "Trusting Trust" style attack from LLMs. I will remain deeply suspicious of them

autobodie 5 days ago

Profit over security, outsource liability

LZ_Khan 4 days ago

I do feel like large scale LLM vulnerabilities will be the real Y2K

This item has no comments currently.