Preferences

Waterluvian parent
This got me thinking about something…

Isn’t an LLM basically a program that is impossible to virus scan and therefore can never be safely given access to any capable APIs?

For example: I’m a nice guy and spend billions on training LLMs. They’re amazing and free and I hand out the actual models for you all to use however you want. But I’ve trained it very heavily on a specific phrase or UUID or some other activation key being a signal to <do bad things, especially if it has console and maybe internet access>. And one day I can just leak that key into the world. Maybe it’s in spam, or on social media, etc.

How does the community detect that this exists in the model? Ie. How does the community virus scan the LLM for this behaviour?


robertk
Waterluvian OP
Yes these look perfect! Thank you.
orbital-decay
This is what mechanistic interpretability studies are trying to achieve, and it's not yet realistically possible for a general case.
avarun
Similarly to how you can never guarantee that one of your trusted employees won’t be made a foreign asset.
theGnuMe
This is a good insight. There’s also a similar insight about compilers back in the days before AV.. we will have AV LLMs etc… basically reinvent everything for the new stack.
jedimastert
I was just talking to somebody at work about a "Trusting Trust" style attack from LLMs. I will remain deeply suspicious of them
autobodie
Profit over security, outsource liability
LZ_Khan
I do feel like large scale LLM vulnerabilities will be the real Y2K

This item has no comments currently.