You may be interested in:
https://www.anthropic.com/research/sleeper-agents-training-d...
https://arxiv.org/abs/2404.13660
This is what mechanistic interpretability studies are trying to achieve, and it's not yet realistically possible for a general case.
This is a good insight. There’s also a similar insight about compilers back in the days before AV.. we will have AV LLMs etc… basically reinvent everything for the new stack.
I was just talking to somebody at work about a "Trusting Trust" style attack from LLMs. I will remain deeply suspicious of them
Isn’t an LLM basically a program that is impossible to virus scan and therefore can never be safely given access to any capable APIs?
For example: I’m a nice guy and spend billions on training LLMs. They’re amazing and free and I hand out the actual models for you all to use however you want. But I’ve trained it very heavily on a specific phrase or UUID or some other activation key being a signal to <do bad things, especially if it has console and maybe internet access>. And one day I can just leak that key into the world. Maybe it’s in spam, or on social media, etc.
How does the community detect that this exists in the model? Ie. How does the community virus scan the LLM for this behaviour?