Preferences

I found this statement particularly relevant:

  While it’s possible to demonstrate the safety of an AI for 
  a specific test suite or a known threat, it’s impossible 
  for AI creators to definitively say their AI will never act 
  maliciously or dangerously for any prompt it could be given.
This possibility is compounded exponentially when MCP[0] is used.

0 - https://github.com/modelcontextprotocol


I wonder if a safer approach to using MCP could involve isolating or sandboxing the AI. A similar context was discussed in Nick Bostrom's book Superintelligence. In the book, the AI is only allowed to communicate via a single light signal, comparable to Morse code.

Nevertheless, in the book, the AI managed to convince people, using the light signal, to free it. Furthermore, it seems difficult to sandbox any AI that is allowed to access dependencies or external resources (i.e. the internet). It would require (e.g.) dumping the whole Internet as data into the Sandbox. Taking away such external resources, on the other hand, reduces its usability.

> it’s impossible for AI creators to definitively say their AI will never act maliciously or dangerously for any prompt it could be given

This is false, AI doesn't "act" at all unless you, the developer, use it for actions. In which case it is you, the developer, taking the action.

Anthropomorphizing AI with terms like "malicious" when they can literally be implemented with a spreadsheet—first-order functional programming—and the world's dumbest while-loop to append the next token and restart the computation—should be enough to tell you there's nothing going on here beyond next token prediction.

Saying an LLM can be "malicious" is not even wrong, it's just nonsense.

> AI doesn't "act" at all unless you, the developer, use it for actions

This seems like a pointless definition of "act"? someone else could use the AI for actions which affect me, in which case I'm very much worried about those actions being dangerous, regardless of precisely how you're defining the word "act".

> when they can literally be implemented with a spreadsheet

The financial system that led to 2008 basically was one big spreadsheet, and yet it would have been correct to be worried about it. "Malicious" maybe is a bit evocative, I'll grant you that, but if I'm about to be eaten by a lion, I'm less concerned about not mistakenly athropomorphizing the lion, and more about ensuring I don't get eaten. It _doesn't matter_ whether the AI has agency or is just a big spreadsheet or wants to do us harm or is just sitting there. If it can do harm, it's dangerous.

You are right about 'malicious'. 'Dangerous', however, is a different matter.
Yeah in that regard we should always treat it like a junior something. Very much like you can't expect your own kids to never do something dangerous even if tell it for years to be careful. I got used to getting my kid from the Kindergarten with a new injury at least once a month.
I think it's very dangerous to use the term "junior" here because it implies growth potential, where in fact it's the opposite: you are using a finished product, it won't get any better. AI is an intern, not a junior. All the effort you're spending into correcting it will leave the company, either as soon as you close your browser or whenever the manufacturer releases next year's model -- and that model will be better regardless of how much time you waste on training this year's intern, so why even bother? Thinking of AI as a junior coworker is probably the least productive way of looking at it.
We should move well beyond human analogies. I have never met a human that would straight up lie about something, or build up so much deceptive tests that it might as well be lying.

Granted this is not super common in these tools, but it is essentially unheard of in junior devs.

> I have never met a human that would straight up lie about something

This doesn't match my experience. Consider high profile things like the VW emissions scandal, where the control system was intentionally programmed to only engage during the emissions test. Dictators. People are prone to lie when it's in their self interest, especially for self preservation. We have entire structures of government, courts, that try to resolve fact in the face of lying.

If we consider true-but-misleading, then politics, marketing, etc. come sharply into view.

I think the challenge is that we don't know when an LLM will generate untrue output, but we expect people to lie in certain circumstances. LLMs don't have clear self-interests, or self awareness to lie with intent. It's just useful noise.

There is an enormous amount of difference between planned deception as part of a product, and undermining your own product with deceptive reporting about its quality. The difference is collaboration and alignment. You might have evil goals, but if your developers are maliciously incompetent, no goal will be accomplished.
> Granted this is not super common in these tools, but it is essentially unheard of in junior devs.

I wonder if it's unheard of in junior devs because they're all saints, or because they're not talented enough to get away with it?

Incentives align against lying about what you built. You'd be found out immediately. There's no "shame" button with these chatbots.
Thanks! I'm very interested in mechanistic intepretability, specifically Anthropic and Neel Nanda's work, so this impossibility of proving safety is a core concept for me.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal