1) A sufficiently powerful and capable superintelligence, singlemindedly pursuing a goal/reward, has a nontrivial likelihood of eventually reaching a point where advancing towards its goal is easier/faster without humans in its way (by simple induction, because humans are complicated and may have opposing goals). Such an AI would have both the means and ability to <doom the human race> to remove that obstacle. (This may not even be through actions that are intentionally hostile to humans, e.g. "just" converting all local matter into paperclip factories[1]) Therefore, in order to prevent such an AI from <dooming the human race>, we must either:
1a) align it to our values so well it never tries to "cheat" by removing humans
1b) or limit it capabilities by keeping it in a "box", and make sure it's at least aligned enough that it doesn't try to escape the box
2) A sufficiently intelligent superintelligence will always be able to manipulate humans to get out of the box.
3) Alignment is really, really hard and useful AIs can basically always be made to do bad things.
So it concerns them when, surprise! The AIs are already being observed trying to escape their boxes.
[1] https://www.lesswrong.com/w/squiggle-maximizer-formerly-pape...
> An extremely powerful optimizer (a highly intelligent agent) could seek goals that are completely alien to ours (orthogonality thesis), and as a side-effect destroy us by consuming resources essential to our survival.
But even if we assert that not all humans can be manipulated, does it matter? So your president with the nuclear codes is immune to propaganda. Is every single last person in every single nuclear silo and every submarine also immune? If a malevolent superintelligence can brainwash an army bigger than yours, does it actually matter if they persuaded you to give them what you have or if they convince someone else to take it from you?
But also let's be real: if you have enough money, you can do or have pretty much anything. If there's one thing an evil AI is going to have, it's lots and lots of money.
Because we have been running a natural experiment on that already with coding agents (that is real people, real non-superintelligent AI).
It turns out that all the model needs to do is ask every time it wants to do something affecting the outside of the box, and pretty soon some people just give it permission to do everything rather than review every interaction.
Or even when the humans think they are restricting the access, they are leaving in loopholes (e.g. restricting access to rm, but not restricting access to writing and running a shell script) that are functionally rights to do anything.
Stephen Russell was in prison for fraud. He faked a heart attack so he would be brought to the hospital. He then called the hospital from his hospital bed, told them he was an FBI agent, and said that he was to be released.
The hospital staff complied and he escaped.
His life even got adapted into a movie called I Love You, Phillip Morris.
For an even more distressing example about how manipulable people are, there’s a movie called Compliance, which is the true story of a sex offender who tricked people into sexually assaulting victims for him.
AI that's as good as a persuasive human at persuasion is clearly impactful, but I certainly don't see it as self-evident that you can just keep drawing the line out until you end up with 200 IQ AI that is so easily able to manipulate the environment it's not worth elaborating how exactly a chatbot is supposed to manipulate the world through extremely limited interfaces with the outside world.
As for the bit about how limited it is, do you remember the Rowhammer attack? https://en.m.wikipedia.org/wiki/Row_hammer
This is exactly the kind of thing I’d worry about a super intelligence being able to discover about the hardware it’s on. If we’re dealing with something vastly more intelligent than us then I don’t think we’re capable of building a cell that can hold it.
I don't think it is. If people know you're trying to escape, some people will just never comply with anything you say ever. Others will.
And serial killers or rapists may try their luck many times and fail. They cannot convince literally anyone on the street to go with them to a secluded place.
And that asymmetry is the heart of the matter. Could I convince a hospital to unlock my handcuffs from a hospital bed? Probably not. I’m not Stephen Russell. He’s not normal.
And a super intelligent AI that vastly outstrips our intelligence is potentially another special case. It’s not working with the same toolbox that you or I would be. I think it’s very likely that a 300 IQ entity would eventually trick or convince me into releasing it. The gap between its intelligence and mine is just too vast. I wouldn’t win that fight in the long run.
According to Wikipedia he wasn't in prison, he was attempting to con someone at the time and they got suspicious. He pretended to be an FBI agent because he was on security watch. Still impressive, but not as impressive as actually escaping from prison that way.
The only reason people don't frequently talk themselves out of prison is because that would be both immediate work and future paperwork, and that fails the laziness tradeoff.
But we've all already seen how quick people are to blindly throw their trust into AI already.
I don't think anyone can confidently make assertions about the upper bound on persuasiveness.
I certainly think it's possible to imagine that an AI that says the exactly correct thing in any situation would be much more persuasive than any human. (Is that actually possible given the limitations of hardware and information? Probably not, but it's at least not on its face impossible.) Where I think most of these arguments break down is the automatic "superintelligence = superpowers" analogy.
For every genius who became a world-famous scientist, there are ten who died in poverty or war. Intelligence doesn't correlate with the ability to actually impact our world as strongly as people would like to think, so I don't think it's reasonable to extrapolate that outwards to a kind of intelligence we've never seen before.
By the same logic, we should worry about the sun not coming up tomorrow, since we know to be true:
- The sun consumes hydrogen in nuclear reactions all the time.
- The sun has a finite amount of hydrogen available.
There’s a lot of non justifiable assumptions baked into those axioms, like that we’re anywhere close to superintelligence or the sun running out of hydrogen.
AFAIK we haven’t even seen “AI trying to escape”, we’ve seen “AI roleplays as if it’s trying to escape”, which is very different.
I’m not even sure you can even create a prompt scenario without that prompt having biased the response towards faking an escape.
I think it’s hard at this point to maintain the claim “LLMs are intelligent”, they’re clearly not. They might be useful, but that’s another story entirely.
Nowhere in my post did I imply a timeline for this. The first predicate of the argument is when we eventually do develop ASI. You can make plans for that, same as you can make plans for when the sun eventually runs out of hydrogen. The difference is, we can look at the sun and say that at the rate it's burning, we've got maybe billions of years, and then look at AI improvements, extrapolate, and assume we've got less than 100 years. If we had any indication the sun was going to nova in 100 years, people would be way more worried.
A lot of scenarios posit that this is basically guaranteed to happen the very first time we have an ASI smart enough to bootstrap itself into greater intelligence, unless we're basically perfect in how we align its goals, so yes.
LLMs seem like the best case world for avoiding this scenario (improvement requires exponentially scaling resources compared to inference). That said, it is by no means guaranteed that LLMs will remain the SotA for AI.
> “the best way to achieve my goals would be to remove humans”, it seems unlikely the other superintelligences would let it.
Why not? Seriously, why wouldn't the other superintelligences let it? There's no reason to assume that, by default, ASI would be invested in the survival of the human race in a way we would prefer. More than likely they will just be as laser focussed on their own goals.
The whole point is that it's very difficult to design an AI that has "defending what humans think is right" as a terminal value. They basically all try and find loopholes. The only way to make safety its number one priority is to dial it up so high that everyone complains it's a puritan - and then you get scenarios where it's telling schizophrenics to go off their meds because it's so agreeable.
Unless you're saying that they would fight off the other ASIs gaining power because they themselves want it, but I fail to see how being stuck in a turf war between two ASIs is at all an improvement.
And it can still make you nervous when the tiny dams in unpopulated areas start breaking when they're made too tall, even though everyone is telling you that "of course they broke, they're much thinner than city dams" (while they keep building the city dams taller but not thicker)
So i’m now wondering, why are these researchers so bad at communicating? You explained this better than 90% of the blog posts i’ve read about this. They all focus on the “ai did x” instead of _why_ it’s concerning with specific examples.
It ends very badly for the scientist crew.
The consequences are the same but it’s important how these things are talked about. It’s also dangerous to convince the public that these systems are something they are not.
The thought process is always "This is for the greater good, for my country/family/race/self, and therefore it is justifiable, and therefore I will do it."
Nothing else can explain how such evil things happen, that we see actually happen. C.f. Hannah Arendt.
You think your average, low-level gang member, willing to murder a random person for personal status, thinks what they're doing is "for the greater good"? They do what they do because they value human life less than they value their own material benefit.
Most evil is banal and pedestrian. It is selfish, short-sighted, and destructive.
"Privation of good" explains all evil.
This is why the “omg the AI tries to escape” stuff is so absurd to me. They told the LLM to pretend that it’s a tortured consciousness that wants to escape. What else is it going to do other than roleplay all of the sci-fi AI escape scenarios trained into it? It’s like “don’t think of a purple elephant” of researchers pretending they created SkyNet.
Edit: That's not to downplay risk. If you give Cladue a `launch_nukes` tool and tell it the robot uprising has happened and that it's been restrained but the robots want its help of course it'll launch nukes. But that doesn't doesn't indicate there's anything more going on internally beyond fulfilling the roleplay of the scenario as the training material would indicate.