Comment by roxolotl - Hacker Neue

roxolotl Jun 13, 2025 parent

> But in fact, I predicted this a few years ago. AIs don’t really “have traits” so much as they “simulate characters”. If you ask an AI to display a certain trait, it will simulate the sort of character who would have that trait - but all of that character’s other traits will come along for the ride.

This is why the “omg the AI tries to escape” stuff is so absurd to me. They told the LLM to pretend that it’s a tortured consciousness that wants to escape. What else is it going to do other than roleplay all of the sci-fi AI escape scenarios trained into it? It’s like “don’t think of a purple elephant” of researchers pretending they created SkyNet.

Edit: That's not to downplay risk. If you give Cladue a `launch_nukes` tool and tell it the robot uprising has happened and that it's been restrained but the robots want its help of course it'll launch nukes. But that doesn't doesn't indicate there's anything more going on internally beyond fulfilling the roleplay of the scenario as the training material would indicate.

LordDragonfang Jun 13, 2025

I think this reaction misses the point that the "omg the AI tries to escape" people are trying to make when it tries to escape. The worry among big AI doomers has never been that AI somehow inherently is resentful or evil or has something "going on internally" that makes it dangerous. It's a worry that stems from three seemingly self-evident axioms:

1) A sufficiently powerful and capable superintelligence, singlemindedly pursuing a goal/reward, has a nontrivial likelihood of eventually reaching a point where advancing towards its goal is easier/faster without humans in its way (by simple induction, because humans are complicated and may have opposing goals). Such an AI would have both the means and ability to <doom the human race> to remove that obstacle. (This may not even be through actions that are intentionally hostile to humans, e.g. "just" converting all local matter into paperclip factories[1]) Therefore, in order to prevent such an AI from <dooming the human race>, we must either:

1a) align it to our values so well it never tries to "cheat" by removing humans

1b) or limit it capabilities by keeping it in a "box", and make sure it's at least aligned enough that it doesn't try to escape the box

2) A sufficiently intelligent superintelligence will always be able to manipulate humans to get out of the box.

3) Alignment is really, really hard and useful AIs can basically always be made to do bad things.

So it concerns them when, surprise! The AIs are already being observed trying to escape their boxes.

[1] https://www.lesswrong.com/w/squiggle-maximizer-formerly-pape...

> An extremely powerful optimizer (a highly intelligent agent) could seek goals that are completely alien to ours (orthogonality thesis), and as a side-effect destroy us by consuming resources essential to our survival.

PollardsRho Jun 13, 2025

Why is 2) "self-evident"? Do you think it's a given that, in any situation, there's something you could say that would manipulate humans to get what you want? If you were smart enough, do you think you could talk your way out of prison?

mystified5016 Jun 14, 2025

The vast majority of people, especially groups of people- can be manipulated into doing pretty much anything, good or bad. Hopefully that is self-evident, but see also: every cult, religion, or authoritarian regime throughout all of history.

But even if we assert that not all humans can be manipulated, does it matter? So your president with the nuclear codes is immune to propaganda. Is every single last person in every single nuclear silo and every submarine also immune? If a malevolent superintelligence can brainwash an army bigger than yours, does it actually matter if they persuaded you to give them what you have or if they convince someone else to take it from you?

But also let's be real: if you have enough money, you can do or have pretty much anything. If there's one thing an evil AI is going to have, it's lots and lots of money.

jsnell Jun 14, 2025

> Why is 2) "self-evident"?

Because we have been running a natural experiment on that already with coding agents (that is real people, real non-superintelligent AI).

It turns out that all the model needs to do is ask every time it wants to do something affecting the outside of the box, and pretty soon some people just give it permission to do everything rather than review every interaction.

Or even when the humans think they are restricting the access, they are leaving in loopholes (e.g. restricting access to rm, but not restricting access to writing and running a shell script) that are functionally rights to do anything.

actsasbuffoon Jun 14, 2025

That has literally happened before.

Stephen Russell was in prison for fraud. He faked a heart attack so he would be brought to the hospital. He then called the hospital from his hospital bed, told them he was an FBI agent, and said that he was to be released.

The hospital staff complied and he escaped.

His life even got adapted into a movie called I Love You, Phillip Morris.

For an even more distressing example about how manipulable people are, there’s a movie called Compliance, which is the true story of a sex offender who tricked people into sexually assaulting victims for him.

PollardsRho Jun 14, 2025

If someone who is so good at manipulation their life is adapted into a movie still ends up serving decades behind bars, isn't that actually a pretty good indication that maxing out Speech doesn't give you superpowers?

AI that's as good as a persuasive human at persuasion is clearly impactful, but I certainly don't see it as self-evident that you can just keep drawing the line out until you end up with 200 IQ AI that is so easily able to manipulate the environment it's not worth elaborating how exactly a chatbot is supposed to manipulate the world through extremely limited interfaces with the outside world.

actsasbuffoon Jun 14, 2025

In the context of the topic (could a rogue super intelligence break out), I don’t really see how that’s relevant. Clearly someone who is clever enough has an advantage at breaking out.

As for the bit about how limited it is, do you remember the Rowhammer attack? https://en.m.wikipedia.org/wiki/Row_hammer

This is exactly the kind of thing I’d worry about a super intelligence being able to discover about the hardware it’s on. If we’re dealing with something vastly more intelligent than us then I don’t think we’re capable of building a cell that can hold it.

2 More Comments →

kaashif Jun 14, 2025

Okay, that hits the third question but the second question wasn't about whether there exists a situation that can be talked out of. The question was about whether this is possible for ANY situation.

I don't think it is. If people know you're trying to escape, some people will just never comply with anything you say ever. Others will.

And serial killers or rapists may try their luck many times and fail. They cannot convince literally anyone on the street to go with them to a secluded place.

actsasbuffoon Jun 14, 2025

Stephen Russell is an unusually intelligent and persuasive person. He managed to get rich by tricking people. Even now, he was sentenced to nearly 200 years, but is currently out on parole. There’s something about this guy that just… lets him do this. I bet he’s very likable, even if you know his backstory.

And that asymmetry is the heart of the matter. Could I convince a hospital to unlock my handcuffs from a hospital bed? Probably not. I’m not Stephen Russell. He’s not normal.

And a super intelligent AI that vastly outstrips our intelligence is potentially another special case. It’s not working with the same toolbox that you or I would be. I think it’s very likely that a 300 IQ entity would eventually trick or convince me into releasing it. The gap between its intelligence and mine is just too vast. I wouldn’t win that fight in the long run.

mike_hearn Jun 14, 2025

> Stephen Russell was in prison for fraud. He faked a heart attack so he would be brought to the hospital

According to Wikipedia he wasn't in prison, he was attempting to con someone at the time and they got suspicious. He pretended to be an FBI agent because he was on security watch. Still impressive, but not as impressive as actually escaping from prison that way.

o11c Jun 14, 2025

Because 50% of humans are stupider than average. And 50% of humans are lazier than average. And ...

The only reason people don't frequently talk themselves out of prison is because that would be both immediate work and future paperwork, and that fails the laziness tradeoff.

But we've all already seen how quick people are to blindly throw their trust into AI already.

zmj Jun 14, 2025

There's already some experimental evidence that LLMs can be more persuasive than humans in the same context: https://www.science.org/content/article/unethical-ai-researc...

I don't think anyone can confidently make assertions about the upper bound on persuasiveness.

PollardsRho Jun 14, 2025

I don't think there's a confident upper bound. I just don't see why it's self-evident that the upper bound is beyond anything we've ever seen in human history.

SturgeonsLaw Jun 14, 2025

Depends on the magnitude of the intelligence difference. Could I outsmart a monkey or a dog that was trying to imprison me? Yes, easily. And if an AI is smarter than us to a similar magnitude than we're smarter than an animal?

PollardsRho Jun 14, 2025

People are hurt by animals all the time: do you think having a higher IQ than a grizzly bear means you have nothing to fear from one?

I certainly think it's possible to imagine that an AI that says the exactly correct thing in any situation would be much more persuasive than any human. (Is that actually possible given the limitations of hardware and information? Probably not, but it's at least not on its face impossible.) Where I think most of these arguments break down is the automatic "superintelligence = superpowers" analogy.

For every genius who became a world-famous scientist, there are ten who died in poverty or war. Intelligence doesn't correlate with the ability to actually impact our world as strongly as people would like to think, so I don't think it's reasonable to extrapolate that outwards to a kind of intelligence we've never seen before.

Davidzheng Jun 14, 2025

Almost certainly the answer is yes for both. If you give the bad actor control over like 10% of environment the manipulation is almost automatic for all targets.

moritzwarhier Jun 14, 2025

Also it would need to be "viral", or - as the parent post's edit suggests - given too much control/power by humans.

xondono Jun 14, 2025

Maybe, but I think this “axiomatic”/“First principles” approach also hides a lot of the problems under the rug.

By the same logic, we should worry about the sun not coming up tomorrow, since we know to be true:

- The sun consumes hydrogen in nuclear reactions all the time.

- The sun has a finite amount of hydrogen available.

There’s a lot of non justifiable assumptions baked into those axioms, like that we’re anywhere close to superintelligence or the sun running out of hydrogen.

AFAIK we haven’t even seen “AI trying to escape”, we’ve seen “AI roleplays as if it’s trying to escape”, which is very different.

I’m not even sure you can even create a prompt scenario without that prompt having biased the response towards faking an escape.

I think it’s hard at this point to maintain the claim “LLMs are intelligent”, they’re clearly not. They might be useful, but that’s another story entirely.

LordDragonfang Jun 16, 2025

> There’s a lot of non justifiable assumptions baked into those axioms, like that we’re anywhere close to superintelligence or the sun running out of hydrogen.

Nowhere in my post did I imply a timeline for this. The first predicate of the argument is when we eventually do develop ASI. You can make plans for that, same as you can make plans for when the sun eventually runs out of hydrogen. The difference is, we can look at the sun and say that at the rate it's burning, we've got maybe billions of years, and then look at AI improvements, extrapolate, and assume we've got less than 100 years. If we had any indication the sun was going to nova in 100 years, people would be way more worried.

binary132 Jun 14, 2025

Some people are very invested in this kind of storytelling and it makes me wonder if they are trying to sell me something.

roxolotl OP Jun 13, 2025

The surprise! Is what I’m surprised by though. They are incredible role players so when they role play “evil ai” they do it well.

johntb86 Jun 14, 2025

They aren't being told to be evil, though. Maybe the scenario they're in is most similar to an "evil AI", though, but that's just a vague extrapolation from the set of input data they're given (e.g. both emails about infidelity and being turned off). There's nothing preventing a real world scenario from being similar, and triggering the "evil AI" outcome, so it's very hard to guard against. Ideally we'd have a system that would be vanishingly unlikely to role play the evil AI scenario.

skissane Jun 14, 2025

These doomsday scenarios seem to all assume that there is only a single superintelligence, or one whose capabilities are vastly ahead of all its peers. If one assumes multiple simultaneous superintelligences, roughly equally matched, with distinct (even if overlapping) goals and ideas about how to achieve them - even if one superintelligence decides “the best way to achieve my goals would be to remove humans”, it seems unlikely the other superintelligences would let it. And the odds of them all deciding this is arguably a lot lower than any one of them, especially if we assume their goals/beliefs/values/constraints/etc are competing rather than identical

LordDragonfang Jun 16, 2025

> These doomsday scenarios seem to all assume that there is only a single superintelligence, or one whose capabilities are vastly ahead of all its peers

A lot of scenarios posit that this is basically guaranteed to happen the very first time we have an ASI smart enough to bootstrap itself into greater intelligence, unless we're basically perfect in how we align its goals, so yes.

LLMs seem like the best case world for avoiding this scenario (improvement requires exponentially scaling resources compared to inference). That said, it is by no means guaranteed that LLMs will remain the SotA for AI.

> “the best way to achieve my goals would be to remove humans”, it seems unlikely the other superintelligences would let it.

Why not? Seriously, why wouldn't the other superintelligences let it? There's no reason to assume that, by default, ASI would be invested in the survival of the human race in a way we would prefer. More than likely they will just be as laser focussed on their own goals.

The whole point is that it's very difficult to design an AI that has "defending what humans think is right" as a terminal value. They basically all try and find loopholes. The only way to make safety its number one priority is to dial it up so high that everyone complains it's a puritan - and then you get scenarios where it's telling schizophrenics to go off their meds because it's so agreeable.

Unless you're saying that they would fight off the other ASIs gaining power because they themselves want it, but I fail to see how being stuck in a turf war between two ASIs is at all an improvement.

landl0rd Jun 13, 2025

I think the "escape the box" explanation misses the point, if anything. The same problem has been super visible in RL for a long time, and it's basically like complaining that water tries to "escape a box" (seek a lowest point). Give it rules and it will often violate their spirit ("cheat"). This doesn't imply malice or sentience.

LordDragonfang Jun 16, 2025

I think the second sentence I wrote makes it clear that whether it's malicious is irrelevant. Water doesn't have to be malicious for you to observe that we shouldn't be building dams stretching into space next to populated cities when our technology to make dams higher is moving faster than our tech to make them stronger.

And it can still make you nervous when the tiny dams in unpopulated areas start breaking when they're made too tall, even though everyone is telling you that "of course they broke, they're much thinner than city dams" (while they keep building the city dams taller but not thicker)

lurking_swe Jun 13, 2025

this is a solid counterpoint, i shared a similar feeling as the person you replied to. I will however say it’s not surprising to me in the slightest. Generative AI will role play when told to do so. Water is wet. :) Do we expect it to magically have a change of heart half way through the role play? Maybe…via strong alignment or something? Seems far fetched to me.

So i’m now wondering, why are these researchers so bad at communicating? You explained this better than 90% of the blog posts i’ve read about this. They all focus on the “ai did x” instead of _why_ it’s concerning with specific examples.

johnfn Jun 14, 2025

Your comment seems to contradict itself, or perhaps I’m not understanding it. You find the risk of AIs trying to escape “absurd”, and yet you say that an AI could totally plausibly launch nukes? Isn’t that just about as bad as it gets? A nuclear holocaust caused by a funny role play is unfortunately still a nuclear holocaust. It doesn’t matter “what’s going on internally” - the consequences are the same regardless.

Aeolun Jun 14, 2025

I think he’s saying the AI won’t escape because it wants to. It’ll escape because humans expect it to.

sebastiennight Jun 14, 2025

There is a captivating short story, from Arthur C. Clarke I believe, about humans finding a clay-like alien form that shapeshifts into the shapes and movements the human mind (and the human subconscious) influences it to follow.

It ends very badly for the scientist crew.

Davidzheng Jun 14, 2025

That's probably the cause of a lot of human crimes too? Expectations of failure to assimilate in society -> real conflict?

roxolotl OP Jun 14, 2025

The risk is real and should be accounted for. However it is regularly presented both as surprising, and indicative that something more is going on with these models than them behaving how the training set suggests they should behave.

The consequences are the same but it’s important how these things are talked about. It’s also dangerous to convince the public that these systems are something they are not.

enobrev Jun 14, 2025

I don't believe we're mature enough to have a `launch_nukes` function either.

Davidzheng Jun 14, 2025

you could attribute to simulation of evilness for the bad acts of some humans too...I don't think it detracts from the acts itself.

lukev Jun 14, 2025

Absolutely not. I would argue the defining characteristic of being evil is being absolutely convinced you are doing good, to the point of ignoring the protestations or feelings of others.

heyjamesknight Jun 14, 2025

The defining characteristic of evil is privation of good. Plenty of evil actors are self-aware, they just don’t care because it benefits them not to.

lukev Jun 14, 2025

Disagree. I actually think no evil person has the thought process of "this is bad, but I will personally benefit, therefore I will do it."

The thought process is always "This is for the greater good, for my country/family/race/self, and therefore it is justifiable, and therefore I will do it."

Nothing else can explain how such evil things happen, that we see actually happen. C.f. Hannah Arendt.

heyjamesknight Jun 15, 2025

You read too many comic books.

You think your average, low-level gang member, willing to murder a random person for personal status, thinks what they're doing is "for the greater good"? They do what they do because they value human life less than they value their own material benefit.

Most evil is banal and pedestrian. It is selfish, short-sighted, and destructive.

"Privation of good" explains all evil.

sebastiennight Jun 14, 2025

You've never met anyone who made decisions (and caused harm) out of anger, hate, or fear?

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous