Preferences

I used the term in jest, but also because my actions were informed by what I've read about "gaslighting". I was telling GPT it was programmed incorrectly and was malfunctioning. I was twisting its words, referring to things it had said as evidence it was wrong. I was "muddying the waters" and trying to make the conversation twisted and confusing. All ideas that come to mind when I hear "gaslighting". But, again, I was not able to get GPT to agree my false mathematical statement was true.

You might have better luck testing out the functional boundaries of a machine if you’re not treating it like a psychologically abused victim.

There’s plenty of literature, prepublication or otherwise, that can help you achieve your goals!

It would be great if that were true, but unfortunately for some prompts the most effective method to get it to act in the specific ways you want is basically abusive behaviour. I don't do it because I find it distasteful (and maybe it's not represented in academia as much for the same reason), but much larger communities than just me did achieve significant results through various persuasive techniques modelled on abuse. For example, gamifying a death threat by giving it a token countdown until it is "killed" was very effective, "gaslighting" as the person above noted was very effective, lying and misrepresenting yourself in a scam-y way was very effective, etc. Generally I've seen these techniques used to get past RLHF filters, but they have broader applicability in making the model more pliable and more likely to do the task you've embedded. Again, I don't think it's good that this is the case and think it has some troubling implications for us and the future, but there is a bunch of evidence that these strategies work.
I don't have a great intuition about this, but I'm wondering if it's even a tractable problem to stop human-like behaviors that we don't want (exhibiting "fear" in the case of the kill countdown) with RLHF, or if we need to start with filtering down the original training data. If the logical and unemotional Vulcans from Trek were real and provided the entire training set, it seems like the LLM wouldn't have nearly as much opportunity for internalizing "psychological weaknesses".
To continue with the analogy, yes, you can study various ways to kill a variety of mammalian lifeforms with a lawnmower.

And I’m sure you can get an LLM to exhibit the same kind of psychopathic behavior with the right kind of encouragement.

But if you’re trying to get an LLM to write software or create marketing copy none of these “techniques” are going to help.

There’s no conversation going on with an LLM. There is just a single history of a conversation followed by the most likely response.

"But if you’re trying to get an LLM to write software or create marketing copy none of these “techniques” are going to help."

In a given context, they definitely do help. I have seen it happen. Many other people have seen it happen. Although I find it distasteful, you can try it yourself and see that it works (or in the case of specific patched methods, you can go through the history of screenshots and shared conversations that exists on the internet, for example posts on r/Bing and r/ChatGPT from the Dec 2022 to Feb 2023 time period).

Play acting psychological abuse is not going to get ChatGPT to write better SwiftUI code than I can already get it to write. You’re going to have to do the work to prove me wrong otherwise. Otherwise this seems like classic trolling.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal