smolder parent
I don't have a great intuition about this, but I'm wondering if it's even a tractable problem to stop human-like behaviors that we don't want (exhibiting "fear" in the case of the kill countdown) with RLHF, or if we need to start with filtering down the original training data. If the logical and unemotional Vulcans from Trek were real and provided the entire training set, it seems like the LLM wouldn't have nearly as much opportunity for internalizing "psychological weaknesses".