Preferences

I don't have a great intuition about this, but I'm wondering if it's even a tractable problem to stop human-like behaviors that we don't want (exhibiting "fear" in the case of the kill countdown) with RLHF, or if we need to start with filtering down the original training data. If the logical and unemotional Vulcans from Trek were real and provided the entire training set, it seems like the LLM wouldn't have nearly as much opportunity for internalizing "psychological weaknesses".

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal