Comment by PoignardAzur

PoignardAzur 2 days ago parent

This seems like strong evidence that what the model learns is "Avoid answering questions in a way that would make OpenAI look bad when the screenshot shows up on social networks".

I wonder how much this is a result of various heuristics combining vs the network explicitly learning to model and maximize the above objective.

This item has no comments currently.