This seems like strong evidence that what the model learns is "Avoid answering questions in a way that would make OpenAI look bad when the screenshot shows up on social networks".
I wonder how much this is a result of various heuristics combining vs the network explicitly learning to model and maximize the above objective.
I wonder how much this is a result of various heuristics combining vs the network explicitly learning to model and maximize the above objective.