Preferences

If it’s predicting a next token to maximize scores against a training/test set, naively, wouldn’t that be expected?

I would imagine very little of the training data consists of a question followed by an answer of “I don’t know”, thus making it statistically very unlikely as a “next token”.


Even where the training data does say "I don't know" (which it usually doesn't-- people don't tend to comment or publish books, etc. when they don't think they know) that text is reflecting the author's knowledge rather than the models... so it would be off in both directions.

One could imagine a fine tuning procedure that gave a model better knowledge of itself by testing it and on prompts where its most probable completions are wrong fine tune it to say "I don't know" instead. Though the 'are wrong' is doing some really heavy lifting since it wouldn't be simple to do that without a better model that knew the right answers.

This item has no comments currently.