> that the author also tripped over
The evidence for unfaithful reasoning comes from Anthropic. It is in their system card and this Anthropic paper.
https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea...
The evidence for unfaithful reasoning comes from Anthropic. It is in their system card and this Anthropic paper.
https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea...
If you ask an LLM a question, then get the answer and then ask how it got to that answer, it will make stuff up - because it literally can't do otherwise: There is no hidden memory space in which the LLM could do its calculations, and also record which calculations it did, that it could then consult to answer the second question. All there is are the tokens.
However if you tell the model to "think step by step", I.e. first make a number of small inferences, then use those to derive the final answer, you should (at least in theory) get a high-level description of the actual reasoning process, because the model will use the tokens of its intermediate results to generate the features for the final result.
So "explain how you did it" will give you bullshit, but "think step by step" should work.
And as far as my understanding goes, the "reasoning models" are essentially just heavily fine tuned for step-by-step reasoning.