So a model is or is not "a reasoning model" according to the extent of a fine tune.
Are there specific benchmarks that compare models vs themselves with and without scratchpads? High with:without ratios being reasonier models?
Curious also how much a generalist model's one-shot responses degrade with reasoning post-training.
Yep, it's pretty common for many models to release an instruction-tuned and thinking-tuned model and then bench them against each other. For instance, if you scroll down to "Pure text performance" there's a comparison of these two Qwen models' performance: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking
Yes, simplest example: https://www.anthropic.com/engineering/claude-think-tool
This can be done with finetuning/RL using an existing pre-formatted dataset, or format based RL where the model is rewarded for both answering correct and using the right format.
I think I get that "reasoning" in this context refers to dynamically budgeting scratchpad tokens that aren't intended as the main response body. But can't any model do that, and it's just part of the system prompt, or more generally, the conversation scaffold that is being written to.
Or does a "reasoning model" specifically refer to models whose "post training" / "fine tuning" / "rlhf" laps have been run against those sorts of prompts rather than simpler user-assistant-user-assistant back and forths?
EG, a base model becomes "a reasoning model" after so much experience in the reasoning mines.