Any model that does thinking inside <think></think> style tokens before it answers.
This can be done with finetuning/RL using an existing pre-formatted dataset, or format based RL where the model is rewarded for both answering correct and using the right format.
This can be done with finetuning/RL using an existing pre-formatted dataset, or format based RL where the model is rewarded for both answering correct and using the right format.