Preferences

Just speculating but proximity to a reference answer is a much denser reward signal. In contrast, parsing out a final answer into a pass/fail only provides a sparse reward signal.

Yup, RLVR as implemented by Deepseek et al. use only outcome supervision instead of process supervision. There have been attempts to do process supervision though.

This item has no comments currently.