Comment by markisus - Hacker Neue

markisus Aug 17, 2025 parent

Just speculating but proximity to a reference answer is a much denser reward signal. In contrast, parsing out a final answer into a pass/fail only provides a sparse reward signal.

krackers Aug 21, 2025

Yup, RLVR as implemented by Deepseek et al. use only outcome supervision instead of process supervision. There have been attempts to do process supervision though.

This item has no comments currently.