New Findings in Large Model Post-Training: In-Track Training with Self-Generated Data Enhances Model Performance

According to Beating Monitor, "on-policy sampling" (i.e., training the model on its own real-time generated data) during post-training of large models is crucial for preventing model degradation and enhancing problem-solving ability. Online reinforcement learning (RL) and on-policy distillation (OPD) outperform traditional supervised fine-tuning (SFT) because they optimize the model based on its own generated steps, rather than memorizing external ground-truth answers. SFT forcibly implants standard answers, applying uniform correction pressure across every token, which easily disrupts the model’s original knowledge structure and causes catastrophic forgetting. In contrast, RL and OPD enable the model to identify and reinforce its own best steps within its draft outputs. This not only avoids cumulative errors such as “one wrong word at the start leading to a chain of mistakes,” but also confines updates to regions of knowledge the model already understands, thereby preserving its native capabilities to the greatest extent. In the “minimal code editing” experiment, whether using SFT or RL as the teacher, the student model achieved Pass@1 success rates of 80.0% and 78.7% respectively—surpassing the teacher models themselves. Even when the SFT teacher became severely degraded due to over-fine-tuning (its code capability score on LiveCodeBench dropping from 0.320 to 0.286), its student still achieved a high score of 0.297, nearly unaffected by the teacher’s flaws—demonstrating that on-policy practice effectively filters out the teacher’s bad habits. Currently, DeepSeek-V4 and GLM-5 have adopted on-policy distillation to merge expert model capabilities. In expert training, domains with clear right/wrong answers—such as coding and mathematics—are better suited for RL, while creative and knowledge-based subjective tasks are more appropriate for on-policy distillation. The ultimate fine-tuning algorithm of the future must, within an on-policy training framework, discover new mechanisms that combine the high efficiency (high information density) of distillation with the objectivity (unbiased updates) of RL.