New Findings in Large Model Post-Training: In-Track Training with Self-Generated Data Enhances Model Performance

icon MarsBit
Share
AI summary iconSummary

According to Beating Monitor, "on-policy sampling" (i.e., training the model on its own real-time generated data) during post-training of large models is crucial for preventing model degradation and enhancing problem-solving ability. Online reinforcement learning (RL) and on-policy distillation (OPD) outperform traditional supervised fine-tuning (SFT) because they optimize the model based on its own generated steps, rather than memorizing external ground-truth answers. SFT forcibly implants standard answers, applying uniform correction pressure across every token, which easily disrupts the model’s original knowledge structure and causes catastrophic forgetting. In contrast, RL and OPD enable the model to identify and reinforce its own best steps within its draft outputs. This not only avoids cumulative errors such as “one wrong word at the start leading to a chain of mistakes,” but also confines updates to regions of knowledge the model already understands, thereby preserving its native capabilities to the greatest extent. In the “minimal code editing” experiment, whether using SFT or RL as the teacher, the student model achieved Pass@1 success rates of 80.0% and 78.7% respectively—surpassing the teacher models themselves. Even when the SFT teacher became severely degraded due to over-fine-tuning (its code capability score on LiveCodeBench dropping from 0.320 to 0.286), its student still achieved a high score of 0.297, nearly unaffected by the teacher’s flaws—demonstrating that on-policy practice effectively filters out the teacher’s bad habits. Currently, DeepSeek-V4 and GLM-5 have adopted on-policy distillation to merge expert model capabilities. In expert training, domains with clear right/wrong answers—such as coding and mathematics—are better suited for RL, while creative and knowledge-based subjective tasks are more appropriate for on-policy distillation. The ultimate fine-tuning algorithm of the future must, within an on-policy training framework, discover new mechanisms that combine the high efficiency (high information density) of distillation with the objectivity (unbiased updates) of RL.

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.