According to Beating Monitor, "on-policy sampling" (i.e., training the model on its own real-time generated data) during post-training of large models is crucial for preventing model degradation and enhancing problem-solving ability. Online reinforcement learning (RL) and on-policy distillation (OPD) outperform traditional supervised fine-tuning (SFT) because they optimize the model based on its own generated steps, rather than memorizing external ground-truth answers. SFT forcibly implants standard answers, applying uniform correction pressure across every token, which easily disrupts the model’s original knowledge structure and causes catastrophic forgetting. In contrast, RL and OPD enable the model to identify and reinforce its own best steps within its draft outputs. This not only avoids cumulative errors such as “one wrong word at the start leading to a chain of mistakes,” but also confines updates to regions of knowledge the model already understands, thereby preserving its native capabilities to the greatest extent. In the “minimal code editing” experiment, whether using SFT or RL as the teacher, the student model achieved Pass@1 success rates of 80.0% and 78.7% respectively—surpassing the teacher models themselves. Even when the SFT teacher became severely degraded due to over-fine-tuning (its code capability score on LiveCodeBench dropping from 0.320 to 0.286), its student still achieved a high score of 0.297, nearly unaffected by the teacher’s flaws—demonstrating that on-policy practice effectively filters out the teacher’s bad habits. Currently, DeepSeek-V4 and GLM-5 have adopted on-policy distillation to merge expert model capabilities. In expert training, domains with clear right/wrong answers—such as coding and mathematics—are better suited for RL, while creative and knowledge-based subjective tasks are more appropriate for on-policy distillation. The ultimate fine-tuning algorithm of the future must, within an on-policy training framework, discover new mechanisms that combine the high efficiency (high information density) of distillation with the objectivity (unbiased updates) of RL.
New Findings in Large Model Post-Training: In-Track Training with Self-Generated Data Enhances Model Performance
MarsBitShare
New research highlights in-track training with self-generated data as a key method to enhance model performance and avoid degradation. Unlike traditional SFT, online RL, and in-track distillation (OPD) enable models to refine their own steps in real time. Recent tests show student models trained this way outperformed their mentors, even as inflation data and market shifts affected external benchmarks. Platforms such as DeepSeek-V4 and GLM-5 are already using this method to integrate expert knowledge. New token listings may benefit from more accurate and adaptive models employing in-track distillation.
Source:Show original
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information.
Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.