DeepSeek V4 shifts its training methodology to OPD and integrates expert models.

KuCoinFlash

Release Time: 04/24/2026 04:20:49

Summary

DeepSeek V4 now employs OPD following a transition from the mixed RL stage in V3.2. Experts in mathematics, coding, and instruction following are trained first, then distilled into a single model via multi-teacher OPD. A GRM enhances performance on complex tasks using minimal human data. This shift aligns with stricter CFT protocols and growing interest in risk-on assets as projects prioritize efficiency.

ME News report, April 24 (UTC+8): According to monitoring by Beating, DeepSeek V4 has undergone a major change in its post-training methodology: the mixed RL phase from V3.2 has been entirely replaced by On-Policy Distillation (OPD). The new process consists of two steps. First, domain-specific expert models are trained individually on the V3.2 pipeline, focusing on areas such as mathematics, coding, agents, and instruction following; each expert is first fine-tuned and then trained with GRPO for reinforcement learning. Second, multiple-teacher OPD distills the capabilities of over ten experts into a unified model: the student performs full-vocabulary logit distillation using reverse KL divergence on trajectories it generates, aligning logits at the level of output distributions to merge weights from multiple experts into a single parameter space, thereby avoiding the common capability conflicts seen in traditional weight merging and mixed RL. The report also introduces the Generative Reward Model (GRM): for tasks difficult to validate with rules, instead of training traditional scalar reward models, GRM is trained using rubric-guided RL data, enabling the actor network to simultaneously generate and evaluate outputs, achieving generalization to complex tasks with only a small amount of diverse human annotations. (Source: BlockBeats)

Source:Show original

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.