DeepSeek V4 open-source model launches with 1.6 trillion parameters and MIT license

ME News reports that on April 24 (UTC+8), according to monitoring by Beating, DeepSeek has released the preview version of its V4 series under the MIT license, with weights now available on Hugging Face and ModelScope. The series includes two MoE models: V4-Pro, with a total of 1.6T parameters and 49B (49 billion) activated per token; and V4-Flash, with a total of 284B (284 billion) parameters and 13B (13 billion) activated per token. Both support a 1M token context length. Three architectural upgrades: A hybrid attention mechanism (Compressed Sparse Attention CSA + Heavily Compressed Attention HCA) significantly reduces long-context overhead—under 1M context, V4-Pro’s single-token inference FLOPs are only 27% of V3.2’s, and KV cache (GPU memory used to store historical information during inference) is just 10% of V3.2’s; Manifold-constrained Hyperconnection (mHC) replaces traditional residual connections to enhance cross-layer signal propagation stability; training now employs the Muon optimizer for faster convergence. Pre-training data exceeds 32T tokens. Post-training occurs in two stages: first, domain-specific experts are trained using SFT and GRPO reinforcement learning; then, online distillation unifies them into a single model. V4-Pro-Max (highest inference intensity mode) claims to be the strongest open-source model currently available, achieving top-tier performance on coding benchmarks and significantly narrowing the gap with closed-source state-of-the-art models in reasoning and agent tasks. V4-Flash-Max delivers reasoning performance close to V4-Pro when given sufficient thinking budget, but is limited by parameter scale on pure knowledge and complex agent tasks. Weights are stored in FP4+FP8 mixed precision. (Source: BlockBeats)