DeepSeek V4 Series Released with 1.6 Trillion Parameters and MIT License

ChainThink reports that on April 24, according to official information, DeepSeek has released the preview version of its V4 series under the MIT license, with model weights now available on Hugging Face and ModelScope.

This series includes two MoE models, with the V4-Pro boasting a total of 1.6 trillion parameters and activating 49 billion parameters per token;

V4-Flash has a total of 284 billion parameters, with 13 billion parameters activated per token. Both models support a context length of 1 million tokens.

This series features three upgrades: a hybrid attention mechanism (Compressed Sparse Attention CSA + Heavily Compressed Attention HCA) that significantly reduces long-context overhead; under a 1M context scenario, the V4-Pro requires only 27% of the FLOPs per token and 10% of the KV cache memory compared to V3.2.

Manifold-constrained hyperconnection (mHC) replaces traditional residual connections to enhance the stability of cross-layer signal propagation; training is accelerated using the Muon optimizer. The model was pre-trained on over 32 trillion tokens.

Post-training consists of two stages: first, domain-specific expert models are trained separately using SFT and GRPO reinforcement learning, then unified into the final model through online distillation.

V4-Pro-Max claims to be the strongest open-source model currently available, achieving top-tier performance in coding benchmarks and significantly narrowing the gap with closed-source frontier models in reasoning and agent tasks;

After acquiring sufficient thinking budget, V4-Flash-Max achieves reasoning performance close to Pro, but is limited by its parameter scale in pure knowledge and complex agent tasks. Model weights are stored in FP4+FP8 mixed precision.