DeepSeek open-sources the DeepSpec framework, boosting the V4 model's speed by up to 85%.

icon MarsBit
Share
AI summary iconSummary

According to Beating Monitor, DeepSeek, in collaboration with Peking University, has released a technical report on DSpark, a speculative sampling acceleration framework, and open-sourced the full-stack codebase DeepSpec. DSpark is currently deployed in DeepSeek-V4’s production services. Without compromising output quality, DSpark improves single-user generation speed by 60% to 85% for the Flash version and by 57% to 78% for the Pro version. DSpark outperforms the previous MTP-1 (Single-Token Multi-Branch Prediction) baseline, significantly increasing overall system throughput under strict latency constraints. Previously, multi-token speculative sampling was difficult to deploy in live production environments. Autoregressive draft models were too slow, while parallel draft models suffered from extremely low acceptance rates for the latter portions of long sequences due to independent predictions at each position. Blindly validating multi-token drafts under high concurrency would cause large models to waste substantial computational resources verifying inevitably incorrect tokens, leading to severe system throughput collapse—hence, industry practice has largely been limited to single-token prediction (MTP-1). DSpark overcomes this throughput degradation bottleneck under high concurrency. First, DSpark employs DFlash, a parallel backbone network, to generate hidden states, followed by an extremely lightweight Markov head. The Markov head injects correlations between adjacent tokens at minimal cost through table lookup and a single matrix multiplication. The system also integrates a confidence prediction head and a posterior calibration algorithm. To ensure seamless compatibility with production environments’ zero-overhead scheduling and prevent future information leakage, the scheduler uses an asynchronous mechanism that dynamically determines candidate token pruning length based on predictions from two steps prior, completely preventing large models from validating high-risk tail errors under heavy loads. In addition to DSpark, DeepSeek has open-sourced DeepSpec, a codebase that natively supports open-source large models such as Qwen3 and Gemma. DeepSpec provides a complete Python toolchain covering prompt downloading, large model cache reconstruction, draft model training, and benchmark evaluation. Developers can directly use the open-source scripts to customize and deploy dedicated acceleration modules for different open-source large models locally.

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.