DeepSeek open-sources the DeepSpec framework, boosting the V4 model's speed by up to 85%.

According to Beating Monitor, DeepSeek, in collaboration with Peking University, has released a technical report on DSpark, a speculative sampling acceleration framework, and open-sourced the full-stack codebase DeepSpec. DSpark is currently deployed in DeepSeek-V4’s production services. Without compromising output quality, DSpark improves single-user generation speed by 60% to 85% for the Flash version and by 57% to 78% for the Pro version. DSpark outperforms the previous MTP-1 (Single-Token Multi-Branch Prediction) baseline, significantly increasing overall system throughput under strict latency constraints. Previously, multi-token speculative sampling was difficult to deploy in live production environments. Autoregressive draft models were too slow, while parallel draft models suffered from extremely low acceptance rates for the latter portions of long sequences due to independent predictions at each position. Blindly validating multi-token drafts under high concurrency would cause large models to waste substantial computational resources verifying inevitably incorrect tokens, leading to severe system throughput collapse—hence, industry practice has largely been limited to single-token prediction (MTP-1). DSpark overcomes this throughput degradation bottleneck under high concurrency. First, DSpark employs DFlash, a parallel backbone network, to generate hidden states, followed by an extremely lightweight Markov head. The Markov head injects correlations between adjacent tokens at minimal cost through table lookup and a single matrix multiplication. The system also integrates a confidence prediction head and a posterior calibration algorithm. To ensure seamless compatibility with production environments’ zero-overhead scheduling and prevent future information leakage, the scheduler uses an asynchronous mechanism that dynamically determines candidate token pruning length based on predictions from two steps prior, completely preventing large models from validating high-risk tail errors under heavy loads. In addition to DSpark, DeepSeek has open-sourced DeepSpec, a codebase that natively supports open-source large models such as Qwen3 and Gemma. DeepSpec provides a complete Python toolchain covering prompt downloading, large model cache reconstruction, draft model training, and benchmark evaluation. Developers can directly use the open-source scripts to customize and deploy dedicated acceleration modules for different open-source large models locally.