Running AI models on your own computer is great—but not always.
Promises privacy protection, no subscription fees, and data never leaves your device. But for most people, the reality is that between sentences, the cursor blinks for five seconds.
This bottleneck has a name: inference speed. It has nothing to do with the model’s intelligence, but is a hardware issue. Standard AI models generate one word fragment at a time (called a “token”), and the hardware must transfer billions of parameters from memory to the computational unit for each token. This design is inherently slow. On consumer-grade hardware, it’s simply unbearable.
Most people resort to running smaller, less powerful models, or highly compressed versions known as quantized models. Neither solution is perfect—they both trade off some quality for speed. While they can run, they’re not the exact models you truly want.
Now Google has proposed a different approach. The company has just released the Multi-Token Prediction (MTP) draft for its Gemma 4 family open model technology—a technique that enables up to 3x speed improvements without compromising model quality or inference capabilities.
This method is called speculative decoding, and the concept has existed for many years. Google researchers published the foundational paper as early as 2022. Only now has this idea gained mainstream acceptance, as it requires the right architecture to run at scale.
In short, here’s how it works: instead of relying solely on a large, powerful model to do all the work, it’s combined with a smaller “predictor” model. The predictor is fast and low-cost—it can predict multiple tokens at once, taking less time than it takes the main model to generate a single token. Then, the large model checks all these predictions in just one pass. If the predictions are correct, the full sequence is obtained at the cost of only one forward pass.
According to Google, "If the target model agrees with the draft, it accepts the entire sequence in a single forward pass—even generating its own additional tokens in the process."
No loss: Large models—such as the 31-billion dense version of Gemma 4—still validate every token, and output quality remains exactly the same. You’re simply leveraging compute capacity that would otherwise be idle during slow portions.
Google states that the sketching model shares the key-value cache (KV cache) with the target model, a memory structure that stores previously processed context, so they avoid wasting time re-computing information already known to the large model. For smaller edge models designed for phones and Raspberry Pi devices, the team even developed an efficient clustering technique to further reduce generation time.
This is not the only attempt in the field of artificial intelligence to parallelize text generation. Diffusion-based language models—such as Mercury by Inception Labs—take a fundamentally different approach: instead of predicting one token at a time, they start from noise and iteratively optimize the entire output. While theoretically fast, diffusion language models have struggled to match the quality of traditional Transformer models, making them more of a research subject than a practical tool.
Speculative decoding is different because it doesn't change the underlying model at all. It's a service optimization, not an architecture replacement. The Gemma 4 version you were already running will become faster.
The real-world performance improvement is indeed significant. According to Google's own benchmarking, enabling the MTP draft on a Gemma 4 26B chip with an Nvidia RTX Pro 6000 desktop GPU roughly doubled the tokens per second. On Apple Silicon chips, batch sizes of 4 to 8 requests delivered approximately a 2.2x speedup. While not every scenario reaches the 3x upper limit, this still represents a substantial difference between “barely usable” and “fast enough to be practical.”
Context is crucial. When China’s model DeepSeek shocked the market in January 2025—causing NVIDIA’s market value to plunge by $600 billion in a single day—the key takeaway was that efficiency gains have a greater impact than simply increasing computational power. Smarter execution trumps relentless hardware investment. Google’s MTP diagramming tool is another step in this direction, though its target audience is clearly consumers.
The entire artificial intelligence industry is currently like a triangle, composed of three parts: inference, training, and memory. A breakthrough in any one area can drive or disrupt the entire ecosystem. DeepSeek’s training method—building powerful models using low-end hardware—is one example, while Google’s TurboQuant—how to reduce AI memory without compromising quality—is another. Both papers triggered market turmoil as companies scrambled to find responses.
Google says this drawing tool can "improve responsiveness: significantly reduce latency for near-real-time chat, immersive voice applications, and agent workflows"—tasks that require low latency to function effectively.
Clear and immediate use cases: a local code assistant with no lag; a voice interface that responds before you forget what you asked; an intelligent workflow that completes steps without waiting three seconds. All of this can be achieved on your existing hardware.
The MTP draft is now live Face with Tears of Joy and is compatible with Apache 2.0 license, Kaggle, and Ollama. They work out of the box with vLLM, MLX, SGLang, and Hugging Face Transformers.
