Gemma 4 Runs on Consumer Hardware: Google's New Drafts Triples Speed

Google has released the Multi-Token Prediction (MTP) drafter for the open model family Gemma 4, with a promise of generating content at speeds up to three times higher than current standards. The main advantage is that this performance boost occurs without any degradation in output quality or the model's reasoning capabilities.

Historically, the problem lies within the architecture of Large Language Models, which generate text one fragment (token) at a time. This process forces the hardware to move billions of parameters from memory to computing units for each token produced, making the experience on consumer PCs often frustrating and characterized by long waits. While previously the only way out was to use smaller or compressed models, Mountain View's move focuses entirely on optimizing the serving process.

Google has succeeded, in particular, through speculative decoding, a concept the company has been exploring since 2022 but which today finds its peak application. The system pairs the main, powerful, and dense model with an extremely lightweight "drafter" that acts as a fast predictor: it hypothesizes a sequence of multiple tokens simultaneously in a fraction of the time it would take the target model to produce just one.

Subsequently, the main model intervenes to verify the entire sequence in a single parallel pass. If the drafter's predictions are correct, the entire string is accepted instantly, taking advantage of GPU compute cycles that would otherwise remain unused during traditional inference downtime. In this scenario, the target model can even generate an additional token during the validation phase, maximizing time efficiency.

Data released by Google shows clear benefits across various hardware architectures. A configuration based on NVIDIA RTX Pro 6000 running Gemma 4 26B recorded a net doubling of tokens per second by activating the MTP drafter. On the Apple Silicon front, tests indicate a speedup of about 2.2x with batch sizes between 4 and 8 requests. Although the theoretical ceiling of 3x is not consistent across all usage scenarios, this approach transforms previously borderline usable models into fluid tools ready for integration into professional workflows.

Unlike experimental approaches such as diffusion models applied to language (which still suffer from qualitative gaps compared to transformers), speculative decoding does not alter the weights or architecture of the original model but performs a purely logical optimization of load management. To further refine the process, Google has implemented shared management of the KV cache (Key-Value cache), preventing the drafter from having to recalculate contexts already processed by the main model.

The AI ecosystem is experiencing a phase where software efficiency weighs as much as, if not more than, raw hardware power. The example of DeepSeek has already demonstrated how optimizing training and inference costs can shake markets and valuations of silicon giants like Nvidia. With the MTP drafter, Google strategically positions itself in the local AI segment, where latency is the determining factor for the success of coding assistants, voice interfaces, and autonomous agents.

The new drafter is already accessible through major industry repositories, including Hugging Face, Kaggle, and Ollama, distributed under the Apache 2.0 license. Native support is guaranteed for the most popular frameworks such as vLLM, MLX, SGLang, and Hugging Face's Transformers library, so Big G aims to offer immediate adoption for anyone already using the Gemma 4 family in their local applications or optimized cloud environments.