Speculative Decoding with Compressed Draft Models for LLMs: Faster Inference Without Losing Quality

Speculative Decoding with Compressed Draft Models for LLMs: Faster Inference Without Losing Quality
by Vicki Powell Mar, 23 2026

Generating text one token at a time is slow. Even the most powerful large language models (LLMs) hit a wall when they have to wait for each word to be calculated before moving to the next. This isn’t just a minor delay-it makes chatbots feel sluggish, real-time translation lag, and AI assistants take too long to respond. Speculative decoding changes that. It doesn’t make the model smarter. It makes it faster-by letting a smaller, simpler model guess the next few words, while the big model double-checks them in parallel. And when it works, you can cut inference time by up to 3x.

How Speculative Decoding Works

Imagine you’re writing a sentence. You pause after each word, waiting for your brain to come up with the next one. Now imagine someone else whispers the next three words to you. You don’t have to think about them-you just check if they make sense. If they do, you keep them. If not, you ignore them and write the next word yourself. That’s speculative decoding.

The system uses two models: a small draft model and a large target model. The draft model runs fast because it’s tiny-maybe 1/10th the size of the main model. It takes the current context and predicts the next k tokens all at once. Then, instead of waiting for the big model to generate each token one by one, it checks all k tokens together. If the big model agrees with the draft’s prediction, the token is accepted. If it disagrees, everything after the mismatch is thrown out, and the big model generates just the next token on its own. Then the cycle repeats.

This isn’t magic. It’s math. The target model calculates the log-likelihood of each draft token. If the probability is high enough, it accepts it. The result? The expensive part-the big model’s forward pass-is done in parallel, not sequentially. And since most next tokens are predictable (think: "the", "and", "but"), the draft model gets a lot right.

Why Draft Models Matter More Than You Think

Not all draft models are created equal. A generic draft model trained on web text might work okay for general chat, but it’ll fail badly if you’re asking it to generate legal documents, medical summaries, or financial reports. Why? Because it doesn’t know your domain.

Research from BentoML showed that theoretical speedups of 3x only happen when the draft model’s predictions match the target model’s probability distribution almost perfectly. In practice, off-the-shelf draft models often achieve acceptance rates below 50%. That means half the time, the system throws away work and falls back to slow, sequential generation. The result? You get maybe a 1.5x speedup instead of 3x.

The fix? Train your own draft model. Not on random internet data. On your company’s documents, customer support logs, internal reports-whatever your LLM actually needs to understand. When you do that, acceptance rates jump. One team at a healthcare startup trained a 700M-parameter draft model on clinical notes. Their acceptance rate went from 42% to 78%. Their latency dropped from 850ms to 280ms per response. No change to the target model. Just a better draft partner.

Breaking the Mold: EAGLE and Medusa

Early speculative decoding relied on a separate, tiny model. But that created new problems: training it, syncing it, keeping it aligned. Two newer approaches cut that middleman out.

EAGLE doesn’t train a new model. It reuses the top layers of the target model itself. Instead of predicting tokens, it predicts the hidden features that come before the final output. This gives the draft model access to richer context-information the big model already knows. The result? Better predictions, higher acceptance rates, and faster speedups without needing a separate model.

Medusa goes even further. It adds multiple prediction heads directly onto the last hidden layer of the base LLM. No separate model. No training pipeline. Just a few extra linear layers that predict the next 2, 3, or even 5 tokens in one go. It builds a tree of possible continuations and lets the model verify them all at once. In benchmarks, Medusa generated 2.8x more tokens per second than standard speculative decoding. And because it’s built into the same model, there’s no distribution mismatch. It’s like the model is whispering to itself.

A single LLM with multiple prediction heads generating and verifying token trees, representing the Medusa architecture.

When You Don’t Need a Draft Model

Not every use case needs this complexity. If you’re doing greedy decoding-always picking the single most likely next token-speculative decoding still works. You don’t need fancy sampling tricks. Just let the draft model predict a few tokens, and verify them one by one. It’s simpler, and still gives you 1.5x-2x speedup.

And if your current setup already feels fast? Maybe you don’t need to change anything. Speculative decoding isn’t a universal upgrade. It’s a tool. If your model responds in under 400ms for your users, and they’re happy, adding complexity might not be worth it. But if you’re struggling with latency spikes, or users are dropping off because the AI takes too long, then this is where you start.

Implementation Tips

Here’s what actually works in production:

  1. Start with what you have. Try speculative decoding with a pre-trained draft model like Phi-3 or TinyLlama. Measure your acceptance rate. If it’s above 60%, you’re already ahead.
  2. Track your latency, not just tokens. Speedups look great on paper, but what matters is real-world response time. Use tools like Prometheus or Langfuse to monitor end-to-end latency under load.
  3. Train on your data. If you’re in finance, law, or healthcare, train your draft model on your own documents. Use 10,000-50,000 samples. You don’t need millions.
  4. Consider Medusa if you can. If your target model is based on Llama or Mistral, Medusa’s integrated heads are easier to deploy than managing two models. It’s becoming the new standard.
  5. Test with real prompts. Don’t benchmark with generic queries. Use your actual user inputs. A draft model that works well on "What’s the weather?" might fail on "Explain this insurance clause."
A healthcare chatbot's response time drops from 850ms to 280ms after training the draft model on clinical notes.

What You Gain (and What You Don’t)

Speculative decoding doesn’t change what the model says. It only changes how fast it says it. The output distribution stays identical to standard decoding. That’s huge. You’re not sacrificing accuracy for speed. You’re not introducing hallucinations. You’re not altering the model’s behavior. You’re just making it quicker.

This matters for compliance, safety, and trust. In regulated industries, you can’t afford to tweak the model’s output. But you can absolutely speed up how fast it delivers the same answer.

And the best part? It works on any autoregressive transformer. Llama, Mistral, Qwen, Phi-you name it. As long as it generates tokens one by one, speculative decoding can help.

Future Directions

Research is moving fast. LayerSkip is testing whether skipping layers in the target model can generate draft tokens without extra parameters. MTP (Medusa Tree Parallel) is exploring how to predict longer sequences with dynamic branching. And companies are starting to combine speculative decoding with quantization and pruning-stacking optimizations on top of each other.

But the core idea stays the same: if you can guess well, you don’t have to calculate everything. The future of LLM inference isn’t bigger models. It’s smarter shortcuts.

What’s the difference between speculative decoding and quantization?

Quantization reduces model size by using lower-precision numbers (like 8-bit instead of 32-bit), which makes inference faster and uses less memory. Speculative decoding doesn’t change the model’s weights-it uses a smaller model to predict ahead and lets the big model verify. You can use both together: quantize the target model and pair it with a speculative draft model for even better speed.

Can speculative decoding be used with any LLM?

Yes, as long as it’s an autoregressive transformer model that generates tokens one at a time. This includes Llama, Mistral, Phi, Qwen, and others. It doesn’t work with non-autoregressive models like those that generate all tokens at once (e.g., some summarization models). But for chat, coding, and text generation, it’s widely compatible.

Do I need to retrain my main LLM to use speculative decoding?

No. The target model stays unchanged. You only need to train or select a smaller draft model. With architectures like Medusa, you don’t even need a separate model-you add prediction heads directly to your existing LLM. No retraining of the base model is required.

Is speculative decoding better than using a smaller LLM instead?

It depends. Using a smaller model alone means sacrificing quality. Speculative decoding keeps the quality of the big model while speeding it up. If you need high accuracy and speed, speculative decoding wins. If you can accept lower quality for lower cost, a smaller model might be enough. But for production systems where quality matters, speculative decoding is the better trade-off.

How much faster will my LLM be with speculative decoding?

In ideal conditions-high acceptance rate, good draft model, and parallel verification-you can see 2x to 3x speedups in tokens per second. Real-world results vary: 1.5x-2.5x is common. If your draft model is poorly matched, you might only get 1.2x. The key is tuning the draft model to your data. The best results come from domain-specific training, not off-the-shelf models.