Boosting LLM Inference Speed Using Speculative Decoding

A practical guide on using cutting-edge optimization techniques to speed up inference

Image generated using Flux Schnell

Intro

Large language models are extremely power-hungry and require a significant amount of GPU resources to perform well. However, the transformer architecture does not take full advantage of the GPU.

GPUs, by design, can process things in parallel, but the transformer architecture is auto-regressive. In order for the next token to get generated it has to look at all of the previous tokens that came before it. Transformers don’t allow you to predict the next n tokens in parallel. Ultimately, this makes the generation phase of LLMs quite slow as each new token must be produced sequentially. Speculative decoding is a novel optimization technique that aims to solve this issue.

Each forward pass produces a new token generated by the LLM

There are a few different methods for speculative decoding. The technique described in this article uses the two model approach.

Speculative Decoding

Speculative decoding works by having two models, a large main model and a…