A practical guide on using cutting-edge optimization techniques to speed up inference
Intro
Large language models are extremely power-hungry and require a significant amount of GPU resources to perform well. However, the transformer architecture does not take full advantage of the GPU.
GPUs, by design, can process things in parallel, but the transformer architecture is auto-regressive. In order for the next token to get generated it has to look at all of the previous tokens that came before it. Transformers don’t allow you to predict the next n
tokens in parallel. Ultimately, this makes the generation phase of LLMs quite slow as each new token must be produced sequentially. Speculative decoding is a novel optimization technique that aims to solve this issue.
There are a few different methods for speculative decoding. The technique described in this article uses the two model approach.
Speculative Decoding
Speculative decoding works by having two models, a large main model and a…