Boosting LLM Inference Speed Using Speculative Decoding

A practical guide on using cutting-edge optimization techniques to speed up inference

Het Trivedi

Published in

Towards Data Science

6 min read

3 hours ago

—

Image generated using Flux Schnell

Intro

Large language models are extremely power-hungry and require a significant amount of GPU resources to perform well. However, the transformer architecture does not take full advantage of the GPU.

GPUs, by design, can process things in parallel, but the transformer architecture is auto-regressive. In order for the next token to get generated it has to look at all of the previous tokens that came before it. Transformers don’t allow you to predict the next n tokens in parallel. Ultimately, this makes the generation phase of LLMs quite slow as each new token must be produced sequentially. Speculative decoding is a novel optimization technique that aims to solve this issue.

Each forward pass produces a new token generated by the LLM

There are a few different methods for speculative decoding. The technique described in this article uses the two model approach.

Speculative Decoding

Speculative decoding works by having two models, a large main model and a…

Dell unveils new Snapdragon X-powered laptops, the first of the Copilot+ AI PC wave

Dell has just announced 5 new Copilot+ AI PC laptops powered by Qualcomm’s new Snapdragon X series SoC. 4 Dell’s new XPS 13 (9345) laptop

May 20, 2024

Microsoft low-balled the impact of the Windows outage

Microsoft has provided another update on the fallout of the CrowdStrike outage that knocked out an initially estimated 8.5 million Windows machines around the globe.

July 30, 2024

NASA Juno Captures Stunning Views Of A Hellish Volcanic Moon Of Jupiter

NASA scientists have taken data gathered by the Juno spacecraft of Jupiter’s moon Io and created interesting animations, highlighting two of the Jovian moon’s most

April 21, 2024

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

With expert analysis, comprehensive market coverage, and actionable insights, our newsletter equips you with the knowledge & tools necessary to make informed decisions & maximize your potential returns in the dynamic world of future tech stocks.