EfficiencyNovember 30, 2022Google Research

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, Yossi Matias

Abstract

We present speculative decoding, an algorithm to accelerate inference from large autoregressive models without any changes to the model outputs. The key idea is to use a smaller, faster draft model to generate candidate tokens that are then verified in parallel by the larger target model. This provides up to 3x speedup while producing the exact same output distribution.

Key Findings

1Achieved up to 3x inference speedup without changing model outputs
2Used a smaller draft model to generate candidate tokens verified by the target
3Maintained the exact same output distribution as standard decoding
4Required no model retraining or architecture changes
5Demonstrated practical applicability across different model sizes

Impact & Significance

Speculative decoding became a standard technique for LLM inference optimization, adopted by major providers to reduce latency. It is used in production by multiple AI companies to serve models faster and cheaper.

Related Tools

Chatgpt Claude Gemini

Read Full Paper

Fast Inference from Transformers via Speculative Decoding

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku