EfficiencyNovember 30, 2022Google Research
Fast Inference from Transformers via Speculative Decoding
Yaniv Leviathan, Matan Kalman, Yossi Matias
Abstract
We present speculative decoding, an algorithm to accelerate inference from large autoregressive models without any changes to the model outputs. The key idea is to use a smaller, faster draft model to generate candidate tokens that are then verified in parallel by the larger target model. This provides up to 3x speedup while producing the exact same output distribution.
Key Findings
- 1Achieved up to 3x inference speedup without changing model outputs
- 2Used a smaller draft model to generate candidate tokens verified by the target
- 3Maintained the exact same output distribution as standard decoding
- 4Required no model retraining or architecture changes
- 5Demonstrated practical applicability across different model sizes
Impact & Significance
Speculative decoding became a standard technique for LLM inference optimization, adopted by major providers to reduce latency. It is used in production by multiple AI companies to serve models faster and cheaper.
Related Papers
LLMJuly 23, 2024
The Llama 3 Herd of Models
Meta AI
LLMJuly 15, 2024
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
EfficiencyMay 7, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
LLMMarch 4, 2024
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic