EfficiencyNovember 30, 2022Google Research

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, Yossi Matias

Abstract

We present speculative decoding, an algorithm to accelerate inference from large autoregressive models without any changes to the model outputs. The key idea is to use a smaller, faster draft model to generate candidate tokens that are then verified in parallel by the larger target model. This provides up to 3x speedup while producing the exact same output distribution.

Key Findings

  • 1Achieved up to 3x inference speedup without changing model outputs
  • 2Used a smaller draft model to generate candidate tokens verified by the target
  • 3Maintained the exact same output distribution as standard decoding
  • 4Required no model retraining or architecture changes
  • 5Demonstrated practical applicability across different model sizes

Impact & Significance

Speculative decoding became a standard technique for LLM inference optimization, adopted by major providers to reduce latency. It is used in production by multiple AI companies to serve models faster and cheaper.

Related Tools

Read Full Paper