EfficiencyMay 27, 2022Stanford University

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Re

Abstract

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high-bandwidth memory (HBM) and GPU on-chip SRAM. FlashAttention is 2-4x faster than standard attention and enables up to 16x longer context lengths.

Key Findings

  • 1Achieved 2-4x speedup in attention computation without approximation
  • 2Reduced memory usage from quadratic to linear in sequence length
  • 3Enabled up to 16x longer context lengths on the same hardware
  • 4Introduced IO-awareness as a key principle for GPU algorithm design
  • 5Became the standard attention implementation in major ML frameworks

Impact & Significance

FlashAttention enabled the long-context revolution in LLMs, making 100K+ context windows practical. It is now the default attention implementation in PyTorch, and is essential to the training and inference of every major LLM.

Related Tools

Read Full Paper