FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Re
Abstract
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high-bandwidth memory (HBM) and GPU on-chip SRAM. FlashAttention is 2-4x faster than standard attention and enables up to 16x longer context lengths.
Key Findings
- 1Achieved 2-4x speedup in attention computation without approximation
- 2Reduced memory usage from quadratic to linear in sequence length
- 3Enabled up to 16x longer context lengths on the same hardware
- 4Introduced IO-awareness as a key principle for GPU algorithm design
- 5Became the standard attention implementation in major ML frameworks
Impact & Significance
FlashAttention enabled the long-context revolution in LLMs, making 100K+ context windows practical. It is now the default attention implementation in PyTorch, and is essential to the training and inference of every major LLM.
Related Papers
The Llama 3 Herd of Models
Meta AI
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic