EfficiencyMay 27, 2022Stanford University

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Re

Abstract

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high-bandwidth memory (HBM) and GPU on-chip SRAM. FlashAttention is 2-4x faster than standard attention and enables up to 16x longer context lengths.

Key Findings

1Achieved 2-4x speedup in attention computation without approximation
2Reduced memory usage from quadratic to linear in sequence length
3Enabled up to 16x longer context lengths on the same hardware
4Introduced IO-awareness as a key principle for GPU algorithm design
5Became the standard attention implementation in major ML frameworks

Impact & Significance

FlashAttention enabled the long-context revolution in LLMs, making 100K+ context windows practical. It is now the default attention implementation in PyTorch, and is essential to the training and inference of every major LLM.

Related Tools

Chatgpt Claude Llama

Read Full Paper

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku