Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Key Findings
- 1Introduced the Transformer architecture based entirely on self-attention mechanisms
- 2Eliminated the need for recurrence and convolutions in sequence models
- 3Achieved state-of-the-art results on English-to-German and English-to-French translation
- 4Demonstrated significantly faster training compared to recurrent architectures
- 5Introduced multi-head attention and positional encoding concepts
Impact & Significance
Arguably the most influential AI paper of the decade. The Transformer architecture became the foundation for GPT, BERT, T5, and virtually all modern large language models. It revolutionized NLP, computer vision, and audio processing.
Related Papers
The Llama 3 Herd of Models
Meta AI
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic