LLMJune 12, 2017Google Brain / University of Toronto

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

Key Findings

1Introduced the Transformer architecture based entirely on self-attention mechanisms
2Eliminated the need for recurrence and convolutions in sequence models
3Achieved state-of-the-art results on English-to-German and English-to-French translation
4Demonstrated significantly faster training compared to recurrent architectures
5Introduced multi-head attention and positional encoding concepts

Impact & Significance

Arguably the most influential AI paper of the decade. The Transformer architecture became the foundation for GPT, BERT, T5, and virtually all modern large language models. It revolutionized NLP, computer vision, and audio processing.

Related Tools

Chatgpt Claude Gemini Llama

Read Full Paper

Attention Is All You Need

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku