EfficiencyJanuary 11, 2021Google Brain
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, Noam Shazeer
Abstract
We introduce Switch Transformers, which simplify the Mixture of Experts (MoE) routing algorithm to route to a single expert, reducing computation and communication costs. Switch Transformers scale to trillion parameter models with the same computational cost as much smaller dense models, achieving up to 7x speedups in pre-training.
Key Findings
- 1Simplified MoE routing to single-expert selection for stability and efficiency
- 2Achieved up to 7x pre-training speedups over dense models at equivalent compute
- 3Scaled to over 1 trillion parameters while remaining practical to train
- 4Demonstrated that sparsely activated models can match dense model quality
- 5Showed MoE as a practical path to scaling beyond dense model limits
Impact & Significance
Switch Transformers popularized MoE architectures in LLMs. This approach was adopted by Mistral (Mixtral), reportedly used in GPT-4, and became the standard architecture for models that need high capability with efficient inference.
Related Papers
LLMJuly 23, 2024
The Llama 3 Herd of Models
Meta AI
LLMJuly 15, 2024
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
EfficiencyMay 7, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
LLMMarch 4, 2024
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic