EfficiencyJanuary 11, 2021Google Brain

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, Noam Shazeer

Abstract

We introduce Switch Transformers, which simplify the Mixture of Experts (MoE) routing algorithm to route to a single expert, reducing computation and communication costs. Switch Transformers scale to trillion parameter models with the same computational cost as much smaller dense models, achieving up to 7x speedups in pre-training.

Key Findings

1Simplified MoE routing to single-expert selection for stability and efficiency
2Achieved up to 7x pre-training speedups over dense models at equivalent compute
3Scaled to over 1 trillion parameters while remaining practical to train
4Demonstrated that sparsely activated models can match dense model quality
5Showed MoE as a practical path to scaling beyond dense model limits

Impact & Significance

Switch Transformers popularized MoE architectures in LLMs. This approach was adopted by Mistral (Mixtral), reportedly used in GPT-4, and became the standard architecture for models that need high capability with efficient inference.

Related Tools

Gemini Mistral

Read Full Paper

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku