EfficiencyJanuary 11, 2021Google Brain

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, Noam Shazeer

Abstract

We introduce Switch Transformers, which simplify the Mixture of Experts (MoE) routing algorithm to route to a single expert, reducing computation and communication costs. Switch Transformers scale to trillion parameter models with the same computational cost as much smaller dense models, achieving up to 7x speedups in pre-training.

Key Findings

  • 1Simplified MoE routing to single-expert selection for stability and efficiency
  • 2Achieved up to 7x pre-training speedups over dense models at equivalent compute
  • 3Scaled to over 1 trillion parameters while remaining practical to train
  • 4Demonstrated that sparsely activated models can match dense model quality
  • 5Showed MoE as a practical path to scaling beyond dense model limits

Impact & Significance

Switch Transformers popularized MoE architectures in LLMs. This approach was adopted by Mistral (Mixtral), reportedly used in GPT-4, and became the standard architecture for models that need high capability with efficient inference.

Related Tools

Read Full Paper