EfficiencyDecember 5, 2022Google Research

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, Neil Houlsby

Abstract

We propose sparse upcycling, a simple approach to convert pre-trained dense models into Mixture-of-Experts (MoE) models. Starting from a dense checkpoint, we create expert copies and continue training with MoE routing. This approach outperforms both continued dense training and training MoE from scratch, while using existing pre-training investments.

Key Findings

  • 1Efficiently converted dense models to MoE without training from scratch
  • 2Outperformed both continued dense training and MoE training from scratch
  • 3Leveraged existing pre-training investments for improved efficiency
  • 4Demonstrated scalable conversion across model sizes
  • 5Provided a practical path from dense to sparse architectures

Impact & Significance

Sparse upcycling provided a practical method for organizations to upgrade existing models to more efficient MoE architectures, reducing the cost of adopting sparse model designs.

Related Tools

Read Full Paper