EfficiencyDecember 5, 2022Google Research

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, Neil Houlsby

Abstract

We propose sparse upcycling, a simple approach to convert pre-trained dense models into Mixture-of-Experts (MoE) models. Starting from a dense checkpoint, we create expert copies and continue training with MoE routing. This approach outperforms both continued dense training and training MoE from scratch, while using existing pre-training investments.

Key Findings

1Efficiently converted dense models to MoE without training from scratch
2Outperformed both continued dense training and MoE training from scratch
3Leveraged existing pre-training investments for improved efficiency
4Demonstrated scalable conversion across model sizes
5Provided a practical path from dense to sparse architectures

Impact & Significance

Sparse upcycling provided a practical method for organizations to upgrade existing models to more efficient MoE architectures, reducing the cost of adopting sparse model designs.

Related Tools

Gemini

Read Full Paper

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku