EfficiencyDecember 5, 2022Google Research
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, Neil Houlsby
Abstract
We propose sparse upcycling, a simple approach to convert pre-trained dense models into Mixture-of-Experts (MoE) models. Starting from a dense checkpoint, we create expert copies and continue training with MoE routing. This approach outperforms both continued dense training and training MoE from scratch, while using existing pre-training investments.
Key Findings
- 1Efficiently converted dense models to MoE without training from scratch
- 2Outperformed both continued dense training and MoE training from scratch
- 3Leveraged existing pre-training investments for improved efficiency
- 4Demonstrated scalable conversion across model sizes
- 5Provided a practical path from dense to sparse architectures
Impact & Significance
Sparse upcycling provided a practical method for organizations to upgrade existing models to more efficient MoE architectures, reducing the cost of adopting sparse model designs.
Related Tools
Related Papers
LLMJuly 23, 2024
The Llama 3 Herd of Models
Meta AI
LLMJuly 15, 2024
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
EfficiencyMay 7, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
LLMMarch 4, 2024
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic