EfficiencyMay 7, 2024DeepSeek AI
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
Abstract
We present DeepSeek-V2, a strong Mixture-of-Experts language model characterized by innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76x.
Key Findings
- 1Introduced Multi-head Latent Attention reducing KV cache by 93.3%
- 2Achieved 5.76x higher generation throughput with DeepSeekMoE architecture
- 3Saved 42.5% of training costs compared to previous generation
- 4236B total parameters with only 21B activated per token
- 5Demonstrated that MoE can be both efficient and high-quality
Impact & Significance
DeepSeek-V2 demonstrated that Chinese AI labs could produce highly competitive and innovative model architectures. Its efficiency innovations influenced subsequent model designs and made frontier AI more economically accessible.