EfficiencyMay 7, 2024DeepSeek AI

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek AI

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts language model characterized by innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76x.

Key Findings

  • 1Introduced Multi-head Latent Attention reducing KV cache by 93.3%
  • 2Achieved 5.76x higher generation throughput with DeepSeekMoE architecture
  • 3Saved 42.5% of training costs compared to previous generation
  • 4236B total parameters with only 21B activated per token
  • 5Demonstrated that MoE can be both efficient and high-quality

Impact & Significance

DeepSeek-V2 demonstrated that Chinese AI labs could produce highly competitive and innovative model architectures. Its efficiency innovations influenced subsequent model designs and made frontier AI more economically accessible.

Related Tools

Read Full Paper