EfficiencyMay 7, 2024DeepSeek AI

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek AI

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts language model characterized by innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76x.

Key Findings

1Introduced Multi-head Latent Attention reducing KV cache by 93.3%
2Achieved 5.76x higher generation throughput with DeepSeekMoE architecture
3Saved 42.5% of training costs compared to previous generation
4236B total parameters with only 21B activated per token
5Demonstrated that MoE can be both efficient and high-quality

Impact & Significance

DeepSeek-V2 demonstrated that Chinese AI labs could produce highly competitive and innovative model architectures. Its efficiency innovations influenced subsequent model designs and made frontier AI more economically accessible.

Related Tools

Deepseek Ollama

Read Full Paper

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

The Claude 3 Model Family: Opus, Sonnet, and Haiku

Video Generation Models as World Simulators