Mistral 7B
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier
Abstract
We introduce Mistral 7B, a 7-billion parameter language model that outperforms the best open 13B model (Llama 2 13B) on all evaluated benchmarks and the best released 34B model (Llama 1 34B) on reasoning, math, and code generation. Mistral 7B uses grouped-query attention (GQA) for faster inference and sliding window attention (SWA) for handling longer sequences.
Key Findings
- 1Outperformed Llama 2 13B on all benchmarks despite being half the size
- 2Matched or exceeded Llama 1 34B on reasoning, math, and code
- 3Introduced sliding window attention for efficient long-context handling
- 4Used grouped-query attention for faster inference
- 5Released under Apache 2.0 license for unrestricted use
Impact & Significance
Mistral 7B proved that smaller, well-optimized models can outperform much larger ones, catalyzing the efficient LLM movement. It established Mistral AI as a major player and influenced the trend toward smaller, more capable models.
Related Tools
Related Papers
The Llama 3 Herd of Models
Meta AI
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic