EfficiencyOctober 10, 2023Mistral AI

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier

Abstract

We introduce Mistral 7B, a 7-billion parameter language model that outperforms the best open 13B model (Llama 2 13B) on all evaluated benchmarks and the best released 34B model (Llama 1 34B) on reasoning, math, and code generation. Mistral 7B uses grouped-query attention (GQA) for faster inference and sliding window attention (SWA) for handling longer sequences.

Key Findings

1Outperformed Llama 2 13B on all benchmarks despite being half the size
2Matched or exceeded Llama 1 34B on reasoning, math, and code
3Introduced sliding window attention for efficient long-context handling
4Used grouped-query attention for faster inference
5Released under Apache 2.0 license for unrestricted use

Impact & Significance

Mistral 7B proved that smaller, well-optimized models can outperform much larger ones, catalyzing the efficient LLM movement. It established Mistral AI as a major player and influenced the trend toward smaller, more capable models.

Related Tools

Mistral Ollama Hugging Face

Read Full Paper

Mistral 7B

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku