EfficiencyOctober 10, 2023Mistral AI

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier

Abstract

We introduce Mistral 7B, a 7-billion parameter language model that outperforms the best open 13B model (Llama 2 13B) on all evaluated benchmarks and the best released 34B model (Llama 1 34B) on reasoning, math, and code generation. Mistral 7B uses grouped-query attention (GQA) for faster inference and sliding window attention (SWA) for handling longer sequences.

Key Findings

  • 1Outperformed Llama 2 13B on all benchmarks despite being half the size
  • 2Matched or exceeded Llama 1 34B on reasoning, math, and code
  • 3Introduced sliding window attention for efficient long-context handling
  • 4Used grouped-query attention for faster inference
  • 5Released under Apache 2.0 license for unrestricted use

Impact & Significance

Mistral 7B proved that smaller, well-optimized models can outperform much larger ones, catalyzing the efficient LLM movement. It established Mistral AI as a major player and influenced the trend toward smaller, more capable models.

Read Full Paper