AI Glossary/Model Quantization

What Is Model Quantization?

Definition

Model quantization is a compression technique that reduces the size and computational cost of AI models by converting their numerical weights from high-precision formats (like 32-bit floating point) to lower-precision formats (like 8-bit or 4-bit integers), with minimal loss in quality.

How Model Quantization Works

AI models store their knowledge as numerical weights, typically in 32-bit or 16-bit floating point format. Quantization converts these to lower precision — 8-bit (INT8) or even 4-bit (INT4) — reducing memory usage by 2-8x and often speeding up inference significantly. While some precision is lost, modern quantization techniques (like GPTQ, GGUF, and AWQ) minimize quality degradation through careful calibration. Quantization is what makes it possible to run large language models like LLaMA 70B on consumer GPUs or even laptops. It is a key enabler of on-device AI, edge deployment, and cost-efficient inference at scale.

Real-World Examples

1

Running a 70B parameter LLaMA model quantized to 4-bit on a laptop with 32GB RAM using llama.cpp

2

A cloud provider offering quantized model serving at 3x lower cost than full-precision inference

3

TheBloke on Hugging Face providing GGUF quantized versions of popular models for local deployment

Recommended Tools

Related Terms