What Is Model Quantization?
Model quantization is a compression technique that reduces the size and computational cost of AI models by converting their numerical weights from high-precision formats (like 32-bit floating point) to lower-precision formats (like 8-bit or 4-bit integers), with minimal loss in quality.
How Model Quantization Works
AI models store their knowledge as numerical weights, typically in 32-bit or 16-bit floating point format. Quantization converts these to lower precision — 8-bit (INT8) or even 4-bit (INT4) — reducing memory usage by 2-8x and often speeding up inference significantly. While some precision is lost, modern quantization techniques (like GPTQ, GGUF, and AWQ) minimize quality degradation through careful calibration. Quantization is what makes it possible to run large language models like LLaMA 70B on consumer GPUs or even laptops. It is a key enabler of on-device AI, edge deployment, and cost-efficient inference at scale.
Real-World Examples
Running a 70B parameter LLaMA model quantized to 4-bit on a laptop with 32GB RAM using llama.cpp
A cloud provider offering quantized model serving at 3x lower cost than full-precision inference
TheBloke on Hugging Face providing GGUF quantized versions of popular models for local deployment