AI Glossary/QLoRA (Quantized Low-Rank Adaptation)

What Is QLoRA (Quantized Low-Rank Adaptation)?

Definition

QLoRA (Quantized Low-Rank Adaptation) is a fine-tuning technique that combines 4-bit model quantization with LoRA adapters, enabling the fine-tuning of very large language models on consumer-grade GPUs with as little as a single 48GB GPU.

How QLoRA (Quantized Low-Rank Adaptation) Works

QLoRA takes the efficiency of LoRA a step further by quantizing the base model to 4-bit precision, dramatically reducing memory usage while maintaining performance. It introduces innovations like 4-bit NormalFloat data type and double quantization to minimize information loss. The LoRA adapters are trained in 16-bit precision on top of the frozen 4-bit base model. This means a 65-billion parameter model that would normally require multiple expensive GPUs can be fine-tuned on a single consumer GPU. QLoRA has democratized access to model customization for individual developers and small teams.

Real-World Examples

1

A solo developer fine-tuning a 70B parameter model on a single RTX 4090 using QLoRA

2

A research lab using QLoRA to quickly prototype and test fine-tuned models before committing to full-precision training

3

A small startup customizing an open-source LLM for their domain using QLoRA on a cloud GPU instance costing under $1/hour

Recommended Tools

Related Terms