What Is Knowledge Distillation?
Knowledge distillation is a model compression technique where a smaller 'student' model is trained to replicate the behavior of a larger 'teacher' model, producing a compact model that retains much of the teacher's performance at a fraction of the computational cost.
How Knowledge Distillation Works
Large AI models are powerful but expensive to run. Knowledge distillation addresses this by training a smaller model to mimic a larger one's outputs, including its probability distributions over answers (soft labels), not just its final predictions. The student model learns from the teacher's 'dark knowledge' — the nuanced relationships between possible answers that hard labels don't capture. This produces a model that is significantly smaller and faster while maintaining 90-99% of the teacher's performance. Distillation is widely used to create models suitable for mobile devices, edge deployment, and real-time applications.
Real-World Examples
OpenAI distilling GPT-4's knowledge into GPT-4o mini to create a smaller, faster, and cheaper model
Google creating DistilBERT, which is 60% smaller and 60% faster than BERT while retaining 97% of its performance
A company distilling a large vision model into a lightweight version that runs on smartphones for real-time object detection