Model-Compression | Practical ML

Low-Rank Factorization in PyTorch: Compressing Neural Networks with Linear Algebra

Introduction Can we shrink neural networks without sacrificing much accuracy? Low-rank factorization is a powerful, often overlooked technique that compresses models by decomposing large weight matrices into smaller components. In this post, we’ll explain what low-rank factorization is, show how to apply it to a ResNet50 model in PyTorch, and evaluate the trade-offs. ...

Knowledge Distillation in PyTorch: Shrinking Neural Networks the Smart Way

Introduction What if your model could run twice as fast and use half the memory, without giving up much accuracy? This is the promise of knowledge distillation: training smaller, faster models to mimic larger, high-performing ones. In this post, we’ll walk through how to distill a powerful ResNet50 model into a lightweight ResNet18 and demonstrate a +5% boost in accuracy compared to training the smaller model from scratch, all while cutting inference latency by over 50%. ...

Neural Network Quantization in PyTorch

Introduction This tutorial provides an introduction to quantization in PyTorch, covering both theory and practice. We’ll explore the different types of quantization, and apply both post training quantization (PTQ) and quantization aware training (QAT) on a simple example using CIFAR-10 and ResNet18. In the presented example we achieve a 75% reduction in space and 16% reduction in GPU latency with only 1% drop in accuracy. What is Quantization? Quantization is a model optimization technique that reduces the numerical precision used to represent weights and activations in deep learning models. Its primary benefits include: ...

Neural Network Pruning: How to Accelerate Inference with Minimal Accuracy Loss

Introduction In this post, I will demonstrate how to use pruning to significantly reduce a model’s size and latency while maintaining minimal accuracy loss. In the example, we achieve a 90% reduction in model size and 5.5x faster inference time, all while preserving the same level of accuracy. ...

Introduction to Model Compression: Why and How to Shrink Neural Networks for Speed

Introduction Deep learning models have grown increasingly large and complex, enabling state-of-the-art performance in tasks such as image recognition, natural language processing, and generative AI. However, these large models often come with high computational costs, making them slow to run on edge devices, embedded systems, or even in cloud environments with strict latency requirements. Model compression techniques aim to reduce the size and computational requirements of neural networks while maintaining their accuracy. This enables faster inference, lower power consumption, and better deployment flexibility. In this post, we’ll explore why model compression is essential and provide an overview of four key techniques: pruning, quantization, knowledge distillation, and low-rank factorization. ...