Quantization | Practical ML

Introduction This tutorial provides an introduction to quantization in PyTorch, covering both theory and practice. We’ll explore the different types of quantization, and apply both post training quantization (PTQ) and quantization aware training (QAT) on a simple example using CIFAR-10 and ResNet18. In the presented example we achieve a 75% reduction in space and 16% reduction in GPU latency with only 1% drop in accuracy. What is Quantization? Quantization is a model optimization technique that reduces the numerical precision used to represent weights and activations in deep learning models. Its primary benefits include: ...