Pytorch | Practical ML

Speed Up PyTorch Training by 3x with NVIDIA Nsight and PyTorch 2.0 Tricks

What You Will Learn This post demonstrates how to achieve a 3.2x speedup in PyTorch training by systematically identifying and eliminating performance bottlenecks. We’ll start by using NVIDIA Nsight Systems to profile a typical training loop, uncovering inefficiencies such as unnecessary CPU-GPU synchronization and slow data transfers. Guided by the profiler, we’ll apply targeted fixes, like asynchronous data movement and smarter loss accumulation, that directly address the observed issues. ...

Low-Rank Factorization in PyTorch: Compressing Neural Networks with Linear Algebra

Introduction Can we shrink neural networks without sacrificing much accuracy? Low-rank factorization is a powerful, often overlooked technique that compresses models by decomposing large weight matrices into smaller components. In this post, we’ll explain what low-rank factorization is, show how to apply it to a ResNet50 model in PyTorch, and evaluate the trade-offs. ...

Knowledge Distillation in PyTorch: Shrinking Neural Networks the Smart Way

Introduction What if your model could run twice as fast and use half the memory, without giving up much accuracy? This is the promise of knowledge distillation: training smaller, faster models to mimic larger, high-performing ones. In this post, we’ll walk through how to distill a powerful ResNet50 model into a lightweight ResNet18 and demonstrate a +5% boost in accuracy compared to training the smaller model from scratch, all while cutting inference latency by over 50%. ...

Neural Network Quantization in PyTorch

Introduction This tutorial provides an introduction to quantization in PyTorch, covering both theory and practice. We’ll explore the different types of quantization, and apply both post training quantization (PTQ) and quantization aware training (QAT) on a simple example using CIFAR-10 and ResNet18. In the presented example we achieve a 75% reduction in space and 16% reduction in GPU latency with only 1% drop in accuracy. What is Quantization? Quantization is a model optimization technique that reduces the numerical precision used to represent weights and activations in deep learning models. Its primary benefits include: ...

Neural Network Pruning: How to Accelerate Inference with Minimal Accuracy Loss

Introduction In this post, I will demonstrate how to use pruning to significantly reduce a model’s size and latency while maintaining minimal accuracy loss. In the example, we achieve a 90% reduction in model size and 5.5x faster inference time, all while preserving the same level of accuracy. ...