Posts | Practical ML

Speed Up PyTorch Training by 3x with NVIDIA Nsight and PyTorch 2.0 Tricks

What You Will Learn This post demonstrates how to achieve a 3.2x speedup in PyTorch training by systematically identifying and eliminating performance bottlenecks. We’ll start by using NVIDIA Nsight Systems to profile a typical training loop, uncovering inefficiencies such as unnecessary CPU-GPU synchronization and slow data transfers. Guided by the profiler, we’ll apply targeted fixes, like asynchronous data movement and smarter loss accumulation, that directly address the observed issues. ...

Faster Models with Graph Fusion: How Deep Learning Frameworks Optimize Your Computation

Introduction Modern deep learning models are made up of hundreds or even thousands of operations. Each of these operations involves memory reads, computation, and memory writes, which when executed individually leads to substantial overhead. One of the most effective ways to cut down this overhead and boost performance is through graph fusion. Graph fusion, also known as operation fusion or kernel fusion, refers to the process of merging multiple operations into a single, more efficient kernel. By combining adjacent operations like a convolution followed by batch normalization and a ReLU activation into one fused unit, deep learning frameworks can avoid unnecessary memory access, reduce kernel launch overhead, and take better advantage of hardware capabilities. ...

Low-Rank Factorization in PyTorch: Compressing Neural Networks with Linear Algebra

Introduction Can we shrink neural networks without sacrificing much accuracy? Low-rank factorization is a powerful, often overlooked technique that compresses models by decomposing large weight matrices into smaller components. In this post, we’ll explain what low-rank factorization is, show how to apply it to a ResNet50 model in PyTorch, and evaluate the trade-offs. ...

Knowledge Distillation in PyTorch: Shrinking Neural Networks the Smart Way

Introduction What if your model could run twice as fast and use half the memory, without giving up much accuracy? This is the promise of knowledge distillation: training smaller, faster models to mimic larger, high-performing ones. In this post, we’ll walk through how to distill a powerful ResNet50 model into a lightweight ResNet18 and demonstrate a +5% boost in accuracy compared to training the smaller model from scratch, all while cutting inference latency by over 50%. ...

Neural Network Quantization in PyTorch

Introduction This tutorial provides an introduction to quantization in PyTorch, covering both theory and practice. We’ll explore the different types of quantization, and apply both post training quantization (PTQ) and quantization aware training (QAT) on a simple example using CIFAR-10 and ResNet18. In the presented example we achieve a 75% reduction in space and 16% reduction in GPU latency with only 1% drop in accuracy. What is Quantization? Quantization is a model optimization technique that reduces the numerical precision used to represent weights and activations in deep learning models. Its primary benefits include: ...

Neural Network Pruning: How to Accelerate Inference with Minimal Accuracy Loss

Introduction In this post, I will demonstrate how to use pruning to significantly reduce a model’s size and latency while maintaining minimal accuracy loss. In the example, we achieve a 90% reduction in model size and 5.5x faster inference time, all while preserving the same level of accuracy. ...

Fast Image Loading with NVIDIA nvImageCodec

Introduction In deep learning pipelines, especially those involving image data, data loading and preprocessing often become major bottlenecks. Traditionally, image decoding is performed using libraries like OpenCV or Pillow, which rely on CPU-based processing. After decoding, the data must be transferred to GPU memory for further operations. But what if the decoding process itself could be performed directly on the GPU? Could this lead to faster performance? ...

Introduction to Model Compression: Why and How to Shrink Neural Networks for Speed

Introduction Deep learning models have grown increasingly large and complex, enabling state-of-the-art performance in tasks such as image recognition, natural language processing, and generative AI. However, these large models often come with high computational costs, making them slow to run on edge devices, embedded systems, or even in cloud environments with strict latency requirements. Model compression techniques aim to reduce the size and computational requirements of neural networks while maintaining their accuracy. This enables faster inference, lower power consumption, and better deployment flexibility. In this post, we’ll explore why model compression is essential and provide an overview of four key techniques: pruning, quantization, knowledge distillation, and low-rank factorization. ...

How to Make Your Neural Network Run Faster: An Overview of Optimization Techniques

Introduction Neural networks are becoming increasingly powerful, but speed remains a crucial factor in real-world applications. Whether you’re running models on the cloud, edge devices, or personal hardware, optimizing them for speed can lead to faster inference, lower latency, and reduced resource consumption. In this post, we’ll explore various techniques to accelerate neural networks, from model compression to hardware optimizations. This will serve as a foundation for future deep dives into each method. ...