Speed Up PyTorch Training by 3x with NVIDIA Nsight and PyTorch 2.0 Tricks

What You Will Learn This post demonstrates how to achieve a 3.2x speedup in PyTorch training by systematically identifying and eliminating performance bottlenecks. We’ll start by using NVIDIA Nsight Systems to profile a typical training loop, uncovering inefficiencies such as unnecessary CPU-GPU synchronization and slow data transfers. Guided by the profiler, we’ll apply targeted fixes, like asynchronous data movement and smarter loss accumulation, that directly address the observed issues. ...

May 25, 2025 · 16 min · Arik Poznanski

Fast Image Loading with NVIDIA nvImageCodec

Introduction In deep learning pipelines, especially those involving image data, data loading and preprocessing often become major bottlenecks. Traditionally, image decoding is performed using libraries like OpenCV or Pillow, which rely on CPU-based processing. After decoding, the data must be transferred to GPU memory for further operations. But what if the decoding process itself could be performed directly on the GPU? Could this lead to faster performance? ...

April 7, 2025 · 4 min · Arik Poznanski