Practical ML

A blog by Arik Poznanski on Machine Learning and Software Engineering.

Neural Network Pruning: How to Accelerate Inference with Minimal Accuracy Loss

Introduction In this post, I will demonstrate how to use pruning to significantly reduce a model’s size and latency while maintaining minimal accuracy loss. In the example, we achieve a 90% reduction in model size and 5.5x faster inference time, all while preserving the same level of accuracy. ...

Fast Image Loading with NVIDIA nvImageCodec

Introduction In deep learning pipelines, especially those involving image data, data loading and preprocessing often become major bottlenecks. Traditionally, image decoding is performed using libraries like OpenCV or Pillow, which rely on CPU-based processing. After decoding, the data must be transferred to GPU memory for further operations. But what if the decoding process itself could be performed directly on the GPU? Could this lead to faster performance? ...

Introduction to Model Compression: Why and How to Shrink Neural Networks for Speed

Introduction Deep learning models have grown increasingly large and complex, enabling state-of-the-art performance in tasks such as image recognition, natural language processing, and generative AI. However, these large models often come with high computational costs, making them slow to run on edge devices, embedded systems, or even in cloud environments with strict latency requirements. Model compression techniques aim to reduce the size and computational requirements of neural networks while maintaining their accuracy. This enables faster inference, lower power consumption, and better deployment flexibility. In this post, we’ll explore why model compression is essential and provide an overview of four key techniques: pruning, quantization, knowledge distillation, and low-rank factorization. ...

How to Make Your Neural Network Run Faster: An Overview of Optimization Techniques

Introduction Neural networks are becoming increasingly powerful, but speed remains a crucial factor in real-world applications. Whether you’re running models on the cloud, edge devices, or personal hardware, optimizing them for speed can lead to faster inference, lower latency, and reduced resource consumption. In this post, we’ll explore various techniques to accelerate neural networks, from model compression to hardware optimizations. This will serve as a foundation for future deep dives into each method. ...