Introduction
In deep learning pipelines, especially those involving image data, data loading and preprocessing often become major bottlenecks. Traditionally, image decoding is performed using libraries like OpenCV or Pillow, which rely on CPU-based processing. After decoding, the data must be transferred to GPU memory for further operations. But what if the decoding process itself could be performed directly on the GPU? Could this lead to faster performance?
In this post, I demonstrate how to use nvImageCodec
to achieve a 2.18x speedup in JPEG loading by decoding directly on the GPU. Learn more about nvImageCodec
in its documentation or on GitHub.
๐ What is nvImageCodec
?
nvImageCodec
is a high-performance image codec optimized for GPU acceleration. It is designed for scenarios like model training and batch inference, where decoding thousands of images quickly is critical. The library supports decoding (bytes to pixels) and encoding (pixels to bytes) for various common image formats. However, not all formats are fully supported on the GPU. Some, like PNG and WebP, fall back to CPU-based decoding. Below is a summary of supported formats:
โ Format Support:
Format | GPU Decode | GPU Encode | Notes |
---|---|---|---|
JPEG | โ Yes | โ Yes | Fastest, hardware-accelerated |
JPEG 2000 | โ Yes | โ Yes | |
TIFF | โ Yes | โ No (planned) | CUDA decoder |
PNG | โ No (planned) | โ No (planned) | CPU only |
WebP | โ No | โ No | CPU only |
๐ What was Benchmarked?
We compared the performance of:
- OpenCV: CPU-based decoding followed by PIL transformations.
nvImageCodec
: GPU-based decoding with tensor transformations.
Benchmark Details:
- Dataset: 1000 JPEG images from the ImageNet Sample Images dataset (credit: Eli Schwartz).
- Model: ResNet18 for inference.
- Transform Pipeline: Resize and crop applied to all images.
Each benchmark was run 10 times (plus 1 warmup iteration), and the average times were recorded for:
- ๐งช Loading: Decoding, resizing, and tensor conversion.
- โก Inference: Model forward pass.
- โฑ๏ธ Total: Combined loading and inference time.
All benchmarks were conducted in Google Colab using a T4 GPU instance.
Run this code in Google Colab to try it yourself.
๐ ๏ธ Setup in Colab
Install Dependencies and Load Dataset
!pip install nvidia-nvimgcodec-cu11 opencv-python-headless
!git clone https://github.com/EliSchwartz/imagenet-sample-images.git
Prepare the Images
import os, shutil
from pathlib import Path
source_dir = Path("imagenet-sample-images")
dest_dir = Path("benchmark_images")
dest_dir.mkdir(exist_ok=True)
all_images = list(source_dir.glob("*.JPEG"))
for img in all_images:
shutil.copy(img, dest_dir / img.name)
image_paths = sorted(list(dest_dir.glob("*.JPEG")))
print(f"Prepared {len(image_paths)} images.")
Define Model and Preprocessing
import torch
import torchvision.transforms as transforms
import torchvision.models as models
from torchvision.transforms import Resize, CenterCrop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = models.resnet18(pretrained=True).to(device).eval()
transform = transforms.Compose([
Resize(256),
CenterCrop(224),
])
๐งฒ Benchmark Functions (10x Repeated Runs)
OpenCV Benchmark
def run_opencv_inference(image_paths, runs=10):
import time, numpy as np
from PIL import Image
load_times, infer_times = [], []
for run_idx in range(runs + 1):
imgs = []
t0 = time.time()
for path in image_paths:
img = cv2.imread(str(path))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = Image.fromarray(img)
img = transform(img)
img = transforms.ToTensor()(img)
imgs.append(img)
batch = torch.stack(imgs).to(device)
load_time = time.time() - t0
t1 = time.time()
with torch.no_grad():
model(batch)
infer_time = time.time() - t1
if run_idx == 0:
print(f"Run {run_idx + 1}: Warmup iteration (not included in mean). Loading Time = {load_time:.4f}s, Inference Time = {infer_time:.4f}s")
else:
load_times.append(load_time)
infer_times.append(infer_time)
print(f"Run {run_idx + 1}: Loading Time = {load_time:.4f}s, Inference Time = {infer_time:.4f}s")
return np.mean(load_times), np.mean(infer_times)
opencv_load, opencv_infer = run_opencv_inference(image_paths)
nvImageCodec Benchmark
def run_nvimagecodec_inference(image_paths, runs=10):
import time, numpy as np
decoder = nvimgcodec.Decoder(device_id=0)
load_times, infer_times = [], []
for run_idx in range(runs + 1):
imgs = []
t0 = time.time()
for path in image_paths:
with open(path, 'rb') as f:
data = f.read()
nv_img = decoder.decode(data)
img = torch.as_tensor(nv_img.cuda()).permute(2, 0, 1).float().div(255)
img = transform(img)
imgs.append(img)
batch = torch.stack(imgs).to(device)
load_time = time.time() - t0
t1 = time.time()
with torch.no_grad():
model(batch)
infer_time = time.time() - t1
if run_idx == 0:
print(f"Run {run_idx + 1}: Warmup iteration (not included in mean). Loading Time = {load_time:.4f}s, Inference Time = {infer_time:.4f}s")
else:
load_times.append(load_time)
infer_times.append(infer_time)
print(f"Run {run_idx + 1}: Loading Time = {load_time:.4f}s, Inference Time = {infer_time:.4f}s")
return np.mean(load_times), np.mean(infer_times)
nv_load, nv_infer = run_nvimagecodec_inference(image_paths)
๐ Results & Visualization
import pandas as pd
import matplotlib.pyplot as plt
results = pd.DataFrame({
"Method": ["OpenCV", "nvImageCodec"],
"Loading Time (s)": [opencv_load, nv_load],
"Inference Time (s)": [opencv_infer, nv_infer],
"Total Time (s)": [
opencv_load + opencv_infer,
nv_load + nv_infer
],
})
print(results)
results.plot(x="Method", y=["Loading Time (s)", "Inference Time (s)", "Total Time (s)"],
kind="bar", figsize=(10, 6))
plt.title("OpenCV vs. nvImageCodec on 1000 ImageNet JPEGs (10-run average)")
plt.ylabel("Seconds")
plt.grid(True)
plt.show()
โ Summary
Method | Loading Time (s) | Inference Time (s) | Total Time (s) |
---|---|---|---|
OpenCV | 6.08343 | 0.00349 | 6.08693 |
nvImageCodec | 2.78262 | 0.00323 | 2.78585 |
By leveraging the T4 GPU, nvImageCodec
achieves a 2.18x speedup in JPEG loading times by performing decoding directly on the GPU. This eliminates CPU bottlenecks and enables a more efficient data pipeline.
For workflows heavily reliant on JPEGs, integrating nvImageCodec
into your training or inference pipeline can deliver substantial performance improvements with minimal effort.
Tip: Before integrating, ensure that loading time is indeed a bottleneck in your pipeline. For example, test by preloading a single image or skipping loading altogether to simulate random data. In training pipelines, prefetching images in parallel with GPU processing is also a common optimization strategy.