Advanced CUDA
Optimization Techniques
CUDA optimization is an iterative process: profile, identify bottlenecks, apply targeted changes, and measure again. The best optimizations are workload-specific and data-driven.
Start with profiling
Before changing code, gather evidence with Nsight Systems and Nsight Compute. Focus on kernel time, achieved occupancy, memory throughput, and stall reasons.
High-impact optimization areas
- Memory coalescing: ensure adjacent threads read adjacent memory.
- Shared memory tiling: reduce global memory traffic.
- Instruction efficiency: prefer fused operations and avoid redundant math.
- Launch tuning: block size and register pressure affect occupancy.
Overlap transfer and compute
Use CUDA streams to run memory transfers and kernels concurrently when dependencies allow. This helps hide PCIe transfer latency and improve end-to-end throughput.
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(d_in1, h_in1, bytes, cudaMemcpyHostToDevice, stream1);
kernelA<<<grid, block, 0, stream1>>>(d_in1, d_out1);
cudaMemcpyAsync(d_in2, h_in2, bytes, cudaMemcpyHostToDevice, stream2);
kernelB<<<grid, block, 0, stream2>>>(d_in2, d_out2);
Numerical considerations
Performance changes can affect floating-point behavior, especially when altering operation order. Validate both runtime and numerical tolerance after tuning.
Optimization workflow
- Profile and isolate the top bottleneck.
- Apply one optimization at a time.
- Benchmark with representative production data.
- Keep changes that improve speed without breaking correctness.