Advanced CUDA
Multi-GPU Programming
When a single GPU is not enough, multi-GPU programming helps scale training, simulation, and inference workloads. The challenge is not just compute distribution, but also minimizing communication overhead.
Parallelization strategies
- Data parallelism: each GPU processes different data shards with the same model.
- Model parallelism: model layers or partitions are split across GPUs.
- Pipeline parallelism: batches move through staged model partitions.
Device management basics
int deviceCount = 0;
cudaGetDeviceCount(&deviceCount);
for (int dev = 0; dev < deviceCount; ++dev) {
cudaSetDevice(dev);
// Allocate per-device memory and launch kernels
}
Communication and synchronization
Scaling is often limited by GPU-to-GPU and GPU-to-host communication. Use peer-to-peer access when available and rely on NCCL for efficient collective operations such as all-reduce.
- Enable P2P for direct device access where hardware supports it.
- Aggregate small messages to reduce communication overhead.
- Overlap communication with compute when possible.
Load balancing matters
Uneven work distribution causes some GPUs to idle while others remain busy. Partition tasks by measured runtime, not only by nominal input size.
Scaling checklist
- Measure single-GPU baseline first.
- Track compute time vs communication time as GPU count grows.
- Pinpoint synchronization bottlenecks with timeline tools.
- Validate accuracy and determinism after distribution changes.