Advanced CUDA
NCCL and NVSHMEM
NCCL and NVSHMEM solve different communication problems on modern GPU systems. NCCL is designed for high-performance collective communication, while NVSHMEM provides a Partitioned Global Address Space model for fine-grained one-sided communication directly from GPU kernels.
What NCCL is best at
NCCL (NVIDIA Collective Communications Library) is the standard choice for operations like all-reduce, all-gather, reduce-scatter, and broadcast in multi-GPU and multi-node workloads.
- Optimized ring/tree algorithms for NVLink, PCIe, and InfiniBand topologies.
- Drop-in integration with deep learning frameworks and distributed training stacks.
- Reliable, high-bandwidth collectives with minimal tuning in common cases.
What NVSHMEM is best at
NVSHMEM exposes a PGAS programming model where each GPU can directly read, write, and perform atomics on memory owned by other processing elements. This is especially useful for irregular workloads that do not map cleanly to collectives.
- One-sided put/get and atomic operations.
- Communication initiated from GPU kernels, not only from host code.
- Better control for sparse, dynamic, or graph-like communication patterns.
Collectives vs one-sided communication
If your algorithm naturally synchronizes at known stages (for example gradient synchronization every iteration), NCCL collectives are often the fastest and simplest option. If your algorithm has data-dependent communication or needs remote updates inside kernels, NVSHMEM can be more expressive and efficient.
Using both in one system
Many production codes combine both libraries:
- Use NCCL for dense global reductions and parameter synchronization.
- Use NVSHMEM for fine-grained exchange in custom kernels.
- Measure overlap between communication and computation with profiling tools.
- Validate scaling behavior as topology and GPU count change.
Practical decision checklist
- Primarily collectives and framework-driven training: start with NCCL.
- Irregular communication from device code: evaluate NVSHMEM.
- Need both dense and sparse communication paths: combine them.
- Always benchmark on your real hardware and message sizes.