Advanced CUDA
CUDA GEMM Optimization: From Coalescing and Tiling to Ping-Pong TMA + MMA Pipelines
Matrix multiplication (GEMM) is the core primitive behind training and inference workloads. Modern high-performance CUDA GEMM kernels are built in layers: first coalesced global memory access, then shared-memory tiling, and now hardware-accelerated pipelines that combine double buffering with TMA and Tensor Core MMA instructions.
Why GEMM optimization matters
In deep learning and HPC, GEMM often dominates total runtime. Even a small throughput gain in your matrix multiply kernels can translate into significant end-to-end speedups. The challenge is that GEMM stresses both memory bandwidth and compute throughput, so efficient kernels must keep the data path and compute units busy at the same time.
Step 1: Coalesced global memory access
Coalescing means neighboring threads in a warp access neighboring addresses. When this is true, the hardware can combine requests into fewer memory transactions. If accesses are strided or irregular, global loads become fragmented and throughput drops.
- Map thread indices so each warp reads contiguous regions of matrix A and B.
- Prefer row-major or column-major layouts that match your access pattern.
- Use aligned data types and leading dimensions to avoid split transactions.
Step 2: Shared-memory tiling
A naive kernel repeatedly reloads matrix elements from global memory. Tiling moves blocks of A and B into shared memory, then reuses them across many multiply-accumulate operations. This raises arithmetic intensity and reduces global traffic.
// Conceptual tiled GEMM loop
for (int kTile = 0; kTile < K; kTile += TILE_K) {
// Load A and B tiles from global memory to shared memory
// Synchronize block
// Compute partial accumulations for C tile
// Synchronize before next tile
}
At this stage, common bottlenecks include shared-memory bank conflicts, register pressure, and low occupancy from oversized tiles.
Step 3: Warp-level MMA on Tensor Cores
On modern NVIDIA GPUs, GEMM performance comes from Tensor Cores executing MMA (matrix multiply-accumulate) instructions. Instead of scalar multiply-add in CUDA cores, MMA instructions operate on matrix fragments and deliver much higher throughput for FP16/BF16/TF32/INT8 and related modes.
- Each warp owns fragment tiles of A, B, and C.
- Fragments are loaded into registers in layouts expected by MMA instructions.
- The kernel accumulates partial results over many K-slices.
New hardware paradigm: double buffering (ping-pong)
Classic tiled kernels often run load-then-compute in sequence, creating bubbles in the pipeline. Double buffering removes these bubbles by alternating between two buffers: while compute consumes tile N from buffer 0, the kernel prefetches tile N+1 into buffer 1. Then it swaps roles (ping-pong).
int write_stage = 0;
int read_stage = 1;
prefetch_tile(stage = write_stage);
for (int kTile = 0; kTile < numTiles; ++kTile) {
swap(read_stage, write_stage);
prefetch_tile(stage = write_stage); // overlap with compute
mma_compute(stage = read_stage); // consume previously loaded tile
}
The objective is to overlap data movement and computation so Tensor Cores are continuously fed.
TMA: Tensor Memory Accelerator for async tile movement
On newer architectures (for example Hopper), TMA provides hardware support for efficient multidimensional asynchronous transfers between global and shared memory. Compared to many per-thread copy instructions, TMA can reduce instruction overhead and improve memory movement efficiency for tiled kernels.
- Issue asynchronous tile copies with lower per-thread copy overhead.
- Move 2D/3D tensor regions directly into shared-memory staging buffers.
- Pair with synchronization primitives so compute waits only when required.
MMA + TMA + ping-pong = pipeline kernel design
The modern GEMM kernel pattern is a pipeline:
- TMA asynchronously loads the next A/B tiles into shared-memory stage S.
- Warp groups load fragments from stage S-1 into registers.
- MMA instructions compute on current fragments while next tiles are still loading.
- The pipeline advances with stage rotation (ping-pong, or deeper multistage).
This design minimizes idle cycles and is foundational for reaching near-peak Tensor Core utilization.
Practical tuning checklist
- Tile shapes: tune CTA, warp, and MMA tile dimensions for your GPU and matrix sizes.
- Stage count: start with ping-pong (2-stage), then evaluate 3+ stages if latency remains visible.
- Registers vs occupancy: high register usage can reduce active warps; profile the trade-off.
- Shared memory layout: avoid bank conflicts with proper strides/swizzling.
- Data type choice: use numeric formats that balance throughput and accuracy requirements.
How to validate improvement
Use Nsight Compute to compare achieved FLOP/s, tensor core utilization, memory throughput, and stall breakdown before and after each optimization. Track one change at a time so you can attribute performance gains accurately.