CUDA GEMM Optimization: From Coalescing and Tiling to Ping-Pong TMA + MMA Pipelines

Matrix multiplication (GEMM) is the core primitive behind training and inference workloads. Modern high-performance CUDA GEMM kernels are built in layers: first coalesced global memory access, then shared-memory tiling, and now hardware-accelerated pipelines that combine double buffering with TMA and Tensor Core MMA instructions.

Why GEMM optimization matters

In deep learning and HPC, GEMM often dominates total runtime. Even a small throughput gain in your matrix multiply kernels can translate into significant end-to-end speedups. The challenge is that GEMM stresses both memory bandwidth and compute throughput, so efficient kernels must keep the data path and compute units busy at the same time.

Step 1: Coalesced global memory access

Coalescing means neighboring threads in a warp access neighboring addresses. When this is true, the hardware can combine requests into fewer memory transactions. If accesses are strided or irregular, global loads become fragmented and throughput drops.

Step 2: Shared-memory tiling

A naive kernel repeatedly reloads matrix elements from global memory. Tiling moves blocks of A and B into shared memory, then reuses them across many multiply-accumulate operations. This raises arithmetic intensity and reduces global traffic.

// Conceptual tiled GEMM loop
for (int kTile = 0; kTile < K; kTile += TILE_K) {
    // Load A and B tiles from global memory to shared memory
    // Synchronize block
    // Compute partial accumulations for C tile
    // Synchronize before next tile
}

At this stage, common bottlenecks include shared-memory bank conflicts, register pressure, and low occupancy from oversized tiles.

Step 3: Warp-level MMA on Tensor Cores

On modern NVIDIA GPUs, GEMM performance comes from Tensor Cores executing MMA (matrix multiply-accumulate) instructions. Instead of scalar multiply-add in CUDA cores, MMA instructions operate on matrix fragments and deliver much higher throughput for FP16/BF16/TF32/INT8 and related modes.

New hardware paradigm: double buffering (ping-pong)

Classic tiled kernels often run load-then-compute in sequence, creating bubbles in the pipeline. Double buffering removes these bubbles by alternating between two buffers: while compute consumes tile N from buffer 0, the kernel prefetches tile N+1 into buffer 1. Then it swaps roles (ping-pong).

int write_stage = 0;
int read_stage = 1;

prefetch_tile(stage = write_stage);
for (int kTile = 0; kTile < numTiles; ++kTile) {
    swap(read_stage, write_stage);
    prefetch_tile(stage = write_stage);   // overlap with compute
    mma_compute(stage = read_stage);      // consume previously loaded tile
}

The objective is to overlap data movement and computation so Tensor Cores are continuously fed.

TMA: Tensor Memory Accelerator for async tile movement

On newer architectures (for example Hopper), TMA provides hardware support for efficient multidimensional asynchronous transfers between global and shared memory. Compared to many per-thread copy instructions, TMA can reduce instruction overhead and improve memory movement efficiency for tiled kernels.

MMA + TMA + ping-pong = pipeline kernel design

The modern GEMM kernel pattern is a pipeline:

  1. TMA asynchronously loads the next A/B tiles into shared-memory stage S.
  2. Warp groups load fragments from stage S-1 into registers.
  3. MMA instructions compute on current fragments while next tiles are still loading.
  4. The pipeline advances with stage rotation (ping-pong, or deeper multistage).

This design minimizes idle cycles and is foundational for reaching near-peak Tensor Core utilization.

Practical tuning checklist

How to validate improvement

Use Nsight Compute to compare achieved FLOP/s, tensor core utilization, memory throughput, and stall breakdown before and after each optimization. Track one change at a time so you can attribute performance gains accurately.

Related: Optimization Techniques →

AI Tools & Services

Chatbots & Assistants

Content Generation