CUDA Basics
Kernel Functions
Kernel functions are the core of CUDA programming. They define the parallel work each GPU thread performs and determine how effectively your workload maps onto GPU hardware.
Kernel anatomy
A kernel is declared with `__global__` and launched from host code using execution configuration syntax:
__global__ void saxpy(float a, const float* x, float* y, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
y[i] = a * x[i] + y[i];
}
}
int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;
saxpy<<<gridSize, blockSize>>>(a, d_x, d_y, n);
Choosing launch configuration
- Start with block sizes of 128, 256, or 512 and profile.
- Compute grid size from problem length to cover all elements.
- Use occupancy calculators and Nsight tools for tuning.
Thread indexing patterns
Linear indexing works for 1D arrays. For matrices and images, combine x/y dimensions:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int idx = y * width + x;
Branching and divergence
Threads execute in groups called warps. If threads in the same warp follow different branches, execution serializes and performance drops. Keep control flow coherent across nearby threads whenever possible.
Kernel quality checklist
- Bounds checks prevent invalid memory access.
- Memory access is coalesced for neighboring threads.
- Shared memory is used when data reuse is high.
- Math work is balanced with memory traffic.