Kernel Functions

Kernel functions are the core of CUDA programming. They define the parallel work each GPU thread performs and determine how effectively your workload maps onto GPU hardware.

Kernel anatomy

A kernel is declared with `__global__` and launched from host code using execution configuration syntax:

__global__ void saxpy(float a, const float* x, float* y, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        y[i] = a * x[i] + y[i];
    }
}

int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;
saxpy<<<gridSize, blockSize>>>(a, d_x, d_y, n);

Choosing launch configuration

Thread indexing patterns

Linear indexing works for 1D arrays. For matrices and images, combine x/y dimensions:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int idx = y * width + x;

Branching and divergence

Threads execute in groups called warps. If threads in the same warp follow different branches, execution serializes and performance drops. Keep control flow coherent across nearby threads whenever possible.

Kernel quality checklist

  1. Bounds checks prevent invalid memory access.
  2. Memory access is coalesced for neighboring threads.
  3. Shared memory is used when data reuse is high.
  4. Math work is balanced with memory traffic.
Continue to Advanced: Optimization Techniques →

AI Tools & Services