CUDA Basics

Kernel Functions

Kernel functions are the core of CUDA programming. They define the parallel work each GPU thread performs and determine how effectively your workload maps onto GPU hardware.

Kernel anatomy

A kernel is declared with `__global__` and launched from host code using execution configuration syntax:

__global__ void saxpy(float a, const float* x, float* y, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        y[i] = a * x[i] + y[i];
    }
}

int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;
saxpy<<<gridSize, blockSize>>>(a, d_x, d_y, n);

Choosing launch configuration

Start with block sizes of 128, 256, or 512 and profile.
Compute grid size from problem length to cover all elements.
Use occupancy calculators and Nsight tools for tuning.

Thread indexing patterns

Linear indexing works for 1D arrays. For matrices and images, combine x/y dimensions:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int idx = y * width + x;

Branching and divergence

Threads execute in groups called warps. If threads in the same warp follow different branches, execution serializes and performance drops. Keep control flow coherent across nearby threads whenever possible.

Kernel quality checklist

Bounds checks prevent invalid memory access.
Memory access is coalesced for neighboring threads.
Shared memory is used when data reuse is high.
Math work is balanced with memory traffic.

Kernel Functions

Kernel anatomy

Choosing launch configuration

Thread indexing patterns

Branching and divergence

Kernel quality checklist

AI Tools & Services

Chatbots & Assistants

Content Generation

Blogs & Learning

APIs & Systems