CUDA Basics
Introduction to CUDA
CUDA is NVIDIA's parallel computing platform that lets you run thousands of lightweight threads on a GPU. If your workload has many similar operations over large arrays, CUDA can unlock major speedups compared to CPU-only code.
Why CUDA matters
Modern AI systems, simulation engines, and computer vision pipelines all process enormous batches of data. GPUs are designed for this model: they execute many operations concurrently and deliver high memory bandwidth. CUDA gives you direct access to that capability.
- CPUs are optimized for low-latency control and complex branching.
- GPUs are optimized for throughput with large numbers of simple operations.
- CUDA allows C/C++ style code to offload hot loops to the GPU.
Core execution model
CUDA kernels are launched over a grid. Each grid contains blocks, and each block contains threads. You write one kernel function, and CUDA runs that function for every thread instance in parallel.
// Minimal CUDA kernel
__global__ void add(const float* a, const float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}
In this pattern, each thread owns one output element. This is the most common first step for data-parallel algorithms.
Host vs device responsibilities
Your CPU code (host) allocates memory, copies data, launches kernels, and retrieves results. Your GPU code (device) performs heavy parallel computation.
- Allocate host and device memory.
- Copy inputs to the device.
- Launch kernel with a grid/block configuration.
- Copy outputs back to host memory.
- Validate results and free memory.
Common beginner pitfalls
- Choosing block sizes without measuring occupancy and runtime.
- Ignoring memory transfer cost between host and device.
- Forgetting error checks after CUDA API calls and kernel launches.
- Using too little parallel work to keep GPU hardware busy.
What to learn next
After you understand kernels and launch geometry, focus on memory hierarchy and performance analysis. These two topics have the largest impact on real-world CUDA speedups.