CUDA Basics
Memory Management
Memory performance is often the main bottleneck in CUDA programs. Understanding device, host, shared, and constant memory helps you build kernels that are fast, stable, and scalable.
CUDA memory types at a glance
- Global memory: Large and flexible, but higher latency.
- Shared memory: On-chip and very fast, shared by threads in the same block.
- Constant memory: Read-only and cached, ideal for small constants.
- Registers: Fastest storage, private to each thread.
Basic allocation workflow
float *h_data = (float*)malloc(n * sizeof(float));
float *d_data = nullptr;
cudaMalloc(&d_data, n * sizeof(float));
cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice);
myKernel<<<grid, block>>>(d_data, n);
cudaMemcpy(h_data, d_data, n * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_data);
free(h_data);
Transfer costs and batching
Host-device copies are expensive. You get better performance by minimizing transfer count and moving larger contiguous chunks instead of many tiny transfers.
- Batch small operations into a larger kernel launch when possible.
- Keep intermediate data on the GPU instead of round-tripping to CPU.
- Use pinned host memory for faster transfer throughput when needed.
Shared memory for data reuse
If neighboring threads read overlapping data, stage that data in shared memory once, then reuse it. This reduces repeated global memory accesses and can significantly improve throughput.
Memory safety checklist
- Always check return values from `cudaMalloc` and `cudaMemcpy`.
- Use bounds checks in kernels to avoid out-of-range writes.
- Call `cudaGetLastError()` after launches in debug workflows.
- Pair every `cudaMalloc` with exactly one `cudaFree`.