CUDA Guide
Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC
Core Principles
-
Parallelism First: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance
-
Memory Hierarchy Awareness: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning
-
Coalesced Access: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x
-
Occupancy Over Cleverness: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together
-
Minimize Host-Device Transfers: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory
Guardrails
Error Checking
-
ALWAYS check CUDA API return values with a macro wrapper
-
ALWAYS call cudaGetLastError() after every kernel launch
-
ALWAYS call cudaDeviceSynchronize() before reading kernel results on the host
-
Use compute-sanitizer (successor to cuda-memcheck ) in development builds
-
Handle cudaErrorMemoryAllocation gracefully; never assume GPU memory is infinite
#define CUDA_CHECK(call)
do {
cudaError_t err = call;
if (err != cudaSuccess) {
fprintf(stderr, "CUDA error at %s:%d: %s\n",
FILE, LINE, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
} while (0)
#define CUDA_CHECK_KERNEL()
do {
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "Kernel launch error at %s:%d: %s\n",
FILE, LINE, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
} while (0)
Memory Management
-
Pair every cudaMalloc with a cudaFree ; prefer RAII wrappers in C++ host code
-
Use cudaMallocManaged (Unified Memory) for prototyping; switch to explicit transfers for production
-
Use cudaMallocHost (pinned memory) when streaming data to the GPU; pageable memory cannot overlap with compute
-
Prefer cudaMemcpyAsync with streams over synchronous cudaMemcpy
-
Never access device pointers from host code or host pointers from device code (except Unified Memory)
-
Call cudaMemset or cudaMemsetAsync to zero-initialize device buffers
Kernel Design
-
Block size must be a multiple of warp size (32); prefer 128, 256, or 512
-
Calculate grid size as (n + block_size - 1) / block_size
-
Always include bounds checking: if (idx < n) at the top of every kernel
-
Use grid-stride loops for kernels that must handle arbitrary data sizes
-
Document thread mapping: which dimension maps to which data axis
-
Mark device-only helpers as device , host+device as host device
// Grid-stride loop: works with any grid size, any data size global void saxpy(float a, const float* x, float* y, int n) { for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { y[i] = a * x[i] + y[i]; } }
Synchronization
-
Use __syncthreads() after every shared memory write before any thread reads another thread's value
-
Never place __syncthreads() inside a conditional branch that not all threads in a block will reach (deadlock)
-
Use __syncwarp() (CUDA 9+) for warp-level synchronization instead of relying on implicit warp-synchronous execution
-
Use cudaDeviceSynchronize() sparingly in production; prefer stream synchronization with cudaStreamSynchronize()
-
Use CUDA events (cudaEventRecord / cudaEventSynchronize ) for fine-grained inter-stream ordering
Performance
-
Profile before optimizing: use Nsight Compute for kernel analysis, Nsight Systems for system-level view
-
Target >50% theoretical occupancy; use the CUDA Occupancy Calculator to tune block dimensions
-
Aim for >60% of peak memory bandwidth in memory-bound kernels
-
Avoid warp divergence: ensure threads within a warp take the same branch when possible
-
Prefer float over double on consumer GPUs (2x throughput difference)
-
Minimize atomic operations on global memory; use shared memory atomics with a final reduction
Memory Hierarchy
Understanding the memory hierarchy is the single most important factor in CUDA performance.
Memory Type Scope Latency (cycles) Size Cached Read/Write
Registers Thread 1 ~255 per thread N/A R/W
Shared Block ~5 48-164 KB per SM N/A R/W
L1 Cache SM ~28 48-192 KB per SM Auto R
L2 Cache Device ~200 4-40 MB Auto R/W
Global Device ~400-600 4-80 GB (HBM/GDDR) Yes R/W
Constant Device ~5 (cached) 64 KB Yes (broadcast) R
Texture Device ~400 (cached) Global pool Yes (spatial) R
Decision guide:
-
Data reused within a thread -> registers (automatic via local variables)
-
Data shared across threads in a block -> shared memory
-
Read-only data broadcast to all threads -> constant memory
-
Large read-only data with spatial locality -> texture memory
-
Everything else -> global memory with coalesced access patterns
Key Patterns
Kernel Launch Configuration
// Query device for optimal configuration void launch_optimized(const float* input, float* output, int n) { int block_size; int min_grid_size;
// Let the runtime suggest optimal block size for maximum occupancy
cudaOccupancyMaxPotentialBlockSize(
&min_grid_size, &block_size, my_kernel, 0, n);
int grid_size = (n + block_size - 1) / block_size;
my_kernel<<<grid_size, block_size>>>(input, output, n);
CUDA_CHECK_KERNEL();
}
Coalesced Memory Access
// BAD: Strided access -- adjacent threads access non-adjacent memory // Each warp issues 32 separate memory transactions global void transpose_naive(const float* in, float* out, int W, int H) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if (x < W && y < H) { out[x * H + y] = in[y * W + x]; // Write is strided } }
// GOOD: Use shared memory to coalesce both reads and writes global void transpose_coalesced( const float* in, float* out, int W, int H ) { shared float tile[32][33]; // +1 padding avoids bank conflicts
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
if (x < W && y < H) {
tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
}
__syncthreads();
x = blockIdx.y * 32 + threadIdx.x;
y = blockIdx.x * 32 + threadIdx.y;
if (x < H && y < W) {
out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
}
}
Shared Memory Tiling
// Dot product of two vectors using shared memory reduction global void dot_product( const float* a, const float* b, float* result, int n ) { shared float cache[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Each thread computes its partial sum via grid-stride
float partial = 0.0f;
for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
partial += a[i] * b[i];
}
cache[tid] = partial;
__syncthreads();
// Tree reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
cache[tid] += cache[tid + s];
}
__syncthreads();
}
if (tid == 0) {
atomicAdd(result, cache[0]);
}
}
Warp-Level Primitives (CUDA 9+)
// Warp-level reduction using shuffle instructions -- no shared memory needed device float warp_reduce_sum(float val) { for (int offset = warpSize / 2; offset > 0; offset /= 2) { val += __shfl_down_sync(0xFFFFFFFF, val, offset); } return val; }
// Block-level reduction combining warp shuffles and shared memory device float block_reduce_sum(float val) { shared float warp_sums[32]; // One slot per warp (max 32 warps/block)
int lane = threadIdx.x % warpSize;
int warp_id = threadIdx.x / warpSize;
val = warp_reduce_sum(val);
if (lane == 0) {
warp_sums[warp_id] = val;
}
__syncthreads();
// First warp reduces the warp sums
int num_warps = (blockDim.x + warpSize - 1) / warpSize;
val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
if (warp_id == 0) {
val = warp_reduce_sum(val);
}
return val;
}
Performance
Occupancy Calculator
// Query occupancy at compile time for tuning void report_occupancy() { int block_size = 256; int num_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&num_blocks, my_kernel, block_size, 0);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int active_warps = num_blocks * (block_size / prop.warpSize);
int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
float occupancy = (float)active_warps / max_warps;
printf("Occupancy: %.1f%% (%d/%d warps)\n",
occupancy * 100, active_warps, max_warps);
}
Nsight Profiling Workflow
System-level trace: find CPU/GPU idle gaps, stream concurrency
nsys profile -o trace ./program nsys stats trace.nsys-rep
Kernel-level analysis: roofline, memory throughput, occupancy
ncu --set full -o kernel_report ./program ncu -i kernel_report.ncu-rep # Open in Nsight Compute GUI
Quick single-metric check
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program
Memory Bandwidth Measurement
// Measure effective bandwidth of a kernel void measure_bandwidth(int n) { size_t bytes = 2 * n * sizeof(float); // Read A + Write B
cudaEvent_t start, stop;
CUDA_CHECK(cudaEventCreate(&start));
CUDA_CHECK(cudaEventCreate(&stop));
CUDA_CHECK(cudaEventRecord(start));
copy_kernel<<<grid, block>>>(d_in, d_out, n);
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));
float ms = 0;
CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
float gb_per_sec = bytes / (ms * 1e6);
printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);
CUDA_CHECK(cudaEventDestroy(start));
CUDA_CHECK(cudaEventDestroy(stop));
}
Tooling
Essential Commands
Compile CUDA code
nvcc -arch=sm_80 -O3 -o program main.cu # Single file nvcc -arch=native -lineinfo -o program main.cu # With debug line info
CMake build
cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86" cmake --build build -j$(nproc)
Runtime debugging
compute-sanitizer ./program # Memory errors (replaces cuda-memcheck) compute-sanitizer --tool racecheck ./program # Shared memory race conditions compute-sanitizer --tool initcheck ./program # Uninitialized device memory reads compute-sanitizer --tool synccheck ./program # Synchronization errors
Profiling
nsys profile ./program # System-level timeline ncu ./program # Kernel-level metrics ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program
Device info
nvidia-smi # GPU status and memory usage nvcc --version # CUDA compiler version
CMakeLists.txt Template
cmake_minimum_required(VERSION 3.18) project(myproject LANGUAGES CXX CUDA)
set(CMAKE_CXX_STANDARD 17) set(CMAKE_CUDA_STANDARD 17) set(CMAKE_CUDA_ARCHITECTURES 70 80 86) set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
find_package(CUDAToolkit REQUIRED)
add_library(kernels src/kernels.cu) target_include_directories(kernels PUBLIC include)
add_executable(main src/main.cpp) target_link_libraries(main kernels CUDA::cudart)
enable_testing() add_executable(tests tests/test_kernels.cu) target_link_libraries(tests kernels CUDA::cudart) add_test(NAME gpu_tests COMMAND tests)
References
For detailed patterns and examples, see:
- references/patterns.md -- Tiled matrix multiply, parallel reduction tree, stream overlap pipeline
External References
-
CUDA C++ Programming Guide
-
CUDA C++ Best Practices Guide
-
Nsight Compute Documentation
-
Nsight Systems Documentation
-
Thrust Documentation
-
CUDA Samples
-
NVIDIA Occupancy Calculator
-
GPU Memory Hierarchy (GTC talk)