cuda-guide

Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "cuda-guide" with this command: npx skills add ar4mirez/samuel/ar4mirez-samuel-cuda-guide

CUDA Guide

Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC

Core Principles

  • Parallelism First: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance

  • Memory Hierarchy Awareness: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning

  • Coalesced Access: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x

  • Occupancy Over Cleverness: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together

  • Minimize Host-Device Transfers: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory

Guardrails

Error Checking

  • ALWAYS check CUDA API return values with a macro wrapper

  • ALWAYS call cudaGetLastError() after every kernel launch

  • ALWAYS call cudaDeviceSynchronize() before reading kernel results on the host

  • Use compute-sanitizer (successor to cuda-memcheck ) in development builds

  • Handle cudaErrorMemoryAllocation gracefully; never assume GPU memory is infinite

#define CUDA_CHECK(call)
do {
cudaError_t err = call;
if (err != cudaSuccess) {
fprintf(stderr, "CUDA error at %s:%d: %s\n",
FILE, LINE, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
} while (0)

#define CUDA_CHECK_KERNEL()
do {
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "Kernel launch error at %s:%d: %s\n",
FILE, LINE, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
} while (0)

Memory Management

  • Pair every cudaMalloc with a cudaFree ; prefer RAII wrappers in C++ host code

  • Use cudaMallocManaged (Unified Memory) for prototyping; switch to explicit transfers for production

  • Use cudaMallocHost (pinned memory) when streaming data to the GPU; pageable memory cannot overlap with compute

  • Prefer cudaMemcpyAsync with streams over synchronous cudaMemcpy

  • Never access device pointers from host code or host pointers from device code (except Unified Memory)

  • Call cudaMemset or cudaMemsetAsync to zero-initialize device buffers

Kernel Design

  • Block size must be a multiple of warp size (32); prefer 128, 256, or 512

  • Calculate grid size as (n + block_size - 1) / block_size

  • Always include bounds checking: if (idx < n) at the top of every kernel

  • Use grid-stride loops for kernels that must handle arbitrary data sizes

  • Document thread mapping: which dimension maps to which data axis

  • Mark device-only helpers as device , host+device as host device

// Grid-stride loop: works with any grid size, any data size global void saxpy(float a, const float* x, float* y, int n) { for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { y[i] = a * x[i] + y[i]; } }

Synchronization

  • Use __syncthreads() after every shared memory write before any thread reads another thread's value

  • Never place __syncthreads() inside a conditional branch that not all threads in a block will reach (deadlock)

  • Use __syncwarp() (CUDA 9+) for warp-level synchronization instead of relying on implicit warp-synchronous execution

  • Use cudaDeviceSynchronize() sparingly in production; prefer stream synchronization with cudaStreamSynchronize()

  • Use CUDA events (cudaEventRecord / cudaEventSynchronize ) for fine-grained inter-stream ordering

Performance

  • Profile before optimizing: use Nsight Compute for kernel analysis, Nsight Systems for system-level view

  • Target >50% theoretical occupancy; use the CUDA Occupancy Calculator to tune block dimensions

  • Aim for >60% of peak memory bandwidth in memory-bound kernels

  • Avoid warp divergence: ensure threads within a warp take the same branch when possible

  • Prefer float over double on consumer GPUs (2x throughput difference)

  • Minimize atomic operations on global memory; use shared memory atomics with a final reduction

Memory Hierarchy

Understanding the memory hierarchy is the single most important factor in CUDA performance.

Memory Type Scope Latency (cycles) Size Cached Read/Write

Registers Thread 1 ~255 per thread N/A R/W

Shared Block ~5 48-164 KB per SM N/A R/W

L1 Cache SM ~28 48-192 KB per SM Auto R

L2 Cache Device ~200 4-40 MB Auto R/W

Global Device ~400-600 4-80 GB (HBM/GDDR) Yes R/W

Constant Device ~5 (cached) 64 KB Yes (broadcast) R

Texture Device ~400 (cached) Global pool Yes (spatial) R

Decision guide:

  • Data reused within a thread -> registers (automatic via local variables)

  • Data shared across threads in a block -> shared memory

  • Read-only data broadcast to all threads -> constant memory

  • Large read-only data with spatial locality -> texture memory

  • Everything else -> global memory with coalesced access patterns

Key Patterns

Kernel Launch Configuration

// Query device for optimal configuration void launch_optimized(const float* input, float* output, int n) { int block_size; int min_grid_size;

// Let the runtime suggest optimal block size for maximum occupancy
cudaOccupancyMaxPotentialBlockSize(
    &#x26;min_grid_size, &#x26;block_size, my_kernel, 0, n);

int grid_size = (n + block_size - 1) / block_size;
my_kernel&#x3C;&#x3C;&#x3C;grid_size, block_size>>>(input, output, n);
CUDA_CHECK_KERNEL();

}

Coalesced Memory Access

// BAD: Strided access -- adjacent threads access non-adjacent memory // Each warp issues 32 separate memory transactions global void transpose_naive(const float* in, float* out, int W, int H) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if (x < W && y < H) { out[x * H + y] = in[y * W + x]; // Write is strided } }

// GOOD: Use shared memory to coalesce both reads and writes global void transpose_coalesced( const float* in, float* out, int W, int H ) { shared float tile[32][33]; // +1 padding avoids bank conflicts

int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;

if (x &#x3C; W &#x26;&#x26; y &#x3C; H) {
    tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
}
__syncthreads();

x = blockIdx.y * 32 + threadIdx.x;
y = blockIdx.x * 32 + threadIdx.y;

if (x &#x3C; H &#x26;&#x26; y &#x3C; W) {
    out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
}

}

Shared Memory Tiling

// Dot product of two vectors using shared memory reduction global void dot_product( const float* a, const float* b, float* result, int n ) { shared float cache[256];

int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;

// Each thread computes its partial sum via grid-stride
float partial = 0.0f;
for (int i = idx; i &#x3C; n; i += blockDim.x * gridDim.x) {
    partial += a[i] * b[i];
}
cache[tid] = partial;
__syncthreads();

// Tree reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
    if (tid &#x3C; s) {
        cache[tid] += cache[tid + s];
    }
    __syncthreads();
}

if (tid == 0) {
    atomicAdd(result, cache[0]);
}

}

Warp-Level Primitives (CUDA 9+)

// Warp-level reduction using shuffle instructions -- no shared memory needed device float warp_reduce_sum(float val) { for (int offset = warpSize / 2; offset > 0; offset /= 2) { val += __shfl_down_sync(0xFFFFFFFF, val, offset); } return val; }

// Block-level reduction combining warp shuffles and shared memory device float block_reduce_sum(float val) { shared float warp_sums[32]; // One slot per warp (max 32 warps/block)

int lane = threadIdx.x % warpSize;
int warp_id = threadIdx.x / warpSize;

val = warp_reduce_sum(val);

if (lane == 0) {
    warp_sums[warp_id] = val;
}
__syncthreads();

// First warp reduces the warp sums
int num_warps = (blockDim.x + warpSize - 1) / warpSize;
val = (threadIdx.x &#x3C; num_warps) ? warp_sums[threadIdx.x] : 0.0f;
if (warp_id == 0) {
    val = warp_reduce_sum(val);
}

return val;

}

Performance

Occupancy Calculator

// Query occupancy at compile time for tuning void report_occupancy() { int block_size = 256; int num_blocks;

cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &#x26;num_blocks, my_kernel, block_size, 0);

cudaDeviceProp prop;
cudaGetDeviceProperties(&#x26;prop, 0);

int active_warps = num_blocks * (block_size / prop.warpSize);
int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
float occupancy = (float)active_warps / max_warps;

printf("Occupancy: %.1f%% (%d/%d warps)\n",
       occupancy * 100, active_warps, max_warps);

}

Nsight Profiling Workflow

System-level trace: find CPU/GPU idle gaps, stream concurrency

nsys profile -o trace ./program nsys stats trace.nsys-rep

Kernel-level analysis: roofline, memory throughput, occupancy

ncu --set full -o kernel_report ./program ncu -i kernel_report.ncu-rep # Open in Nsight Compute GUI

Quick single-metric check

ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program

Memory Bandwidth Measurement

// Measure effective bandwidth of a kernel void measure_bandwidth(int n) { size_t bytes = 2 * n * sizeof(float); // Read A + Write B

cudaEvent_t start, stop;
CUDA_CHECK(cudaEventCreate(&#x26;start));
CUDA_CHECK(cudaEventCreate(&#x26;stop));

CUDA_CHECK(cudaEventRecord(start));
copy_kernel&#x3C;&#x3C;&#x3C;grid, block>>>(d_in, d_out, n);
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));

float ms = 0;
CUDA_CHECK(cudaEventElapsedTime(&#x26;ms, start, stop));

float gb_per_sec = bytes / (ms * 1e6);
printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);

CUDA_CHECK(cudaEventDestroy(start));
CUDA_CHECK(cudaEventDestroy(stop));

}

Tooling

Essential Commands

Compile CUDA code

nvcc -arch=sm_80 -O3 -o program main.cu # Single file nvcc -arch=native -lineinfo -o program main.cu # With debug line info

CMake build

cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86" cmake --build build -j$(nproc)

Runtime debugging

compute-sanitizer ./program # Memory errors (replaces cuda-memcheck) compute-sanitizer --tool racecheck ./program # Shared memory race conditions compute-sanitizer --tool initcheck ./program # Uninitialized device memory reads compute-sanitizer --tool synccheck ./program # Synchronization errors

Profiling

nsys profile ./program # System-level timeline ncu ./program # Kernel-level metrics ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program

Device info

nvidia-smi # GPU status and memory usage nvcc --version # CUDA compiler version

CMakeLists.txt Template

cmake_minimum_required(VERSION 3.18) project(myproject LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 17) set(CMAKE_CUDA_STANDARD 17) set(CMAKE_CUDA_ARCHITECTURES 70 80 86) set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)

find_package(CUDAToolkit REQUIRED)

add_library(kernels src/kernels.cu) target_include_directories(kernels PUBLIC include)

add_executable(main src/main.cpp) target_link_libraries(main kernels CUDA::cudart)

enable_testing() add_executable(tests tests/test_kernels.cu) target_link_libraries(tests kernels CUDA::cudart) add_test(NAME gpu_tests COMMAND tests)

References

For detailed patterns and examples, see:

  • references/patterns.md -- Tiled matrix multiply, parallel reduction tree, stream overlap pipeline

External References

  • CUDA C++ Programming Guide

  • CUDA C++ Best Practices Guide

  • Nsight Compute Documentation

  • Nsight Systems Documentation

  • Thrust Documentation

  • CUDA Samples

  • NVIDIA Occupancy Calculator

  • GPU Memory Hierarchy (GTC talk)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

actix-web

No summary provided by upstream source.

Repository SourceNeeds Review
General

frontend-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

blazor

No summary provided by upstream source.

Repository SourceNeeds Review
General

fiber

No summary provided by upstream source.

Repository SourceNeeds Review