GPU Programming with CUDA – Part 7: Best Practices & Optimization Patterns for CUDA Development

Oliver White

·4 min read
GPU Programming with CUDA – Part 7: Best Practices & Optimization Patterns for CUDA Development

Let’s move into the final technical section of this CUDA programming series.


Part 7: Best Practices & Optimization Patterns for CUDA Development

GPU programming is about harnessing massive parallelism efficiently. Even small inefficiencies—like poor memory access or divergent threads—can degrade performance drastically. In this part, we’ll explore optimization strategies and proven architectural patterns to help you write high-performance CUDA code.


1. Understanding GPU Occupancy

Occupancy measures how many threads are actively executing on an SM (Streaming Multiprocessor) relative to the hardware’s maximum capacity. High occupancy allows the GPU to hide memory latency by switching between warps while others wait for data.

You can improve occupancy by:

  • Choosing appropriate block size (typically multiples of 32 threads, since a warp = 32 threads)
  • Minimizing shared memory and register usage per thread
  • Balancing computation vs. memory access

Use:

nvcc --ptxas-options=-v my_kernel.cu

to inspect register usage and occupancy in the compiler output.


2. Memory Coalescing

Memory access is one of the most critical performance factors. When threads in a warp access consecutive memory addresses, CUDA coalesces them into a single transaction.

Do:

1// Coalesced access
2data[threadIdx.x + blockIdx.x * blockDim.x] = value;

🚫 Avoid:

1// Strided access (inefficient)
2data[threadIdx.x * stride] = value;

Use structure-of-arrays (SoA) instead of array-of-structures (AoS) for better coalescing.


3. Minimize Warp Divergence

Threads in a warp execute in lockstep. If they take different branches (if statements), execution serializes—hurting performance.

Better pattern:

1int idx = threadIdx.x + blockIdx.x * blockDim.x;
2data[idx] = (idx % 2 == 0) ? evenOp(data[idx]) : oddOp(data[idx]);

Prefer predication or masking over deep branching when possible.


4. Shared Memory Optimization

Shared memory is fast but limited (e.g., 48 KB per SM). Use it as a cache for frequently reused data.

Example — Matrix Multiplication with Shared Memory:

1__global__ void matMulShared(float *A, float *B, float *C, int N) {
2    __shared__ float As[16][16];
3    __shared__ float Bs[16][16];
4    
5    int bx = blockIdx.x, by = blockIdx.y;
6    int tx = threadIdx.x, ty = threadIdx.y;
7    int Row = by * 16 + ty;
8    int Col = bx * 16 + tx;
9    float Cvalue = 0;
10
11    for (int k = 0; k < N / 16; ++k) {
12        As[ty][tx] = A[Row * N + (k * 16 + tx)];
13        Bs[ty][tx] = B[(k * 16 + ty) * N + Col];
14        __syncthreads();
15
16        for (int n = 0; n < 16; ++n)
17            Cvalue += As[ty][n] * Bs[n][tx];
18
19        __syncthreads();
20    }
21    C[Row * N + Col] = Cvalue;
22}

This version minimizes global memory reads by using shared memory tiling.


5. Asynchronous Execution

Use CUDA Streams to overlap computation and data transfer:

1cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream1);
2myKernel<<<grid, block, 0, stream1>>>(d_data);

Multiple streams can execute concurrently, improving throughput.


6. Profiling and Debugging

Use NVIDIA tools:

  • Nsight Systems / Nsight Compute – visualize kernel launches, memory bandwidth, warp stalls.
  • cuda-memcheck – detect illegal memory access.
  • nvprof (legacy) – quick performance overview.

Example:

nsys profile ./my_cuda_app

7. Mixed Precision and Tensor Cores

For AI and numerical workloads, leverage mixed precision (FP16 + FP32) with Tensor Cores to dramatically improve speed.

Example (PyTorch-style):

1with torch.cuda.amp.autocast():
2    output = model(input)

In native CUDA:

  • Use __half for FP16 data types.
  • Call WMMA APIs for Tensor Core instructions (on Volta+ architectures).

8. Common Optimization Checklist

Optimization AreaStrategyBenefit
Memory AccessCoalesce global reads/writesReduced latency
Shared MemoryUse for data reuse, avoid bank conflictsFaster local caching
Threads per BlockUse 128–1024 (multiples of 32)Higher occupancy
Compute IntensityIncrease arithmetic ops per byte loadedBetter throughput
BranchingAvoid warp divergenceConsistent warp execution
StreamsOverlap compute + transferImproved concurrency

9. Final Thoughts

Optimizing CUDA code is an iterative process — profile → tune → validate. Each GPU generation (Volta, Ampere, Hopper) brings new architectural improvements like larger L1 caches, faster NVLink, or Tensor Cores, so tuning should be hardware-aware.

A strong mental model of:

  • Thread hierarchy
  • Memory hierarchy
  • Warp execution model

…is the foundation for writing world-class GPU applications.