Let’s dive into Part 1: Getting Started with CUDA Programming — a hands-on, technical tutorial that takes you from setup to running your first GPU kernel.

🚀 Part 1: Getting Started with CUDA Programming

🎯 Objective

By the end of this tutorial, you’ll:

Understand the CUDA programming model
Set up your development environment
Write, compile, and run your first CUDA program
Learn how CPU (host) and GPU (device) cooperate

🧠 1. What is CUDA?

CUDA (Compute Unified Device Architecture) is NVIDIA’s platform for parallel programming on GPUs. It allows developers to write programs that offload computationally heavy tasks to the GPU while keeping control logic on the CPU.

Component	Role
Host (CPU)	Runs the main program, manages memory, launches GPU kernels
Device (GPU)	Executes data-parallel tasks in thousands of lightweight threads

⚙️ 2. Setting Up the Environment

Requirements

NVIDIA GPU with CUDA support
Linux, macOS (limited), or Windows
CUDA Toolkit (includes compiler nvcc, libraries, and samples)

Installation (Ubuntu example)

1# Update packages
2sudo apt update && sudo apt upgrade -y
3
4# Install CUDA Toolkit
5sudo apt install nvidia-cuda-toolkit -y
6
7# Verify CUDA installation
8nvcc --version

Expected output:

Cuda compilation tools, release 12.x, V12.x.x

If you’re using Windows, install from: 👉 https://developer.nvidia.com/cuda-downloads

🧩 3. The CUDA Programming Model

CUDA divides work hierarchically:

Grid → Contains multiple Blocks
Block → Contains multiple Threads
Thread → Executes one instance of the kernel

🧠 Think of it like this:

A classroom (Grid) has many groups (Blocks), and each student (Thread) solves part of the problem.

💻 4. Your First CUDA Program: Vector Addition

We’ll add two arrays (A and B) on the GPU and store results in C.

File: `vector_add.cu`

1#include <iostream>
2#include <cuda_runtime.h>
3
4__global__ void vectorAdd(const float *A, const float *B, float *C, int N) {
5    int i = blockDim.x * blockIdx.x + threadIdx.x;
6    if (i < N) {
7        C[i] = A[i] + B[i];
8    }
9}
10
11int main() {
12    int N = 1 << 20; // 1M elements
13    size_t size = N * sizeof(float);
14
15    // Allocate host memory
16    float *h_A = (float*)malloc(size);
17    float *h_B = (float*)malloc(size);
18    float *h_C = (float*)malloc(size);
19
20    // Initialize input data
21    for (int i = 0; i < N; i++) {
22        h_A[i] = i * 0.5f;
23        h_B[i] = i * 2.0f;
24    }
25
26    // Allocate device memory
27    float *d_A, *d_B, *d_C;
28    cudaMalloc(&d_A, size);
29    cudaMalloc(&d_B, size);
30    cudaMalloc(&d_C, size);
31
32    // Copy data from host to device
33    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
34    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
35
36    // Launch kernel (1024 threads per block)
37    int threadsPerBlock = 1024;
38    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
39    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
40
41    // Copy result back to host
42    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
43
44    // Verify result
45    for (int i = 0; i < 5; i++) {
46        std::cout << "C[" << i << "] = " << h_C[i] << std::endl;
47    }
48
49    // Free memory
50    cudaFree(d_A);
51    cudaFree(d_B);
52    cudaFree(d_C);
53    free(h_A);
54    free(h_B);
55    free(h_C);
56
57    return 0;
58}

🧮 5. Compile and Run

1nvcc vector_add.cu -o vector_add
2./vector_add

Expected output:

1C[0] = 0
2C[1] = 2.5
3C[2] = 5
4C[3] = 7.5
5C[4] = 10

✅ Congratulations — you just executed your first CUDA kernel!

🔍 6. Understanding the Kernel Launch

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

This line tells the GPU:

Launch blocksPerGrid blocks
Each block contains threadsPerBlock threads
Each thread computes one element in the array

🧠 7. Key Concepts Recap

Concept	Description
`__global__`	Marks a function as a CUDA kernel (runs on GPU)
`cudaMalloc`	Allocates memory on GPU
`cudaMemcpy`	Transfers data between CPU and GPU
`<<<grid, block>>>`	Kernel launch syntax
Thread indexing	`blockIdx.x`, `threadIdx.x`, `blockDim.x` help calculate thread IDs

🧩 8. What’s Next?

In Part 2: Threads, Blocks, and Grids, we’ll:

Dive deeper into parallel execution
Visualize how threads cooperate
Implement element-wise vector multiplication
Learn performance tuning using block sizes

On this page