gpu_training

Introduction to GPU Programming

Presentation:

You can find the presentation here: https://docs.google.com/presentation/d/1Id7dLpzC0UlsJCJeXdJBvwigQ1aFaDmk0l0Pgkv7bwo/edit?usp=sharing

Hands-on

The CUDA Runtime API reference manual is a very useful source of information: http://docs.nvidia.com/cuda/cuda-runtime-api/index.html

Check that your environment is correctly configured to compile CUDA code by running:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0
git clone https://github.com/felicepantaleo/gpu_training.git

Compile and run the deviceQuery application:

$ cd hands-on/utils/deviceQuery
$ make

You can get some useful information about the features and the limits that you will find on the device you will be running your code on. For example:

$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla V100-SXM2-32GB"
  CUDA Driver Version / Runtime Version          12.9 / 12.9
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 32494 MBytes (34072559616 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1530 MHz (1.53 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 5
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.9, CUDA Runtime Version = 12.9, NumDevs = 1
Result = PASS

Some of you are sharing the same machine and some time measurements can be influenced by other users running at the very same moment. It can be necessary to run time measurements multiple times.

GPU Training — Hands‑On Exercises

These short, self‑contained labs walk students from CUDA basics to a tiny predator–prey simulation. Every exercise follows the same workflow:

1. Fill in all  ►►► TODO ◄◄◄  and/or kernel bodies.
2. Build with the line shown in the banner comment.
3. Run → the program prints “…PASSED 🎉” when assertions succeed.

All code lives under gpu_training/ and compiles on CUDA 12 + gcc 13 (Ubuntu 22.04).


Exercise 1 – CUDA Memory Model

Goal – feel the separation between CPU (host) and GPU (device) address spaces.

Step What you implement
 1 cudaMallocAsync two device buffers d_a and d_b
 2 Copy host array h_a → d_a
 3 Copy device array d_a → d_b (device‑to‑device)
 4 Copy back d_b → h_a
 5 Free d_a, d_b with cudaFreeAsync
nvcc -std=c++20 memory_model.cu -o ex01
./ex01   # prints “Exercise 1 – memory model: PASSED 🎉”

Variation

Add a non‑blocking version using streams + cudaMemcpyAsync and time a 100 MB H↔D copy to estimate PCIe bandwidth.


Exercise 2 – Launch Your First Kernel

Goal – understand grid/block configuration and indexing.

  1. cudaMallocAsync a device array d_a[N]
  2. Launch a 1‑D grid of 1‑D blocks

    d_a[i] = i + 42;  // each thread writes one element
    
  3. Copy back, verify, free the memory.

Compile & run:

nvcc -std=c++20 launch_kernel.cu -o ex02 && ./ex02

Hint: Global thread index = blockIdx.x * blockDim.x + threadIdx.x.


Exercise 3 – 2‑D Grid & Block

Goal – move from 1‑D to 2‑D indexing.

Matrix M[numRows × numCols].

Task Detail
1 Set numRows, numCols (start with 4 × 4, then 19 × 67).
2 Launch a 2‑D grid of 2‑D blocks so each thread writes
M[row,col] = row * numCols + col
3 Copy to host, assert correctness
4 Experiment: fix block = 16×16, compute blocksPerGrid with ceiling division

Compile:

nvcc -std=c++20 ex03_fill_matrix.cu -o ex03 && ./ex03

Exercise 4 – Parallel Reduction ∑

Goal – sum a 1‑D array faster than the CPU.

Rule of thumb: keep each block power‑of‑two and reduce in shared memory.

  1. Kernel #1: each block loads its slice, does shared‑mem tree reduction, writes one partial sum.
  2. Kernel #2: single block reduces those partials into the final total.
  3. Copy result, compare to std::accumulate.

Bonus: one‑step reduction (single kernel).


Parallel Challenge – The Circle of Life

A toroidal predator–prey world. Build the starter CPU version first, then port to CUDA

Reference CPU build:

make serial
./circle_of_life --width 256 --height 256 --seed 42

Can you use the asynchronous GPU kernel launch to execute the generation of a git frame on the CPU while the GPU is running the next iteration?

circleoflife


Common Pitfalls & Tips

Thrust

The four labs below mirror the CUDA ones but use the Thrust C++ library. All code lives in hands-on/thrust/XX_/. Every exercise follows the same workflow:

  1. Open the .cu file and replace each ►►► TODO ◄◄◄.
  2. Build with the provided Makefile: $ make # or make test when shown
  3. Run → the program prints “…PASSED 🎉” (or an assert succeeds).

    Build environment CUDA 12.9 · gcc 13 · Ubuntu 22.04. Thrust ships inside the CUDA toolkit—no extra install needed.

Exercise 1 – Memory Model with Thrust Vectors

Folder: thrust/01_memory_model Goal: practice host ↔ device copies and Thrust containers. Step What you implement

1 Initialize a thrust::host_vector<int> h_buf(dim) with 0, 1, 2, … (hint : thrust::sequence) 2 Create two thrust::device_vector<int> buffers d1, d2. 3 Copy host → d1 with any Thrust copy method. 4 Copy d1 → d2 using a different method (e.g. explicit iterator pair). 5 Zero h_buf on the CPU. 6 Copy d2 → h_buf and assert that values are again 0…dim-1.

Build & run:

cd hands-on/thrust/01_memory_model
make         # builds memory_model
./memory_model

Variation: time the H↔D copy using cudaEvent_t and compare to the raw‐pointer CUDA version.

Exercise 2 – Your First Thrust “Kernel”

Folder : thrust/02_my_first_kernel

Goal : launch-free vector operations and async transfers.

Pinned-host buffer h_a (use thrust::universal_vector or cudaMallocHost).

Device buffer d_a (thrust::device_vector<int>(N)).

Fill d_a with i + 42 using thrust::sequence.

Create a cudaStream_t queue.

Asynchronously copy d_a → h_a with thrust::copy(thrust::cuda::par.on(queue), …).

Verify with an assert that h_a == {42, 43, …}.

Build:

cd hands-on/thrust/02_my_first_kernel
make

Run prints Correct, good work!.

Tip: wrap the Thrust copy in CUDA_CHECK(cudaStreamSynchronize(queue)); before destroying the stream to avoid race conditions.

Exercise 3 – Data Statistics (Mean & Stdev)

*Folder *: thrust/03_data_statistics *Goal *: combine transform and reduce to compute simple stats.

  1. Host vector h_data(N); fill with random floats (provided helper).
  2. Copy to thrust::device_vector<float> d_data.
  3. Mean = thrust::reduce(d_data.begin(), d_data.end()) / N.
  4. squared_diffs[i] = (d_data[i] – mean)² via thrust::transform with a lambda or functor.
  5. Stdev = sqrt(reduce(squared_diffs)/N).
  6. Challenge: recompute stdev without the intermediate buffer (use a binary transform-reduce).

Build & run:

cd hands-on/thrust/03_data_statistics
make test      # Makefile prints mean & σ and checks against CPU reference

Hints

    thrust::placeholders::_1 can shorten lambdas (_1 - mean).
For the "no buffer" variant use `thrust::transform_reduce`.

Exercise 4 – Maximum Absolute Difference (Zip Iterators)

Folder: thrust/04_max_difference Goal: learn zip iterators and element-wise transforms.

Two small host vectors h_v1, h_v2 already provided.

Copy to d_v1, d_v2.

Use thrust::transform with a zip iterator pair to compute |v1-v2| into a temporary device_vector<int> diffs.

    auto first = thrust::make_zip_iterator(thrust::make_tuple(d_v1.begin(),
                                                              d_v2.begin()));
    auto last  = first + d_v1.size();
    thrust::transform(first, last, diffs.begin(),
                      [] __device__ (auto t) {
                        int a = thrust::get<0>(t);
                        int b = thrust::get<1>(t);
                        return abs(a - b);
                      });

    max_difference = thrust::reduce(diffs.begin(), diffs.end(), 0, thrust::maximum<int>());
Print the result and compare to a CPU calculation.

Build:

cd hands-on/thrust/04_max_difference
make

Common Pitfalls & Tips

Prefer algorithms (transform, reduce) over raw loops—Thrust chooses good launches for you.

Explicit streams: .on(my_stream) works with every algorithm.

device_vector reallocation is costly—reserve capacity if you grow it.

Thrust follows C++20 ranges concepts; use std::views on the host for quick sanity checks.

GPU programming with CuPy

set up

ssh to the machine with this command:

ssh -L XXXX:localhost:XXXX <username>@131.154.XX.YY

where XXXX is a 4-digit number, unique for each user. Your 4-digit number is 90 plus your ID from the table in the markdown.

Once you are inside the machine, you can start JupyterLab with the command:

jupyter-lab --port XXXX

Once JupyterLab is running, you will find in the output a URL like this:

...
    Or copy and paste one of these URLs:
        http://localhost:9004/lab?token=8928e7071...
        http://127.0.0.1:9004/lab?token=8928e7071...
...

Paste one of these URLs in your local browser, and you should see the JupyterLab page.

Exercises

Exercise 1: port NumPy code to CuPy

Take the code written with NumPy and modify it to use CuPy. Time both executions to compare CPU vs. GPU speedup.

Note: GPU operations are asynchronous. Use:

cp.cuda.Device().synchronize()

before stopping your timer to get accurate timings.

You can try the same with the Python code of your personal project!

Exercise 2: transfers and GPU allocation

  1. Create two arrays of N random numbers on CPU
  2. Copy them to the GPU
  3. Sum them
  4. Copy them back to the CPU.

Now avoid the copy to the GPU by creating the random arrays directly on the device with CuPy.

Exercise 3: write a kernel

Write a kernel (take one from the CUDA exercises or write your own) with CuPy using:

Exercise 4: reduction

Implement the reduction kernel:

Tips: