GPU Architecture and CUDA¶

GPU vs CPU: throughput-oriented design, thousands of cores, SIMT execution model
GPU memory hierarchy: global memory, shared memory, registers, L1/L2 cache, constant memory
CUDA programming model: grids, blocks, threads, warps (32 threads), warp divergence
Kernel launch: grid/block dimensions, occupancy, register usage
Memory access patterns: coalesced access, bank conflicts in shared memory, memory fences
Synchronisation: __syncthreads, atomic operations, cooperative groups
Streams and concurrency: overlapping compute and data transfer, multi-stream execution
Profiling: nsight compute, nsight systems, occupancy calculator
NVIDIA GPU generations: Volta (tensor cores), Ampere (TF32, sparsity), Hopper (transformer engine, FP8), Blackwell