GPU Architecture and CUDA
- GPU vs CPU: throughput-oriented design, thousands of cores, SIMT execution model
- GPU memory hierarchy: global memory, shared memory, registers, L1/L2 cache, constant memory
- CUDA programming model: grids, blocks, threads, warps (32 threads), warp divergence
- Kernel launch: grid/block dimensions, occupancy, register usage
- Memory access patterns: coalesced access, bank conflicts in shared memory, memory fences
- Synchronisation: __syncthreads, atomic operations, cooperative groups
- Streams and concurrency: overlapping compute and data transfer, multi-stream execution
- Profiling: nsight compute, nsight systems, occupancy calculator
- NVIDIA GPU generations: Volta (tensor cores), Ampere (TF32, sparsity), Hopper (transformer engine, FP8), Blackwell