Triton, TPUs, and Vulkan
- Triton: Python-based GPU kernel programming, block-level abstraction, auto-tuning
- Writing Triton kernels: tl.load, tl.store, tl.dot, masking, grid/block programs
- Triton vs CUDA: productivity vs control tradeoff, when to use each
- Flash Attention as a case study: memory-efficient attention via tiling, online softmax
- TPU architecture: systolic arrays, MXU (matrix multiply unit), HBM, ICI interconnect
- TPU programming: XLA compiler, JAX/pjit, GSPMD, sharding annotations
- Vulkan compute: compute shaders, SPIR-V, descriptor sets, command buffers
- Comparison: GPU (CUDA/Triton) vs TPU (JAX/XLA) vs Vulkan, choosing the right tool