ARM and NEON
- ARM architecture: load-store ISA, register file, condition codes, Thumb mode
- ARM NEON: 128-bit SIMD, data types (int8, int16, float16, float32), register layout
- NEON intrinsics: load/store (vld1, vst1), arithmetic (vadd, vmul, vmla), shuffle and permute
- SVE and SVE2: scalable vector extensions, predicate registers, vector-length agnostic programming
- Apple Silicon specifics: AMX (Apple Matrix eXtensions), performance cores vs efficiency cores
- Practical examples: vectorised dot product, matrix multiply, image processing kernels
- Auto-vectorisation: compiler flags, pragmas, loop patterns that help/hinder vectorisation