x86 and AVX¶

x86 SIMD evolution: MMX → SSE → SSE2/3/4 → AVX → AVX2 → AVX-512 → AMX
AVX/AVX2 programming: 256-bit YMM registers, intrinsics (mm256*), FMA instructions
AVX-512: 512-bit ZMM registers, mask registers, gather/scatter, conflict detection
Intel AMX: tile registers, TMUL (tile matrix multiply), BF16/INT8 acceleration
Memory alignment: aligned vs unaligned loads, cache line considerations
Performance pitfalls: AVX frequency throttling, register pressure, lane crossing penalties
Benchmarking and profiling: RDTSC, perf, VTune, likwid