Skip to content

x86 and AVX

x86 processors from Intel and AMD dominate data centre servers where most ML training happens. This file covers the evolution of x86 SIMD, AVX/AVX2 intrinsics programming, AVX-512, Intel AMX for matrix operations, memory alignment, performance pitfalls, and profiling -- the tools for squeezing maximum performance from the world's most common server CPUs.

  • If your training runs on cloud VMs (AWS, GCP, Azure), it is almost certainly running on x86. Even GPU-heavy training has CPU bottlenecks: data loading, preprocessing, gradient aggregation, and checkpointing all run on the CPU. Optimising these with x86 SIMD can meaningfully reduce end-to-end training time.

The x86 SIMD Evolution

  • x86 SIMD has evolved through increasingly wider vector registers:
Generation Year Register Width Registers Key Features
MMX 1997 64-bit 8 (mm0-7) Integer only, shared with FPU
SSE 1999 128-bit 8 (xmm0-7) 4 floats, dedicated registers
SSE2 2001 128-bit 8/16 2 doubles, integer operations
AVX 2011 256-bit 16 (ymm0-15) 8 floats, 3-operand instructions
AVX2 2013 256-bit 16 Integer 256-bit, FMA, gather
AVX-512 2017 512-bit 32 (zmm0-31) 16 floats, mask registers, scatter
AMX 2023 Tile registers 8 tiles Matrix multiply (BF16, INT8)
  • Each generation doubled the throughput for vectorised code. Code written with SSE intrinsics runs on every x86 CPU made since 2001. AVX2 requires a 2013+ CPU. AVX-512 is Intel Xeon and some consumer chips. AMX is the newest (Sapphire Rapids and later).

  • Backward compatibility: x86 SSE registers (xmm) are the lower 128 bits of AVX registers (ymm), which are the lower 256 bits of AVX-512 registers (zmm). Old SSE code runs on new CPUs without modification.

AVX2 Programming

  • AVX2 operates on 256-bit registers (YMM), processing 8 floats or 4 doubles simultaneously. It is the sweet spot for portable high-performance code: available on virtually all modern x86 CPUs (2013+).

Intrinsics Naming Convention

  • All x86 intrinsics follow the pattern: _mm[width]_[operation]_[type]

    • _mm = MMX/SSE (128-bit), _mm256 = AVX (256-bit), _mm512 = AVX-512 (512-bit)
    • Operation: add, mul, fmadd, load, store, set, etc.
    • Type: ps = packed single (float32), pd = packed double (float64), epi32 = packed int32, si256 = 256-bit integer
#include <immintrin.h>  // all x86 SIMD intrinsics

// Data types
__m256  a;   // 256-bit register holding 8 float32s
__m256d b;   // 256-bit register holding 4 float64s
__m256i c;   // 256-bit register holding integers (8x32, 16x16, or 32x8)

Loading and Storing Data

// Load 8 floats from memory
__m256 v = _mm256_loadu_ps(ptr);      // unaligned load (works with any address)
__m256 v = _mm256_load_ps(ptr);       // aligned load (ptr must be 32-byte aligned, faster)

// Store 8 floats to memory
_mm256_storeu_ps(out_ptr, v);          // unaligned store
_mm256_store_ps(out_ptr, v);           // aligned store

// Broadcast a single value to all 8 lanes
__m256 ones = _mm256_set1_ps(1.0f);    // [1, 1, 1, 1, 1, 1, 1, 1]

// Set individual values (rarely needed)
__m256 v = _mm256_set_ps(7,6,5,4,3,2,1,0);  // note: reverse order!

// Zero register
__m256 z = _mm256_setzero_ps();

Arithmetic

__m256 c = _mm256_add_ps(a, b);        // c[i] = a[i] + b[i]
__m256 d = _mm256_mul_ps(a, b);        // d[i] = a[i] * b[i]
__m256 e = _mm256_sub_ps(a, b);        // e[i] = a[i] - b[i]
__m256 f = _mm256_div_ps(a, b);        // f[i] = a[i] / b[i] (slower than mul)

// Fused Multiply-Add: r = a * b + c (one instruction, one rounding)
__m256 r = _mm256_fmadd_ps(a, b, c);   // THE most important instruction for ML

// Min and max
__m256 mn = _mm256_min_ps(a, b);       // min(a[i], b[i]) — useful for clipping
__m256 mx = _mm256_max_ps(a, b);       // max(a[i], b[i]) — useful for ReLU

Practical Example: AVX2 Dot Product

#include <immintrin.h>

float dot_avx2(const float* a, const float* b, int n) {
    __m256 sum = _mm256_setzero_ps();  // 8 accumulators initialised to 0

    int i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i);
        __m256 vb = _mm256_loadu_ps(b + i);
        sum = _mm256_fmadd_ps(va, vb, sum);  // sum += va * vb
    }

    // Horizontal reduction: sum the 8 elements of sum
    // Step 1: add upper 128 bits to lower 128 bits
    __m128 hi = _mm256_extractf128_ps(sum, 1);
    __m128 lo = _mm256_castps256_ps128(sum);
    __m128 sum128 = _mm_add_ps(hi, lo);        // 4 partial sums

    // Step 2: horizontal add within 128-bit register
    sum128 = _mm_hadd_ps(sum128, sum128);       // [a+b, c+d, a+b, c+d]
    sum128 = _mm_hadd_ps(sum128, sum128);       // [a+b+c+d, ...]

    float result = _mm_cvtss_f32(sum128);       // extract scalar

    // Scalar cleanup
    for (; i < n; i++) {
        result += a[i] * b[i];
    }

    return result;
}
  • Why the horizontal reduction is ugly: SIMD is designed for vertical operations (lane 0 with lane 0, lane 1 with lane 1). Horizontal operations (summing across lanes) fight the hardware. This is why dot products have the awkward reduction code at the end. The vectorised loop is clean; the reduction is boilerplate.

  • Performance: compared to the NEON version (file 02), AVX2 processes 8 floats per iteration vs NEON's 4. For long vectors, this is a 2x speedup over NEON (ignoring memory bandwidth limits).

Practical Example: AVX2 Softmax (Simplified)

  • Softmax requires: find the max, subtract it, exponentiate, sum, divide. Here is the max-finding step:
float vector_max_avx2(const float* data, int n) {
    __m256 max_vec = _mm256_set1_ps(-INFINITY);

    int i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 v = _mm256_loadu_ps(data + i);
        max_vec = _mm256_max_ps(max_vec, v);
    }

    // Reduce 8 maxes to 1
    __m128 hi = _mm256_extractf128_ps(max_vec, 1);
    __m128 lo = _mm256_castps256_ps128(max_vec);
    __m128 max128 = _mm_max_ps(hi, lo);

    // Shuffle and max to find the single maximum
    max128 = _mm_max_ps(max128, _mm_shuffle_ps(max128, max128, 0b01001110));
    max128 = _mm_max_ps(max128, _mm_shuffle_ps(max128, max128, 0b10110001));

    float result = _mm_cvtss_f32(max128);

    for (; i < n; i++) {
        result = result > data[i] ? result : data[i];
    }

    return result;
}
  • The _mm_shuffle_ps instruction rearranges elements within a register. The binary constant 0b01001110 controls which elements go where. This is called a permutation, and it connects directly to permutation matrices (chapter 2): shuffling SIMD lanes is the hardware equivalent of multiplying by a permutation matrix.

AVX-512

  • AVX-512 doubles the width again: 512-bit registers (ZMM), processing 16 floats simultaneously.
__m512 a = _mm512_loadu_ps(ptr);                // load 16 floats
__m512 c = _mm512_fmadd_ps(a, b, c);            // 16 FMAs at once
float sum = _mm512_reduce_add_ps(a);             // built-in horizontal sum (no manual reduction!)

// Mask operations: operate on a subset of lanes
__mmask16 mask = _mm512_cmpgt_ps_mask(a, zero);  // which lanes are > 0?
__m512 relu = _mm512_maskz_mov_ps(mask, a);       // zero out negative lanes = ReLU
  • Mask registers (__mmask16) are AVX-512's most powerful feature. Each bit controls whether a lane participates in an operation. This replaces the scalar cleanup loop: the last iteration uses a mask where only the valid lanes are active, handling any vector length without a separate scalar loop.

  • AVX-512 frequency throttling: on many Intel CPUs, using AVX-512 instructions causes the CPU to temporarily reduce its clock frequency (to stay within thermal limits). This means AVX-512 is not always faster than AVX2 for short bursts — the frequency penalty can outweigh the wider vectors. For sustained workloads (like matrix multiplication), AVX-512 wins. For mixed code (some SIMD, some scalar), the frequency transitions can hurt.

Intel AMX: Matrix Multiply Hardware

  • AMX (Advanced Matrix eXtensions) adds dedicated matrix multiply units. Instead of SIMD vectors, AMX operates on tiles: 2D blocks of data (up to 16 rows × 64 bytes each).
#include <immintrin.h>

// AMX tile multiply: C += A * B (in BF16)
// A is 16x32 BF16, B is 32x16 BF16, C is 16x16 FP32
_tile_loadd(0, a_ptr, stride_a);   // load tile 0 from A
_tile_loadd(1, b_ptr, stride_b);   // load tile 1 from B
_tile_dpbf16ps(2, 0, 1);           // tile 2 += tile 0 * tile 1 (BF16 matmul, FP32 accumulation)
_tile_stored(2, c_ptr, stride_c);  // store tile 2 to C
  • AMX performs a full 16×32 × 32×16 matrix multiply in a single instruction. This is hundreds of FMA operations at once, specifically designed for the small matrix multiplies that dominate transformer inference (attention score computation, MLP layers).

  • AMX supports BF16 (bfloat16) and INT8, matching the precisions used in ML inference. Combined with AVX-512 for other operations, AMX-equipped CPUs (Intel Sapphire Rapids, Emerald Rapids) can run transformer inference competitively with entry-level GPUs.

Memory Alignment

  • Aligned memory access is when the data address is a multiple of the vector register width (16 bytes for SSE, 32 bytes for AVX, 64 bytes for AVX-512). Aligned access is faster on some CPUs and required by _mm256_load_ps (as opposed to _mm256_loadu_ps).
// Allocate aligned memory
float* data = (float*)aligned_alloc(32, n * sizeof(float));  // 32-byte aligned for AVX

// C++ aligned allocation
#include <new>
float* data = new (std::align_val_t(32)) float[n];

// Alternatively, use the compiler attribute
alignas(32) float data[1024];
  • In practice: on modern CPUs (Haswell and later), unaligned loads (loadu) are almost as fast as aligned loads when the data does not cross a cache line boundary. The performance penalty of unaligned access has largely disappeared, but cache line splits (data spanning two 64-byte cache lines) can still cause a ~2x slowdown for that specific load. Aligned allocation avoids this entirely.

Performance Pitfalls

  • AVX-SSE transition penalty: on older Intel CPUs (pre-Skylake), switching between AVX (256-bit) and SSE (128-bit) instructions caused a penalty (~70 cycles). This is why you should use _mm256_zeroupper() (or vzeroupper instruction) before returning from a function that uses AVX, to clear the upper 128 bits of YMM registers. Modern CPUs (Skylake+) do not have this penalty.

  • Register pressure: AVX2 has 16 YMM registers. If your kernel uses too many variables, the compiler spills registers to the stack (memory), destroying performance. Keep inner loops simple with few live variables.

  • Data dependencies: sum = _mm256_fmadd_ps(a, b, sum) has a dependency on sum: each iteration must wait for the previous FMA to complete (~4-5 cycles latency). The fix: use multiple independent accumulators and reduce at the end:

// Single accumulator: limited by FMA latency (4-5 cycles)
__m256 sum = _mm256_setzero_ps();
for (...) {
    sum = _mm256_fmadd_ps(a, b, sum);  // each depends on previous
}

// Four accumulators: 4x throughput (hide latency)
__m256 sum0 = _mm256_setzero_ps();
__m256 sum1 = _mm256_setzero_ps();
__m256 sum2 = _mm256_setzero_ps();
__m256 sum3 = _mm256_setzero_ps();
for (...) {
    sum0 = _mm256_fmadd_ps(a0, b0, sum0);  // independent
    sum1 = _mm256_fmadd_ps(a1, b1, sum1);  // independent
    sum2 = _mm256_fmadd_ps(a2, b2, sum2);  // independent
    sum3 = _mm256_fmadd_ps(a3, b3, sum3);  // independent
}
sum0 = _mm256_add_ps(sum0, sum1);
sum2 = _mm256_add_ps(sum2, sum3);
sum0 = _mm256_add_ps(sum0, sum2);
  • This is loop unrolling to hide latency. The CPU can issue the FMAs back-to-back because they write to different registers. This is one of the most impactful micro-optimisations in numerical code.

Profiling

  • Performance counters provide hardware-level measurements:
# Linux perf (requires kernel support)
perf stat ./my_program                    # basic counters: cycles, instructions, IPC
perf stat -e cache-misses,cache-references ./my_program  # cache behaviour
perf record -g ./my_program && perf report              # call graph profiling

# Intel VTune (detailed x86 profiling)
vtune -collect hotspots -- ./my_program
vtune -collect memory-access -- ./my_program   # memory bandwidth analysis
  • What to look for:
    • IPC (Instructions Per Cycle): how efficiently the CPU is used. IPC > 2 is good. IPC < 1 suggests memory stalls or branch mispredictions.
    • Cache miss rate: high L1/L2 miss rates indicate poor data locality. Restructure data access patterns.
    • Branch misprediction rate: > 5% suggests unpredictable branches. Convert to branchless code (SIMD comparisons + blending) if possible.
    • FLOPS achieved vs roofline: compare your measured FLOPS to the roofline model (file 01). If you are below the roofline, there is room for improvement.

Coding Tasks (compile with g++ or clang++ on x86 — Intel/AMD)

  1. Write a scalar dot product and an AVX2 dot product. Benchmark both and measure the speedup from 8-wide SIMD.

    // task1_avx_dot.cpp
    // Compile: g++ -O3 -mavx2 -mfma -o task1 task1_avx_dot.cpp
    
    #include <iostream>
    #include <chrono>
    #include <vector>
    #include <immintrin.h>
    
    float dot_scalar(const float* a, const float* b, int n) {
        float sum = 0.0f;
        for (int i = 0; i < n; i++) sum += a[i] * b[i];
        return sum;
    }
    
    float dot_avx2(const float* a, const float* b, int n) {
        __m256 sum = _mm256_setzero_ps();
        int i = 0;
        for (; i + 8 <= n; i += 8) {
            __m256 va = _mm256_loadu_ps(a + i);
            __m256 vb = _mm256_loadu_ps(b + i);
            sum = _mm256_fmadd_ps(va, vb, sum);
        }
        // Reduce: add upper 128 to lower 128, then horizontal add
        __m128 hi = _mm256_extractf128_ps(sum, 1);
        __m128 lo = _mm256_castps256_ps128(sum);
        __m128 r = _mm_add_ps(hi, lo);
        r = _mm_hadd_ps(r, r);
        r = _mm_hadd_ps(r, r);
        float result = _mm_cvtss_f32(r);
        for (; i < n; i++) result += a[i] * b[i];
        return result;
    }
    
    int main() {
        const int N = 10'000'000;
        std::vector<float> a(N, 1.0f), b(N, 2.0f);
    
        volatile float s1 = dot_scalar(a.data(), b.data(), N);
        volatile float s2 = dot_avx2(a.data(), b.data(), N);
    
        auto bench = [&](auto fn, const char* name) {
            auto start = std::chrono::high_resolution_clock::now();
            volatile float s;
            for (int t = 0; t < 100; t++) s = fn(a.data(), b.data(), N);
            auto end = std::chrono::high_resolution_clock::now();
            double ms = std::chrono::duration<double, std::milli>(end - start).count() / 100;
            std::cout << name << ": " << ms << " ms (result: " << s << ")\n";
            return ms;
        };
    
        double t1 = bench(dot_scalar, "Scalar");
        double t2 = bench(dot_avx2,   "AVX2  ");
        std::cout << "Speedup: " << t1 / t2 << "x\n";
        return 0;
    }
    

  2. Implement AVX2 ReLU using _mm256_max_ps and compare against a scalar loop. Then try with multiple accumulators (loop unrolling) to hide FMA latency.

    // task2_avx_relu.cpp
    // Compile: g++ -O3 -mavx2 -o task2 task2_avx_relu.cpp
    
    #include <iostream>
    #include <chrono>
    #include <vector>
    #include <immintrin.h>
    
    void relu_scalar(const float* in, float* out, int n) {
        for (int i = 0; i < n; i++) {
            out[i] = in[i] > 0.0f ? in[i] : 0.0f;
        }
    }
    
    void relu_avx2(const float* in, float* out, int n) {
        __m256 zero = _mm256_setzero_ps();
        int i = 0;
        for (; i + 8 <= n; i += 8) {
            __m256 x = _mm256_loadu_ps(in + i);
            _mm256_storeu_ps(out + i, _mm256_max_ps(x, zero));
        }
        for (; i < n; i++) out[i] = in[i] > 0.0f ? in[i] : 0.0f;
    }
    
    // Unrolled: process 32 floats per iteration (4 x 8)
    void relu_avx2_unrolled(const float* in, float* out, int n) {
        __m256 zero = _mm256_setzero_ps();
        int i = 0;
        for (; i + 32 <= n; i += 32) {
            __m256 x0 = _mm256_loadu_ps(in + i);
            __m256 x1 = _mm256_loadu_ps(in + i + 8);
            __m256 x2 = _mm256_loadu_ps(in + i + 16);
            __m256 x3 = _mm256_loadu_ps(in + i + 24);
            _mm256_storeu_ps(out + i,      _mm256_max_ps(x0, zero));
            _mm256_storeu_ps(out + i + 8,  _mm256_max_ps(x1, zero));
            _mm256_storeu_ps(out + i + 16, _mm256_max_ps(x2, zero));
            _mm256_storeu_ps(out + i + 24, _mm256_max_ps(x3, zero));
        }
        for (; i + 8 <= n; i += 8) {
            _mm256_storeu_ps(out + i, _mm256_max_ps(_mm256_loadu_ps(in + i), zero));
        }
        for (; i < n; i++) out[i] = in[i] > 0.0f ? in[i] : 0.0f;
    }
    
    int main() {
        const int N = 16'000'000;
        std::vector<float> in(N), out(N);
        for (int i = 0; i < N; i++) in[i] = (float)(i % 200) - 100.0f;
    
        auto bench = [&](auto fn, const char* name) {
            fn(in.data(), out.data(), N);  // warm up
            auto start = std::chrono::high_resolution_clock::now();
            for (int t = 0; t < 100; t++) fn(in.data(), out.data(), N);
            auto end = std::chrono::high_resolution_clock::now();
            double ms = std::chrono::duration<double, std::milli>(end - start).count() / 100;
            double bw = 2.0 * N * sizeof(float) / ms / 1e6;  // read + write
            std::cout << name << ": " << ms << " ms (" << bw << " GB/s)\n";
        };
    
        bench(relu_scalar,        "Scalar        ");
        bench(relu_avx2,          "AVX2          ");
        bench(relu_avx2_unrolled, "AVX2 unrolled ");
        return 0;
    }
    

  3. Measure the effect of memory alignment. Compare aligned vs unaligned loads on large arrays.

    // task3_alignment.cpp
    // Compile: g++ -O3 -mavx2 -o task3 task3_alignment.cpp
    
    #include <iostream>
    #include <chrono>
    #include <cstdlib>
    #include <immintrin.h>
    
    int main() {
        const int N = 16'000'000;
    
        // Aligned allocation (32-byte for AVX2)
        float* aligned = (float*)aligned_alloc(32, N * sizeof(float));
    
        // Unaligned: offset by 4 bytes (1 float) from aligned boundary
        float* raw = (float*)malloc((N + 1) * sizeof(float));
        float* unaligned = raw + 1;  // guaranteed misaligned
    
        for (int i = 0; i < N; i++) {
            aligned[i] = 1.0f;
            unaligned[i] = 1.0f;
        }
    
        auto bench = [&](float* ptr, bool use_aligned, const char* name) {
            __m256 sum = _mm256_setzero_ps();
            // Warm up
            for (int i = 0; i + 8 <= N; i += 8) {
                __m256 v = use_aligned ? _mm256_load_ps(ptr + i) : _mm256_loadu_ps(ptr + i);
                sum = _mm256_add_ps(sum, v);
            }
    
            auto start = std::chrono::high_resolution_clock::now();
            for (int t = 0; t < 100; t++) {
                sum = _mm256_setzero_ps();
                for (int i = 0; i + 8 <= N; i += 8) {
                    __m256 v = use_aligned ? _mm256_load_ps(ptr + i) : _mm256_loadu_ps(ptr + i);
                    sum = _mm256_add_ps(sum, v);
                }
            }
            auto end = std::chrono::high_resolution_clock::now();
            double ms = std::chrono::duration<double, std::milli>(end - start).count() / 100;
            double bw = (double)N * sizeof(float) / ms / 1e6;
            std::cout << name << ": " << ms << " ms (" << bw << " GB/s)\n";
        };
    
        bench(aligned,   true,  "Aligned load  ");
        bench(unaligned, false, "Unaligned load");
    
        free(aligned);
        free(raw);
        return 0;
    }