Hardware Fundamentals¶

Before writing SIMD or GPU code, you need to understand the hardware you are programming. This file covers why parallelism replaced clock speed, how modern CPUs execute instructions, what SIMD is, the roofline model for reasoning about performance, and the landscape of chip architectures

For decades, software got faster for free: buy a new CPU with a higher clock speed, and your program runs faster without changing a line of code. That era ended around 2005. Understanding why it ended, and what replaced it, is essential for anyone who wants to write fast code.

The End of Free Performance¶

Moore's Law (1965) observed that the number of transistors on a chip doubles roughly every two years. This held for 60 years. More transistors meant smaller transistors, which meant higher clock speeds, which meant faster programs.
But around 2005, clock speeds hit a wall at ~4 GHz. The problem is power. The power consumed by a chip is approximately:

\[P \propto C \cdot V^2 \cdot f\]

where \(C\) is capacitance (proportional to transistor count), \(V\) is voltage, and \(f\) is clock frequency. To increase frequency, you must increase voltage (to switch transistors faster). But power scales with \(V^2 \cdot f\), so a small increase in frequency causes a large increase in power (and heat). At 4 GHz, chips were already hitting 100+ watts. Going to 8 GHz would require impractical cooling.
The solution: instead of making one core faster, put multiple cores on the same chip. A 4-core chip at 3 GHz uses similar power to a single core at 4.5 GHz but can do 4x the parallel work. This is why every modern CPU has multiple cores, and why parallelism (SIMD, multithreading, GPU computing) is the only path to more performance.
Implication for ML: a training step that takes 10 minutes on one core cannot be made faster by buying a faster CPU. It can only be made faster by using more cores (data parallelism, chapter 6), wider SIMD units (this chapter), or GPUs (thousands of cores).

How Modern CPUs Execute Instructions¶

A modern CPU core is far more complex than the simple fetch-decode-execute model from chapter 13. It uses several tricks to execute more instructions per cycle:
Superscalar execution: the CPU has multiple execution units (ALUs, FPUs, load/store units) and can execute multiple independent instructions simultaneously. A modern core might execute 4-6 instructions per cycle if they do not depend on each other.
Out-of-order execution (OoO): the CPU does not execute instructions in program order. It looks ahead in the instruction stream, finds instructions whose inputs are ready, and executes them immediately, regardless of their position. This hides latency: while one instruction waits for data from memory (100+ cycles), the CPU executes other instructions that are ready.
Branch prediction: conditional branches (if statements, loop conditions) create uncertainty: the CPU does not know which path to take until the condition is evaluated. Rather than stall, the CPU predicts the outcome and speculatively executes down the predicted path. If the prediction is correct (>95% of the time with modern predictors), no time is lost. If wrong, the speculative work is discarded and the correct path is executed (~15 cycle penalty).
Speculative execution: an extension of branch prediction. The CPU executes instructions that might not be needed, betting that they will be. This fills the pipeline and keeps execution units busy.
All of these are automatic — the CPU does them without any programmer intervention. But they only help with instruction-level parallelism (ILP): independent instructions within a single stream. For data-level parallelism (the same operation on many data elements), we need SIMD.

SIMD: Single Instruction, Multiple Data¶

SIMD is the idea of applying one instruction to multiple data elements simultaneously. Instead of adding two numbers, add two vectors of 4 (or 8, or 16) numbers in a single instruction.
Without SIMD (scalar):

// Add two arrays element by element: 4 add instructions
for (int i = 0; i < 4; i++) {
    c[i] = a[i] + b[i];  // one add per iteration
}

With SIMD (vectorised):

// Add two arrays: 1 SIMD instruction does all 4 adds
#include <immintrin.h>  // x86 SIMD intrinsics

__m128 va = _mm_load_ps(a);    // load 4 floats into a 128-bit register
__m128 vb = _mm_load_ps(b);    // load 4 floats into another register
__m128 vc = _mm_add_ps(va, vb); // add all 4 pairs simultaneously
_mm_store_ps(c, vc);            // store 4 results

The SIMD version does the same work in 1/4 of the instructions. This is a theoretical 4x speedup, achieved by processing 4 floats per instruction instead of 1.

Vector Registers¶

SIMD instructions operate on vector registers: wide registers that hold multiple data elements.

Register Width	Floats (32-bit)	Doubles (64-bit)	Name
128-bit	4	2	SSE (x86), NEON (ARM)
256-bit	8	4	AVX/AVX2 (x86)
512-bit	16	8	AVX-512 (x86)
Variable (128-2048)	varies	varies	SVE/SVE2 (ARM)

Wider registers = more parallelism. A 512-bit AVX-512 instruction processes 16 floats at once, a theoretical 16x speedup over scalar code. In practice, the speedup is lower due to memory bandwidth limitations (you can compute faster than you can feed data to the CPU).
For ML: a matrix multiplication of float32 values benefits enormously from SIMD. The inner loop (dot product of two vectors) maps directly to SIMD multiply-accumulate instructions. This is why BLAS libraries (which NumPy and PyTorch call) are so heavily optimised with SIMD.

The Roofline Model¶

How do you know if your code is fast? The roofline model provides a framework by characterising performance in terms of two hardware limits:
Peak compute (FLOPS): the maximum floating-point operations per second. For a 4 GHz CPU with 256-bit AVX (8 floats per instruction) and 2 FMA units: \(4 \times 10^9 \times 8 \times 2 = 64\) GFLOPS.
Peak memory bandwidth (bytes/second): how fast data can be moved from memory to the CPU. A modern CPU might have 50 GB/s of memory bandwidth.
The arithmetic intensity of your code is the ratio of computation to memory access:

\[\text{Arithmetic Intensity} = \frac{\text{FLOPS}}{\text{Bytes transferred}}\]

If arithmetic intensity is low (few operations per byte loaded), your code is memory-bound: it spends most of its time waiting for data. Making compute faster (wider SIMD, higher clock) will not help.
If arithmetic intensity is high (many operations per byte), your code is compute-bound: it spends most of its time computing. Faster memory will not help.
The roofline:

\[\text{Achievable FLOPS} = \min\left(\text{Peak FLOPS}, \; \text{Bandwidth} \times \text{Arithmetic Intensity}\right)\]

Matrix multiplication has high arithmetic intensity: \(O(n^3)\) operations on \(O(n^2)\) data, so intensity \(\approx O(n)\). For large matrices, it is compute-bound. This is why GPUs (high compute) dominate matrix-heavy ML workloads.
Element-wise operations (ReLU, add, multiply) have low arithmetic intensity: 1 operation per element loaded. These are memory-bound. Making the GPU faster does not help; you need faster memory (or fuse these operations with compute-heavy ones to avoid separate memory round-trips).
The roofline model explains why kernel fusion matters so much: combining a matmul with a bias add and ReLU into a single kernel avoids writing intermediate results to memory and reading them back, turning three memory-bound operations into one compute-bound operation.

Latency vs Throughput¶

Latency is the time to complete one operation. Throughput is the number of operations completed per unit time.
An analogy: a bus has high latency (waits at every stop) but high throughput (carries 50 people at once). A taxi has low latency (goes directly to your destination) but low throughput (carries 1-4 people).
GPUs are buses: high latency per operation (each instruction takes many cycles to complete) but enormous throughput (thousands of cores processing simultaneously). CPUs are taxis: low latency (out-of-order execution, branch prediction, deep caches minimise delays) but limited throughput (4-64 cores).
This is why GPUs are better for ML training (throughput matters: process millions of examples) and CPUs are better for OS tasks (latency matters: respond to a keypress immediately).
Pipelining converts latency into throughput. If an instruction takes 5 cycles but the pipeline starts a new instruction every cycle, the throughput is 1 instruction per cycle (even though each takes 5 cycles to complete). This is the same principle as the CPU pipeline from chapter 13, but it applies at every level: SIMD units, memory controllers, and GPU cores are all pipelined.

Chip Architecture Landscape¶

The hardware you write code for determines which SIMD instructions are available:

x86 (Intel, AMD)¶

Dominates desktops, laptops, and data centre CPUs. SIMD: SSE (128-bit), AVX/AVX2 (256-bit), AVX-512 (512-bit). Intel AMX provides dedicated matrix multiply units for AI workloads.
Strengths: highest single-core performance, widest SIMD, mature software ecosystem (MKL, oneDNN).
Weaknesses: high power consumption, complex instruction set, expensive.

ARM¶

Dominates mobile (every smartphone), growing in servers (AWS Graviton, Ampere Altra) and laptops (Apple M-series). SIMD: NEON (128-bit), SVE/SVE2 (scalable, 128-2048 bit).
Strengths: excellent power efficiency (performance per watt), custom cores (Apple M4 rivals Intel in single-core performance at fraction of the power).
Weaknesses: narrower SIMD (NEON is only 128-bit, though SVE can be wider), smaller software ecosystem for HPC.

Apple Silicon (M1/M2/M3/M4)¶

ARM-based with custom additions. Includes AMX (Apple Matrix eXtensions) — undocumented matrix multiply units that Accelerate framework uses for BLAS operations. Unified memory architecture: CPU and GPU share the same physical memory, eliminating the CPU↔GPU copy bottleneck.
For ML: Apple's Neural Engine (16-core, dedicated ML accelerator) and unified memory make M-series chips surprisingly capable for local ML inference and small-scale training. No CUDA, though: you must use Metal (Apple's GPU API) or MLX (Apple's ML framework).

RISC-V¶

Open-source ISA. No licensing fees (unlike ARM). Growing in embedded systems, IoT, and research. SIMD: the "V" (vector) extension provides scalable vector processing similar to ARM SVE.
For ML: not yet competitive with x86/ARM for ML workloads, but watch this space. Several AI accelerator startups use RISC-V cores.

GPUs (NVIDIA, AMD, Intel)¶

Covered in depth in files 04-05. Thousands of simple cores optimised for throughput. NVIDIA dominates ML with CUDA; AMD competes with ROCm; Intel enters with Arc GPUs and Gaudi accelerators.

TPUs (Google)¶

Custom ASIC designed specifically for ML. Systolic arrays optimised for matrix multiplication. Covered in file 05.

Thermal and Power Constraints¶

Performance is ultimately limited by power and cooling:
TDP (Thermal Design Power): the maximum sustained power a chip can consume. A laptop CPU might have a 15W TDP; a server CPU 250W; a data centre GPU 700W (NVIDIA B200).
Dark silicon: at any given moment, a significant fraction of transistors must be powered off to stay within the thermal budget. A chip could theoretically use all its transistors simultaneously, but it would melt.
Power efficiency (FLOPS/watt) is increasingly the metric that matters, not raw FLOPS. This is why:
- ARM is taking over data centres (better FLOPS/watt than x86).
- TPUs compete with GPUs despite lower peak FLOPS (much better FLOPS/watt for ML workloads).
- Quantisation (INT8, FP8) is not just about memory: it also reduces power per operation.
For ML at scale: training a frontier LLM consumes megawatts of power for months. The electricity cost can exceed the hardware cost. Power efficiency directly impacts the economics of AI research.

Practical: Measuring Performance in C++¶

To reason about performance, you need to measure it. Here is a minimal C++ benchmarking setup:

#include <iostream>
#include <chrono>
#include <vector>

// Scalar addition
void add_scalar(const float* a, const float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    const int N = 1 << 24;  // ~16 million elements
    std::vector<float> a(N, 1.0f), b(N, 2.0f), c(N);

    // Warm up (fill caches, trigger frequency scaling)
    add_scalar(a.data(), b.data(), c.data(), N);

    // Benchmark
    auto start = std::chrono::high_resolution_clock::now();

    for (int trial = 0; trial < 100; trial++) {
        add_scalar(a.data(), b.data(), c.data(), N);
    }

    auto end = std::chrono::high_resolution_clock::now();
    double elapsed = std::chrono::duration<double>(end - start).count();

    double total_bytes = 3.0 * N * sizeof(float) * 100;  // read a, read b, write c
    double bandwidth = total_bytes / elapsed / 1e9;        // GB/s

    std::cout << "Time: " << elapsed << " s\n";
    std::cout << "Bandwidth: " << bandwidth << " GB/s\n";

    return 0;
}

# Compile with optimisations
g++ -O3 -march=native -o bench bench.cpp
./bench

Key C++ concepts in this code:
- #include <vector>: dynamic array (std::vector<float>) — like Python's list but typed and contiguous in memory.
- a.data(): returns a raw pointer (float*) to the underlying array — needed for SIMD intrinsics.
- std::chrono: high-resolution timer for benchmarking.
- -O3: maximum compiler optimisation level. The compiler may auto-vectorise your loops (use SIMD automatically). -march=native enables all SIMD instructions your CPU supports.
Why warm up: the first run fills caches and may trigger CPU frequency scaling (turbo boost). Subsequent runs are more representative.
Why measure bandwidth: for memory-bound operations (like element-wise add), the meaningful metric is bandwidth (GB/s), not FLOPS. If your measured bandwidth is close to the hardware limit (~50 GB/s for DDR5), you are memory-bound and SIMD will not help much (the bottleneck is memory, not compute).

Coding Tasks (use CoLab or notebook)¶

Compute the arithmetic intensity of common ML operations and classify them as memory-bound or compute-bound.

import jax.numpy as jnp

def arithmetic_intensity(flops, bytes_transferred):
    return flops / bytes_transferred

# Element-wise ReLU: 1 comparison per element, read + write
n = 1024
relu_flops = n  # 1 op per element
relu_bytes = 2 * n * 4  # read input + write output (float32)
print(f"ReLU: {arithmetic_intensity(relu_flops, relu_bytes):.2f} FLOPS/byte → memory-bound")

# Matrix multiply: 2*n^3 ops, read 2*n^2 + write n^2 floats
matmul_flops = 2 * n**3
matmul_bytes = 3 * n**2 * 4  # read A + read B + write C
print(f"Matmul ({n}×{n}): {arithmetic_intensity(matmul_flops, matmul_bytes):.0f} FLOPS/byte → compute-bound")

# Layer norm: ~5n ops (mean, var, normalise), read + write
ln_flops = 5 * n
ln_bytes = 2 * n * 4
print(f"LayerNorm: {arithmetic_intensity(ln_flops, ln_bytes):.2f} FLOPS/byte → memory-bound")

# Convolution 3x3: 2*9*C_in*C_out*H*W, read kernel + feature map + write output
C_in, C_out, H, W = 64, 128, 32, 32
conv_flops = 2 * 9 * C_in * C_out * H * W
conv_bytes = (9 * C_in * C_out + C_in * H * W + C_out * H * W) * 4
print(f"Conv3x3: {arithmetic_intensity(conv_flops, conv_bytes):.0f} FLOPS/byte → compute-bound")

Demonstrate why parallelism matters. Compare sequential vs parallel (NumPy) execution as data size grows.

import numpy as np
import time

for n in [1000, 10000, 100000, 1000000, 10000000]:
    a = np.random.randn(n).astype(np.float32)
    b = np.random.randn(n).astype(np.float32)

    # "Sequential" (Python loop)
    start = time.time()
    c = [a[i] * b[i] for i in range(min(n, 100000))]  # cap at 100K for sanity
    seq_time = time.time() - start
    if n > 100000:
        seq_time *= n / 100000  # extrapolate

    # "Parallel" (NumPy, uses SIMD + multithreading internally)
    start = time.time()
    c = a * b
    par_time = time.time() - start

    print(f"n={n:>10,}  sequential={seq_time:.4f}s  parallel={par_time:.6f}s  "
          f"speedup={seq_time/par_time:.0f}x")