ARM and NEON¶

ARM processors power every smartphone, most tablets, Apple's laptops, and an increasing share of data centre servers. This file covers the ARM architecture, NEON SIMD programming with C++ intrinsics, SVE/SVE2 for scalable vector processing, Apple Silicon specifics, and practical vectorised kernel examples

If you own an iPhone, a MacBook, or use AWS Graviton instances, you are running ARM. ARM's power efficiency makes it dominant in mobile and embedded, and increasingly competitive in servers and ML inference. Understanding ARM SIMD lets you write code that runs fast on the hardware most people actually use.
For a real-world example of ARM SIMD kernels in production, see Cactus — a low-latency AI engine for mobile devices and wearables: github.com/cactus-compute/cactus. Cactus implements custom ARM NEON and NPU-accelerated kernels for attention, KV-cache quantisation, and chunked prefill, achieving the fastest inference on ARM CPUs with 10x lower RAM than other engines. Its three-layer architecture (Engine → Graph → Kernels) is a concrete example of how the SIMD concepts in this file are used to build production ML infrastructure.

ARM Architecture Basics¶

ARM is a RISC (Reduced Instruction Set Computer) architecture (chapter 13). Key characteristics:
- Load-store architecture: arithmetic instructions operate only on registers, never directly on memory. To add two numbers from memory, you must: (1) load them into registers, (2) add the registers, (3) store the result back to memory. This is simpler than x86 (which can add a register and a memory location in one instruction) but enables cleaner pipelining.
- Fixed-width instructions: every ARMv8 (AArch64) instruction is exactly 32 bits. This makes decoding fast and predictable (unlike x86's variable-length instructions that can be 1-15 bytes).
- 32 general-purpose registers (x0-x30, each 64-bit) plus the stack pointer (sp) and zero register (xzr). Compare to x86's 16 general-purpose registers. More registers = fewer memory accesses = faster code.
- 32 SIMD/floating-point registers (v0-v31, each 128-bit) for NEON and floating-point operations.

// ARM assembly (just to see the flavour -- you will use intrinsics, not assembly)
// Add two registers
add x0, x1, x2    // x0 = x1 + x2

// Load from memory
ldr x0, [x1]      // x0 = *x1 (load 64 bits from address in x1)

// NEON: add four floats
fadd v0.4s, v1.4s, v2.4s  // v0 = v1 + v2 (four 32-bit floats)

You will not write assembly. You will use intrinsics: C/C++ functions that map 1:1 to specific instructions. The compiler handles register allocation, scheduling, and other low-level details.

NEON: 128-bit SIMD¶

NEON is ARM's SIMD extension. Each NEON register is 128 bits wide and can hold:

Data type	Elements per register	Notation
float32	4	`float32x4_t`
float16	8	`float16x8_t`
int32	4	`int32x4_t`
int16	8	`int16x8_t`
int8	16	`int8x16_t`

128 bits is narrower than x86's AVX (256-bit) or AVX-512 (512-bit). But ARM compensates with excellent power efficiency and wide availability.

NEON Intrinsics: The Basics¶

NEON intrinsics follow a naming convention: v[operation][qualifier]_[type]

#include <arm_neon.h>

// Load 4 floats from memory into a NEON register
float32x4_t a = vld1q_f32(ptr);        // vld1q = vector load 1, q = 128-bit (quad)

// Store 4 floats from a NEON register to memory
vst1q_f32(out_ptr, a);                   // vst1q = vector store 1, q = 128-bit

// Arithmetic
float32x4_t c = vaddq_f32(a, b);        // c = a + b (4 floats)
float32x4_t d = vmulq_f32(a, b);        // d = a * b (4 floats)
float32x4_t e = vfmaq_f32(c, a, b);     // e = c + a * b (fused multiply-add, 4 floats)

// Comparison (returns a mask: all 1s if true, all 0s if false)
uint32x4_t mask = vcgtq_f32(a, b);      // mask[i] = (a[i] > b[i]) ? 0xFFFFFFFF : 0

// Select elements based on mask (like numpy.where)
float32x4_t result = vbslq_f32(mask, a, b);  // result[i] = mask[i] ? a[i] : b[i]

// Reduce: sum all 4 elements to a scalar
float total = vaddvq_f32(a);             // total = a[0] + a[1] + a[2] + a[3]

vfmaq_f32 (fused multiply-add) is the most important SIMD instruction for ML. It computes \(c = c + a \times b\) in one instruction with a single rounding step (more accurate than separate multiply then add). Dot products, matrix multiplications, and convolutions are built from FMA.

Practical Example: Vectorised Dot Product¶

The dot product is the inner loop of matrix multiplication. Let's write it in scalar C++ and then vectorise it with NEON.

#include <arm_neon.h>

// Scalar dot product
float dot_scalar(const float* a, const float* b, int n) {
    float sum = 0.0f;
    for (int i = 0; i < n; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}

// NEON-vectorised dot product
float dot_neon(const float* a, const float* b, int n) {
    float32x4_t sum_vec = vdupq_n_f32(0.0f);  // initialise 4 accumulators to 0

    int i = 0;
    for (; i + 4 <= n; i += 4) {
        float32x4_t va = vld1q_f32(a + i);     // load 4 elements from a
        float32x4_t vb = vld1q_f32(b + i);     // load 4 elements from b
        sum_vec = vfmaq_f32(sum_vec, va, vb);   // sum_vec += va * vb
    }

    // Reduce the 4 accumulators to a single scalar
    float sum = vaddvq_f32(sum_vec);

    // Handle remaining elements (if n is not a multiple of 4)
    for (; i < n; i++) {
        sum += a[i] * b[i];
    }

    return sum;
}

Key C++ concepts:
- const float*: a pointer to read-only float data. const promises we won't modify the data through this pointer.
- a + i: pointer arithmetic. a + i points to the \(i\)-th element of the array (same as &a[i]).
- The "cleanup loop" at the end handles the case where \(n\) is not a multiple of 4. This is a universal pattern in SIMD code: process the bulk in vectorised chunks, then handle the remainder in scalar code.
Why 4 accumulators in sum_vec: instead of a single scalar accumulator, we use 4 independent accumulators (one per SIMD lane). This avoids a data dependency: each iteration's FMA depends on sum_vec, but with 4 independent lanes, the CPU can pipeline the FMAs. At the end, we reduce the 4 partial sums to one.

Practical Example: Vectorised ReLU¶

#include <arm_neon.h>

void relu_neon(const float* input, float* output, int n) {
    float32x4_t zero = vdupq_n_f32(0.0f);

    int i = 0;
    for (; i + 4 <= n; i += 4) {
        float32x4_t x = vld1q_f32(input + i);
        float32x4_t result = vmaxq_f32(x, zero);  // max(x, 0) = ReLU
        vst1q_f32(output + i, result);
    }

    // Scalar cleanup
    for (; i < n; i++) {
        output[i] = input[i] > 0 ? input[i] : 0;
    }
}

vmaxq_f32 computes the element-wise maximum of two vectors. Since one vector is all zeros, this is exactly ReLU. No branching, no comparisons — just a single instruction.

I8MM: Integer Matrix Multiply¶

I8MM (Int8 Matrix Multiply) is an ARMv8.6 extension that adds dedicated instructions for INT8 matrix multiplication with INT32 accumulation — exactly what quantised ML inference needs.
The key instruction is SMMLA (Signed Matrix Multiply-Accumulate): it takes two 8×2 blocks of INT8 values and accumulates the result into a 2×2 block of INT32:

#include <arm_neon.h>

// I8MM: multiply two 8-element INT8 vectors, accumulate into 4 INT32 results
// This computes a 2x2 tile of the output matrix from 2x8 x 8x2 input tiles
void matmul_i8mm_tile(const int8_t* A, const int8_t* B, int32_t* C) {
    // Load 8 bytes from A (2 rows of 4 elements, packed)
    int8x16_t va = vld1q_s8(A);   // 16 bytes = 2 rows × 8 elements
    int8x16_t vb = vld1q_s8(B);   // 16 bytes = 2 rows × 8 elements

    // Load existing accumulator (2x2 = 4 int32 values)
    int32x4_t acc = vld1q_s32(C);

    // I8MM instruction: acc += A_tile × B_tile^T
    // Computes 2×2 output from 2×8 × 8×2 inputs
    acc = vmmlaq_s32(acc, va, vb);  // THE I8MM instruction

    vst1q_s32(C, acc);
}

Why I8MM matters: without I8MM, INT8 matmul on NEON requires widening multiplies (vmull) followed by pairwise adds — multiple instructions per output element. With I8MM, the hardware does an 8-element dot product (2×8 × 8×2 = 2×2) in a single instruction. For INT8 inference workloads, this is 4-8x faster than plain NEON.
Availability: Apple M1+ (all Apple Silicon), ARM Cortex-A510/A710/X2+ (ARMv9), AWS Graviton3+. Check with #ifdef __ARM_FEATURE_MATMUL_INT8.
For ML inference: INT8 quantised models (chapter 18) running on ARM servers (Graviton) or Apple Silicon benefit enormously from I8MM. Frameworks like ONNX Runtime and llama.cpp detect I8MM at runtime and use optimised kernels automatically.

SME and SME2: Scalable Matrix Extension¶

SME (Scalable Matrix Extension) is ARM's answer to Intel AMX and NVIDIA Tensor Cores: dedicated hardware for matrix operations. SME2 (ARMv9.2) extends it further.
SME introduces ZA tile registers: 2D matrices stored in hardware, up to SVL×SVL bytes (where SVL is the streaming vector length, typically 128-512 bits per dimension). Unlike NEON (1D vectors) or even SVE (1D scalable vectors), SME operates on 2D tiles natively.
The programming model has two modes:
- Normal mode: standard ARM execution (NEON, SVE work as usual).
- Streaming SVE mode: entered via smstart, enables SME instructions. SVE instructions also work in this mode but may use different register widths.

#include <arm_sme.h>

// SME2: outer product accumulation for matrix multiply
// Accumulates A_col × B_row into the ZA tile register
void sme2_matmul_outer(const float* A_col, const float* B_row, int K) {
    // Enter streaming mode
    // smstart;  // (done via compiler intrinsic or inline asm)

    // Zero the ZA tile accumulator
    svzero_za();

    for (int k = 0; k < K; k++) {
        // Load a column of A and a row of B into SVE registers
        svfloat32_t a = svld1_f32(svptrue_b32(), &A_col[k * SVL]);
        svfloat32_t b = svld1_f32(svptrue_b32(), &B_row[k * SVL]);

        // Outer product: ZA += a × b^T
        // This accumulates an SVL×SVL tile in one instruction
        svmopa_za32_f32_m(0, svptrue_b32(), svptrue_b32(), a, b);
    }

    // Store the ZA tile to memory
    // svst1_za(...);

    // Exit streaming mode
    // smstop;
}

Key concepts:
- svmopa (outer product accumulate): the core SME instruction. It computes a full outer product of two vectors and accumulates into the ZA tile. For SVL=512 bits (16 floats), this is a 16×16 outer product — 256 FMA operations in one instruction.
- ZA tile: persistent across instructions within streaming mode. You accumulate multiple outer products (one per K iteration) into the same tile, building up a full matrix multiply tile.
- Streaming mode: SME instructions only work in streaming mode. The overhead of entering/exiting streaming mode means SME is best for sustained matrix computation, not short bursts.
SME2 additions: multi-vector operations (process 2 or 4 SVE vectors simultaneously), additional tile operations, and improved integration with normal mode.
Availability: ARM Neoverse V2 (AWS Graviton4), some upcoming mobile chips. Not yet on Apple Silicon (as of 2026). SME is still early-stage — most ML frameworks do not yet have SME-optimised kernels.
The progression: NEON (128-bit vectors, element-wise) → I8MM (INT8 matrix tiles) → SVE (scalable vectors) → SME (scalable 2D matrix tiles). Each generation moves closer to native matrix operations in hardware.

SVE and SVE2: Scalable Vector Extensions¶

NEON has a fixed 128-bit width. SVE (Scalable Vector Extension) introduces vector-length agnostic (VLA) programming: you write code once, and it runs on hardware with any vector width (128 to 2048 bits). The hardware determines the width at runtime.

#include <arm_sve.h>

void add_sve(const float* a, const float* b, float* c, int n) {
    int i = 0;
    svbool_t pred = svwhilelt_b32(i, n);  // predicate: which lanes are active

    while (svptest_any(svptrue_b32(), pred)) {
        svfloat32_t va = svld1(pred, a + i);
        svfloat32_t vb = svld1(pred, b + i);
        svst1(pred, c + i, svadd_x(pred, va, vb));

        i += svcntw();  // advance by the hardware vector width (in 32-bit elements)
        pred = svwhilelt_b32(i, n);
    }
}

Predicate registers (svbool_t) replace the scalar cleanup loop. Each lane has a predicate bit: active lanes participate, inactive lanes are masked off. The svwhilelt_b32(i, n) instruction creates a predicate where lanes corresponding to i, i+1, ..., n-1 are active. This handles the tail automatically.
svcntw() returns the number of 32-bit elements per vector register at runtime. On a CPU with 256-bit SVE, this returns 8. On 512-bit SVE, it returns 16. Your code adapts automatically.
SVE is available on ARM Neoverse V1/V2 (AWS Graviton3/4, some server chips). It is not yet available on Apple Silicon.

Apple Silicon Specifics¶

Apple's M-series chips (M1, M2, M3, M4) are ARM-based with custom microarchitecture:
Performance and efficiency cores: P-cores (Firestorm/Avalanche/etc.) for heavy compute, E-cores (Icestorm/Blizzard/etc.) for background tasks. The scheduler assigns threads to the appropriate core type.
AMX (Apple Matrix eXtensions): dedicated matrix multiply units, separate from NEON. AMX is undocumented (Apple does not publish the ISA), but the Accelerate framework uses it internally for BLAS operations. When you call np.dot on a Mac, it goes through Accelerate, which uses AMX. You cannot program AMX directly (without reverse engineering).
Unified memory: CPU and GPU share the same physical RAM. On other systems, data must be copied from CPU memory to GPU memory (over PCIe, ~32 GB/s). On Apple Silicon, there is no copy — the GPU reads the same memory the CPU wrote. This eliminates a major bottleneck for ML workloads.
Neural Engine: a 16-core dedicated ML accelerator. Performs ~30 TOPS (trillion operations per second) for INT8 inference. Used by Core ML for on-device inference.
For ML on Apple Silicon: use MLX (Apple's ML framework), which is designed for the unified memory architecture. PyTorch also has MPS (Metal Performance Shaders) backend support, though it is less mature than CUDA.

Auto-Vectorisation¶

Writing SIMD intrinsics is tedious. Can the compiler vectorise your code automatically?
Yes, with caveats. Modern compilers (GCC, Clang) can auto-vectorise simple loops:

// The compiler CAN auto-vectorise this (with -O3 -march=native)
void add_auto(const float* a, const float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

Patterns that help auto-vectorisation:
- Simple loops with known trip count.
- No data dependencies between iterations (c[i] does not depend on c[i-1]).
- Contiguous memory access (no scatter/gather).
- const and restrict pointers (tells the compiler arrays do not overlap).

// restrict tells the compiler: a, b, c point to non-overlapping memory
void add_restrict(const float* __restrict__ a,
                  const float* __restrict__ b,
                  float* __restrict__ c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

Without restrict, the compiler must assume c might overlap with a or b (writing to c[i] might change a[i+1]), preventing vectorisation.
Patterns that prevent auto-vectorisation:
- Data dependencies: a[i] = a[i-1] + b[i] (each iteration depends on the previous).
- Complex control flow: if statements inside the loop (unless the compiler can convert to predication).
- Function calls inside the loop (unless the function is inlined).
- Pointer aliasing (arrays might overlap, without restrict).
Checking auto-vectorisation: use compiler flags to see what was vectorised:

# GCC: show vectorisation decisions
g++ -O3 -march=native -fopt-info-vec-optimized code.cpp

# Clang: show vectorisation report
clang++ -O3 -march=native -Rpass=loop-vectorize code.cpp

When to use intrinsics vs auto-vectorisation: start with clean C++ and compiler optimisations. If the compiler vectorises your loop, great. If performance is still insufficient, inspect the compiler's vectorisation report to understand why, and only then write intrinsics for the critical inner loop. Premature intrinsics make code unreadable without guaranteed benefit.

Coding Tasks (compile with g++ or clang++ on ARM — Mac M-series or Linux aarch64)¶

Write a scalar dot product and a NEON-vectorised dot product. Benchmark both and measure the speedup.

// task1_neon_dot.cpp
// Compile (Mac/ARM Linux): clang++ -O3 -o task1 task1_neon_dot.cpp
// Note: NEON is enabled by default on AArch64, no special flags needed

#include <iostream>
#include <chrono>
#include <vector>
#include <arm_neon.h>

float dot_scalar(const float* a, const float* b, int n) {
    float sum = 0.0f;
    for (int i = 0; i < n; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}

float dot_neon(const float* a, const float* b, int n) {
    float32x4_t sum_vec = vdupq_n_f32(0.0f);
    int i = 0;
    for (; i + 4 <= n; i += 4) {
        float32x4_t va = vld1q_f32(a + i);
        float32x4_t vb = vld1q_f32(b + i);
        sum_vec = vfmaq_f32(sum_vec, va, vb);
    }
    float sum = vaddvq_f32(sum_vec);
    for (; i < n; i++) sum += a[i] * b[i];
    return sum;
}

int main() {
    const int N = 10'000'000;
    std::vector<float> a(N, 1.0f), b(N, 2.0f);

    // Warm up
    volatile float s1 = dot_scalar(a.data(), b.data(), N);
    volatile float s2 = dot_neon(a.data(), b.data(), N);

    // Benchmark scalar
    auto start = std::chrono::high_resolution_clock::now();
    for (int t = 0; t < 100; t++) {
        s1 = dot_scalar(a.data(), b.data(), N);
    }
    auto end = std::chrono::high_resolution_clock::now();
    double scalar_ms = std::chrono::duration<double, std::milli>(end - start).count() / 100;

    // Benchmark NEON
    start = std::chrono::high_resolution_clock::now();
    for (int t = 0; t < 100; t++) {
        s2 = dot_neon(a.data(), b.data(), N);
    }
    end = std::chrono::high_resolution_clock::now();
    double neon_ms = std::chrono::duration<double, std::milli>(end - start).count() / 100;

    std::cout << "Scalar: " << scalar_ms << " ms (result: " << s1 << ")\n";
    std::cout << "NEON:   " << neon_ms << " ms (result: " << s2 << ")\n";
    std::cout << "Speedup: " << scalar_ms / neon_ms << "x\n";
    return 0;
}

Implement NEON ReLU and softmax-max-finding. Practice the load→compute→store pattern with different operations.

// task2_neon_ops.cpp
// Compile: clang++ -O3 -o task2 task2_neon_ops.cpp

#include <iostream>
#include <vector>
#include <cmath>
#include <arm_neon.h>

void relu_neon(const float* in, float* out, int n) {
    float32x4_t zero = vdupq_n_f32(0.0f);
    int i = 0;
    for (; i + 4 <= n; i += 4) {
        float32x4_t x = vld1q_f32(in + i);
        vst1q_f32(out + i, vmaxq_f32(x, zero));
    }
    for (; i < n; i++) out[i] = in[i] > 0 ? in[i] : 0;
}

float max_neon(const float* data, int n) {
    float32x4_t max_vec = vdupq_n_f32(-INFINITY);
    int i = 0;
    for (; i + 4 <= n; i += 4) {
        max_vec = vmaxq_f32(max_vec, vld1q_f32(data + i));
    }
    float result = vmaxvq_f32(max_vec);
    for (; i < n; i++) result = result > data[i] ? result : data[i];
    return result;
}

int main() {
    std::vector<float> data = {-3, 1, -1, 4, 2, -5, 0, 7, -2, 3};
    std::vector<float> out(data.size());

    relu_neon(data.data(), out.data(), data.size());
    std::cout << "ReLU: ";
    for (float x : out) std::cout << x << " ";
    std::cout << "\n";

    float mx = max_neon(data.data(), data.size());
    std::cout << "Max: " << mx << " (expected: 7)\n";
    return 0;
}

Compare auto-vectorised code against hand-written NEON intrinsics. Compile with -fopt-info-vec (GCC) or -Rpass=loop-vectorize (Clang) to see what the compiler does.

// task3_auto_vs_manual.cpp
// Compile: clang++ -O3 -Rpass=loop-vectorize -o task3 task3_auto_vs_manual.cpp
//    (or): g++ -O3 -fopt-info-vec-optimized -o task3 task3_auto_vs_manual.cpp

#include <iostream>
#include <chrono>
#include <vector>
#include <arm_neon.h>

// Let the compiler auto-vectorise
void add_auto(const float* __restrict__ a, const float* __restrict__ b,
              float* __restrict__ c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// Hand-written NEON
void add_neon(const float* a, const float* b, float* c, int n) {
    int i = 0;
    for (; i + 4 <= n; i += 4) {
        vst1q_f32(c + i, vaddq_f32(vld1q_f32(a + i), vld1q_f32(b + i)));
    }
    for (; i < n; i++) c[i] = a[i] + b[i];
}

int main() {
    const int N = 10'000'000;
    std::vector<float> a(N, 1.0f), b(N, 2.0f), c(N);

    auto bench = [&](auto fn, const char* name) {
        fn(a.data(), b.data(), c.data(), N);  // warm up
        auto start = std::chrono::high_resolution_clock::now();
        for (int t = 0; t < 100; t++) fn(a.data(), b.data(), c.data(), N);
        auto end = std::chrono::high_resolution_clock::now();
        double ms = std::chrono::duration<double, std::milli>(end - start).count() / 100;
        std::cout << name << ": " << ms << " ms\n";
    };

    bench(add_auto, "Auto-vectorised");
    bench(add_neon, "Hand-written NEON");
    // They should be very close — the compiler auto-vectorises this simple loop well
    return 0;
}