Edge Inference¶

Edge inference runs models on user devices (phones, laptops, IoT sensors) without sending data to the cloud. This file covers edge constraints, the model compression pipeline, on-device runtimes, the compiler stack, hardware targets (NPUs, Neural Engines), on-device LLMs, federated learning, and latency optimisation

Cloud inference requires network connectivity, adds latency (50-200 ms round trip), costs money per request, and sends user data to third-party servers. Edge inference eliminates all four: the model runs locally, responds instantly, costs nothing per inference, and keeps data private.
The tradeoff: edge devices have 100-1000x less compute and memory than data centre GPUs. Making models run within these constraints requires aggressive optimisation at every level.
Cactus (github.com/cactus-compute/cactus) is a low-latency AI engine purpose-built for mobile and wearable devices. It demonstrates many of the techniques covered in this file in production: custom ARM SIMD kernels for attention and matrix operations (chapter 16), KV-cache quantisation (chapter 17 file 01), chunked prefill, NPU-accelerated inference on Apple and Qualcomm chips, zero-copy memory mapping for 10x lower RAM usage, and automatic cloud fallback when on-device compute is insufficient. Cactus supports multimodal inference (LLMs, vision, speech) across iOS, Android, macOS, and embedded Linux, with SDKs for Swift, Kotlin, Python, Flutter, React Native, and Rust. Its benchmarks show 100 tokens/s decode on M4 Pro and 48 tokens/s on iPhone 17 Pro for a 1.2B model at INT4, a concrete example of what optimised edge inference looks like.

Edge Constraints¶

Resource	Cloud GPU (H100)	Laptop (M4)	Phone (Snapdragon 8 Gen 3)	IoT (ESP32)
RAM	80 GB HBM3	16-36 GB unified	8-12 GB LPDDR5	520 KB
Compute	989 TFLOPS (FP8)	38 TOPS (Neural Engine)	45 TOPS (NPU)	0.001 TOPS
Power	700 W	15-30 W	5-10 W	0.1 W
Storage	TB	256 GB-2 TB	128-512 GB	4 MB

The compute gap between a cloud GPU and a phone NPU is ~20x. Between a GPU and a microcontroller, it is ~1,000,000x. Different devices require different levels of compression and different model architectures.

The Model Compression Pipeline¶

For edge deployment, compression is not a single technique — it is a pipeline of complementary techniques applied in sequence:

Full model (FP32, 70B params)
    ↓ Knowledge distillation → smaller model (7B params)
    ↓ Structured pruning → remove redundant heads/layers (4B effective)
    ↓ Quantisation (INT4) → 4x smaller (2 GB)
    ↓ Compiler optimisation → fused kernels, optimised memory layout
    ↓ Runtime → on-device execution

Each step reduces size and latency. The order matters: distil first (reduce architecture), then prune (remove structure), then quantise (reduce precision), then compile (optimise for target hardware). Distilling after quantisation would try to compress an already-lossy model.

On-Device Runtimes¶

A runtime loads a model, allocates memory, and executes inference on the target hardware. Each platform has its preferred runtime:
ONNX Runtime: cross-platform (Windows, Linux, macOS, iOS, Android). Supports CPU, GPU (CUDA, DirectML, CoreML, NNAPI), and many accelerator backends. The most portable option. Models are exported to ONNX format from PyTorch/TensorFlow.
TensorFlow Lite (TFLite): Google's edge runtime. Optimised for ARM CPUs and Android NPUs. Tiny binary (~1 MB). Supports INT8 and float16. The standard for Android deployment.
Core ML: Apple's runtime for iOS/macOS. Automatically uses the Neural Engine, GPU, or CPU depending on model characteristics. Models are converted from PyTorch/TensorFlow using coremltools. Tight integration with Apple hardware (unified memory, Neural Engine).
ExecuTorch: Meta's new runtime for on-device PyTorch. Designed for edge deployment with ahead-of-time compilation and operator-level delegation to hardware accelerators. Successor to PyTorch Mobile.
TensorRT: NVIDIA's runtime for GPU inference optimisation (chapter 15). Fuses layers, selects optimal kernels, and quantises automatically. 2-5x faster than PyTorch eager mode on NVIDIA GPUs.
llama.cpp: single-file C++ inference engine for LLMs. Supports GGUF quantisation (Q4, Q5, Q8), CPU (AVX/NEON), Metal (Apple GPU), CUDA, and Vulkan. The go-to for running LLMs on consumer hardware.

The Compiler Stack¶

Between the high-level model (PyTorch graph) and the hardware (NPU instructions) sits the compiler stack, which optimises the model for the specific target:

PyTorch model
    ↓ Export (torch.export, ONNX, TorchScript)
Graph IR (intermediate representation)
    ↓ Graph optimisations
        - Constant folding (compute constant expressions at compile time)
        - Dead code elimination (remove unused operations)
        - Operator fusion (conv + bn + relu → single fused op)
        - Layout transformation (NCHW → NHWC for ARM, channels-last)
    ↓ Lowering
Hardware-specific IR
    ↓ Backend optimisations
        - Tiling and loop ordering (cache-friendly access patterns)
        - Vectorisation (SIMD, chapter 16)
        - Memory planning (reuse buffers to minimise peak memory)
        - Kernel selection (choose the best implementation per op)
    ↓ Code generation
Machine code / NPU instructions

Operator fusion is the most impactful optimisation. A transformer block has ~20 operations (matmul, add, layernorm, softmax, etc.). Without fusion, each writes its output to memory and the next reads it back. With fusion, multiple operations are combined into a single kernel that keeps data in registers/cache. This can be 2-5x faster (chapter 16, roofline model).
Memory planning: the compiler analyses the model graph to determine which tensors overlap in lifetime and can share the same memory buffer. A model with 100 intermediate tensors might only need memory for 10, because most are consumed and freed before others are created. This is critical on devices with limited RAM.

Hardware Targets¶

Mobile GPUs¶

Qualcomm Adreno (Android): supports OpenCL, Vulkan compute (chapter 16), and Qualcomm's proprietary SNPE (Snapdragon Neural Processing Engine). Adreno GPUs have 256-1024 ALUs with FP16 and INT8 support.
ARM Mali (Android): supports OpenCL and Vulkan. Mali GPUs use a tile-based architecture (different from desktop GPUs), which affects optimal memory access patterns.
Apple GPU (iOS/macOS): accessed via Metal (Apple's GPU API). Unified memory architecture means no CPU↔GPU copy overhead. Metal Performance Shaders (MPS) provide optimised ML primitives.

Neural Processing Units (NPUs)¶

NPUs are fixed-function accelerators designed specifically for ML inference. They are far more power-efficient than GPUs for standard ML operations (matmul, conv, activation).
Apple Neural Engine: 16 cores, ~38 TOPS (INT8). Accessed via Core ML. Excellent for vision models and on-device diffusion. Cannot run arbitrary code — only operations supported by Core ML.
Qualcomm Hexagon NPU: integrated into Snapdragon SoCs. Supports INT8 and INT4 inference. Accessed via SNPE or ONNX Runtime with QNN backend. Powers on-device features like background blur, speech recognition, and real-time translation.
Google Edge TPU: a small, low-power version of the cloud TPU. 4 TOPS, 2W. Used in Coral devices for on-device inference. Supports only INT8 quantised TFLite models.
The delegation pattern: the runtime splits the model graph between the NPU (for supported operations) and the CPU (for unsupported ones). Maximising the fraction that runs on the NPU is key to performance and power efficiency.

On-Device LLMs¶

Running LLMs on phones and laptops has become feasible with small models and aggressive quantisation:

Model	Params	Quantised Size	Target Device	Performance
Phi-3 Mini	3.8B	~2 GB (Q4)	Phone/Laptop	~15 tokens/s on iPhone 15
Gemma 2B	2B	~1.5 GB (Q4)	Phone	~20 tokens/s on Pixel 8
Llama 3.2 1B	1B	~700 MB (Q4)	Phone	~30 tokens/s
Llama 3.2 3B	3B	~2 GB (Q4)	Phone/Laptop	~15 tokens/s
Llama 3.1 8B	8B	~4.5 GB (Q4)	Laptop	~20 tokens/s on M2

Challenges:
- Memory: a 3B Q4 model fits in 2 GB, but the KV-cache for long conversations adds significantly. Context length is typically limited to 2-4K tokens on phones.
- Thermal throttling: sustained inference heats the phone. After 30 seconds of continuous generation, the SoC throttles clock speeds to prevent overheating, reducing performance by 30-50%.
- Battery: running a 3B model at 15 tokens/s consumes ~3-5W. A 30-minute conversation drains ~5% of a typical phone battery. Acceptable for occasional use, problematic for always-on applications.
llama.cpp is the standard for on-device LLMs. It runs on CPU (AVX2, NEON, I8MM), Apple GPU (Metal), NVIDIA GPU (CUDA), AMD GPU (ROCm/Vulkan), and even phones (via Termux on Android).

Federated Learning¶

Federated learning trains models across many devices without centralising the data. Each device trains on its local data, computes a gradient update, and sends only the update (not the data) to a central server that aggregates the updates.
The algorithm (FedAvg):
1. Server sends the current model to \(K\) selected devices.
2. Each device fine-tunes the model on its local data for a few steps.
3. Each device sends its updated model (or the difference) back to the server.
4. Server averages the updates: \(W_{\text{new}} = \frac{1}{K} \sum_{k=1}^{K} W_k\).
5. Repeat.
Privacy: raw data never leaves the device. The server only sees aggregated model updates. Differential privacy adds noise to the updates so that individual data points cannot be reverse-engineered from the gradient.
Communication efficiency: model updates are large (same size as the model). Compression techniques reduce this: gradient quantisation (send INT8 gradients instead of FP32), sparsification (send only the largest gradients), and gradient accumulation (do more local steps, send less often).
Applications: Google's keyboard predictions (Gboard), Apple's voice recognition, health monitoring (train on sensitive health data without centralising it).

Latency Optimisation¶

Beyond compression, several techniques reduce end-to-end inference latency:
Early exit: add classification heads at intermediate layers. If the model is confident at layer 6 (out of 24), return the prediction without running layers 7-24. Easy inputs exit early, hard inputs use the full model. Average latency drops significantly for tasks with a mix of easy and hard inputs.
Model partitioning: split the model between the NPU (efficient for matmul), GPU (efficient for irregular operations), and CPU (handles everything else). The compiler decides which operations go where based on profiling.
Caching: for applications with repeated queries (autocomplete, code completion), cache recent computations. If the user types "How do I" and the model has recently generated completions for "How do I," the cached KV-cache can be reused, skipping the prefill phase entirely.
Speculative prefetching: predict what the user will do next and start inference before they ask. A chat app might start generating a response to the likely follow-up question while the user is reading the current answer.

Coding Tasks (use CoLab or notebook)¶

Simulate the model compression pipeline. Start with a float32 model, apply distillation (mock), pruning, and quantisation, and track the size at each step.

def compression_pipeline(original_params_M, original_bits=32):
    size_mb = original_params_M * 1e6 * original_bits / 8 / 1e6

    print(f"Original: {original_params_M}M params, {original_bits}-bit → {size_mb:.0f} MB")

    # Step 1: Knowledge distillation (reduce params)
    distilled_params = original_params_M * 0.15  # 70B → ~10B equivalent
    size_mb = distilled_params * 1e6 * original_bits / 8 / 1e6
    print(f"After distillation ({distilled_params:.0f}M params): {size_mb:.0f} MB")

    # Step 2: Structured pruning (remove 30% of remaining)
    pruned_params = distilled_params * 0.7
    size_mb = pruned_params * 1e6 * original_bits / 8 / 1e6
    print(f"After pruning ({pruned_params:.0f}M params): {size_mb:.0f} MB")

    # Step 3: INT4 quantisation
    size_mb = pruned_params * 1e6 * 4 / 8 / 1e6
    print(f"After INT4 quantisation: {size_mb:.0f} MB")

    print(f"Total compression: {original_params_M * 1e6 * original_bits / 8 / 1e6 / size_mb:.0f}x")

print("=== Starting from 70B model ===")
compression_pipeline(70000)

print("\n=== Starting from 7B model ===")
compression_pipeline(7000)

Estimate on-device inference latency. Given a model's operations count and hardware specs, compute whether it meets a latency target.

def estimate_latency(model_name, params_M, bits, compute_tops, mem_bw_gbs, seq_len=256):
    """Estimate token generation latency for a memory-bandwidth-bound model."""
    # Model size in bytes
    model_bytes = params_M * 1e6 * bits / 8

    # Decode is memory-bound: must load entire model per token
    time_per_token_ms = model_bytes / (mem_bw_gbs * 1e9) * 1000

    # Tokens per second
    tokens_per_sec = 1000 / time_per_token_ms

    print(f"{model_name}: {params_M/1000:.1f}B params @ {bits}-bit = {model_bytes/1e9:.1f} GB")
    print(f"  Memory bandwidth: {mem_bw_gbs} GB/s")
    print(f"  Time per token: {time_per_token_ms:.1f} ms")
    print(f"  Tokens/sec: {tokens_per_sec:.0f}")
    print()

# Apple M2 Pro: 200 GB/s unified memory bandwidth
print("=== Apple M2 Pro (200 GB/s) ===")
estimate_latency("Llama-7B Q4", 7000, 4, 15.8, 200)
estimate_latency("Llama-7B Q8", 7000, 8, 15.8, 200)
estimate_latency("Llama-70B Q4", 70000, 4, 15.8, 200)

# Phone (Snapdragon 8 Gen 3): ~50 GB/s LPDDR5
print("=== Snapdragon 8 Gen 3 (50 GB/s) ===")
estimate_latency("Phi-3 Mini Q4", 3800, 4, 45, 50)
estimate_latency("Llama-3B Q4", 3000, 4, 45, 50)