Edge Inference¶

Edge constraints: limited memory, power budget, no network dependency
Model compression pipeline: pruning → quantisation → compilation
On-device runtimes: TensorFlow Lite, ONNX Runtime, Core ML, TensorRT, ExecuTorch
Compiler stack: graph optimisation, operator fusion, memory planning, tiling
Hardware targets: mobile GPUs (Adreno, Mali), NPUs (Qualcomm Hexagon, Apple Neural Engine, Google Edge TPU)
On-device LLMs: Phi, Gemma, Llama at 1-3B parameter scale, 4-bit inference
Federated learning: on-device training, privacy-preserving aggregation, communication efficiency
Latency optimisation: model partitioning, early exit, caching strategies