Skip to content

Edge Inference

  • Edge constraints: limited memory, power budget, no network dependency
  • Model compression pipeline: pruning → quantisation → compilation
  • On-device runtimes: TensorFlow Lite, ONNX Runtime, Core ML, TensorRT, ExecuTorch
  • Compiler stack: graph optimisation, operator fusion, memory planning, tiling
  • Hardware targets: mobile GPUs (Adreno, Mali), NPUs (Qualcomm Hexagon, Apple Neural Engine, Google Edge TPU)
  • On-device LLMs: Phi, Gemma, Llama at 1-3B parameter scale, 4-bit inference
  • Federated learning: on-device training, privacy-preserving aggregation, communication efficiency
  • Latency optimisation: model partitioning, early exit, caching strategies