Edge Inference
- Edge constraints: limited memory, power budget, no network dependency
- Model compression pipeline: pruning → quantisation → compilation
- On-device runtimes: TensorFlow Lite, ONNX Runtime, Core ML, TensorRT, ExecuTorch
- Compiler stack: graph optimisation, operator fusion, memory planning, tiling
- Hardware targets: mobile GPUs (Adreno, Mali), NPUs (Qualcomm Hexagon, Apple Neural Engine, Google Edge TPU)
- On-device LLMs: Phi, Gemma, Llama at 1-3B parameter scale, 4-bit inference
- Federated learning: on-device training, privacy-preserving aggregation, communication efficiency
- Latency optimisation: model partitioning, early exit, caching strategies