ML Systems Design
- ML system lifecycle: problem framing → data → training → evaluation → deployment → monitoring
- Data management: feature stores, data versioning (DVC), labelling pipelines, data quality checks
- Training infrastructure: distributed training (data parallel, model parallel), experiment tracking (MLflow, W&B)
- Model evaluation: offline metrics, A/B testing, shadow deployment, interleaving experiments
- Model serving: batch vs real-time inference, model registry, model versioning
- Feature engineering: online vs offline features, feature freshness, feature serving latency
- ML pipelines: orchestration (Airflow, Kubeflow, Metaflow), reproducibility
- Monitoring: data drift, concept drift, model degradation, alerting