Skip to content

ML Systems Design

  • ML system lifecycle: problem framing → data → training → evaluation → deployment → monitoring
  • Data management: feature stores, data versioning (DVC), labelling pipelines, data quality checks
  • Training infrastructure: distributed training (data parallel, model parallel), experiment tracking (MLflow, W&B)
  • Model evaluation: offline metrics, A/B testing, shadow deployment, interleaving experiments
  • Model serving: batch vs real-time inference, model registry, model versioning
  • Feature engineering: online vs offline features, feature freshness, feature serving latency
  • ML pipelines: orchestration (Airflow, Kubeflow, Metaflow), reproducibility
  • Monitoring: data drift, concept drift, model degradation, alerting