Deployment and DevOps¶

Deployment is where your model stops being a research artifact and starts being a product. This file covers Docker for ML, model serving, experiment tracking, reproducibility, monitoring in production, feature stores, and pipeline orchestration, the infrastructure that takes a trained model from a notebook to millions of users.

A model that only runs on your laptop is a prototype. A model that runs reliably at scale, serves predictions in milliseconds, recovers from failures, and can be updated without downtime is a product. The gap between the two is deployment and DevOps.
Most ML engineers spend more time on deployment, monitoring, and debugging production issues than on training models. Understanding this infrastructure is not optional for anyone building real ML systems.

Docker for ML¶

We covered containers conceptually in chapter 13 (OS). Here we focus on the practical side: writing Dockerfiles for ML workloads.
A Dockerfile is a recipe for building a container image:

# Start from an official CUDA base image
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

# System dependencies
RUN apt-get update && apt-get install -y \
    python3.11 python3-pip git \
    && rm -rf /var/lib/apt/lists/*

# Python dependencies (install separately for caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy source code (changes frequently, so this layer is last)
COPY src/ /app/src/
COPY configs/ /app/configs/
WORKDIR /app

# Entry point
CMD ["python3", "src/scripts/serve.py", "--config", "configs/serve.yaml"]

Layer caching: Docker caches each layer. If requirements.txt has not changed, pip install is skipped on rebuild. Put rarely-changing layers (system packages, pip install) before frequently-changing ones (source code). This turns a 10-minute build into a 10-second rebuild.
GPU access: use nvidia/cuda base images and run with docker run --gpus all. The nvidia-container-toolkit provides GPU passthrough from host to container.
Multi-stage builds reduce image size by separating the build environment from the runtime:

# Build stage: install build tools, compile dependencies
FROM python:3.11 AS builder
COPY requirements.txt .
RUN pip install --user -r requirements.txt

# Runtime stage: only runtime dependencies
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
COPY --from=builder /root/.local /root/.local
COPY src/ /app/src/
ENV PATH=/root/.local/bin:$PATH

The final image contains only the runtime libraries, not compilers, headers, or build tools. A 5 GB build image becomes a 2 GB runtime image.
Docker Compose runs multi-container setups (model server + load balancer + monitoring):

# docker-compose.yml
services:
  model:
    build: .
    ports:
      - "8080:8080"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"

Model Serving¶

Model serving is running inference as a service: receive requests, run the model, return predictions.
FastAPI (covered in file 03) is the simplest approach for low-to-medium throughput. For high throughput and GPU-optimised serving, use dedicated tools:
Triton Inference Server (NVIDIA): serves models in TensorRT, ONNX, PyTorch, and TensorFlow formats. Features:
- Dynamic batching: collects individual requests and batches them for GPU efficiency. A stream of single requests is batched into groups of 32, dramatically improving throughput.
- Model ensembles: chain multiple models (preprocessor → model → postprocessor) in a single request.
- Multi-model serving: serve multiple models on the same GPU, sharing resources.
- Concurrent model execution: run multiple inference requests in parallel on the same GPU.
TorchServe (PyTorch): serves PyTorch models with a REST/gRPC API. Supports model versioning, A/B testing, and custom handlers.
vLLM: specialised for LLM serving. Implements PagedAttention (efficient KV cache management), continuous batching, and tensor parallelism across GPUs. Achieves 10-20x higher throughput than naive serving for large language models.
Cactus (github.com/cactus-compute/cactus): a low-latency AI engine for on-device serving on mobile and edge. Cactus provides an OpenAI-compatible API (chat completion, streaming, tool calling, transcription, embeddings, RAG, vision) that runs entirely on-device, with automatic cloud fallback when the local model cannot handle a request. This hybrid architecture means your application code uses the same API regardless of whether inference runs locally or in the cloud — the engine decides based on model confidence and device capability. SDKs available for Python, Swift, Kotlin, Flutter, React Native, and Rust, with pre-converted model weights on HuggingFace. Supports multimodal inference (LLMs, vision, speech) with custom ARM SIMD kernels for fastest inference on ARM CPUs and zero-copy memory mapping for 10x lower RAM usage (chapter 16, chapter 17).
Model format optimisation:
- ONNX: open format for interoperability. Export from PyTorch/TensorFlow, run anywhere.
- TensorRT: NVIDIA's optimiser. Fuses layers, selects optimal kernels, quantises weights. Typically 2-5x faster than PyTorch on NVIDIA GPUs.
- GGUF/GGML: formats for CPU-efficient inference, popular for running LLMs on consumer hardware.

Experiment Tracking¶

Without experiment tracking, ML research devolves into: "I think the model from last Tuesday with that config I changed something in was the best one, but I do not remember what I changed."
Weights & Biases (W&B): the most popular experiment tracker. Log anything from your training script:

import wandb

wandb.init(project="my-project", config={
    "model": "transformer",
    "lr": 3e-4,
    "batch_size": 64,
})

for epoch in range(num_epochs):
    train_loss = train_one_epoch()
    val_loss = validate()

    wandb.log({
        "train/loss": train_loss,
        "val/loss": val_loss,
        "epoch": epoch,
    })

    # Log model as artifact
    if val_loss < best_loss:
        wandb.save("best_model.pt")

wandb.finish()

W&B provides: dashboards for comparing runs, hyperparameter sweep tools, model registry, dataset versioning, and team collaboration.
MLflow: open-source alternative. Runs locally or on a server:

import mlflow

mlflow.set_experiment("my-experiment")

with mlflow.start_run():
    mlflow.log_params({"lr": 3e-4, "batch_size": 64})
    mlflow.log_metric("val_loss", 0.042, step=epoch)
    mlflow.pytorch.log_model(model, "model")

Model registry: a central store of trained models with versioning, staging (dev → staging → production), and metadata. Both W&B and MLflow provide registries. The registry answers: "what model is currently in production, who trained it, what was its validation accuracy, and what code/data produced it?"

Reproducibility¶

Reproducibility means: given the same code, data, and config, produce the same model. This is surprisingly hard in ML due to non-determinism in GPU operations, data shuffling, and floating-point accumulation.
The reproducibility checklist:

What	How
Code version	Git commit hash
Config / hyperparameters	Config file (versioned in git or logged to W&B)
Random seeds	Set and log all seeds (Python, NumPy, PyTorch, CUDA)
Data version	DVC hash, dataset version tag, or S3 object version
Dependencies	`pip freeze`, Docker image hash, or lockfile
Hardware	GPU type, number of GPUs, CUDA version
Non-determinism	`torch.backends.cudnn.deterministic = True` (slower but reproducible)

Pinning everything: pip install torch==2.2.1 not torch>=2.0. A minor version bump can change numerical behaviour, optimiser implementations, or default hyperparameters.
Docker for reproducibility: the Docker image pins the OS, system libraries, Python version, and pip packages. The image hash is a complete environment fingerprint. If you can reproduce the Docker image, you can reproduce the training.

Monitoring in Production¶

Deploying a model is not the end — it is the beginning of a new set of problems. Models degrade over time as the real world changes (concept drift) and as the input data distribution shifts (data drift).
What to monitor:
- Latency: how long does inference take? Track p50 (median), p95, and p99. A p99 of 500ms means 1 in 100 users waits half a second, which may be unacceptable.
- Throughput: how many requests per second? Is the system keeping up with demand?
- Error rate: what fraction of requests fail (exceptions, timeouts, invalid inputs)?
- Model metrics: accuracy, precision, recall on a holdout set. If labelled data is available in production (e.g., user corrections), track online metrics.
- Data drift: has the distribution of incoming data changed? A model trained on daytime photos may fail on night photos. Statistical tests (KS test, PSI) compare the training distribution to the live distribution.
- Feature drift: have individual feature distributions changed? A feature that was normally distributed during training but is now bimodal signals a data pipeline issue.
Tools:
- Prometheus + Grafana: the standard for infrastructure monitoring. Prometheus collects metrics, Grafana visualises them in dashboards with alerts.
- Evidently AI: open-source ML monitoring. Generates reports on data drift, model performance, and data quality.
Alerts: do not just dashboard it — set up automated alerts. "If p99 latency exceeds 200ms for 5 minutes, send a Slack notification." "If data drift score exceeds threshold, page the on-call engineer."

Feature Stores¶

A feature store is a centralised repository of precomputed features, shared between training and serving. It solves two problems:
- Training-serving skew: the features used during training must be identical to those used during serving. If training uses user_age_at_signup computed one way and serving computes it differently, the model's predictions are silently wrong.
- Feature reuse: multiple models often use the same features (user demographics, item embeddings, aggregated statistics). Computing them once and sharing avoids duplication and inconsistency.
Feast is the most popular open-source feature store. It manages online features (low-latency, served from Redis or DynamoDB) and offline features (batch, stored in data warehouses for training).
Feature stores are critical for recommendation systems, fraud detection, and any application where features are computed from raw data pipelines.

Pipeline Orchestration¶

A production ML system is not just a model. It is a pipeline: data ingestion → preprocessing → feature computation → training → evaluation → deployment → monitoring. Each step depends on the previous one, can fail independently, and may need to run on different schedules.
Orchestrators manage these pipelines:
Apache Airflow: the standard for data pipeline orchestration. DAGs (Directed Acyclic Graphs) define task dependencies. Each task runs independently, can be retried on failure, and is monitored via a web UI.

# airflow DAG example (simplified)
from airflow import DAG
from airflow.operators.python import PythonOperator

dag = DAG("training_pipeline", schedule="@daily")

preprocess = PythonOperator(task_id="preprocess", python_callable=preprocess_data, dag=dag)
train = PythonOperator(task_id="train", python_callable=train_model, dag=dag)
evaluate = PythonOperator(task_id="evaluate", python_callable=evaluate_model, dag=dag)
deploy = PythonOperator(task_id="deploy", python_callable=deploy_model, dag=dag)

preprocess >> train >> evaluate >> deploy

Kubeflow Pipelines: ML-specific orchestration on Kubernetes. Each step runs in a container, GPU resources are allocated on demand, and experiments are tracked automatically.
Prefect and Dagster: modern alternatives to Airflow with better developer experience, native Python APIs, and built-in data lineage.
When to orchestrate: when your pipeline has more than 2-3 steps, runs on a schedule, involves multiple teams or services, or needs automatic recovery from failures. A single-script training job does not need an orchestrator. A daily retraining pipeline that ingests data from 5 sources, trains 3 models, evaluates them, and deploys the best one absolutely does.