Codebase Design and Patterns¶

Good codebase design is what separates a research prototype from production software. This file covers project structure, clean code principles, design patterns relevant to ML, configuration management, logging, API design, and packaging

Most ML code starts as a Jupyter notebook. The notebook grows, gets copied, modified, shared, and eventually becomes an unmaintainable tangle of global variables, dead cells, and magic numbers. Codebase design is the discipline of organising code so that it remains understandable and modifiable as the project grows.
This is not about following rules for their own sake. It is about reducing the time between "I want to change X" and "X is changed and working." In a well-designed codebase, that time is minutes. In a poorly designed one, it is days of archaelogy through undocumented spaghetti.

Project Structure¶

A consistent project layout lets anyone (including future you) navigate the codebase instantly.

my_project/
├── src/my_project/       # source code (importable package)
│   ├── __init__.py
│   ├── data/             # data loading and preprocessing
│   │   ├── __init__.py
│   │   ├── dataset.py
│   │   └── transforms.py
│   ├── models/           # model architectures
│   │   ├── __init__.py
│   │   ├── transformer.py
│   │   └── layers.py
│   ├── training/         # training loops, optimisers
│   │   ├── __init__.py
│   │   ├── trainer.py
│   │   └── losses.py
│   └── utils/            # shared utilities
│       ├── __init__.py
│       └── logging.py
├── configs/              # configuration files
│   ├── base.yaml
│   └── experiment_1.yaml
├── scripts/              # entry points (train, evaluate, serve)
│   ├── train.py
│   ├── evaluate.py
│   └── serve.py
├── tests/                # test files (mirrors src/ structure)
│   ├── test_dataset.py
│   ├── test_model.py
│   └── test_trainer.py
├── notebooks/            # exploration only (not production code)
├── pyproject.toml        # project metadata and dependencies
├── README.md
├── .gitignore
└── Dockerfile

src/ layout: putting source code under src/my_project/ prevents accidental imports from the current directory (which masks import errors that would surface in production). Install with pip install -e . for development.
Monorepo vs multi-repo: a monorepo keeps all related projects in one repository (easier cross-project changes, shared CI). A multi-repo gives each project its own repository (cleaner boundaries, independent versioning). Most ML teams start with a monorepo and split later if needed.
Scripts vs library: keep entry points (train.py, evaluate.py) in scripts/. Keep reusable logic in src/. A training script should be ~50 lines: parse config, build dataset, build model, build trainer, train. All the complexity lives in the library.

Clean Code Principles¶

Naming: the single most impactful thing you can do. A variable named x requires you to read the surrounding code to understand it. A variable named learning_rate is self-documenting.

# BAD
def proc(d, n, lr):
    for i in range(n):
        for k, v in d.items():
            v -= lr * g[k]

# GOOD
def update_parameters(parameters, num_steps, learning_rate):
    for step in range(num_steps):
        for name, param in parameters.items():
            param -= learning_rate * gradients[name]

Single Responsibility Principle: each function/class does one thing. A function called load_data_and_train_model is doing two things and should be split. This makes each piece independently testable, reusable, and understandable.
DRY (Don't Repeat Yourself) — but not prematurely. If you copy-paste code three times, extract it into a function. But do not create an abstraction for code you have used only once. Premature abstraction is worse than duplication: it adds complexity without proven benefit.

# Premature abstraction (one use case, over-engineered)
class AbstractDataTransformPipelineFactory:
    ...

# Just right (direct, clear, used in three places)
def normalise_image(image, mean, std):
    return (image - mean) / std

Magic numbers: never use unexplained literal values.

# BAD
if len(batch) > 32:
    split_batch(batch, 32)

# GOOD
MAX_BATCH_SIZE = 32
if len(batch) > MAX_BATCH_SIZE:
    split_batch(batch, MAX_BATCH_SIZE)

Functions should be short: if a function does not fit on one screen (~30 lines), it is probably doing too much. Extract logical chunks into helper functions with descriptive names. The function body then reads like a high-level summary.

Design Patterns for ML¶

Design patterns are reusable solutions to common problems. These are the ones most relevant to ML codebases:
Factory pattern: create objects without specifying the exact class. Useful when your config says model: "transformer" and you need to instantiate the right class:

MODEL_REGISTRY = {
    "transformer": TransformerModel,
    "cnn": CNNModel,
    "mlp": MLPModel,
}

def build_model(config):
    model_cls = MODEL_REGISTRY[config["model"]]
    return model_cls(**config["model_params"])

This decouples the training script from specific model implementations. Adding a new model means adding one line to the registry, not modifying the training loop.
Strategy pattern: swap algorithms at runtime. Useful for losses, optimisers, schedulers:

LOSS_FUNCTIONS = {
    "mse": nn.MSELoss,
    "cross_entropy": nn.CrossEntropyLoss,
    "focal": FocalLoss,
}

loss_fn = LOSS_FUNCTIONS[config["loss"]]()

Observer pattern (callbacks/hooks): let modules react to events without tight coupling. Training frameworks (PyTorch Lightning, Keras) use callbacks extensively:

class EarlyStopping:
    def __init__(self, patience=5):
        self.patience = patience
        self.best_loss = float('inf')
        self.counter = 0

    def on_epoch_end(self, epoch, val_loss):
        if val_loss < self.best_loss:
            self.best_loss = val_loss
            self.counter = 0
        else:
            self.counter += 1
            if self.counter >= self.patience:
                return "stop"

Dependency injection: pass dependencies into a function/class rather than creating them inside. This makes testing easy (inject a mock) and configuration flexible:

# BAD: hard-coded dependency
class Trainer:
    def __init__(self):
        self.logger = WandbLogger()  # cannot test without W&B

# GOOD: injected dependency
class Trainer:
    def __init__(self, logger):
        self.logger = logger  # can inject any logger, including a mock

Configuration Management¶

Hard-coding hyperparameters, file paths, and model settings makes experiments unreproducible and modifications painful. Externalise configuration into files.
YAML is the most common format for ML configs:

# configs/experiment_1.yaml
model:
  name: transformer
  d_model: 512
  n_heads: 8
  n_layers: 6

training:
  batch_size: 64
  learning_rate: 3e-4
  max_epochs: 100
  early_stopping_patience: 10

data:
  train_path: /data/train.parquet
  val_path: /data/val.parquet
  max_seq_length: 512

Hydra (Facebook) is a configuration framework that supports composition (merge base config with experiment-specific overrides), command-line overrides (python train.py training.lr=1e-3), and multi-run (sweep over hyperparameters).
argparse is simpler for scripts with a few parameters:

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--lr", type=float, default=3e-4)
parser.add_argument("--batch-size", type=int, default=64)
parser.add_argument("--config", type=str, default="configs/base.yaml")
args = parser.parse_args()

Best practice: have a base config with all defaults, and per-experiment configs that override only what changes. Track every experiment's config alongside its results.

Logging and Observability¶

print statements are for debugging. Logging is for production:

import logging

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

logger.debug("Batch loaded: %d samples", len(batch))     # noisy, for debugging
logger.info("Epoch %d: loss=%.4f, lr=%.6f", epoch, loss, lr)  # normal operation
logger.warning("GPU memory >90%%, consider reducing batch size")
logger.error("Failed to load checkpoint: %s", path)       # recoverable error
logger.critical("CUDA out of memory, aborting")            # fatal

Why not print: logging supports levels (filter out debug messages in production), formatting (timestamps, module names), and handlers (write to file, send to monitoring system) without changing the logging calls.
Structured logging outputs machine-parseable formats (JSON) alongside human-readable messages. This enables searching and alerting on specific fields:

logger.info("training_step", extra={
    "epoch": 5, "step": 1200, "loss": 0.0342, "lr": 2.1e-4
})

API Design¶

If your model will be used by other services (a web app, a mobile app, another ML pipeline), it needs an API (Application Programming Interface).
REST APIs use HTTP methods: GET to read, POST to create/predict, PUT to update, DELETE to remove. Endpoints follow resource-based naming:

POST /api/v1/predict          # send input, get prediction
GET  /api/v1/models           # list available models
GET  /api/v1/models/{id}      # get model details
POST /api/v1/models/{id}/predict  # predict with a specific model

FastAPI is the go-to Python framework for ML serving:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class PredictRequest(BaseModel):
    text: str

class PredictResponse(BaseModel):
    label: str
    confidence: float

@app.post("/predict", response_model=PredictResponse)
async def predict(request: PredictRequest):
    result = model.predict(request.text)
    return PredictResponse(label=result.label, confidence=result.score)

FastAPI auto-generates API documentation (Swagger UI at /docs), validates input/output with Pydantic models, and supports async for high throughput.
gRPC is faster than REST for internal service-to-service communication. It uses Protocol Buffers (binary serialisation, smaller and faster than JSON) and supports streaming. Used by TensorFlow Serving, Triton Inference Server, and many microservice architectures.

Packaging and Distribution¶

Making your code installable as a package lets others (and your own scripts) import it cleanly:

# pyproject.toml
[project]
name = "my-ml-project"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
    "torch>=2.0",
    "jax>=0.4",
    "pydantic>=2.0",
]

[project.optional-dependencies]
dev = ["pytest", "ruff", "mypy"]

[build-system]
requires = ["setuptools>=64"]
build-backend = "setuptools.backends._legacy:_Backend"

pip install -e ".[dev]"    # install in editable mode with dev dependencies

Editable install (-e): changes to your source code are immediately reflected without reinstalling. Essential during development.
Pinning dependencies: requirements.txt with exact versions (torch==2.2.1, not torch>=2.0) ensures reproducibility. Use pip freeze > requirements.txt to capture your current environment. For more sophisticated dependency management, use uv, poetry, or pip-tools.

Working with AI Coding Agents¶

AI coding agents (Claude Code, GitHub Copilot, Cursor, etc.) are now part of the professional engineering workflow. Used well, they dramatically accelerate development. Used poorly, they introduce subtle bugs, erode your understanding of your own codebase, and create a false sense of productivity.
The right mental model: an agent is a fast but inexperienced pair programmer. It can write code quickly, knows syntax and standard patterns, and has read more documentation than you ever will. But it does not understand your specific system, your business constraints, your edge cases, or the why behind your design decisions. You are the senior engineer; the agent is the junior. You direct, review, and take responsibility.

When Agents Excel¶

Boilerplate and scaffolding: generating Dockerfiles, CI configs, test fixtures, data class definitions, argparse setups. These follow well-known patterns and are tedious to write by hand. Let the agent generate them, then review for correctness.
Writing tests: describe the function's behaviour, and the agent generates test cases. It often catches edge cases you would miss (empty input, negative values, Unicode). Always read the generated tests — they verify your assumptions, not just your code.
Refactoring: "extract this block into a function," "convert this class to use dataclasses," "add type hints to this module." Mechanical transformations where the intent is clear and the risk of subtle errors is low.
Exploration and prototyping: "write a quick script to benchmark inference latency" or "show me how to use the HuggingFace tokeniser API." The agent gets you a working starting point faster than reading documentation.
Documentation and docstrings: the agent can generate documentation from your code structure. Review for accuracy, but the grunt work is automated.
Debugging assistance: paste an error traceback and ask for diagnosis. The agent can often identify the root cause and suggest a fix, especially for common issues (shape mismatches, import errors, CUDA out of memory).

When to NOT Rely on Agents¶

Novel architecture decisions: if you are designing a new training pipeline, the agent will give you a generic answer. It does not know your data constraints, latency requirements, or team expertise. Use the agent to implement the design you have already thought through.
Security-critical code: authentication, encryption, input sanitisation. The agent may generate code that looks correct but has subtle vulnerabilities (SQL injection, insecure defaults, timing attacks). Security code should be written by someone who understands the threat model, and reviewed by someone else.
Performance-critical inner loops: the agent will write correct but naive code. For GPU kernels, memory-critical data structures, or latency-sensitive serving paths, you need to understand the hardware constraints (chapter 13, chapter 16) and optimise deliberately.
Code you don't understand: if the agent generates 200 lines and you cannot explain what each line does, do not commit it. You are now maintaining code you do not understand, and when it breaks (it will), you cannot debug it. This is the most common and most dangerous failure mode.

The Review Discipline¶

Always read every line of generated code before committing. This is not optional. The agent's code is a draft, not a finished product. Treat it exactly like a pull request from a colleague: review it critically.
What to check:
- Correctness: does it actually do what you asked? Agents often solve a subtly different problem than the one you intended.
- Edge cases: does it handle empty inputs, None values, negative numbers, very large inputs? Agents frequently omit edge case handling.
- Hallucinated APIs: the agent may call functions or use parameters that do not exist, especially for newer or less common libraries. Verify that every API call is real.
- Over-engineering: agents tend to produce more code than necessary. A 50-line solution to a 10-line problem adds complexity without benefit. Simplify ruthlessly.
- Security: hardcoded secrets, unsanitised user input, insecure defaults. The agent does not think adversarially.
- Style consistency: does the generated code match your project's conventions (naming, patterns, error handling)?

How to Write Good Prompts¶

The quality of the agent's output is directly proportional to the quality of your instruction. Vague prompts get vague code.
Bad: "write a data loader"
Good: "write a PyTorch DataLoader for a CSV file with columns 'text' and 'label'. Tokenise the text using the HuggingFace tokeniser 'bert-base-uncased' with max_length=512. Return input_ids, attention_mask, and label as tensors. Handle the case where the CSV has missing values in the label column by skipping those rows."
Provide context: tell the agent about your project structure, existing code, constraints, and conventions. The more context, the better the output.
Specify constraints: "use only the standard library," "must work with Python 3.10," "do not use global variables," "follow the existing pattern in src/models/transformer.py."
Ask for explanations: "implement X and explain the key design decisions." This forces the agent to articulate its reasoning, making it easier for you to spot flawed assumptions.

Using Quality Gates to Catch Agent Mistakes¶

Your existing quality infrastructure (file 04) catches agent errors just as well as human errors:
- Type checking (mypy): catches hallucinated API signatures and type mismatches.
- Linting (ruff): catches unused imports, undefined variables, and style violations.
- Tests (pytest): if the agent's code passes your test suite, it is more likely correct. If you do not have tests, write them before asking the agent to implement the feature (test-driven development works especially well with agents).
- CI pipeline: runs all of the above automatically on every commit.
The combination of "agent writes code" + "quality gates verify it" is more productive than either alone. The agent is fast but sloppy; the gates are thorough but do not write code. Together, you get speed and correctness.

The Productivity Trap¶

The biggest risk of coding agents is the illusion of productivity. You can generate 500 lines of code in 10 minutes. But if you spend 2 hours debugging those 500 lines because you did not understand them, you were slower than writing 200 lines yourself in 30 minutes.
True productivity with agents comes from:
1. Staying in control: you decide the architecture, the agent fills in the implementation.
2. Understanding what is generated: if you cannot explain it, rewrite it or ask the agent to simplify.
3. Investing in quality gates: tests, types, and linting amortise their cost across every agent interaction.
4. Using the agent for your weaknesses: if you are great at algorithms but slow at writing tests, let the agent write tests. If you are fast at UI code but unfamiliar with database queries, let the agent draft the SQL. Play to your strengths, delegate your gaps.
The engineers who get the most out of coding agents are the ones who already know how to code well. The agent amplifies your existing skill; it does not replace it. Understanding data structures, algorithms, system design, and software engineering (this entire chapter) is what lets you direct the agent effectively and evaluate its output critically.