Codebase Design and Patterns¶
Good codebase design is what separates a research prototype from production software. This file covers project structure, clean code principles, design patterns relevant to ML, configuration management, logging, API design, and packaging
-
Most ML code starts as a Jupyter notebook. The notebook grows, gets copied, modified, shared, and eventually becomes an unmaintainable tangle of global variables, dead cells, and magic numbers. Codebase design is the discipline of organising code so that it remains understandable and modifiable as the project grows.
-
This is not about following rules for their own sake. It is about reducing the time between "I want to change X" and "X is changed and working." In a well-designed codebase, that time is minutes. In a poorly designed one, it is days of archaelogy through undocumented spaghetti.
Project Structure¶
- A consistent project layout lets anyone (including future you) navigate the codebase instantly.
my_project/
├── src/my_project/ # source code (importable package)
│ ├── __init__.py
│ ├── data/ # data loading and preprocessing
│ │ ├── __init__.py
│ │ ├── dataset.py
│ │ └── transforms.py
│ ├── models/ # model architectures
│ │ ├── __init__.py
│ │ ├── transformer.py
│ │ └── layers.py
│ ├── training/ # training loops, optimisers
│ │ ├── __init__.py
│ │ ├── trainer.py
│ │ └── losses.py
│ └── utils/ # shared utilities
│ ├── __init__.py
│ └── logging.py
├── configs/ # configuration files
│ ├── base.yaml
│ └── experiment_1.yaml
├── scripts/ # entry points (train, evaluate, serve)
│ ├── train.py
│ ├── evaluate.py
│ └── serve.py
├── tests/ # test files (mirrors src/ structure)
│ ├── test_dataset.py
│ ├── test_model.py
│ └── test_trainer.py
├── notebooks/ # exploration only (not production code)
├── pyproject.toml # project metadata and dependencies
├── README.md
├── .gitignore
└── Dockerfile
-
src/layout: putting source code undersrc/my_project/prevents accidental imports from the current directory (which masks import errors that would surface in production). Install withpip install -e .for development. -
Monorepo vs multi-repo: a monorepo keeps all related projects in one repository (easier cross-project changes, shared CI). A multi-repo gives each project its own repository (cleaner boundaries, independent versioning). Most ML teams start with a monorepo and split later if needed.
-
Scripts vs library: keep entry points (
train.py,evaluate.py) inscripts/. Keep reusable logic insrc/. A training script should be ~50 lines: parse config, build dataset, build model, build trainer, train. All the complexity lives in the library.
Clean Code Principles¶
- Naming: the single most impactful thing you can do. A variable named
xrequires you to read the surrounding code to understand it. A variable namedlearning_rateis self-documenting.
# BAD
def proc(d, n, lr):
for i in range(n):
for k, v in d.items():
v -= lr * g[k]
# GOOD
def update_parameters(parameters, num_steps, learning_rate):
for step in range(num_steps):
for name, param in parameters.items():
param -= learning_rate * gradients[name]
-
Single Responsibility Principle: each function/class does one thing. A function called
load_data_and_train_modelis doing two things and should be split. This makes each piece independently testable, reusable, and understandable. -
DRY (Don't Repeat Yourself) — but not prematurely. If you copy-paste code three times, extract it into a function. But do not create an abstraction for code you have used only once. Premature abstraction is worse than duplication: it adds complexity without proven benefit.
# Premature abstraction (one use case, over-engineered)
class AbstractDataTransformPipelineFactory:
...
# Just right (direct, clear, used in three places)
def normalise_image(image, mean, std):
return (image - mean) / std
- Magic numbers: never use unexplained literal values.
# BAD
if len(batch) > 32:
split_batch(batch, 32)
# GOOD
MAX_BATCH_SIZE = 32
if len(batch) > MAX_BATCH_SIZE:
split_batch(batch, MAX_BATCH_SIZE)
- Functions should be short: if a function does not fit on one screen (~30 lines), it is probably doing too much. Extract logical chunks into helper functions with descriptive names. The function body then reads like a high-level summary.
Design Patterns for ML¶
-
Design patterns are reusable solutions to common problems. These are the ones most relevant to ML codebases:
-
Factory pattern: create objects without specifying the exact class. Useful when your config says
model: "transformer"and you need to instantiate the right class:
MODEL_REGISTRY = {
"transformer": TransformerModel,
"cnn": CNNModel,
"mlp": MLPModel,
}
def build_model(config):
model_cls = MODEL_REGISTRY[config["model"]]
return model_cls(**config["model_params"])
-
This decouples the training script from specific model implementations. Adding a new model means adding one line to the registry, not modifying the training loop.
-
Strategy pattern: swap algorithms at runtime. Useful for losses, optimisers, schedulers:
LOSS_FUNCTIONS = {
"mse": nn.MSELoss,
"cross_entropy": nn.CrossEntropyLoss,
"focal": FocalLoss,
}
loss_fn = LOSS_FUNCTIONS[config["loss"]]()
- Observer pattern (callbacks/hooks): let modules react to events without tight coupling. Training frameworks (PyTorch Lightning, Keras) use callbacks extensively:
class EarlyStopping:
def __init__(self, patience=5):
self.patience = patience
self.best_loss = float('inf')
self.counter = 0
def on_epoch_end(self, epoch, val_loss):
if val_loss < self.best_loss:
self.best_loss = val_loss
self.counter = 0
else:
self.counter += 1
if self.counter >= self.patience:
return "stop"
- Dependency injection: pass dependencies into a function/class rather than creating them inside. This makes testing easy (inject a mock) and configuration flexible:
# BAD: hard-coded dependency
class Trainer:
def __init__(self):
self.logger = WandbLogger() # cannot test without W&B
# GOOD: injected dependency
class Trainer:
def __init__(self, logger):
self.logger = logger # can inject any logger, including a mock
Configuration Management¶
-
Hard-coding hyperparameters, file paths, and model settings makes experiments unreproducible and modifications painful. Externalise configuration into files.
-
YAML is the most common format for ML configs:
# configs/experiment_1.yaml
model:
name: transformer
d_model: 512
n_heads: 8
n_layers: 6
training:
batch_size: 64
learning_rate: 3e-4
max_epochs: 100
early_stopping_patience: 10
data:
train_path: /data/train.parquet
val_path: /data/val.parquet
max_seq_length: 512
-
Hydra (Facebook) is a configuration framework that supports composition (merge base config with experiment-specific overrides), command-line overrides (
python train.py training.lr=1e-3), and multi-run (sweep over hyperparameters). -
argparse is simpler for scripts with a few parameters:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--lr", type=float, default=3e-4)
parser.add_argument("--batch-size", type=int, default=64)
parser.add_argument("--config", type=str, default="configs/base.yaml")
args = parser.parse_args()
- Best practice: have a base config with all defaults, and per-experiment configs that override only what changes. Track every experiment's config alongside its results.
Logging and Observability¶
printstatements are for debugging. Logging is for production:
import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logger.debug("Batch loaded: %d samples", len(batch)) # noisy, for debugging
logger.info("Epoch %d: loss=%.4f, lr=%.6f", epoch, loss, lr) # normal operation
logger.warning("GPU memory >90%%, consider reducing batch size")
logger.error("Failed to load checkpoint: %s", path) # recoverable error
logger.critical("CUDA out of memory, aborting") # fatal
-
Why not print: logging supports levels (filter out debug messages in production), formatting (timestamps, module names), and handlers (write to file, send to monitoring system) without changing the logging calls.
-
Structured logging outputs machine-parseable formats (JSON) alongside human-readable messages. This enables searching and alerting on specific fields:
API Design¶
-
If your model will be used by other services (a web app, a mobile app, another ML pipeline), it needs an API (Application Programming Interface).
-
REST APIs use HTTP methods:
GETto read,POSTto create/predict,PUTto update,DELETEto remove. Endpoints follow resource-based naming:
POST /api/v1/predict # send input, get prediction
GET /api/v1/models # list available models
GET /api/v1/models/{id} # get model details
POST /api/v1/models/{id}/predict # predict with a specific model
- FastAPI is the go-to Python framework for ML serving:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class PredictRequest(BaseModel):
text: str
class PredictResponse(BaseModel):
label: str
confidence: float
@app.post("/predict", response_model=PredictResponse)
async def predict(request: PredictRequest):
result = model.predict(request.text)
return PredictResponse(label=result.label, confidence=result.score)
-
FastAPI auto-generates API documentation (Swagger UI at
/docs), validates input/output with Pydantic models, and supports async for high throughput. -
gRPC is faster than REST for internal service-to-service communication. It uses Protocol Buffers (binary serialisation, smaller and faster than JSON) and supports streaming. Used by TensorFlow Serving, Triton Inference Server, and many microservice architectures.
Packaging and Distribution¶
- Making your code installable as a package lets others (and your own scripts) import it cleanly:
# pyproject.toml
[project]
name = "my-ml-project"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
"torch>=2.0",
"jax>=0.4",
"pydantic>=2.0",
]
[project.optional-dependencies]
dev = ["pytest", "ruff", "mypy"]
[build-system]
requires = ["setuptools>=64"]
build-backend = "setuptools.backends._legacy:_Backend"
-
Editable install (
-e): changes to your source code are immediately reflected without reinstalling. Essential during development. -
Pinning dependencies:
requirements.txtwith exact versions (torch==2.2.1, nottorch>=2.0) ensures reproducibility. Usepip freeze > requirements.txtto capture your current environment. For more sophisticated dependency management, useuv,poetry, orpip-tools.
Working with AI Coding Agents¶
-
AI coding agents (Claude Code, GitHub Copilot, Cursor, etc.) are now part of the professional engineering workflow. Used well, they dramatically accelerate development. Used poorly, they introduce subtle bugs, erode your understanding of your own codebase, and create a false sense of productivity.
-
The right mental model: an agent is a fast but inexperienced pair programmer. It can write code quickly, knows syntax and standard patterns, and has read more documentation than you ever will. But it does not understand your specific system, your business constraints, your edge cases, or the why behind your design decisions. You are the senior engineer; the agent is the junior. You direct, review, and take responsibility.
When Agents Excel¶
-
Boilerplate and scaffolding: generating Dockerfiles, CI configs, test fixtures, data class definitions, argparse setups. These follow well-known patterns and are tedious to write by hand. Let the agent generate them, then review for correctness.
-
Writing tests: describe the function's behaviour, and the agent generates test cases. It often catches edge cases you would miss (empty input, negative values, Unicode). Always read the generated tests — they verify your assumptions, not just your code.
-
Refactoring: "extract this block into a function," "convert this class to use dataclasses," "add type hints to this module." Mechanical transformations where the intent is clear and the risk of subtle errors is low.
-
Exploration and prototyping: "write a quick script to benchmark inference latency" or "show me how to use the HuggingFace tokeniser API." The agent gets you a working starting point faster than reading documentation.
-
Documentation and docstrings: the agent can generate documentation from your code structure. Review for accuracy, but the grunt work is automated.
-
Debugging assistance: paste an error traceback and ask for diagnosis. The agent can often identify the root cause and suggest a fix, especially for common issues (shape mismatches, import errors, CUDA out of memory).
When to NOT Rely on Agents¶
-
Novel architecture decisions: if you are designing a new training pipeline, the agent will give you a generic answer. It does not know your data constraints, latency requirements, or team expertise. Use the agent to implement the design you have already thought through.
-
Security-critical code: authentication, encryption, input sanitisation. The agent may generate code that looks correct but has subtle vulnerabilities (SQL injection, insecure defaults, timing attacks). Security code should be written by someone who understands the threat model, and reviewed by someone else.
-
Performance-critical inner loops: the agent will write correct but naive code. For GPU kernels, memory-critical data structures, or latency-sensitive serving paths, you need to understand the hardware constraints (chapter 13, chapter 16) and optimise deliberately.
-
Code you don't understand: if the agent generates 200 lines and you cannot explain what each line does, do not commit it. You are now maintaining code you do not understand, and when it breaks (it will), you cannot debug it. This is the most common and most dangerous failure mode.
The Review Discipline¶
-
Always read every line of generated code before committing. This is not optional. The agent's code is a draft, not a finished product. Treat it exactly like a pull request from a colleague: review it critically.
-
What to check:
- Correctness: does it actually do what you asked? Agents often solve a subtly different problem than the one you intended.
- Edge cases: does it handle empty inputs, None values, negative numbers, very large inputs? Agents frequently omit edge case handling.
- Hallucinated APIs: the agent may call functions or use parameters that do not exist, especially for newer or less common libraries. Verify that every API call is real.
- Over-engineering: agents tend to produce more code than necessary. A 50-line solution to a 10-line problem adds complexity without benefit. Simplify ruthlessly.
- Security: hardcoded secrets, unsanitised user input, insecure defaults. The agent does not think adversarially.
- Style consistency: does the generated code match your project's conventions (naming, patterns, error handling)?
How to Write Good Prompts¶
-
The quality of the agent's output is directly proportional to the quality of your instruction. Vague prompts get vague code.
-
Bad: "write a data loader"
-
Good: "write a PyTorch DataLoader for a CSV file with columns 'text' and 'label'. Tokenise the text using the HuggingFace tokeniser 'bert-base-uncased' with max_length=512. Return input_ids, attention_mask, and label as tensors. Handle the case where the CSV has missing values in the label column by skipping those rows."
-
Provide context: tell the agent about your project structure, existing code, constraints, and conventions. The more context, the better the output.
-
Specify constraints: "use only the standard library," "must work with Python 3.10," "do not use global variables," "follow the existing pattern in
src/models/transformer.py." -
Ask for explanations: "implement X and explain the key design decisions." This forces the agent to articulate its reasoning, making it easier for you to spot flawed assumptions.
Using Quality Gates to Catch Agent Mistakes¶
-
Your existing quality infrastructure (file 04) catches agent errors just as well as human errors:
- Type checking (mypy): catches hallucinated API signatures and type mismatches.
- Linting (ruff): catches unused imports, undefined variables, and style violations.
- Tests (pytest): if the agent's code passes your test suite, it is more likely correct. If you do not have tests, write them before asking the agent to implement the feature (test-driven development works especially well with agents).
- CI pipeline: runs all of the above automatically on every commit.
-
The combination of "agent writes code" + "quality gates verify it" is more productive than either alone. The agent is fast but sloppy; the gates are thorough but do not write code. Together, you get speed and correctness.
The Productivity Trap¶
-
The biggest risk of coding agents is the illusion of productivity. You can generate 500 lines of code in 10 minutes. But if you spend 2 hours debugging those 500 lines because you did not understand them, you were slower than writing 200 lines yourself in 30 minutes.
-
True productivity with agents comes from:
- Staying in control: you decide the architecture, the agent fills in the implementation.
- Understanding what is generated: if you cannot explain it, rewrite it or ask the agent to simplify.
- Investing in quality gates: tests, types, and linting amortise their cost across every agent interaction.
- Using the agent for your weaknesses: if you are great at algorithms but slow at writing tests, let the agent write tests. If you are fast at UI code but unfamiliar with database queries, let the agent draft the SQL. Play to your strengths, delegate your gaps.
-
The engineers who get the most out of coding agents are the ones who already know how to code well. The agent amplifies your existing skill; it does not replace it. Understanding data structures, algorithms, system design, and software engineering (this entire chapter) is what lets you direct the agent effectively and evaluate its output critically.