Skip to content

Git and Version Control

Git is how software teams collaborate without overwriting each other's work. This file covers the Git mental model, branching strategies, merging and rebasing, conflict resolution, pull requests, and managing ML-specific challenges like large files and experiment tracking.

  • Every serious software project uses version control. Git is the dominant system, used by virtually all open-source projects and companies. Without git, collaboration is emailing zip files and praying nobody overwrites your changes. With git, every change is tracked, reversible, and attributable.

  • For ML engineers: git tracks your code, configs, and experiment scripts. Combined with experiment tracking tools, it gives you reproducibility: "what exact code and config produced this model?"

The Mental Model

  • Git tracks snapshots of your project. Every commit is a full snapshot of all tracked files at that moment, not a diff (internally, git stores diffs for efficiency, but conceptually each commit is a complete state).

  • Four "locations" for your files:

    1. Working directory: the actual files on disk. You edit these.
    2. Staging area (index): files you have marked for the next commit. git add moves changes here.
    3. Local repository: your commit history, stored in .git/. git commit saves the staging area as a new snapshot.
    4. Remote repository (e.g., GitHub): a shared copy. git push uploads your commits, git pull downloads others'.
Working Dir  →  git add  →  Staging  →  git commit  →  Local Repo  →  git push  →  Remote
                                                        ←  git pull  ←
  • The staging area is what makes git powerful. You can edit 10 files but only commit 3 of them, keeping the other changes for a separate commit. This enables clean, focused commits.

Essential Commands

git init                          # create a new repository
git clone url                     # download a remote repository
git status                        # what has changed? (most-used command)
git add file.py                   # stage a specific file
git add .                         # stage all changes (use with caution)
git commit -m "descriptive msg"   # commit staged changes
git push                          # upload commits to remote
git pull                          # download + merge remote changes
git log --oneline                 # compact commit history
git diff                          # show unstaged changes
git diff --staged                 # show staged changes

Branching

  • A branch is a pointer to a commit. The default branch is main (or master). Creating a branch gives you an independent line of development: you can make changes without affecting main.
git branch feature-x              # create a branch
git checkout feature-x            # switch to it
git checkout -b feature-x         # create and switch in one step
git branch -d feature-x           # delete branch (after merging)
git branch -a                     # list all branches (local + remote)
  • When to branch: always. Never commit directly to main. Every feature, bug fix, or experiment gets its own branch. This keeps main stable and deployable.

Branching Strategies

  • Feature branches (most common): each feature/fix gets a branch off main. When done, open a pull request (PR) to merge back. Simple, works for most teams.

  • Trunk-based development: developers commit to main frequently (multiple times per day), using feature flags to hide incomplete work. Preferred by teams that deploy continuously (Google, Facebook). Requires excellent CI/CD.

  • Gitflow: separate branches for features, releases, and hotfixes. More complex, better for software with versioned releases (mobile apps, packaged software). Overkill for most ML projects.

  • For ML teams: feature branches with short-lived branches (merge within 1-3 days) is the sweet spot. Long-lived branches diverge from main and create painful merge conflicts.

Merging and Rebasing

  • Merge creates a new "merge commit" that combines two branches:
git checkout main
git merge feature-x
  • This preserves the full history: you can see that work happened on a branch and when it was merged. The merge commit has two parents.

  • Rebase replays your branch's commits on top of the target branch:

git checkout feature-x
git rebase main
  • This rewrites history: your branch's commits get new hashes, as if you had started your work from the current tip of main. The result is a linear history (no merge commits), which is cleaner to read.

  • When to use which:

    • Rebase for updating your feature branch with the latest main changes (keeps your branch clean and up-to-date).
    • Merge for integrating your feature branch into main (preserves the branch history).
    • Never rebase commits that have been pushed and shared with others. Rebasing rewrites history; if someone else has based work on the original commits, rebasing causes chaos.

Resolving Conflicts

  • A conflict occurs when two branches modify the same line of the same file. Git cannot automatically decide which change to keep and asks you to resolve it manually.
<<<<<<< HEAD
learning_rate = 0.001
=======
learning_rate = 0.0005
>>>>>>> feature-x
  • Between <<<<<<< HEAD and ======= is the current branch's version. Between ======= and >>>>>>> feature-x is the incoming branch's version. You decide which to keep (or combine them), remove the markers, save, and git add the resolved file.

  • Pitfall: do not leave conflict markers in committed files. They are literal text that will break your code. Always search for <<<<<<< after resolving.

  • Reducing conflicts: keep branches short-lived, merge main into your branch frequently, and avoid multiple people editing the same file simultaneously.

Writing Good Commit Messages

  • A commit message is for your future self and your teammates. "fix bug" tells you nothing. "Fix off-by-one in batch size calculation that caused OOM on 8-GPU training" tells you everything.

  • Format:

Short summary (50 chars or less, imperative mood)

Longer description if needed. Explain WHY, not WHAT
(the diff shows what changed). Wrap at 72 characters.

Fixes #123
  • Imperative mood: "Add feature" not "Added feature" or "Adds feature." Read it as completing the sentence: "If applied, this commit will add feature."

  • Atomic commits: each commit should do one thing. "Add data loader" is one commit. "Add data loader and fix unrelated bug and update README" should be three commits. This makes git bisect (finding which commit introduced a bug) possible.

Pull Requests and Code Review

  • A pull request (PR) proposes merging a branch into main. It is the gateway for code review: teammates read your changes, suggest improvements, and approve before merging.

  • Good PR practices:

    • Keep PRs small (under 400 lines of changes). Large PRs get rubber-stamped because nobody wants to review 2000 lines.
    • Write a clear description: what changed, why, and how to test it.
    • Link to the issue or ticket that motivated the change.
    • Respond to review comments promptly.
    • Squash trivial commits before merging (so main has a clean history).
  • Code review is not about finding bugs (tests do that). It is about: knowledge sharing (the reviewer learns the codebase), design feedback (is this the right approach?), and maintaining standards (naming, style, architecture).

.gitignore

  • The .gitignore file tells git which files to exclude from tracking. For ML projects:
# Python
__pycache__/
*.pyc
*.egg-info/
.venv/
env/

# Data and models (too large for git)
data/
*.csv
*.parquet
models/
*.pt
*.onnx
*.bin
checkpoints/

# Secrets
.env
*.pem
credentials.json

# IDE
.vscode/
.idea/
*.swp

# OS
.DS_Store
Thumbs.db

# Jupyter
.ipynb_checkpoints/

# Experiment outputs
wandb/
mlruns/
outputs/
logs/
  • Pitfall: adding a file to .gitignore after it has been committed does not remove it from the repository. You must also git rm --cached file to untrack it. The file stays in the history forever unless you rewrite history (which is messy).

Git for ML

  • ML introduces challenges that traditional software does not face:

  • Large files: datasets and model weights are gigabytes or more. Git is designed for text files (source code), not binary blobs. Solutions:

    • Git LFS (Large File Storage): tracks pointers in git, stores actual files on a separate server. Simple but has storage/bandwidth limits on GitHub.
    • DVC (Data Version Control): manages data and model files separately from git, using remote storage (S3, GCS). Works like git for data: dvc add data.csv, dvc push, dvc pull.
  • Experiment tracking: which commit + which hyperparameters + which data produced which metrics? Git tracks code, but not the full experiment context.

    • Weights & Biases (W&B): logs metrics, hyperparameters, system info, and links to the git commit. Provides dashboards for comparing runs.
    • MLflow: open-source experiment tracking with model registry. Logs parameters, metrics, and artifacts.
    • Simple approach: log the git hash in your training script: git_hash = subprocess.check_output(['git', 'rev-parse', 'HEAD']).strip(). Store it alongside your results.
  • Reproducibility checklist (what to track for each experiment):

    • Git commit hash (exact code version)
    • Config file / hyperparameters
    • Random seeds
    • Python and library versions (pip freeze)
    • Data version (DVC hash or dataset version tag)
    • Hardware (GPU type, number of GPUs)
# Quick reproducibility snapshot
echo "Commit: $(git rev-parse HEAD)" > experiment_info.txt
echo "Branch: $(git branch --show-current)" >> experiment_info.txt
echo "Dirty: $(git status --porcelain | wc -l) files" >> experiment_info.txt
pip freeze >> experiment_info.txt
nvidia-smi >> experiment_info.txt