Skip to content

Linux and the Command Line

The command line is the primary interface for ML engineering: training jobs, server management, data pipelines, and cluster administration all happen through the terminal. This file covers the shell, file system, permissions, process management, package managers, environment variables, SSH, and the essential commands every ML engineer uses daily.

  • GUIs are convenient for browsing the web. They are terrible for running a training job on a remote GPU cluster at 2 AM. The command line (or terminal, or shell) is the tool that scales: it works on any machine, can be scripted, is composable, and is the same on your laptop, a cloud VM, and an HPC cluster.

  • If you are an ML engineer who only uses Jupyter notebooks and VS Code buttons, you are leaving enormous productivity on the table. Every production ML system is deployed, monitored, and debugged through the command line.

The Shell

  • A shell is a program that reads commands from you and executes them. It is the intermediary between you and the operating system (chapter 13). The most common shells are bash (the default on most Linux systems) and zsh (the default on macOS).

  • A command has the form: command [options] [arguments]

ls -la /home/user    # command=ls, options=-la, argument=/home/user
  • Options modify behaviour (usually prefixed with - for short or -- for long form). ls -l lists in long format, ls --all shows hidden files. Many options can be combined: ls -la means -l and -a together.

Essential Navigation

pwd                 # print working directory (where am I?)
ls                  # list files in current directory
ls -la              # list all files (including hidden) with details
cd /path/to/dir     # change directory
cd ..               # go up one level
cd ~                # go to home directory
cd -                # go back to previous directory

File Operations

cp source dest      # copy file
cp -r dir1 dir2     # copy directory recursively
mv old new          # move/rename file
rm file             # delete file (no recycle bin — gone forever)
rm -rf dir          # delete directory recursively (DANGEROUS — no confirmation)
mkdir -p a/b/c      # create nested directories
touch file.txt      # create empty file (or update timestamp)
cat file.txt        # print file contents
head -n 20 file     # first 20 lines
tail -f logfile     # follow a log file in real-time (invaluable for monitoring training)
  • Pitfall: rm -rf is the most dangerous command in computing. There is no undo. Triple-check the path before pressing enter. Never run rm -rf / or rm -rf ~.

Pipes and Redirection

  • The shell's killer feature is composability: small commands connected together to do complex things.

  • Pipe (|): sends the output of one command as input to the next.

cat training.log | grep "loss" | tail -5    # last 5 lines containing "loss"
ps aux | grep python                        # find running Python processes
history | grep "docker"                     # find previous docker commands
  • Redirection: send output to a file instead of the screen.
python train.py > output.log 2>&1    # stdout AND stderr to file
python train.py >> output.log        # append (don't overwrite)
echo "data" > file.txt               # overwrite file
echo "more" >> file.txt              # append to file
  • 2>&1 redirects stderr (file descriptor 2) to stdout (file descriptor 1). Without it, error messages still appear on screen while only normal output goes to the file.

Text Processing

grep "error" logfile.txt             # find lines containing "error"
grep -r "import torch" src/          # search recursively in directory
grep -i "warning" log.txt            # case-insensitive search
grep -c "epoch" train.log            # count matching lines

wc -l file.txt                       # count lines
wc -w file.txt                       # count words

sort data.txt                        # sort lines alphabetically
sort -n numbers.txt                  # sort numerically
sort -u data.txt                     # sort and remove duplicates
uniq -c sorted.txt                   # count consecutive duplicates

cut -d',' -f2,3 data.csv            # extract columns 2 and 3 from CSV
awk '{print $1, $3}' data.txt       # print 1st and 3rd whitespace-separated fields
sed 's/old/new/g' file.txt          # replace all occurrences of "old" with "new"
  • These compose beautifully:
# Find the 10 most common error types in a log file
grep "ERROR" app.log | awk -F': ' '{print $2}' | sort | uniq -c | sort -rn | head -10

Finding Files

find . -name "*.py"                  # find all Python files
find . -name "*.pyc" -delete         # find and delete compiled Python files
find /data -size +100M               # files larger than 100 MB
find . -mtime -1                     # files modified in the last 24 hours

which python                        # where is the python executable?
locate filename                      # fast file search (uses pre-built index)

File System Hierarchy

  • Linux organises everything in a single tree rooted at /:
Directory Purpose
/ Root of the entire file system
/home/user Your personal files, configs, projects
/etc System-wide configuration files
/usr User programs, libraries, documentation
/usr/local Locally installed software (not from package manager)
/var Variable data: logs (/var/log), databases, caches
/tmp Temporary files (cleared on reboot)
/opt Optional third-party software
/proc Virtual file system exposing kernel and process info
/dev Device files (disks, GPUs show up here)
  • For ML: your training data is typically in /data or /home/user/data, models in /home/user/models, and CUDA lives in /usr/local/cuda. GPU devices appear as /dev/nvidia0, /dev/nvidia1, etc.

File Permissions

  • Every file and directory has three permission types for three user classes:
Permission File Directory
r (read) View contents List contents
w (write) Modify contents Create/delete files inside
x (execute) Run as program Enter (cd into) the directory
  • Three user classes: owner (u), group (g), others (o).
ls -l script.py
# -rwxr-xr-- 1 henry ml_team 2048 Mar 28 script.py
#  ^^^         owner permissions: rwx (read, write, execute)
#     ^^^      group permissions: r-x (read, execute, no write)
#        ^^^   others permissions: r-- (read only)
chmod 755 script.py       # owner=rwx, group=rx, others=rx
chmod +x script.py        # add execute permission for everyone
chmod u+w,g-w file.txt    # add write for owner, remove write for group
chown henry:ml_team file  # change owner and group
  • Pitfall: a Python script with #!/usr/bin/env python3 at the top needs execute permission (chmod +x) to be run as ./script.py. Without it, you must use python3 script.py.

Process Management

  • A process is a running program (chapter 13). The shell gives you tools to manage them:
ps aux                    # list all running processes
ps aux | grep python      # find Python processes
top                       # real-time process monitor (CPU, memory)
htop                      # better version of top (install separately)
nvidia-smi                # GPU usage (essential for ML)
watch -n 1 nvidia-smi     # refresh nvidia-smi every second

kill PID                  # gracefully terminate process
kill -9 PID               # force kill (use when graceful fails)
killall python            # kill all Python processes

# Run in background
python train.py &                    # run in background
nohup python train.py > log.txt &    # run in background, survive logout
  • nohup is critical for ML training: without it, closing your SSH connection kills the training job. nohup detaches the process from the terminal.

  • screen and tmux are terminal multiplexers that create persistent sessions. You can start a training job in a tmux session, disconnect from SSH, reconnect later, and the session (and training) is still running.

tmux new -s training          # create named session
# ... start training ...
# Ctrl+B, then D              # detach from session
tmux attach -t training       # reattach later (even after SSH reconnect)
tmux ls                       # list sessions

Package Managers

  • System packages (OS-level software):
# Debian/Ubuntu
sudo apt update               # refresh package list
sudo apt install htop         # install a package
sudo apt upgrade              # upgrade all packages

# macOS
brew install wget             # install via Homebrew
  • Python packages:
pip install torch             # install from PyPI
pip install -e .              # install current project in editable mode
pip install -r requirements.txt  # install from requirements file
pip freeze > requirements.txt    # export installed packages

# Conda (for complex dependencies like CUDA)
conda create -n myenv python=3.11
conda activate myenv
conda install pytorch torchvision cudatoolkit=12.1 -c pytorch
  • Pitfall: never pip install into the system Python. Always use a virtual environment (python -m venv env, conda create, or uv venv). System Python is shared by OS tools; breaking it can break your system.

Environment Variables

  • Environment variables are key-value pairs accessible to all programs. They configure behaviour without changing code.
export CUDA_VISIBLE_DEVICES=0,1    # use only GPUs 0 and 1
export PYTHONPATH=/home/user/src   # add to Python's import path
export WANDB_API_KEY=abc123        # API key for Weights & Biases

echo $PATH                         # see current PATH
export PATH=$PATH:/usr/local/cuda/bin  # add CUDA to PATH
  • .bashrc (or .zshrc): commands run every time you open a shell. Put your export statements here so they persist.

  • .env files: project-specific variables loaded by tools like python-dotenv. Keep secrets (API keys, database passwords) in .env and add .env to .gitignore. Never commit secrets to git.

SSH (Secure Shell)

  • SSH connects you to remote machines over an encrypted channel. This is how you access cloud VMs, GPU servers, and HPC clusters.
ssh user@hostname              # connect to remote machine
ssh -i ~/.ssh/key.pem user@ip  # connect with specific key
ssh -L 8888:localhost:8888 user@server  # port forwarding (Jupyter on remote)
  • SSH keys (public/private key pair) replace passwords:
ssh-keygen -t ed25519          # generate key pair
ssh-copy-id user@server        # copy public key to server
# now you can SSH without typing a password
  • SSH config (~/.ssh/config) saves connection details:
Host gpu-server
    HostName 10.0.1.42
    User henry
    IdentityFile ~/.ssh/gpu_key
    LocalForward 8888 localhost:8888
  • Now ssh gpu-server connects with all those settings automatically.

  • scp and rsync transfer files between machines:

scp model.pt user@server:/data/models/     # copy file to remote
scp -r user@server:/data/results/ ./       # copy directory from remote
rsync -avz --progress data/ user@server:/data/  # sync with progress (smarter than scp)

Essential ML Commands Cheat Sheet

# GPU monitoring
nvidia-smi                                   # GPU usage snapshot
watch -n 1 nvidia-smi                        # live monitoring
gpustat                                      # cleaner GPU overview (pip install gpustat)

# Training management
nohup python train.py > train.log 2>&1 &     # background training that survives logout
tail -f train.log                            # monitor training output
kill %1                                      # kill last background job

# Disk usage (datasets are huge)
df -h                                        # disk space on all mounts
du -sh /data/*                               # size of each item in /data
du -sh --max-depth=1 .                       # size of subdirectories

# Memory
free -h                                      # RAM usage
cat /proc/meminfo                            # detailed memory info

# Network
curl -O https://example.com/dataset.tar.gz   # download file
wget https://example.com/model.bin           # alternative downloader
curl -X POST http://localhost:8080/predict \
    -H "Content-Type: application/json" \
    -d '{"text": "hello"}'                   # test a model serving endpoint

# Archives
tar -czf archive.tar.gz directory/           # compress
tar -xzf archive.tar.gz                      # extract
zip -r archive.zip directory/                # zip
unzip archive.zip                            # unzip

# Quick data inspection
head -5 data.csv                             # first 5 lines of CSV
wc -l data.csv                               # count rows
cut -d',' -f1 data.csv | sort -u | wc -l    # count unique values in column 1