Object Detection and Segmentation¶

Object detection localises and classifies every object in an image; segmentation assigns a label to every pixel. This file covers IoU, mAP, anchor boxes, R-CNN family, YOLO, SSD, Feature Pyramid Networks, semantic/instance/panoptic segmentation (U-Net, Mask R-CNN, SAM), and the metrics that benchmark them.

Image classification (file 02) answers "what is in this image?" Object detection asks a harder question: "what objects are in this image, and where are they?"
Segmentation goes further still: "which pixels belong to which object or category?" These tasks form a hierarchy of increasingly precise spatial understanding.
An object detection model outputs a set of bounding boxes, each defined by four coordinates (top-left corner \(x, y\), width, height) and a class label with a confidence score. A single image may contain zero, one, or hundreds of objects from multiple classes.

Input image with multiple objects, each enclosed by a coloured bounding box with a class label and confidence score

Intersection over Union (IoU) measures how well a predicted bounding box matches the ground truth. It is the area of overlap divided by the area of union:

\[\text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}}\]

An IoU of 1 means perfect overlap; an IoU of 0 means no overlap at all. The standard threshold for a "correct" detection is IoU \(\geq 0.5\), though stricter thresholds (0.75, 0.9) are also used.
A detection is a true positive (TP) if its IoU with a ground truth box exceeds the threshold and the class is correct.
A false positive (FP) is a predicted box that does not match any ground truth.
A false negative (FN) is a ground truth object that no prediction matched. These are the same precision/recall concepts from chapter 06.
Average Precision (AP) summarises detection quality for one class. For each class, rank all detections by confidence score, compute precision and recall at each rank, and calculate the area under the precision-recall curve:

\[\text{AP} = \int_0^1 p(r) \, dr\]

In practice, the curve is interpolated: at each recall level, precision is set to the maximum precision at any recall \(\geq r\). This smooths the curve and makes it monotonically decreasing.
Mean Average Precision (mAP) averages AP across all classes. "mAP@0.5" uses IoU threshold 0.5. "mAP@[.5:.95]" (the COCO standard) averages mAP over ten IoU thresholds from 0.5 to 0.95 in steps of 0.05, rewarding both detection and precise localisation.
Non-Maximum Suppression (NMS) removes duplicate detections. When a model predicts multiple overlapping boxes for the same object, NMS keeps the highest-confidence box and removes all others that overlap with it above an IoU threshold. This is applied per class after the model produces its raw predictions.
Two-stage detectors first propose candidate regions, then classify and refine each proposal.
R-CNN (Girshick et al., 2014) was the first successful deep learning detector. It uses selective search (a classical algorithm) to propose ~2,000 candidate regions, warps each region to a fixed size, runs each through a CNN independently, and classifies with an SVM (chapter 06). R-CNN was accurate but extremely slow: it ran the CNN 2,000 times per image.
Fast R-CNN (Girshick, 2015) solved the redundancy by running the CNN once on the entire image to produce a shared feature map, then extracting features for each proposal from that shared map using RoI pooling (Region of Interest pooling).
RoI pooling takes a variable-sized region of the feature map and produces a fixed-size output by dividing the region into a grid and max-pooling within each cell. This is much faster because the expensive CNN computation happens only once.
Faster R-CNN (Ren et al., 2015) eliminated the external region proposal algorithm by introducing the Region Proposal Network (RPN), a small CNN that runs on top of the shared feature map and predicts proposals directly. The RPN slides a small window over the feature map and, at each position, predicts \(k\) proposals (one for each anchor box).

Faster R-CNN pipeline: input image → backbone CNN → shared feature map → RPN generates proposals → RoI pooling → classification and box regression heads

Anchor boxes are predefined bounding boxes at each spatial position of the feature map, covering different scales and aspect ratios (e.g., three scales \(\times\) three ratios = 9 anchors per position). The RPN predicts two things for each anchor: an objectness score (object vs background) and coordinate offsets that refine the anchor into a tighter proposal. This parametrisation makes the regression problem easier: instead of predicting absolute coordinates, the network predicts small adjustments to a reasonable starting box.
The anchor offsets are parametrised as:

\[t_x = \frac{x - x_a}{w_a}, \quad t_y = \frac{y - y_a}{h_a}, \quad t_w = \log\frac{w}{w_a}, \quad t_h = \log\frac{h}{h_a}\]

where \((x, y, w, h)\) are the predicted box centre and size, and \((x_a, y_a, w_a, h_a)\) are the anchor. The log transform for width and height ensures the predicted box is always positive and makes the regression scale-invariant.
Faster R-CNN trains with a multi-task loss: classification loss (cross-entropy from chapter 05) for the class label, plus a smooth L1 loss for box regression. Smooth L1 is less sensitive to outliers than L2:

\[ \text{smooth}_{L1}(x) = \begin{cases} 0.5x^2 & \text{if } |x| < 1 \\ |x| - 0.5 & \text{otherwise} \end{cases} \]

Feature Pyramid Networks (FPN) (Lin et al., 2017) address the multi-scale problem by building a top-down pathway with lateral connections that merges high-level semantics with low-level spatial detail. The backbone produces feature maps at multiple scales (each pooling layer halves the resolution). FPN adds a top-down path where each level receives upsampled features from the level above and merges them with the corresponding bottom-up level via lateral 1x1 convolutions. The result is a pyramid of feature maps, each with both strong semantics and good spatial resolution.
Small objects are detected from the higher-resolution levels of the pyramid; large objects from the lower-resolution levels. FPN is now a standard component in most modern detection architectures.
One-stage detectors skip the proposal step entirely, predicting class labels and bounding boxes in a single pass. This is faster but was historically less accurate than two-stage detectors, until focal loss closed the gap.
YOLO (You Only Look Once, Redmon et al., 2016) divides the image into an \(S \times S\) grid. Each grid cell predicts \(B\) bounding boxes and \(C\) class probabilities. If the centre of an object falls in a grid cell, that cell is responsible for detecting it. YOLO is extremely fast because the entire detection is a single forward pass with no proposal stage.
YOLOv2 added anchor boxes, batch normalisation, and multi-scale training. YOLOv3 used a Feature Pyramid Network and predicted at three scales. YOLOv4-v8 continued improving with better backbones, path aggregation networks, and mosaic data augmentation (stitching four images together during training to increase context diversity).
SSD (Single Shot MultiBox Detector, Liu et al., 2016) predicts at multiple feature map scales within the backbone, using anchor boxes at each scale. Early (high-resolution) feature maps detect small objects; later (low-resolution) maps detect large objects. SSD is faster than Faster R-CNN with competitive accuracy.
RetinaNet (Lin et al., 2017) identified the core problem with one-stage detectors: class imbalance. The vast majority of anchor boxes correspond to background, which generates easy negatives that dominate the loss and overwhelm the gradients from the rare positive examples.
Focal loss solves this by down-weighting easy examples:

\[\text{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)\]

where \(p_t\) is the predicted probability for the correct class. When the model is confident and correct (\(p_t\) is high), \((1 - p_t)^\gamma\) is small, reducing the loss contribution from easy negatives. The hyperparameter \(\gamma\) (typically 2) controls the strength of the down-weighting. With \(\gamma = 0\), focal loss reduces to standard cross-entropy. With focal loss, RetinaNet achieved accuracy comparable to two-stage detectors at one-stage speed.
Anchor-free detection eliminates anchor boxes entirely, reducing hyperparameter tuning and simplifying the pipeline.
FCOS (Fully Convolutional One-Stage, Tian et al., 2019) predicts, at every spatial position of the feature map, the distances from that position to the four sides of the nearest bounding box (left, top, right, bottom) plus a class label. A centerness score down-weights predictions far from the object centre, improving quality. FCOS uses FPN to handle multiple scales.
CenterNet (Zhou et al., 2019) detects objects as points: it predicts a heatmap where peaks correspond to object centres, then regresses the width and height at each peak. Detection becomes keypoint estimation. This is elegant and anchor-free, but requires careful heatmap post-processing.
CornerNet detects objects as pairs of corners (top-left and bottom-right). It predicts two heatmaps (one for each corner type) and uses an associative embedding to match corresponding corners into bounding boxes. This avoids the need for anchors and handles objects of arbitrary shape.
Semantic segmentation assigns a class label to every pixel in the image. Unlike detection (which outputs boxes), segmentation produces a dense pixel-level map. A street scene might label every pixel as road, sidewalk, car, pedestrian, building, sky, etc.

Semantic segmentation: input street scene and its pixel-level label map where each colour represents a class

Fully Convolutional Networks (FCN) (Long et al., 2015) adapted classification CNNs for segmentation by replacing fully connected layers with convolutional layers, allowing the network to output a spatial map rather than a single class. Upsampling (via transposed convolutions or bilinear interpolation) restores the output to the input resolution. Skip connections from earlier layers add back spatial detail lost during downsampling.
Transposed convolution (sometimes called "deconvolution") is the upsampling counterpart of convolution. Where strided convolution reduces spatial dimensions, transposed convolution increases them. It inserts zeros between input elements and then applies a standard convolution, effectively learning how to upsample.
U-Net (Ronneberger et al., 2015) introduced a symmetric encoder-decoder architecture with skip connections at every level. The encoder (contracting path) reduces spatial resolution while increasing channels, exactly like a classification CNN. The decoder (expanding path) upsamples back to full resolution. Skip connections concatenate encoder feature maps with decoder feature maps at each level, providing fine spatial detail to the decoder. This combination of high-level semantics and low-level detail produces sharp, accurate segmentation boundaries.

U-Net architecture: encoder path on the left with downsampling, decoder path on the right with upsampling, and skip connections bridging corresponding levels

U-Net was originally designed for biomedical image segmentation (where training data is scarce) and its architecture has become the foundation for many subsequent models, including the U-Net in latent diffusion models (file 04).
DeepLab (Chen et al., 2014-2018) introduced two key innovations for segmentation:
- Atrous (dilated) convolution: standard convolution with gaps inserted between filter elements, controlled by a dilation rate \(r\). A 3x3 filter with dilation \(r\) has a receptive field of \((2r + 1) \times (2r + 1)\) while using only 9 parameters. This captures context at multiple scales without downsampling, preserving spatial resolution.
- Atrous Spatial Pyramid Pooling (ASPP): applies multiple atrous convolutions with different dilation rates in parallel (e.g., rates 1, 6, 12, 18), concatenates the results, and fuses with a 1x1 convolution. ASPP captures context at multiple scales simultaneously, similar in spirit to the Inception module (file 02) but using dilation instead of different kernel sizes.
DeepLab also used a Conditional Random Field (CRF) (chapter 05) as a post-processing step to refine segmentation boundaries by encouraging spatially nearby pixels with similar colours to share the same label.
Instance segmentation combines detection and segmentation: it identifies each individual object instance and produces a pixel-level mask for each. Two cars in a scene get two separate masks, not just "car" for both.
Mask R-CNN (He et al., 2017) extends Faster R-CNN by adding a small segmentation head that predicts a binary mask for each detected object. The architecture is Faster R-CNN + a mask branch: the mask branch takes the RoI-pooled features and outputs a \(m \times m\) binary mask per class. It uses RoIAlign instead of RoI pooling: bilinear interpolation at precisely sampled points rather than quantised grid cells, which avoids the spatial misalignment that quantisation causes. This small change significantly improves mask quality.
Mask R-CNN is trained with a multi-task loss: classification loss + box regression loss + mask loss (per-pixel binary cross-entropy). The mask branch predicts a mask for every class independently; only the mask corresponding to the predicted class is used, which decouples mask prediction from classification and improves both.
Panoptic segmentation unifies semantic and instance segmentation into a single task. Every pixel gets both a class label (semantic) and an instance ID (instance, for "thing" classes like cars and people). "Stuff" classes (sky, road, grass) get only semantic labels because they are amorphous regions without countable instances.
The panoptic quality (PQ) metric evaluates this by decomposing into a segmentation quality (average IoU of matched segments) and a recognition quality (F1 score of matched segments):

\[\text{PQ} = \underbrace{\frac{\sum_{(p,g) \in \text{TP}} \text{IoU}(p,g)}{|\text{TP}|}}_{\text{SQ}} \times \underbrace{\frac{|\text{TP}|}{|\text{TP}| + \frac{1}{2}|\text{FP}| + \frac{1}{2}|\text{FN}|}}_{\text{RQ}}\]

Real-time segmentation is critical for applications like autonomous driving and augmented reality, where latency budgets are tight (often under 30 milliseconds per frame).
BiSeNet (Bilateral Segmentation Network, Yu et al., 2018) uses two parallel paths: a spatial path with wide, shallow layers that preserves spatial detail, and a context path with deep, narrow layers that captures semantics. The outputs are fused, giving both speed and accuracy.
DDRNet (Deep Dual-Resolution Network, Hong et al., 2021) maintains two branches at different resolutions throughout the network, with repeated information exchange between them. The high-resolution branch preserves spatial detail while the low-resolution branch captures global context. Multiple bilateral fusion modules merge information in both directions.
The general trend in real-time segmentation is to avoid the heavy encoder-decoder pattern and instead maintain sufficient spatial resolution throughout the network, trading some accuracy for dramatically lower latency.

Coding Tasks (use CoLab or notebook)¶

Implement IoU computation and Non-Maximum Suppression from scratch. Apply NMS to a set of overlapping bounding boxes and visualise the result.

import jax.numpy as jnp
import matplotlib.pyplot as plt
import matplotlib.patches as patches

def compute_iou(box1, box2):
    """Compute IoU between two boxes [x1, y1, x2, y2]."""
    x1 = jnp.maximum(box1[0], box2[0])
    y1 = jnp.maximum(box1[1], box2[1])
    x2 = jnp.minimum(box1[2], box2[2])
    y2 = jnp.minimum(box1[3], box2[3])

    intersection = jnp.maximum(0, x2 - x1) * jnp.maximum(0, y2 - y1)
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    return intersection / (union + 1e-6)

def nms(boxes, scores, iou_threshold=0.5):
    """Non-Maximum Suppression."""
    order = jnp.argsort(-scores)  # sort by descending confidence
    keep = []

    remaining = list(range(len(scores)))
    order_list = order.tolist()

    while order_list:
        idx = order_list[0]
        keep.append(idx)
        order_list = order_list[1:]

        new_order = []
        for j in order_list:
            iou = compute_iou(boxes[idx], boxes[j])
            if iou < iou_threshold:
                new_order.append(j)
        order_list = new_order

    return keep

# Example: overlapping detections of the same object
boxes = jnp.array([
    [50, 60, 150, 160],   # high confidence
    [55, 65, 155, 165],   # overlapping duplicate
    [52, 58, 148, 158],   # overlapping duplicate
    [200, 100, 300, 200], # different object
    [205, 105, 305, 205], # overlapping duplicate
])
scores = jnp.array([0.95, 0.80, 0.70, 0.90, 0.60])

keep = nms(boxes, scores, iou_threshold=0.5)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors = ['#3498db', '#e74c3c', '#27ae60', '#9b59b6', '#f39c12']

for ax, title, indices in zip(axes, ['Before NMS', 'After NMS'],
                               [range(len(boxes)), keep]):
    ax.set_xlim(0, 400); ax.set_ylim(0, 300)
    ax.set_aspect('equal'); ax.invert_yaxis()
    ax.set_title(title)
    for i in indices:
        b = boxes[i]
        rect = patches.Rectangle((b[0], b[1]), b[2]-b[0], b[3]-b[1],
                                  linewidth=2, edgecolor=colors[i],
                                  facecolor='none')
        ax.add_patch(rect)
        ax.text(b[0], b[1]-5, f'{scores[i]:.2f}', color=colors[i], fontsize=10)

plt.tight_layout(); plt.show()
print(f"Kept {len(keep)} of {len(boxes)} boxes after NMS")

Implement a simplified Region Proposal Network (RPN). Given a feature map, generate anchor boxes at multiple scales and aspect ratios, and predict objectness scores and box offsets.

import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt
import matplotlib.patches as patches

def generate_anchors(feature_h, feature_w, stride, scales, ratios):
    """Generate anchor boxes for each position on the feature map."""
    anchors = []
    for y in range(feature_h):
        for x in range(feature_w):
            cx = (x + 0.5) * stride
            cy = (y + 0.5) * stride
            for s in scales:
                for r in ratios:
                    w = s * jnp.sqrt(r)
                    h = s / jnp.sqrt(r)
                    anchors.append([cx - w/2, cy - h/2, cx + w/2, cy + h/2])
    return jnp.array(anchors)

def rpn_forward(feature_map, params):
    """Simplified RPN: predicts objectness and box offsets per anchor."""
    H, W, C = feature_map.shape
    n_anchors = params['cls_w'].shape[1]

    # Slide a 1x1 conv over the feature map (simplified)
    cls_scores = feature_map.reshape(-1, C) @ params['cls_w']  # (H*W, n_anchors)
    box_offsets = feature_map.reshape(-1, C) @ params['reg_w']  # (H*W, n_anchors*4)

    cls_scores = jax.nn.sigmoid(cls_scores)
    return cls_scores.ravel(), box_offsets.reshape(-1, 4)

# Setup
feature_h, feature_w, channels = 4, 4, 16
stride = 16  # each feature map cell covers 16x16 pixels
scales = [32, 64, 128]
ratios = [0.5, 1.0, 2.0]
n_anchors_per_pos = len(scales) * len(ratios)

key = jax.random.PRNGKey(42)
k1, k2, k3 = jax.random.split(key, 3)

feature_map = jax.random.normal(k1, (feature_h, feature_w, channels))
params = {
    'cls_w': jax.random.normal(k2, (channels, n_anchors_per_pos)) * 0.01,
    'reg_w': jax.random.normal(k3, (channels, n_anchors_per_pos * 4)) * 0.01,
}

anchors = generate_anchors(feature_h, feature_w, stride, scales, ratios)
scores, offsets = rpn_forward(feature_map, params)

print(f"Feature map: {feature_h}x{feature_w}, stride={stride}")
print(f"Anchors per position: {n_anchors_per_pos}")
print(f"Total anchors: {len(anchors)}")
print(f"Objectness scores shape: {scores.shape}")
print(f"Box offsets shape: {offsets.shape}")

# Visualise anchors for one position
fig, ax = plt.subplots(figsize=(6, 6))
img_size = feature_h * stride
ax.set_xlim(0, img_size); ax.set_ylim(0, img_size)
ax.invert_yaxis(); ax.set_aspect('equal')

pos_idx = feature_h // 2 * feature_w + feature_w // 2  # centre position
colors = ['#3498db', '#e74c3c', '#27ae60']
for i, s in enumerate(scales):
    for j, r in enumerate(ratios):
        idx = pos_idx * n_anchors_per_pos + i * len(ratios) + j
        a = anchors[idx]
        rect = patches.Rectangle((a[0], a[1]), a[2]-a[0], a[3]-a[1],
                                  linewidth=1.5, edgecolor=colors[i],
                                  facecolor='none', linestyle=['--', '-', ':'][j])
        ax.add_patch(rect)

ax.scatter([img_size/2], [img_size/2], c='red', s=50, zorder=5)
ax.set_title(f'Anchors at centre position\n3 scales × 3 ratios = {n_anchors_per_pos}')
ax.grid(True, alpha=0.3)
plt.tight_layout(); plt.show()

Implement a simplified U-Net encoder-decoder with skip connections for 1D segmentation (binary labelling of a 1D signal).

import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt

def conv1d_same(x, kernel):
    """1D convolution with same padding."""
    k = len(kernel)
    pad = k // 2
    x_pad = jnp.pad(x, pad, mode='edge')
    n = len(x)
    out = jnp.zeros(n)
    for i in range(n):
        out = out.at[i].set(jnp.sum(x_pad[i:i+k] * kernel))
    return out

def downsample(x):
    return x[::2]

def upsample(x, target_len):
    return jnp.interp(jnp.linspace(0, 1, target_len), jnp.linspace(0, 1, len(x)), x)

def unet_1d(x, params):
    """Simplified 1D U-Net with 2 encoder/decoder levels."""
    # Encoder
    e1 = jnp.maximum(0, conv1d_same(x, params['enc1']))
    e1_down = downsample(e1)

    e2 = jnp.maximum(0, conv1d_same(e1_down, params['enc2']))
    e2_down = downsample(e2)

    # Bottleneck
    bottleneck = jnp.maximum(0, conv1d_same(e2_down, params['bottleneck']))

    # Decoder with skip connections
    d2_up = upsample(bottleneck, len(e2))
    d2 = jnp.maximum(0, conv1d_same(d2_up + e2, params['dec2']))  # skip connection

    d1_up = upsample(d2, len(e1))
    d1 = conv1d_same(d1_up + e1, params['dec1'])  # skip connection

    return jax.nn.sigmoid(d1)

# Create signal with labelled regions
n = 128
t = jnp.linspace(0, 4 * jnp.pi, n)
signal = jnp.sin(t) + 0.5 * jnp.sin(3 * t)
labels = (signal > 0.5).astype(jnp.float32)  # binary segmentation target

key = jax.random.PRNGKey(42)
keys = jax.random.split(key, 5)
params = {
    'enc1': jax.random.normal(keys[0], (5,)) * 0.3,
    'enc2': jax.random.normal(keys[1], (5,)) * 0.3,
    'bottleneck': jax.random.normal(keys[2], (3,)) * 0.3,
    'dec2': jax.random.normal(keys[3], (5,)) * 0.3,
    'dec1': jax.random.normal(keys[4], (5,)) * 0.3,
}

def loss_fn(params, signal, labels):
    pred = unet_1d(signal, params)
    return -jnp.mean(labels * jnp.log(pred + 1e-7) + (1 - labels) * jnp.log(1 - pred + 1e-7))

grad_fn = jax.jit(jax.grad(loss_fn))
lr = 0.05

for step in range(500):
    grads = grad_fn(params, signal, labels)
    params = {k: params[k] - lr * grads[k] for k in params}

pred = unet_1d(signal, params)

fig, axes = plt.subplots(3, 1, figsize=(12, 7), sharex=True)
axes[0].plot(t, signal, color='#3498db', linewidth=1.5)
axes[0].set_title('Input Signal'); axes[0].set_ylabel('Value')

axes[1].fill_between(t, 0, labels, alpha=0.3, color='#27ae60')
axes[1].set_title('Ground Truth Labels'); axes[1].set_ylabel('Label')

axes[2].plot(t, pred, color='#e74c3c', linewidth=1.5)
axes[2].fill_between(t, 0, (pred > 0.5).astype(float), alpha=0.2, color='#e74c3c')
axes[2].set_title('U-Net Prediction'); axes[2].set_ylabel('Probability')
axes[2].set_xlabel('t')

plt.tight_layout(); plt.show()
print(f"Final loss: {loss_fn(params, signal, labels):.4f}")
print(f"Pixel accuracy: {jnp.mean((pred > 0.5) == labels):.2%}")