Video and 3D Vision¶

Video and 3D vision extend image understanding into the temporal and spatial domains. This file covers optical flow, video classification (3D CNNs, TimeSformer), object tracking (SORT, DeepSORT), action recognition, depth estimation (monocular and stereo), point clouds, NeRFs, and 3D Gaussian splatting.

Files 01-04 treated images as isolated snapshots. But the visual world is continuous: objects move, scenes change, and depth exists. This file extends computer vision into the temporal domain (video) and the spatial domain (3D), covering how models understand motion, track objects, estimate depth, and reconstruct scenes.
A video is a sequence of images (frames) captured over time. At 30 frames per second, a 10-second clip contains 300 frames. The key challenge is modelling the temporal dimension: how do objects move, how do scenes evolve, and how can we relate information across frames?
Optical flow estimates the apparent motion of pixels between two consecutive frames. For each pixel in frame \(t\), optical flow produces a 2D displacement vector \((u, v)\) pointing to where that pixel moved in frame \(t+1\). The result is a dense motion field the same size as the image.

Two consecutive video frames and the optical flow field between them, visualised as coloured arrows showing pixel motion direction and magnitude

Optical flow is computed under the brightness constancy assumption: a pixel's intensity does not change as it moves. If a pixel at position \((x, y)\) in frame \(t\) has intensity \(I(x, y, t)\) and moves by \((u, v)\) in a small time interval \(\delta t\):

\[I(x + u\delta t, \, y + v\delta t, \, t + \delta t) = I(x, y, t)\]

Taking a first-order Taylor expansion (chapter 03) and dividing by \(\delta t\):

\[I_x u + I_y v + I_t = 0\]

where \(I_x, I_y\) are the spatial gradients (Sobel, file 01) and \(I_t\) is the temporal gradient (difference between consecutive frames). This is the optical flow constraint equation. One equation, two unknowns (\(u, v\)): we need an additional constraint.
Lucas-Kanade assumes the flow is constant within a small window (e.g., 5x5 pixels). This gives an overdetermined system (25 equations, 2 unknowns) solved by least squares (the normal equation from chapter 06):

\[ \begin{bmatrix} u \\ v \end{bmatrix} = \begin{bmatrix} \sum I_x^2 & \sum I_x I_y \\ \sum I_x I_y & \sum I_y^2 \end{bmatrix}^{-1} \begin{bmatrix} -\sum I_x I_t \\ -\sum I_y I_t \end{bmatrix} \]

The 2x2 matrix is the structure tensor from file 01 (the same matrix used in Harris corner detection). Lucas-Kanade works well for small motions but fails when objects move more than a few pixels between frames.
Farneback's method fits a polynomial expansion to the neighbourhood of each pixel and estimates the displacement field that best explains the change between frames. It produces dense flow (a vector for every pixel) and handles larger motions than Lucas-Kanade.
Modern deep learning optical flow methods (FlowNet, RAFT) learn to predict optical flow end-to-end from pairs of frames. RAFT (Recurrent All-Pairs Field Transforms, Teed and Deng, 2020) computes a 4D correlation volume between all pairs of pixels in the two frames and iteratively refines the flow estimate using a GRU-based update operator. RAFT achieves state-of-the-art accuracy and has become the standard flow backbone.
Two-stream networks (Simonyan and Zisserman, 2014) were an early approach to video understanding. One stream processes a single RGB frame (appearance), the other processes a stack of optical flow frames (motion). The two streams are fused at the end (by averaging or concatenation). This architecture explicitly separates "what things look like" from "how they move."
3D Convolutional Networks extend 2D convolutions to the temporal dimension. A 3D convolution applies a filter of size \(k \times k \times k_t\) that spans both spatial and temporal dimensions, directly learning spatiotemporal features.
C3D (Tran et al., 2015) stacked 3D convolutions with 3x3x3 filters, showing that temporal convolutions can learn motion features without explicit optical flow. The cost is high: 3D convolutions have \(k_t\) times more parameters and computation than their 2D counterparts.
I3D (Inflated 3D, Carreira and Zisserman, 2017) took a more practical approach: start with a pre-trained 2D CNN (like Inception or ResNet) and "inflate" all 2D filters to 3D by repeating the weights along the temporal dimension and dividing by \(k_t\). This transfers ImageNet pre-training to video while adding temporal modelling. A 2D \(k \times k\) filter becomes a \(k \times k \times k_t\) filter initialised as \(W_{\text{3D}}[:,:,j] = W_{\text{2D}} / k_t\) for all temporal positions \(j\).
SlowFast Networks (Feichtenhofer et al., 2019) use two parallel pathways operating at different temporal resolutions:
- The Slow pathway processes frames at a low frame rate (e.g., every 16th frame) with high spatial resolution and many channels, capturing fine spatial detail.
- The Fast pathway processes frames at a high frame rate (every 2nd frame) with reduced spatial resolution and fewer channels (typically \(1/8\) of the Slow pathway), capturing rapid temporal changes.
- Lateral connections fuse information from Fast to Slow via strided convolutions.
The insight is that spatial and temporal information have different bandwidth requirements: object appearance changes slowly, but motion can be rapid. SlowFast matches this asymmetry by design.
TimeSformer (Bertasius et al., 2021) applies the Vision Transformer to video. It decomposes full spatiotemporal attention (which would be prohibitively expensive: \(O((T \times N)^2)\) for \(T\) frames and \(N\) patches per frame) into divided attention: each block alternates between temporal attention (each patch attends across time at the same spatial position) and spatial attention (each patch attends across space within the same frame). This reduces the cost from \(O(T^2 N^2)\) to \(O(T^2 + N^2)\).
VideoMAE (Tong et al., 2022) extends the masked autoencoder idea (file 04) to video. An extremely high masking ratio (90-95%) is used because video has high temporal redundancy: neighbouring frames look nearly identical, so masking most patches still leaves enough information for reconstruction. VideoMAE pre-trains a ViT backbone on unlabelled video and transfers to downstream tasks.
Action recognition classifies a video clip into one of many action categories (e.g., "running," "cooking," "playing guitar"). It is the video analogue of image classification. Standard benchmarks include Kinetics-400 (400 action classes, ~300K clips), Something-Something (174 fine-grained actions requiring temporal reasoning), and ActivityNet (200 classes with long, untrimmed videos).
Temporal action detection goes beyond classification: given a long untrimmed video, find the start time, end time, and class of each action. This is the temporal analogue of object detection. Methods like ActionFormer use a Transformer to process temporal features and predict action boundaries.
Video object tracking follows a specific object across frames after it is identified in the first frame.
SORT (Simple Online and Realtime Tracking, Bewley et al., 2016) combines a detection model (which detects objects in each frame independently) with the Kalman filter for motion prediction and the Hungarian algorithm for assignment.
The Kalman filter maintains a state estimate (position, velocity, size) for each tracked object and predicts where it will be in the next frame using a linear motion model. When a new detection arrives, the Kalman filter updates its estimate by combining the prediction with the observation, weighted by their respective uncertainties. This is Bayesian updating (chapter 05) applied to tracking.
The Hungarian algorithm solves the bilinear assignment problem: given \(M\) tracked objects and \(N\) new detections, find the optimal one-to-one matching that minimises total cost (using IoU distance from file 03). Unmatched detections start new tracks; unmatched tracks are terminated after a grace period.
DeepSORT extends SORT by adding a deep appearance feature: each detected object is passed through a small CNN that produces an appearance embedding (a descriptor vector). The matching cost combines IoU distance with cosine distance (chapter 01) in embedding space. This handles occlusion and re-identification: even if an object disappears behind another for several frames, its appearance embedding allows re-matching when it reappears.
ByteTrack (Zhang et al., 2022) improves tracking by using every detection, including low-confidence ones. Most trackers discard detections below a confidence threshold. ByteTrack first matches high-confidence detections to existing tracks, then matches the remaining low-confidence detections to unmatched tracks. This recovers objects that are temporarily occluded or blurry (and thus have low detection confidence).
3D vision recovers the third spatial dimension that is lost in the 2D image projection (file 01).
Depth estimation predicts the distance from the camera to each point in the scene.
Stereo depth uses two cameras separated by a known baseline \(b\). The same point appears at different horizontal positions in the left and right images (this offset is called disparity \(d\)). Depth is inversely proportional to disparity:

\[Z = \frac{f \cdot b}{d}\]

where \(f\) is the focal length and \(b\) is the baseline distance. Computing disparity requires finding corresponding points between the two images (stereo matching), which is a 1D search along the horizontal scan line (because the cameras are horizontally aligned, a point at the same height in 3D projects to the same row in both images).
Monocular depth estimation predicts depth from a single image, which is fundamentally ill-posed (infinitely many 3D scenes can produce the same 2D image). Yet humans do it effortlessly using cues like relative size, texture gradients, occlusion, and atmospheric haze. Deep networks learn these cues from training data.
Models like MiDaS and Depth Anything predict relative depth maps (ranking which objects are closer) from single images. They are trained on diverse datasets with a scale-invariant loss and produce remarkably accurate results despite the theoretical ambiguity.
Point clouds are sets of 3D points \((x, y, z)\), optionally with colour or other attributes, captured by LiDAR sensors or stereo reconstruction. Unlike images, point clouds are unordered and irregularly spaced.
PointNet (Qi et al., 2017) processes point clouds directly by applying shared MLPs to each point independently, then aggregating with max pooling (which is permutation-invariant, solving the ordering problem). PointNet++ adds hierarchical grouping to capture local structure at multiple scales.
Neural Radiance Fields (NeRFs) (Mildenhall et al., 2020) represent a 3D scene as a continuous function that maps a 3D position \((x, y, z)\) and viewing direction \((\theta, \phi)\) to a colour \((r, g, b)\) and density \(\sigma\). This function is parameterised by an MLP:

\[F_\theta: (x, y, z, \theta, \phi) \to (r, g, b, \sigma)\]

To render a pixel, a ray is cast from the camera through that pixel into the scene. Points are sampled along the ray, and the MLP predicts colour and density at each point. The pixel colour is computed by volume rendering: integrating colour weighted by density along the ray:

\[C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \cdot \sigma(\mathbf{r}(t)) \cdot \mathbf{c}(\mathbf{r}(t), \mathbf{d}) \, dt\]

where \(T(t) = \exp(-\int_{t_n}^{t} \sigma(\mathbf{r}(s)) \, ds)\) is the accumulated transmittance (how much light has been absorbed so far). In practice, this integral is approximated by sampling \(N\) points along the ray and summing:

\[\hat{C} = \sum_{i=1}^{N} T_i \cdot (1 - \exp(-\sigma_i \delta_i)) \cdot c_i\]

NeRF is trained by minimising the MSE between rendered pixels and ground truth pixels from a set of posed photographs. After training, the NeRF can render photorealistic novel views from any camera position. The limitation is speed: rendering requires evaluating the MLP millions of times (one per sample point per pixel), making real-time rendering difficult.
3D Gaussian Splatting (Kerbl et al., 2023) addresses NeRF's speed limitation by representing the scene as a collection of 3D Gaussian primitives rather than a continuous volumetric function. Each Gaussian has a 3D position (mean), a 3D covariance matrix (controlling shape and orientation), opacity, and colour (represented as spherical harmonics for view-dependent effects).
Rendering projects each 3D Gaussian onto the image plane (producing a 2D Gaussian "splat"), sorts by depth, and composites front-to-back using alpha blending. This is a rasterisation process that runs on the GPU in real time (100+ FPS), orders of magnitude faster than NeRF's ray marching. Gaussian splatting matches or exceeds NeRF quality while enabling real-time rendering.
SLAM (Simultaneous Localisation and Mapping) is the problem of building a map of an unknown environment while simultaneously tracking the camera's position within it. It is fundamental to robotics, autonomous driving, and AR.
Visual odometry estimates camera motion from frame to frame by tracking features across images. Feature points (SIFT, ORB from file 01) are matched between consecutive frames, and the camera's rotation and translation are estimated from the correspondences using the essential matrix (which encodes the geometric relationship between two views, derived from the intrinsic and extrinsic parameters of file 01).
Feature-based SLAM extends visual odometry by maintaining a persistent map. ORB-SLAM (Mur-Artal et al., 2015) is the most widely used feature-based SLAM system. It has three parallel threads:
1. Tracking: match ORB features in each new frame to the map, estimate camera pose using PnP (Perspective-n-Point) and RANSAC
2. Local mapping: triangulate new map points from matched features, optimise their positions using bundle adjustment (minimising reprojection error across all views that see each point)
3. Loop closure: detect when the camera revisits a previously mapped area (using bag-of-visual-words), then correct the accumulated drift by globally optimising the map
LiDAR SLAM uses 3D point clouds from LiDAR sensors instead of (or in addition to) camera images. LiDAR provides direct depth measurements, making the geometry estimation more robust but at higher hardware cost. Methods like LOAM (LiDAR Odometry and Mapping) register point clouds between consecutive scans using iterative closest point (ICP) alignment.
Visual-Inertial SLAM fuses camera data with measurements from an IMU (accelerometer + gyroscope). The IMU provides high-frequency rotation and acceleration estimates that bridge the gaps between camera frames and handle fast motion or temporary visual feature loss.
VR/AR applications are among the most demanding consumers of computer vision.
Pose estimation determines the position and orientation of the human body (or face, or hands) from images. Body pose is typically represented as a set of 2D or 3D keypoint locations (joints: shoulders, elbows, wrists, hips, knees, ankles). Models like OpenPose and MediaPipe predict these keypoints using heatmap regression: for each joint, the model outputs a heatmap where the peak indicates the joint's location.
Top-down methods first detect people with a bounding box detector (file 03), then estimate the pose within each box. Bottom-up methods detect all keypoints in the image first, then group them into individuals using part affinity fields (vector fields that encode the association between connected joints).
Scene reconstruction builds a 3D model of the environment from sensor data. In AR, this enables placing virtual objects on real surfaces, occluding virtual objects behind real ones, and casting virtual shadows. Real-time scene reconstruction methods (like depth sensor-based systems in ARKit and ARCore) build a sparse mesh of the environment that updates as the user moves.
Real-time rendering constraints in VR are extreme: both eyes need separate renders at 90+ FPS (to avoid motion sickness), with latency under 20 milliseconds from head movement to display update. Techniques like foveated rendering (rendering at high resolution only where the user is looking, using eye tracking) and reprojection (warping the previous frame based on new head pose to fill the gap while the next frame renders) are essential for meeting these constraints.
The convergence of real-time neural rendering (3D Gaussian splatting), robust tracking (visual-inertial SLAM), and efficient pose estimation is making photorealistic, interactive AR/VR experiences increasingly feasible.

Coding Tasks (use CoLab or notebook)¶

Implement the Lucas-Kanade optical flow algorithm from scratch. Compute flow between two synthetic frames where a square moves to the right.

import jax.numpy as jnp
import matplotlib.pyplot as plt

def lucas_kanade(frame1, frame2, window_size=5):
    """Lucas-Kanade optical flow."""
    # Compute gradients
    Ix = jnp.zeros_like(frame1)
    Iy = jnp.zeros_like(frame1)
    It = frame2 - frame1

    # Sobel-like gradients
    Ix = Ix.at[1:-1, :].set((frame1[2:, :] - frame1[:-2, :]) / 2)
    Iy = Iy.at[:, 1:-1].set((frame1[:, 2:] - frame1[:, :-2]) / 2)

    H, W = frame1.shape
    half_w = window_size // 2
    u = jnp.zeros_like(frame1)
    v = jnp.zeros_like(frame1)

    for i in range(half_w, H - half_w):
        for j in range(half_w, W - half_w):
            Ix_win = Ix[i-half_w:i+half_w+1, j-half_w:j+half_w+1].ravel()
            Iy_win = Iy[i-half_w:i+half_w+1, j-half_w:j+half_w+1].ravel()
            It_win = It[i-half_w:i+half_w+1, j-half_w:j+half_w+1].ravel()

            A = jnp.stack([Ix_win, Iy_win], axis=1)
            ATA = A.T @ A
            ATb = -A.T @ It_win

            # Check if the system is well-conditioned
            det = ATA[0,0] * ATA[1,1] - ATA[0,1] * ATA[1,0]
            if jnp.abs(det) > 1e-6:
                flow = jnp.linalg.solve(ATA, ATb)
                u = u.at[i, j].set(flow[0])
                v = v.at[i, j].set(flow[1])

    return u, v

# Create two frames: a white square that moves right
frame1 = jnp.zeros((64, 64))
frame1 = frame1.at[20:40, 15:35].set(1.0)

frame2 = jnp.zeros((64, 64))
frame2 = frame2.at[20:40, 20:40].set(1.0)  # shifted 5 pixels right

u, v = lucas_kanade(frame1, frame2, window_size=7)

# Visualise
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
axes[0].imshow(frame1, cmap='gray'); axes[0].set_title('Frame 1'); axes[0].axis('off')
axes[1].imshow(frame2, cmap='gray'); axes[1].set_title('Frame 2'); axes[1].axis('off')

# Quiver plot of flow (subsample for clarity)
step = 4
Y, X = jnp.mgrid[0:64:step, 0:64:step]
axes[2].imshow(frame1, cmap='gray', alpha=0.5)
axes[2].quiver(X, Y, u[::step, ::step], v[::step, ::step],
               color='#e74c3c', scale=50, width=0.005)
axes[2].set_title('Optical Flow'); axes[2].axis('off')

plt.tight_layout(); plt.show()

# Check average flow in the moving region
region_u = u[20:40, 15:35]
print(f"Average horizontal flow in object region: {region_u[region_u != 0].mean():.2f} pixels")

Implement a simple Kalman filter for 2D object tracking. Simulate a noisy trajectory and show how the Kalman filter smooths the estimates.

import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt

def kalman_predict(x, P, F, Q):
    """Kalman filter prediction step."""
    x_pred = F @ x
    P_pred = F @ P @ F.T + Q
    return x_pred, P_pred

def kalman_update(x_pred, P_pred, z, H, R):
    """Kalman filter update step."""
    y = z - H @ x_pred                        # innovation
    S = H @ P_pred @ H.T + R                  # innovation covariance
    K = P_pred @ H.T @ jnp.linalg.inv(S)      # Kalman gain
    x_updated = x_pred + K @ y
    P_updated = (jnp.eye(len(x_pred)) - K @ H) @ P_pred
    return x_updated, P_updated

# State: [x, y, vx, vy]
dt = 1.0
F = jnp.array([[1, 0, dt, 0],    # state transition
                [0, 1, 0, dt],
                [0, 0, 1, 0],
                [0, 0, 0, 1]])
H = jnp.array([[1, 0, 0, 0],     # observation: we measure x, y
                [0, 1, 0, 0]])
Q = jnp.eye(4) * 0.01            # process noise
R = jnp.eye(2) * 4.0             # measurement noise (noisy detector)

# Simulate ground truth: circular motion
n_steps = 50
t = jnp.linspace(0, 2 * jnp.pi, n_steps)
true_x = 10 * jnp.cos(t) + 20
true_y = 10 * jnp.sin(t) + 20

# Noisy observations
key = jax.random.PRNGKey(42)
noise = jax.random.normal(key, (n_steps, 2)) * 2.0
obs_x = true_x + noise[:, 0]
obs_y = true_y + noise[:, 1]

# Run Kalman filter
x = jnp.array([obs_x[0], obs_y[0], 0.0, 0.0])  # initial state
P = jnp.eye(4) * 10.0                             # initial uncertainty

kalman_x, kalman_y = [], []
for i in range(n_steps):
    x, P = kalman_predict(x, P, F, Q)
    z = jnp.array([obs_x[i], obs_y[i]])
    x, P = kalman_update(x, P, z, H, R)
    kalman_x.append(x[0])
    kalman_y.append(x[1])

kalman_x = jnp.array(kalman_x)
kalman_y = jnp.array(kalman_y)

# Visualise
plt.figure(figsize=(8, 8))
plt.plot(true_x, true_y, 'k-', linewidth=2, label='Ground Truth')
plt.scatter(obs_x, obs_y, c='#e74c3c', s=20, alpha=0.5, label='Noisy Observations')
plt.plot(kalman_x, kalman_y, '#3498db', linewidth=2, label='Kalman Filter')
plt.legend(); plt.grid(alpha=0.3)
plt.title('Kalman Filter Tracking')
plt.xlabel('x'); plt.ylabel('y')
plt.axis('equal'); plt.show()

obs_error = jnp.mean(jnp.sqrt((obs_x - true_x)**2 + (obs_y - true_y)**2))
kalman_error = jnp.mean(jnp.sqrt((kalman_x - true_x)**2 + (kalman_y - true_y)**2))
print(f"Observation RMSE: {obs_error:.2f}")
print(f"Kalman filter RMSE: {kalman_error:.2f}")
print(f"Error reduction: {(1 - kalman_error/obs_error) * 100:.1f}%")

Implement a simplified NeRF-style volume rendering pipeline. Cast rays through a simple 3D scene (spheres of known colour and density) and render an image by integrating along each ray.

import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt

def render_ray(origin, direction, spheres, n_samples=64, t_near=1.0, t_far=6.0):
    """Volume render a single ray through a scene of spheres."""
    t_vals = jnp.linspace(t_near, t_far, n_samples)
    deltas = jnp.concatenate([jnp.diff(t_vals), jnp.array([1e-3])])

    colour = jnp.zeros(3)
    transmittance = 1.0

    for i in range(n_samples):
        point = origin + t_vals[i] * direction

        # Compute density and colour at this point
        density = 0.0
        point_colour = jnp.zeros(3)

        for center, radius, col, sigma in spheres:
            dist = jnp.linalg.norm(point - center)
            # Soft sphere: density falls off with distance from surface
            d = jnp.exp(-jnp.maximum(0, dist - radius) * sigma) * sigma
            density += d
            point_colour += d * jnp.array(col)

        # Normalise colour by total density
        point_colour = jnp.where(density > 1e-6, point_colour / density, point_colour)

        # Volume rendering equation
        alpha = 1.0 - jnp.exp(-density * deltas[i])
        colour += transmittance * alpha * point_colour
        transmittance *= (1.0 - alpha)

    return colour

# Scene: three coloured spheres
spheres = [
    (jnp.array([0.0, 0.0, 4.0]), 0.8, [1.0, 0.2, 0.2], 5.0),   # red
    (jnp.array([1.5, 0.5, 5.0]), 0.6, [0.2, 1.0, 0.2], 5.0),   # green
    (jnp.array([-1.0, -0.5, 3.5]), 0.5, [0.2, 0.2, 1.0], 5.0), # blue
]

# Camera setup
img_h, img_w = 64, 64
focal = 60.0
origin = jnp.array([0.0, 0.0, 0.0])

image = jnp.zeros((img_h, img_w, 3))
for i in range(img_h):
    for j in range(img_w):
        # Compute ray direction
        px = (j - img_w / 2) / focal
        py = -(i - img_h / 2) / focal
        direction = jnp.array([px, py, 1.0])
        direction = direction / jnp.linalg.norm(direction)

        colour = render_ray(origin, direction, spheres)
        image = image.at[i, j].set(jnp.clip(colour, 0, 1))

plt.figure(figsize=(6, 6))
plt.imshow(image)
plt.title('NeRF-style Volume Rendering\n(3 spheres)')
plt.axis('off')
plt.tight_layout(); plt.show()
print(f"Image shape: {image.shape}")
print(f"Rendered {img_h * img_w} rays with 64 samples each")