Multivariate Calculus¶

Multivariate calculus extends derivatives and integrals to functions of many variables, which is essential since ML models have millions of parameters. This file covers partial derivatives, gradients, the Jacobian, the Hessian, and the multivariable chain rule that makes backpropagation possible.

So far, our functions have taken a single input \(x\) and produced a single output \(f(x)\). But in ML, we almost never work with just one variable.
Consider a function of two variables, like \(f(x, y) = x^2 + y^2\). This defines a surface in 3D space, a bowl shape. We want to know: if we nudge \(x\) a little while keeping \(y\) fixed, how does \(f\) change? That is a partial derivative.
The partial derivative of \(f\) with respect to \(x\), written \(\frac{\partial f}{\partial x}\), treats every other variable as a constant and differentiates normally with respect to \(x\).
For \(f(x, y) = x^2y + 3x - 2y\):

\[\frac{\partial f}{\partial x} = 2xy + 3 \qquad \frac{\partial f}{\partial y} = x^2 - 2\]

To compute \(\frac{\partial f}{\partial x}\), we treated \(y\) as a constant, so \(x^2y\) differentiated to \(2xy\), \(3x\) to \(3\), and \(-2y\) to \(0\).
To compute \(\frac{\partial f}{\partial y}\), we treated \(x\) as a constant, so \(x^2y\) differentiated to \(x^2\), \(3x\) to \(0\), and \(-2y\) to \(-2\).
Geometrically, taking a partial derivative with respect to \(x\) is like slicing the 3D surface with a plane parallel to the \(xz\)-plane (at a fixed \(y\) value) and finding the slope of the resulting curve.

Partial derivative: slice the surface by holding one variable constant

The gradient collects all the partial derivatives into a single vector:

\[\nabla f = \left(\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right)\]

For \(f(x, y) = x^2 + y^2\): \(\nabla f(x, y) = (2x, 2y)\). At the point \((1, 2)\): \(\nabla f(1, 2) = (2, 4)\).
The gradient has two key properties:
- Direction: it points in the direction of steepest increase. Imagine a hiker on a mountain. The gradient at their position points straight uphill, along the steepest path.
- Magnitude: \(\|\nabla f\|\) gives the rate of increase in that steepest direction. A large gradient means the terrain is steep; a small gradient means it is nearly flat.

Gradient vectors point uphill, perpendicular to contour lines

Since the gradient points uphill, moving in the opposite direction (\(-\nabla f\)) goes downhill, towards lower values. This simple idea is the basis of gradient descent, an optimisation technique we will explore in detail in later chapters. For now, the key takeaway is that the gradient tells you which way is "up" and how steep the climb is.
The directional derivative generalises partial derivatives. Instead of asking "how does \(f\) change along the \(x\)-axis?", it asks "how does \(f\) change along any direction \(\mathbf{u}\)?" It is computed as the dot product of the gradient with a unit vector:

\[D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}\]

For \(f(x, y) = x^2 + y^2\) at \((1, 2)\) in the direction of \(\mathbf{v} = (3, 4)\): first normalise to get \(\mathbf{u} = (3/5, 4/5)\), then \(D_{\mathbf{u}} f = (2, 4) \cdot (3/5, 4/5) = 6/5 + 16/5 = 22/5\).
Partial derivatives are special cases of directional derivatives where the direction is along a coordinate axis. If the directional derivative is zero in some direction, the function is flat in that direction at that point.
Contour lines (or level curves) connect points where a function has the same value. For \(f(x, y) = x^2 + y^2\), the contour lines are circles centred at the origin: \(x^2 + y^2 = c\) for different values of \(c\).
Contour lines never cross each other (a point cannot have two different function values).
The gradient is always perpendicular to the contour lines, pointing from lower to higher values.
Closely spaced contour lines indicate steep terrain; widely spaced lines indicate gentle slopes.
So far, our functions produced a single output. But many functions produce multiple outputs. A function \(\mathbf{F}: \mathbb{R}^n \to \mathbb{R}^m\) takes \(n\) inputs and produces \(m\) outputs. The Jacobian matrix organises all the partial derivatives of such a vector-valued function:

\[ J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \]

Each row of the Jacobian is the gradient of one output component. For a function with 3 inputs and 2 outputs, the Jacobian is a \(2 \times 3\) matrix.
The Jacobian generalises the derivative to vector-valued functions.
Just as the derivative of a scalar function tells you how much the output changes per unit input change, the Jacobian tells you how each output changes with respect to each input.
The determinant of the Jacobian measures how much a transformation locally stretches or compresses space.
If the determinant is 2, small regions double in area. If it is 0, the transformation squashes space to a lower dimension (recall from our chapter on matrices that a zero determinant means a singular, non-invertible transformation).
When several transformations are composed (one feeding into the next), the Jacobian of the overall mapping is the product of the individual Jacobians. We will see this idea become central in later chapters.
Where the gradient captures first-order information (slopes), the Hessian matrix captures second-order information (curvature).
For a scalar function \(f(x_1, \ldots, x_n)\), the Hessian is the \(n \times n\) matrix of all second partial derivatives:

\[ H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix} \]

For \(f(x, y) = x^3 + 2xy^2 - y^3\), the gradient is \((3x^2 + 2y^2,\; 4xy - 3y^2)\), and the Hessian is:

\[ H = \begin{bmatrix} 6x & 4y \\ 4y & 4x - 6y \end{bmatrix} \]

The diagonal entries (\(6x\) and \(4x - 6y\)) tell you how the slope in the \(x\)-direction changes as you move in \(x\), and similarly for \(y\).
The off-diagonal entries (\(4y\)) tell you how the slope in one direction changes as you move in the other direction.
Clairaut's theorem guarantees that for functions with continuous second derivatives, the mixed partial derivatives are equal: \(\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x}\).
This means the Hessian is symmetric, which (as we saw in the matrices chapter) guarantees real eigenvalues and orthogonal eigenvectors.
The Hessian tells us about the shape of the function near a critical point (where the gradient is zero):
- If \(H\) is positive definite (all eigenvalues positive), the point is a local minimum, the surface curves upward in every direction like a bowl.
- If \(H\) is negative definite (all eigenvalues negative), the point is a local maximum, the surface curves downward like an inverted bowl.
- If \(H\) has both positive and negative eigenvalues, the point is a saddle point, the surface curves up in some directions and down in others, like a mountain pass.
The multivariate chain rule extends the chain rule to functions of several variables. If \(z = f(x, y)\) where \(x = g(t)\) and \(y = h(t)\), then:

\[\frac{dz}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}\]

Each path from \(t\) to \(z\) contributes a term: the partial derivative along that path times the derivative of the intermediate variable with respect to \(t\).
For example, if \(z = x^2 y + 3x - y^2\), \(x = \cos(t)\), \(y = \sin(t)\):

\[\frac{dz}{dt} = (2xy + 3)(-\sin t) + (x^2 - 2y)(\cos t)\]

Beyond computing derivatives by hand, there are three approaches:
- Numerical differentiation: approximate \(f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}\) for small \(h\). Simple but noisy and inaccurate.
- Symbolic differentiation: apply differentiation rules algebraically to produce an exact formula. Can produce expressions that grow exponentially large.
- Automatic differentiation (autodiff): tracks the chain of operations and computes exact derivatives efficiently. This is what JAX, PyTorch, and TensorFlow use. It gives exact numerical values (not approximate) without producing bloated symbolic expressions.

Coding Tasks (use CoLab or notebook)¶

Compute the gradient of \(f(x, y) = x^2 y + 3x - 2y\) at the point \((1, 2)\) using jax.grad. Since \(f\) takes a vector input, use jax.grad with argnums.

import jax
import jax.numpy as jnp

def f(x, y):
    return x**2 * y + 3*x - 2*y

df_dx = jax.grad(f, argnums=0)
df_dy = jax.grad(f, argnums=1)

x, y = 1.0, 2.0
print(f"∂f/∂x = {df_dx(x, y):.4f}  (expected: {2*x*y + 3:.4f})")
print(f"∂f/∂y = {df_dy(x, y):.4f}  (expected: {x**2 - 2:.4f})")

Compute the Jacobian of a vector-valued function using jax.jacobian. Compare with manual calculation.

import jax
import jax.numpy as jnp

def F(x):
    return jnp.array([x[0]**2 + x[1], x[0] * x[1]**2])

J = jax.jacobian(F)
x = jnp.array([1.0, 2.0])
print(f"Jacobian at (1,2):\n{J(x)}")
# Expected: [[2*x[0], 1], [x[1]**2, 2*x[0]*x[1]]] = [[2, 1], [4, 4]]

Compute the Hessian of \(f(x, y) = x^3 + 2xy^2 - y^3\) using jax.hessian and verify it is symmetric.

import jax
import jax.numpy as jnp

def f(xy):
    x, y = xy[0], xy[1]
    return x**3 + 2*x*y**2 - y**3

H = jax.hessian(f)
point = jnp.array([1.0, 2.0])
hess = H(point)
print(f"Hessian:\n{hess}")
print(f"Symmetric: {jnp.allclose(hess, hess.T)}")
# Expected: [[6x, 4y], [4y, 4x-6y]] = [[6, 8], [8, -8]]

Build a minimal autodiff engine from scratch.

Each Var tracks its value and how to propagate gradients backward through the chain rule.
Try extending it with more operations (division, power, etc.).

This is the foundations of how JAX, PyTorch and Numpy were designed.

class Var:
    def __init__(self, val, children=(), backward_fn=None):
        self.val = val
        self.grad = 0.0
        self.children = children
        self.backward_fn = backward_fn

    def __add__(self, other):
        out = Var(self.val + other.val, children=(self, other))
        def _backward():
            self.grad += out.grad    # d(a+b)/da = 1
            other.grad += out.grad   # d(a+b)/db = 1
        out.backward_fn = _backward
        return out

    def __mul__(self, other):
        out = Var(self.val * other.val, children=(self, other))
        def _backward():
            self.grad += other.val * out.grad  # d(a*b)/da = b
            other.grad += self.val * out.grad  # d(a*b)/db = a
        out.backward_fn = _backward
        return out

    def backward(self):
        # topological sort then propagate gradients
        # we will go through this in data structures and algorithms
        order, visited = [], set()
        def topo(v):
            if v not in visited:
                visited.add(v)
                for c in v.children:
                    topo(c)
                order.append(v)
        topo(self)
        self.grad = 1.0
        for v in reversed(order):
            if v.backward_fn:
                v.backward_fn()

# f(x, y) = x*x*y + x  at (3, 2)
x = Var(3.0)
y = Var(2.0)
f = x * x * y + x       # = 3*3*2 + 3 = 21

f.backward()
print(f"f = {f.val}")           # 21.0
print(f"df/dx = {x.grad}")     # 2*x*y + 1 = 13.0
print(f"df/dy = {y.grad}")     # x*x = 9.0