Skip to content

Matrix Operations

Matrix operations are the computational engine of deep learning. This file covers matrix addition, scalar multiplication, matrix-vector products, matrix multiplication, element-wise operations, Kronecker products, and broadcasting -- the operations behind every forward pass and gradient update.

  • Matrices can be added and scaled just like vectors.

  • For addition, both matrices must have the same dimensions, and you add element by element:

\[ \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} + \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 6 & 8 \\ 10 & 12 \end{bmatrix} \]
  • For scalar multiplication, you multiply every element by the scalar:
\[ 3 \times \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 3 & 6 \\ 9 & 12 \end{bmatrix} \]
  • The simplest thing you can do with a matrix is multiply it by a vector. Matrix-vector multiplication \(A\mathbf{x}\) combines the columns of \(A\) using the entries of \(\mathbf{x}\) as weights:
\[ \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \begin{bmatrix} 5 \\ 6 \end{bmatrix} = 5 \begin{bmatrix} 1 \\ 3 \end{bmatrix} + 6 \begin{bmatrix} 2 \\ 4 \end{bmatrix} = \begin{bmatrix} 17 \\ 39 \end{bmatrix} \]
  • This is the core operation in ML. Every neural network layer computes \(A\mathbf{x} + \mathbf{b}\): a matrix times an input vector, plus a bias.

  • The general case is matrix multiplication. Given \(A\) (\(m \times n\)) and \(B\) (\(n \times p\)), the product \(C = AB\) is an \(m \times p\) matrix where each element is a dot product:

\[C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}\]
  • Each entry in the result is the dot product of a row from \(A\) with a column from \(B\). The inner dimensions must match (\(n\)), and the result takes the outer dimensions (\(m \times p\)).

  • Another way to see it: each column of the result is a weighted sum of the columns of \(A\), where the weights come from the corresponding column of \(B\).

  • If \(B\) has column \([2, 3]^T\), the result column is \(2 \times (\text{column 1 of } A) + 3 \times (\text{column 2 of } A)\).

  • A useful special case: multiplying a matrix by its transpose always gives a square matrix. \(AA^T\) is \(m \times m\) and \(A^TA\) is \(n \times n\):

\[ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix} = \begin{bmatrix} 14 & 32 \\ 32 & 77 \end{bmatrix} \]
  • Matrix multiplication has important rules:

    • Not commutative: \(AB \neq BA\) in general. The order matters.

    • Associative: \((AB)C = A(BC)\). You can group multiplications however you like.

    • Distributive: \(A(B + C) = AB + AC\).

    • Identity: \(AI = IA = A\).

  • The Hadamard product (element-wise product) multiplies two matrices of the same size entry by entry, written \(A \odot B\):

\[ \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \odot \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 5 & 12 \\ 21 & 32 \end{bmatrix} \]
  • Unlike standard matrix multiplication, the Hadamard product is commutative (\(A \odot B = B \odot A\)) and requires both matrices to have the same dimensions. It is used heavily in ML for gating: multiplying element-wise by a mask of values between 0 and 1 controls how much of each entry "passes through."

  • The outer product of two vectors \(\mathbf{u}\) and \(\mathbf{v}\) produces a matrix: \(\mathbf{u}\mathbf{v}^T\). Each entry is the product of one element from \(\mathbf{u}\) and one from \(\mathbf{v}\):

\[ \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \begin{bmatrix} 4 & 5 \end{bmatrix} = \begin{bmatrix} 4 & 5 \\ 8 & 10 \\ 12 & 15 \end{bmatrix} \]
  • The result always has rank 1, because every row is a scaled version of \(\mathbf{v}^T\). Any matrix can be written as a sum of rank-1 outer products, which is exactly what SVD does (covered in decompositions).

  • Matrix multiplication is computationally expensive. Multiplying two \(n \times n\) matrices takes \(O(n^3)\) operations. For a \(1000 \times 1000\) matrix, that is a billion multiplications.

  • When matrices are sparse (mostly zeros), naive multiplication wastes time multiplying by zero. The Compressed Sparse Row (CSR) format stores only the nonzero elements along with their positions:

    • Values: the nonzero entries in row order
    • Column indices: which column each value belongs to
    • Row offsets: where each row starts in the values list
  • For example, the matrix:

\[ A = \begin{bmatrix} 5 & 0 & 0 & 2 \\ 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & -1 \end{bmatrix} \]
  • Is stored as: values = [5, 2, 3, -1], columns = [0, 3, 2, 3], row offsets = [0, 2, 3, 4]. This skips all the zeros and makes sparse operations much faster.

  • A core use of matrices is solving systems of linear equations. The system \(A\mathbf{x} = \mathbf{b}\) asks: "what vector \(\mathbf{x}\), when transformed by \(A\), produces \(\mathbf{b}\)?"

  • For example, say you are buying fruit. Apples cost \(x_1\) dollars each and bananas cost \(x_2\) dollars each. You know that 2 apples and 1 banana cost $5, and 1 apple and 3 bananas cost $10. In matrix form:

\[ \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 5 \\ 10 \end{bmatrix} \]
  • Multiplying the matrix by the vector row by row (each row dotted with \([x_1, x_2]^T\)) gives two equations:
\[2x_1 + 1x_2 = 5 \qquad \text{(row 1)} \qquad \qquad x_1 + 3x_2 = 10 \qquad \text{(row 2)}\]
  • From row 1, \(x_2 = 5 - 2x_1\). Substituting into row 2: \(x_1 + 3(5 - 2x_1) = 10\), which gives \(x_1 = 1\), then \(x_2 = 3\). Apples cost $1 and bananas cost $3.

  • Verify — it checks out:

\[ \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} 1 \\ 3 \end{bmatrix} = \begin{bmatrix} 2 + 3 \\ 1 + 9 \end{bmatrix} = \begin{bmatrix} 5 \\ 10 \end{bmatrix} \]
  • If \(A\) has an inverse, the solution is simply \(\mathbf{x} = A^{-1}\mathbf{b}\). But computing the inverse directly is expensive and numerically unstable. In practice, we use decompositions instead.

  • Not every matrix is square, and not every square matrix is invertible. The pseudo-inverse \(A^+\) generalises the inverse to any matrix. It always exists and provides the "best possible" inverse:

\[A^+ = (A^TA)^{-1}A^T\]
  • When \(A\) is lower triangular, solving \(L\mathbf{x} = \mathbf{b}\) is easy by forward substitution: solve for \(x_1\) first, then use it to find \(x_2\), and so on down.

  • When \(A\) is upper triangular, solving \(U\mathbf{x} = \mathbf{b}\) works by back substitution: solve for the last variable first, then work upward.

  • This is why decomposing a matrix into triangular factors (as we will see in decompositions) is so useful. It turns a hard problem into two easy ones.

Coding Tasks (use CoLab or notebook)

  1. Multiply two matrices and verify the dimensions. Then swap the order and observe that the result changes (or that it fails if dimensions don't match).

    import jax.numpy as jnp
    
    A = jnp.array([[1.0, 2.0],
                   [3.0, 4.0]])
    B = jnp.array([[5.0, 6.0],
                   [7.0, 8.0]])
    
    print(f"A @ B:\n{A @ B}")
    print(f"B @ A:\n{B @ A}")
    print(f"Equal: {jnp.allclose(A @ B, B @ A)}")
    

  2. Solve a system of linear equations \(A\mathbf{x} = \mathbf{b}\) and verify the solution by multiplying back. Try changing \(\mathbf{b}\) to see how the solution shifts.

    import jax.numpy as jnp
    
    A = jnp.array([[2.0, 1.0],
                   [5.0, 3.0]])
    b = jnp.array([4.0, 7.0])
    
    x = jnp.linalg.solve(A, b)
    print(f"Solution x: {x}")
    print(f"A @ x: {A @ x}")