Matrix Operations¶
Matrix operations are the computational engine of deep learning. This file covers matrix addition, scalar multiplication, matrix-vector products, matrix multiplication, element-wise operations, Kronecker products, and broadcasting -- the operations behind every forward pass and gradient update.
-
Matrices can be added and scaled just like vectors.
-
For addition, both matrices must have the same dimensions, and you add element by element:
- For scalar multiplication, you multiply every element by the scalar:
- The simplest thing you can do with a matrix is multiply it by a vector. Matrix-vector multiplication \(A\mathbf{x}\) combines the columns of \(A\) using the entries of \(\mathbf{x}\) as weights:
-
This is the core operation in ML. Every neural network layer computes \(A\mathbf{x} + \mathbf{b}\): a matrix times an input vector, plus a bias.
-
The general case is matrix multiplication. Given \(A\) (\(m \times n\)) and \(B\) (\(n \times p\)), the product \(C = AB\) is an \(m \times p\) matrix where each element is a dot product:
-
Each entry in the result is the dot product of a row from \(A\) with a column from \(B\). The inner dimensions must match (\(n\)), and the result takes the outer dimensions (\(m \times p\)).
-
Another way to see it: each column of the result is a weighted sum of the columns of \(A\), where the weights come from the corresponding column of \(B\).
-
If \(B\) has column \([2, 3]^T\), the result column is \(2 \times (\text{column 1 of } A) + 3 \times (\text{column 2 of } A)\).
-
A useful special case: multiplying a matrix by its transpose always gives a square matrix. \(AA^T\) is \(m \times m\) and \(A^TA\) is \(n \times n\):
-
Matrix multiplication has important rules:
-
Not commutative: \(AB \neq BA\) in general. The order matters.
-
Associative: \((AB)C = A(BC)\). You can group multiplications however you like.
-
Distributive: \(A(B + C) = AB + AC\).
-
Identity: \(AI = IA = A\).
-
-
The Hadamard product (element-wise product) multiplies two matrices of the same size entry by entry, written \(A \odot B\):
-
Unlike standard matrix multiplication, the Hadamard product is commutative (\(A \odot B = B \odot A\)) and requires both matrices to have the same dimensions. It is used heavily in ML for gating: multiplying element-wise by a mask of values between 0 and 1 controls how much of each entry "passes through."
-
The outer product of two vectors \(\mathbf{u}\) and \(\mathbf{v}\) produces a matrix: \(\mathbf{u}\mathbf{v}^T\). Each entry is the product of one element from \(\mathbf{u}\) and one from \(\mathbf{v}\):
-
The result always has rank 1, because every row is a scaled version of \(\mathbf{v}^T\). Any matrix can be written as a sum of rank-1 outer products, which is exactly what SVD does (covered in decompositions).
-
Matrix multiplication is computationally expensive. Multiplying two \(n \times n\) matrices takes \(O(n^3)\) operations. For a \(1000 \times 1000\) matrix, that is a billion multiplications.
-
When matrices are sparse (mostly zeros), naive multiplication wastes time multiplying by zero. The Compressed Sparse Row (CSR) format stores only the nonzero elements along with their positions:
- Values: the nonzero entries in row order
- Column indices: which column each value belongs to
- Row offsets: where each row starts in the values list
-
For example, the matrix:
-
Is stored as: values = [5, 2, 3, -1], columns = [0, 3, 2, 3], row offsets = [0, 2, 3, 4]. This skips all the zeros and makes sparse operations much faster.
-
A core use of matrices is solving systems of linear equations. The system \(A\mathbf{x} = \mathbf{b}\) asks: "what vector \(\mathbf{x}\), when transformed by \(A\), produces \(\mathbf{b}\)?"
-
For example, say you are buying fruit. Apples cost \(x_1\) dollars each and bananas cost \(x_2\) dollars each. You know that 2 apples and 1 banana cost $5, and 1 apple and 3 bananas cost $10. In matrix form:
- Multiplying the matrix by the vector row by row (each row dotted with \([x_1, x_2]^T\)) gives two equations:
-
From row 1, \(x_2 = 5 - 2x_1\). Substituting into row 2: \(x_1 + 3(5 - 2x_1) = 10\), which gives \(x_1 = 1\), then \(x_2 = 3\). Apples cost $1 and bananas cost $3.
-
Verify — it checks out:
-
If \(A\) has an inverse, the solution is simply \(\mathbf{x} = A^{-1}\mathbf{b}\). But computing the inverse directly is expensive and numerically unstable. In practice, we use decompositions instead.
-
Not every matrix is square, and not every square matrix is invertible. The pseudo-inverse \(A^+\) generalises the inverse to any matrix. It always exists and provides the "best possible" inverse:
-
When \(A\) is lower triangular, solving \(L\mathbf{x} = \mathbf{b}\) is easy by forward substitution: solve for \(x_1\) first, then use it to find \(x_2\), and so on down.
-
When \(A\) is upper triangular, solving \(U\mathbf{x} = \mathbf{b}\) works by back substitution: solve for the last variable first, then work upward.
-
This is why decomposing a matrix into triangular factors (as we will see in decompositions) is so useful. It turns a hard problem into two easy ones.
Coding Tasks (use CoLab or notebook)¶
-
Multiply two matrices and verify the dimensions. Then swap the order and observe that the result changes (or that it fails if dimensions don't match).
-
Solve a system of linear equations \(A\mathbf{x} = \mathbf{b}\) and verify the solution by multiplying back. Try changing \(\mathbf{b}\) to see how the solution shifts.