Probability Distributions¶

Probability distributions describe how random outcomes are spread across possible values. This file catalogues the key discrete and continuous distributions -- Bernoulli, binomial, Poisson, Gaussian, exponential, beta, and more -- giving the formulas, intuitions, and ML applications (loss functions, priors, noise models) for each.

In Chapter 4 we introduced random variables, PMFs, PDFs, and CDFs. Here we catalogue the most important probability distributions you will encounter in ML and statistics, giving the intuition, formula, mean, and variance for each.
Quick recap of the three core functions (see Chapter 4 for full definitions):
- PMF \(P(X = x)\): gives the probability of each discrete outcome. The bars in a bar chart.
- PDF \(f(x)\): gives the density at each point for continuous variables. The area under the curve between two points is the probability.
- CDF \(F(x) = P(X \le x)\): the cumulative probability up to \(x\). Always goes from 0 to 1 and never decreases.
The support of a distribution is the set of values where the PMF or PDF is positive. For a die roll, the support is \(\{1,2,3,4,5,6\}\). For the normal distribution, the support is all real numbers \((-\infty, \infty)\).
Distributions divide cleanly into two families: discrete (countable outcomes, use PMFs) and continuous (uncountable outcomes, use PDFs).
Bernoulli distribution: the simplest distribution. A single trial with two outcomes: success (1) with probability \(p\) and failure (0) with probability \(1-p\).

\[P(X = x) = p^x (1 - p)^{1-x}, \quad x \in \{0, 1\}\]

Mean: \(E[X] = p\). Variance: \(\text{Var}(X) = p(1-p)\).
Every coin flip, every yes/no classification, every binary outcome is a Bernoulli trial. In ML, the output of a sigmoid function is exactly the \(p\) parameter of a Bernoulli distribution.
Binomial distribution: count the number of successes in \(n\) independent Bernoulli trials, each with the same probability \(p\).

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n\]

The binomial coefficient \(\binom{n}{k}\) from file 01 counts how many ways to arrange \(k\) successes among \(n\) trials.
Mean: \(E[X] = np\). Variance: \(\text{Var}(X) = np(1-p)\).

Bernoulli as a single bar chart vs Binomial as a distribution over counts

Example: flip a biased coin (\(p = 0.7\)) eight times. The probability of getting exactly 6 heads is \(\binom{8}{6}(0.7)^6(0.3)^2 = 28 \times 0.1176 \times 0.09 \approx 0.296\).
Poisson distribution: counts the number of events in a fixed interval of time or space, given a known average rate \(\lambda\). Useful when events are rare and independent.

\[P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots\]

Mean: \(E[X] = \lambda\). Variance: \(\text{Var}(X) = \lambda\). The mean equals the variance, which is a signature property.
Examples: emails per hour (\(\lambda = 5\)), typos per page, server requests per second. In ML, Poisson regression models count data where a linear model would predict negative counts.
As \(n \to \infty\) and \(p \to 0\) with \(np = \lambda\) held constant, the Binomial\((n,p)\) converges to Poisson\((\lambda)\). This is why the Poisson works well for rare events in large populations.
Geometric distribution: counts the number of trials until the first success. "How many coins do I flip before I get my first heads?"

\[P(X = k) = (1-p)^{k-1} p, \quad k = 1, 2, 3, \ldots\]

Mean: \(E[X] = 1/p\). Variance: \(\text{Var}(X) = (1-p)/p^2\).
The geometric distribution is memoryless: the probability of waiting \(k\) more trials for success does not depend on how many trials you have already waited. This makes it special among discrete distributions.
Negative Binomial distribution: generalises the geometric by counting trials until the \(r\)-th success (geometric is the special case \(r=1\)).

\[P(X = k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}, \quad k = r, r+1, r+2, \ldots\]

Mean: \(E[X] = r/p\). Variance: \(\text{Var}(X) = r(1-p)/p^2\).
The Negative Binomial is also used in practice to model overdispersed count data (where the variance exceeds the mean), which the Poisson cannot handle.
Now we move to continuous distributions.
Uniform distribution: all values in an interval \([a, b]\) are equally likely. The PDF is a flat rectangle.

\[f(x) = \frac{1}{b - a}, \quad a \le x \le b\]

Mean: \(E[X] = \frac{a+b}{2}\). Variance: \(\text{Var}(X) = \frac{(b-a)^2}{12}\).
Random number generators produce Uniform(0,1) samples as their starting point. Other distributions are generated by transforming these uniform samples.
Normal (Gaussian) distribution: the most important distribution in statistics. It arises naturally from the Central Limit Theorem (see Chapter 4): averages of many independent random variables tend toward a normal distribution regardless of the original distribution.

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\]

Mean: \(E[X] = \mu\). Variance: \(\text{Var}(X) = \sigma^2\).
The standard normal has \(\mu = 0\) and \(\sigma = 1\). Any normal variable \(X\) can be standardised to a standard normal \(Z\) using \(Z = (X - \mu)/\sigma\).

Bell curve with 68-95-99.7 empirical rule regions shaded

The empirical rule (68-95-99.7 rule) says:
- About 68% of data falls within \(\pm 1\sigma\) of the mean
- About 95% falls within \(\pm 2\sigma\)
- About 99.7% falls within \(\pm 3\sigma\)
In ML, normal distributions appear everywhere: weight initialisation, noise in data augmentation, the assumption behind MSE loss (which implicitly assumes Gaussian errors), and the reparameterisation trick in variational autoencoders.
Exponential distribution: models the time between events in a Poisson process. If events arrive at rate \(\lambda\), the waiting time between them follows Exponential\((\lambda)\).

\[f(x) = \lambda e^{-\lambda x}, \quad x \ge 0\]

Mean: \(E[X] = 1/\lambda\). Variance: \(\text{Var}(X) = 1/\lambda^2\).
Like the geometric distribution for discrete variables, the exponential is memoryless: \(P(X > s + t | X > s) = P(X > t)\). The probability of waiting another \(t\) units does not depend on how long you have already waited.
Gamma distribution: generalises the exponential. It models the time until the \(\alpha\)-th event in a Poisson process (exponential is \(\alpha = 1\)).

\[f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}, \quad x > 0\]

Here \(\alpha\) (shape) controls the shape and \(\beta\) (rate) controls the scale. \(\Gamma(\alpha)\) is the gamma function, which extends factorials to real numbers: \(\Gamma(n) = (n-1)!\) for positive integers.
Mean: \(E[X] = \alpha/\beta\). Variance: \(\text{Var}(X) = \alpha/\beta^2\).
Beta distribution: defined on the interval \([0, 1]\), making it perfect for modelling probabilities, proportions, and rates.

\[f(x) = \frac{x^{\alpha - 1}(1 - x)^{\beta - 1}}{B(\alpha, \beta)}, \quad 0 \le x \le 1\]

The denominator \(B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}\) is the beta function, a normalising constant.
Mean: \(E[X] = \frac{\alpha}{\alpha + \beta}\). Variance: \(\text{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\).
The Beta distribution is the conjugate prior for the Bernoulli and Binomial likelihoods. This means if your prior is Beta and your data is Bernoulli, the posterior is also Beta, which makes Bayesian updating analytically tractable. We will use this in file 04.

Four common distribution shapes: Uniform, Exponential, Beta, Poisson

Chi-squared distribution (\(\chi^2\)): if you take \(k\) independent standard normal random variables and sum their squares, the result follows a \(\chi^2\) distribution with \(k\) degrees of freedom.

\[f(x) = \frac{1}{2^{k/2}\Gamma(k/2)} x^{k/2 - 1} e^{-x/2}, \quad x > 0\]

Mean: \(E[X] = k\). Variance: \(\text{Var}(X) = 2k\).
The \(\chi^2\) distribution is actually a special case of the Gamma distribution with \(\alpha = k/2\) and \(\beta = 1/2\). It appears in hypothesis testing (the chi-squared test from Chapter 4), goodness-of-fit tests, and in computing confidence intervals for variance.
Student's t-distribution: looks like a normal distribution but with heavier tails. It arises when you estimate the mean of a normally distributed population using a small sample and the population variance is unknown.

\[f(x) = \frac{\Gamma\!\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\,\Gamma\!\left(\frac{\nu}{2}\right)} \left(1 + \frac{x^2}{\nu}\right)^{-(\nu+1)/2}\]

The parameter \(\nu\) (nu) is the degrees of freedom. As \(\nu \to \infty\), the t-distribution converges to the standard normal. With small \(\nu\), the heavier tails give more probability to extreme values, reflecting the extra uncertainty from a small sample.
Mean: \(E[X] = 0\) (for \(\nu > 1\)). Variance: \(\text{Var}(X) = \frac{\nu}{\nu - 2}\) (for \(\nu > 2\)).
The t-distribution is used in t-tests (Chapter 4) and shows up in Bayesian inference as a marginal distribution when integrating out unknown variance.
To summarise the key distributions:

Distribution	Type	Support	Mean	Variance
Bernoulli\((p)\)	Discrete	\(\{0,1\}\)	\(p\)	\(p(1-p)\)
Binomial\((n,p)\)	Discrete	\(\{0,\ldots,n\}\)	\(np\)	\(np(1-p)\)
Poisson\((\lambda)\)	Discrete	\(\{0,1,2,\ldots\}\)	\(\lambda\)	\(\lambda\)
Geometric\((p)\)	Discrete	\(\{1,2,3,\ldots\}\)	\(1/p\)	\((1-p)/p^2\)
Uniform\((a,b)\)	Continuous	\([a,b]\)	\((a+b)/2\)	\((b-a)^2/12\)
Normal\((\mu,\sigma^2)\)	Continuous	\((-\infty,\infty)\)	\(\mu\)	\(\sigma^2\)
Exponential\((\lambda)\)	Continuous	\([0,\infty)\)	\(1/\lambda\)	\(1/\lambda^2\)
Gamma\((\alpha,\beta)\)	Continuous	\((0,\infty)\)	\(\alpha/\beta\)	\(\alpha/\beta^2\)
Beta\((\alpha,\beta)\)	Continuous	\([0,1]\)	\(\alpha/(\alpha+\beta)\)	see above
\(\chi^2(k)\)	Continuous	\((0,\infty)\)	\(k\)	\(2k\)
Student's \(t(\nu)\)	Continuous	\((-\infty,\infty)\)	\(0\)	\(\nu/(\nu-2)\)

Coding Tasks (use CoLab or notebook)¶

Plot the Binomial PMF for \(n=20\) with several values of \(p\). Observe how the shape shifts from left-skewed to symmetric to right-skewed.

import jax.numpy as jnp
import matplotlib.pyplot as plt
from math import comb

n = 20
ks = jnp.arange(0, n + 1)

fig, axes = plt.subplots(1, 3, figsize=(12, 4), sharey=True)
for ax, p, color in zip(axes, [0.2, 0.5, 0.8], ["#e74c3c", "#3498db", "#27ae60"]):
    pmf = jnp.array([comb(n, int(k)) * p**k * (1-p)**(n-k) for k in ks])
    ax.bar(ks, pmf, color=color, alpha=0.7)
    ax.set_title(f"Binomial(n={n}, p={p})")
    ax.set_xlabel("k")
axes[0].set_ylabel("P(X = k)")
plt.tight_layout()
plt.show()

Verify the Poisson approximation to the Binomial. Set \(n = 1000\), \(p = 0.003\), and compare Binomial\((n, p)\) with Poisson\((\lambda = np)\).

import jax.numpy as jnp
import matplotlib.pyplot as plt
from math import comb, factorial, exp

n, p = 1000, 0.003
lam = n * p
ks = jnp.arange(0, 15)

binom_pmf = jnp.array([comb(n, int(k)) * p**k * (1-p)**(n-k) for k in ks])
poisson_pmf = jnp.array([lam**k * exp(-lam) / factorial(int(k)) for k in ks])

plt.figure(figsize=(8, 4))
plt.bar(ks - 0.15, binom_pmf, width=0.3, color="#3498db", alpha=0.7, label=f"Binomial({n},{p})")
plt.bar(ks + 0.15, poisson_pmf, width=0.3, color="#e74c3c", alpha=0.7, label=f"Poisson({lam})")
plt.xlabel("k")
plt.ylabel("P(X = k)")
plt.title("Poisson approximation to Binomial")
plt.legend()
plt.show()

Sample from a Normal distribution and verify the empirical rule. Count what fraction of samples fall within 1, 2, and 3 standard deviations.

import jax
import jax.numpy as jnp

key = jax.random.PRNGKey(42)
mu, sigma = 5.0, 2.0
samples = mu + sigma * jax.random.normal(key, shape=(100_000,))

for k in [1, 2, 3]:
    within = jnp.abs(samples - mu) <= k * sigma
    print(f"Within {k}σ: {within.mean():.4f} (expected: {[0.6827, 0.9545, 0.9973][k-1]:.4f})")

Explore the Beta distribution by varying \(\alpha\) and \(\beta\). Plot several shapes and see how the distribution changes from uniform to skewed to concentrated.

import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt

x = jnp.linspace(0.01, 0.99, 200)

def beta_pdf(x, a, b):
    # Unnormalised for shape comparison
    return x**(a-1) * (1-x)**(b-1)

plt.figure(figsize=(10, 5))
params = [(1,1,"Uniform"), (2,5,"Left skew"), (5,2,"Right skew"),
          (5,5,"Symmetric"), (0.5,0.5,"U-shape")]
colors = ["#999", "#e74c3c", "#3498db", "#27ae60", "#9b59b6"]

for (a, b, label), color in zip(params, colors):
    y = beta_pdf(x, a, b)
    y = y / jnp.trapezoid(y, x)  # normalise
    plt.plot(x, y, label=f"α={a}, β={b} ({label})", color=color, linewidth=2)

plt.xlabel("x")
plt.ylabel("Density")
plt.title("Beta distribution shapes")
plt.legend()
plt.grid(alpha=0.3)
plt.show()