Probability Concepts¶

Probability theory formalises uncertainty and provides the rules for reasoning under it. This file covers sample spaces, events, axioms of probability, conditional probability, independence, Bayes' theorem, and the frequentist vs. Bayesian interpretations -- the mathematical framework behind every generative and discriminative model in ML.

Probability assigns a number between 0 and 1 to an event, measuring how likely it is to happen.
A probability of 0 means impossible, 1 means certain, and 0.5 means a coin toss.
There are two main interpretations. The frequentist view says probability is the long-run relative frequency: flip a fair coin 10,000 times and heads will appear roughly 50% of the time.
The Bayesian view says probability is a degree of belief: you might say there is a 70% chance it rains tomorrow, even though tomorrow only happens once.
Both interpretations use the same mathematical rules. The difference is philosophical, but it matters in ML. Frequentist methods give you point estimates. Bayesian methods give you full distributions over parameters.
The sample space $S$ is the set of all possible outcomes of an experiment. Flip a coin: $S = \{H, T\}$. Roll a die: $S = \{1, 2, 3, 4, 5, 6\}$.
An event is any subset of the sample space. "Rolling an even number" is the event $A = \{2, 4, 6\}$, which is a subset of $S$.
The probability of an event when all outcomes are equally likely is simply counting (from file 01):

\[P(A) = \frac{|A|}{|S|} = \frac{\text{favourable outcomes}}{\text{total outcomes}}\]

For the even-number example: $P(\text{even}) = \frac{3}{6} = 0.5$.

Venn diagram showing events A and B within sample space S, with intersection and complement

The complement of event $A$, written $A'$ or $A^c$, is everything in $S$ that is not in $A$. Since every outcome is either in $A$ or not:

\[P(A') = 1 - P(A)\]

Complements are often the easier route. Instead of counting all the ways to get at least one head in 5 coin flips, count the one way to get no heads and subtract: $P(\text{at least one head}) = 1 - P(\text{all tails}) = 1 - (0.5)^5 = 0.969$.
Two events are mutually exclusive (disjoint) if they cannot both happen: $A \cap B = \emptyset$. Rolling a 2 and rolling a 5 on a single die are mutually exclusive.
The addition rule for mutually exclusive events is straightforward:

\[P(A \cup B) = P(A) + P(B) \quad \text{(if } A \cap B = \emptyset\text{)}\]

When events can overlap, you need the general addition rule to avoid double-counting the intersection:

\[P(A \cup B) = P(A) + P(B) - P(A \cap B)\]

This mirrors the inclusion-exclusion principle from counting. The Venn diagram above shows why: the purple region (intersection) gets counted once in $P(A)$ and again in $P(B)$, so we subtract it once.
Joint probability $P(A \cap B)$ is the probability that both $A$ and $B$ occur. In a deck of cards, $P(\text{red} \cap \text{king}) = \frac{2}{52}$ because there are 2 red kings.
Marginal probability is the probability of a single event regardless of others. $P(\text{red}) = \frac{26}{52} = 0.5$ is a marginal probability. If you have a joint distribution over two variables, the marginal is obtained by summing (or integrating) over the other variable.
Conditional probability answers: given that $B$ has already happened, what is the probability of $A$? We shrink the sample space from $S$ down to $B$, and ask what fraction of $B$ also belongs to $A$:

\[P(A | B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0\]

Conditional probability as shrinking the sample space from S to B

Example: you draw a card and someone tells you it is red. What is the probability it is a king? There are 26 red cards and 2 of them are kings, so $P(\text{king} | \text{red}) = \frac{2}{26} = \frac{1}{13}$. Using the formula: $P(\text{king} \cap \text{red}) / P(\text{red}) = \frac{2/52}{26/52} = \frac{1}{13}$.
Two events are independent if knowing one happened tells you nothing about the other. Formally:

\[P(A \cap B) = P(A) \cdot P(B)\]

Equivalently, $P(A | B) = P(A)$. Flipping two separate coins are independent events. Drawing two cards without replacement are not independent (the first draw changes what remains).
Independence is a massive simplifier. For independent events, joint probabilities factor into products, which makes computation tractable. Many ML models assume independence between features (e.g. Naive Bayes) precisely because of this simplification.
The multiplication rule for any two events rearranges the conditional probability formula:

\[P(A \cap B) = P(A | B) \cdot P(B) = P(B | A) \cdot P(A)\]

For independent events, this simplifies to $P(A \cap B) = P(A) \cdot P(B)$ since the conditional equals the marginal.
Bayes' theorem is one of the most important results in probability and the foundation of Bayesian ML. It lets you reverse the direction of a conditional probability:

\[P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)}\]

The theorem follows directly from writing $P(A \cap B)$ two ways: $P(B|A) \cdot P(A) = P(A|B) \cdot P(B)$, then solving for $P(A|B)$.

Bayes' theorem components: posterior, likelihood, prior, and evidence

Each component has a name:
- Prior $P(A)$: your initial belief before seeing evidence
- Likelihood $P(B|A)$: how probable the evidence is, assuming $A$ is true
- Evidence $P(B)$: the total probability of seeing the evidence, acts as a normaliser
- Posterior $P(A|B)$: your updated belief after seeing the evidence
Let us work through the classic medical diagnosis example. Suppose a disease affects 1% of the population. A test for the disease is 95% accurate: it correctly identifies 95% of sick people (sensitivity) and correctly identifies 90% of healthy people (specificity).
You test positive. What is the probability you actually have the disease?
Let $D$ = having the disease, $+$ = testing positive.
- Prior: $P(D) = 0.01$
- Likelihood: $P(+ | D) = 0.95$
- False positive rate: $P(+ | D') = 0.10$
We need $P(+)$. By the law of total probability:

\[P(+) = P(+ | D) \cdot P(D) + P(+ | D') \cdot P(D')$$ $$= 0.95 \times 0.01 + 0.10 \times 0.99 = 0.0095 + 0.099 = 0.1085\]

Now apply Bayes' theorem:

\[P(D | +) = \frac{P(+ | D) \cdot P(D)}{P(+)} = \frac{0.95 \times 0.01}{0.1085} \approx 0.088\]

Despite the test being "95% accurate," a positive result only gives you about an 8.8% chance of having the disease. The prior matters enormously. Because the disease is rare, most positive results are false positives. This is a crucial insight for any classification problem in ML: when classes are imbalanced, accuracy alone is misleading.
The law of total probability partitions the sample space into mutually exclusive, exhaustive events $B_1, B_2, \ldots, B_n$ and expresses any event $A$ as:

\[P(A) = \sum_{i=1}^{n} P(A | B_i) \cdot P(B_i)\]

This is exactly what we used to compute $P(+)$ in the medical example: we split the population into "has disease" and "does not have disease."
The chain rule of probability generalises the multiplication rule to any number of events:

\[P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2 | A_1) \cdot P(A_3 | A_1 \cap A_2) \cdots P(A_n | A_1 \cap \cdots \cap A_{n-1})\]

Each factor conditions on everything that came before. This is the backbone of autoregressive language models: the probability of a sentence is the product of each word's probability given all previous words.
Conditional independence means two events are independent given a third. $A$ and $B$ are conditionally independent given $C$ if:

\[P(A \cap B | C) = P(A | C) \cdot P(B | C)\]

Events can be marginally dependent but conditionally independent, or vice versa. For example, two students' exam scores may be correlated (both depend on the difficulty of the exam), but given the exam difficulty, their scores are independent.
Conditional independence is the key assumption behind graphical models like Bayesian networks. It lets you factorise complex joint distributions into manageable pieces, making inference computationally feasible.

Coding Tasks (use CoLab or notebook)¶

Simulate the medical diagnosis problem. Generate a population of 100,000 people, apply the disease prevalence and test accuracy, and verify that Bayes' theorem gives the correct posterior.

import jax
import jax.numpy as jnp

key = jax.random.PRNGKey(42)
n = 100_000

# Generate population
k1, k2 = jax.random.split(key)
has_disease = jax.random.bernoulli(k1, p=0.01, shape=(n,))

# Generate test results
k3, k4 = jax.random.split(k2)
# Sensitivity: P(+|D) = 0.95, Specificity: P(-|D') = 0.90
test_positive = jnp.where(
    has_disease,
    jax.random.bernoulli(k3, p=0.95, shape=(n,)),
    jax.random.bernoulli(k4, p=0.10, shape=(n,))
)

# Among those who tested positive, what fraction actually has the disease?
positives = test_positive.astype(bool)
true_positives = (has_disease & positives).sum()
total_positives = positives.sum()

print(f"Total positive tests: {total_positives}")
print(f"True positives: {true_positives}")
print(f"P(Disease | Positive) = {true_positives / total_positives:.4f}")
print(f"Bayes' formula:         {0.95 * 0.01 / 0.1085:.4f}")

Verify the addition rule by simulation. Generate random events A and B with known probabilities and overlap, then check that $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.

import jax
import jax.numpy as jnp

key = jax.random.PRNGKey(0)
n = 200_000
k1, k2 = jax.random.split(key)

# Events: A = value < 0.4, B = value < 0.6 (overlap at < 0.4)
vals_a = jax.random.uniform(k1, shape=(n,))
vals_b = jax.random.uniform(k2, shape=(n,))

A = vals_a < 0.4
B = vals_b < 0.6

p_a = A.mean()
p_b = B.mean()
p_a_and_b = (A & B).mean()
p_a_or_b = (A | B).mean()

print(f"P(A) = {p_a:.4f}")
print(f"P(B) = {p_b:.4f}")
print(f"P(A ∩ B) = {p_a_and_b:.4f}")
print(f"P(A ∪ B) simulated = {p_a_or_b:.4f}")
print(f"P(A) + P(B) - P(A∩B) = {p_a + p_b - p_a_and_b:.4f}")

Demonstrate that conditional probability changes with evidence. Simulate rolling two dice and compute $P(\text{sum} = 7)$, then $P(\text{sum} = 7 | \text{first die} = 3)$.

import jax
import jax.numpy as jnp

key = jax.random.PRNGKey(1)
n = 500_000
k1, k2 = jax.random.split(key)

d1 = jax.random.randint(k1, shape=(n,), minval=1, maxval=7)
d2 = jax.random.randint(k2, shape=(n,), minval=1, maxval=7)
total = d1 + d2

# Unconditional
p_sum7 = (total == 7).mean()
print(f"P(sum=7) = {p_sum7:.4f} (exact: {6/36:.4f})")

# Conditional on first die = 3
mask = d1 == 3
p_sum7_given_d1_3 = (total[mask] == 7).mean()
print(f"P(sum=7 | d1=3) = {p_sum7_given_d1_3:.4f} (exact: {1/6:.4f})")

Implement Bayes' theorem as a function and use it to update beliefs iteratively. Start with a uniform prior over a coin's bias and update after observing each flip.

import jax.numpy as jnp
import matplotlib.pyplot as plt

def bayes_update(prior, likelihood):
    """Multiply prior by likelihood and normalise."""
    posterior = prior * likelihood
    return posterior / posterior.sum()

# Discretise possible bias values
theta = jnp.linspace(0, 1, 200)
prior = jnp.ones_like(theta)  # uniform prior
prior = prior / prior.sum()

# Observed flips: 1=heads, 0=tails
flips = [1, 1, 0, 1, 1, 1, 0, 1, 0, 1]

plt.figure(figsize=(10, 5))
plt.plot(theta, prior, "--", color="#999", label="prior")

for i, flip in enumerate(flips):
    likelihood = theta if flip == 1 else (1 - theta)
    prior = bayes_update(prior, likelihood)
    if i in [0, 2, 4, 9]:
        plt.plot(theta, prior, label=f"after {i+1} flips", linewidth=2)

plt.xlabel("Coin bias θ")
plt.ylabel("Belief (normalised)")
plt.title("Bayesian updating: belief about coin bias")
plt.legend()
plt.grid(alpha=0.3)
plt.show()