Information Theory¶

Information theory quantifies information, surprise, and the difference between probability distributions. This file covers entropy, cross-entropy, KL divergence, mutual information, and surprisal -- the concepts behind every classification loss function, VAE objective, and data compression scheme used in ML.

Information theory, founded by Claude Shannon in 1948, gives us a mathematical framework for quantifying information. It answers questions like: how surprised should you be by an event? How much information does a message carry? How different are two probability distributions?
These questions sound abstract, but they are the foundation of ML loss functions, data compression, and communication systems. Cross-entropy loss, the most common loss function in classification, comes directly from information theory.
Start with the simplest question: how much information does a single event carry?
Surprisal (also called self-information) measures how surprising an event is. If something very likely happens, you learn almost nothing. If something rare happens, you learn a lot.
If you live in a desert and someone tells you it is sunny, that is not very informative. If they tell you it is snowing, that is extremely informative. Surprisal formalises this intuition:

\[I(x) = \log_2 \frac{1}{p(x)} = -\log_2 p(x)\]

The unit is bits when we use \(\log_2\). A fair coin flip has surprisal \(-\log_2(0.5) = 1\) bit. An event with probability \(1/8\) has surprisal \(\log_2(8) = 3\) bits.
Why logarithm and not just \(1/p\)? Three reasons:
- A certain event (\(p = 1\)) should give zero information: \(\log(1) = 0\) but \(1/1 = 1\).
- Independent events should have additive information: \(\log(1/p_1 p_2) = \log(1/p_1) + \log(1/p_2)\).
- We want a smooth, well-behaved function. \(1/p\) explodes; \(\log(1/p)\) grows gently.
Entropy is the expected surprisal, the average amount of information you get per event sampled from a distribution. It measures the uncertainty or "unpredictability" of the distribution:

\[H(X) = E[I(X)] = -\sum_{x} p(x) \log_2 p(x)\]

Bar chart showing high-probability events have low surprisal and vice versa; entropy is the weighted average

A fair coin has entropy \(H = -0.5\log_2(0.5) - 0.5\log_2(0.5) = 1\) bit. Maximum uncertainty.
A biased coin with \(p = 0.9\) has entropy \(H = -0.9\log_2(0.9) - 0.1\log_2(0.1) \approx 0.469\) bits. Less uncertain, so less entropy.
A deterministic event (\(p = 1\)) has entropy \(H = 0\). No uncertainty at all.
Entropy is maximised when all outcomes are equally likely. For \(n\) equally likely outcomes, \(H = \log_2 n\). A fair die has entropy \(\log_2 6 \approx 2.585\) bits.
The practical meaning of entropy is compression. Shannon's source coding theorem says you cannot compress data below its entropy rate without losing information. An image where every pixel is equally likely (maximum entropy) cannot be compressed. An image that is mostly white (low entropy) compresses well.
For a quick sense of scale: a grayscale pixel (256 values) has a maximum entropy of 8 bits. A 1080p grayscale image has at most \(1920 \times 1080 \times 8 \approx 16.6\) million bits. Real images have much lower entropy because neighbouring pixels are correlated, which is why JPEG compression works.
For continuous random variables, the discrete sum becomes an integral. Differential entropy is:

\[h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x)\, dx\]

A Gaussian with variance \(\sigma^2\) has differential entropy \(h = \frac{1}{2}\log_2(2\pi e \sigma^2)\). Among all distributions with the same variance, the Gaussian has the maximum entropy. This is one reason the Gaussian is so common in modelling: it makes the fewest assumptions beyond the specified mean and variance.
Mutual information measures how much knowing one variable tells you about another. It is the reduction in uncertainty about \(X\) when you observe \(Y\):

\[I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\]

Equivalently:

\[I(X; Y) = \sum_{x,y} p(x,y) \log_2 \frac{p(x,y)}{p(x) p(y)}\]

If \(X\) and \(Y\) are independent, \(p(x,y) = p(x)p(y)\) and mutual information is zero. The more dependent they are, the higher the mutual information.
In ML, mutual information is used in feature selection (pick features with high MI with the target), in information bottleneck methods, and in evaluating clustering quality.
Cross-entropy measures the average number of bits needed to encode events from distribution \(p\) using a code optimised for distribution \(q\):

\[H(p, q) = -\sum_{x} p(x) \log_2 q(x)\]

If \(q\) matches \(p\) perfectly, cross-entropy equals entropy: \(H(p, p) = H(p)\). If \(q\) is a bad approximation, cross-entropy is higher. The "extra" bits come from the mismatch.
This is exactly why cross-entropy is the standard loss function for classification in ML. The true labels define \(p\) (a one-hot distribution), and the model's predicted probabilities define \(q\). Minimising cross-entropy pushes \(q\) toward \(p\):

\[\mathcal{L} = -\sum_{c} y_c \log \hat{y}_c\]

For a single sample with true class \(c\), this simplifies to \(\mathcal{L} = -\log \hat{y}_c\). The loss is the surprisal of the true class under the model's predictions. If the model assigns high probability to the correct class, the loss is low.
KL divergence (Kullback-Leibler divergence, also called relative entropy) measures how much one distribution differs from another:

\[D_{\text{KL}}(p \| q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} = H(p, q) - H(p)\]

KL divergence is the "extra cost" of using distribution \(q\) instead of the true distribution \(p\). It is always non-negative (\(D_{\text{KL}} \ge 0\)) and equals zero only when \(p = q\).

Two distributions p and q with the gap between them representing KL divergence

KL divergence is not symmetric: \(D_{\text{KL}}(p \| q) \ne D_{\text{KL}}(q \| p)\). This asymmetry matters. \(D_{\text{KL}}(p \| q)\) penalises \(q\) for placing low probability where \(p\) has high probability (because \(\log(p/q)\) blows up). \(D_{\text{KL}}(q \| p)\) penalises the reverse.
This asymmetry leads to two styles of approximation:
- Minimising \(D_{\text{KL}}(p \| q)\) produces moment-matching behaviour: \(q\) covers all modes of \(p\) but may be too spread out.
- Minimising \(D_{\text{KL}}(q \| p)\) produces mode-seeking behaviour: \(q\) concentrates on one mode of \(p\) but may miss others. This is what variational inference uses.
Since \(H(p)\) is constant with respect to the model, minimising cross-entropy \(H(p, q)\) is equivalent to minimising \(D_{\text{KL}}(p \| q)\). This is why we can use cross-entropy loss and know that we are also minimising the KL divergence between the true and predicted distributions.
KL divergence plays a central role in Bayesian updating. The posterior \(P(\theta | D)\) is the distribution closest to the prior \(P(\theta)\) (in KL divergence terms) that is consistent with the observed data. Each new observation updates the posterior, reducing uncertainty about \(\theta\).
In variational autoencoders (VAEs), the loss function has two terms: a reconstruction loss (cross-entropy) and a KL divergence term that regularises the latent space to stay close to a standard normal distribution.
To tie everything together: entropy tells you the intrinsic uncertainty in a distribution, cross-entropy tells you how well your model approximates reality, and KL divergence tells you the gap between the two. These three quantities form the backbone of modern ML optimisation.

Coding Tasks (use CoLab or notebook)¶

Compute the entropy of various distributions and verify that the uniform distribution has maximum entropy for a given number of outcomes.

import jax.numpy as jnp

def entropy(p):
    """Compute entropy in bits. Filter out zero-probability events."""
    p = p[p > 0]
    return -jnp.sum(p * jnp.log2(p))

# Fair die
fair = jnp.ones(6) / 6
print(f"Fair die entropy:   {entropy(fair):.4f} bits (max = log2(6) = {jnp.log2(6.):.4f})")

# Loaded die
loaded = jnp.array([0.1, 0.1, 0.1, 0.1, 0.1, 0.5])
print(f"Loaded die entropy: {entropy(loaded):.4f} bits")

# Deterministic
det = jnp.array([0.0, 0.0, 0.0, 0.0, 0.0, 1.0])
print(f"Deterministic:      {entropy(det):.4f} bits")

# Fair coin
coin = jnp.array([0.5, 0.5])
print(f"Fair coin entropy:  {entropy(coin):.4f} bits")

Compute cross-entropy and KL divergence between a true distribution and several approximations. Verify that \(D_{\text{KL}}(p \| q) = H(p, q) - H(p)\).

import jax.numpy as jnp

def cross_entropy(p, q):
    return -jnp.sum(p * jnp.log2(jnp.clip(q, 1e-10, 1.0)))

def kl_divergence(p, q):
    mask = p > 0
    return jnp.sum(jnp.where(mask, p * jnp.log2(p / jnp.clip(q, 1e-10, 1.0)), 0.0))

def entropy(p):
    p = p[p > 0]
    return -jnp.sum(p * jnp.log2(p))

p = jnp.array([0.4, 0.3, 0.2, 0.1])  # true distribution

for name, q in [("perfect match", p),
                ("slight mismatch", jnp.array([0.35, 0.30, 0.25, 0.10])),
                ("big mismatch", jnp.array([0.1, 0.1, 0.1, 0.7]))]:
    h_p = entropy(p)
    h_pq = cross_entropy(p, q)
    kl = kl_divergence(p, q)
    print(f"{name:20s}: H(p)={h_p:.4f}, H(p,q)={h_pq:.4f}, "
          f"KL={kl:.4f}, H(p,q)-H(p)={h_pq-h_p:.4f}")

Show that KL divergence is not symmetric by computing \(D_{\text{KL}}(p \| q)\) and \(D_{\text{KL}}(q \| p)\) for two different distributions.

import jax.numpy as jnp

def kl_div(p, q):
    mask = p > 0
    return float(jnp.sum(jnp.where(mask, p * jnp.log2(p / jnp.clip(q, 1e-10, 1.0)), 0.0)))

p = jnp.array([0.9, 0.1])
q = jnp.array([0.5, 0.5])

print(f"D_KL(p || q) = {kl_div(p, q):.4f}")
print(f"D_KL(q || p) = {kl_div(q, p):.4f}")
print(f"Not the same! KL divergence is asymmetric.")

Simulate cross-entropy loss during training. Create a "true" one-hot label and show how the loss decreases as the model's predicted probabilities improve.

import jax.numpy as jnp
import matplotlib.pyplot as plt

# True label: class 2 out of 4
true_label = jnp.array([0, 0, 1, 0])

# Simulate improving predictions
steps = []
losses = []
for confidence in jnp.linspace(0.25, 0.99, 50):
    # Model becomes more confident in class 2
    remaining = (1 - confidence) / 3
    pred = jnp.array([remaining, remaining, confidence, remaining])
    loss = -jnp.sum(true_label * jnp.log(jnp.clip(pred, 1e-10, 1.0)))
    steps.append(float(confidence))
    losses.append(float(loss))

plt.figure(figsize=(8, 4))
plt.plot(steps, losses, color="#e74c3c", linewidth=2)
plt.xlabel("Model confidence in true class")
plt.ylabel("Cross-entropy loss")
plt.title("Cross-entropy loss decreases as predictions improve")
plt.grid(alpha=0.3)
plt.show()