Hypothesis Testing¶

Hypothesis testing provides a rigorous framework for deciding whether observed effects are real or due to chance. This file covers null and alternative hypotheses, p-values, significance levels, t-tests, chi-squared tests, ANOVA, and Type I/II errors -- the same logic used in A/B testing and model comparison.

Statistics is not just about describing data. Often you need to make a decision: does a new drug work? Is one algorithm faster than another? Has the average changed? Hypothesis testing gives you a structured framework for answering these questions using data.
The idea is simple: assume nothing has changed (the "null hypothesis"), then check whether the data is so extreme that this assumption becomes hard to believe.
The null hypothesis (\(H_0\)) is the default claim, usually a statement of "no effect" or "no difference." For example: "the average delivery time is still 30 minutes" or "the new model is no better than the old one."
The alternative hypothesis (\(H_1\) or \(H_a\)) is what you suspect might be true instead: "the average delivery time has changed" or "the new model is better."
You never prove \(H_1\) directly. Instead, you ask: if \(H_0\) were true, how likely is it that I would see data this extreme? If it is very unlikely, you reject \(H_0\) in favour of \(H_1\).
The test statistic is a single number that summarises how far your sample result is from what \(H_0\) predicts. Different tests use different formulas, but the logic is always the same: measure the distance between observed and expected.
The p-value is the probability of observing a test statistic at least as extreme as yours, assuming \(H_0\) is true. A small p-value means the data is surprising under \(H_0\).
The significance level (\(\alpha\)) is the threshold you set before looking at the data. If \(p \le \alpha\), you reject \(H_0\). Common choices are \(\alpha = 0.05\) (5%) and \(\alpha = 0.01\) (1%).

Normal curve with rejection regions shaded, test statistic marked, and p-value area highlighted

The shaded tails are the rejection regions. If your test statistic lands there, the data is surprising enough under \(H_0\) that you reject it. The green area shows the p-value for a particular test statistic.
Here is the step-by-step procedure:
- Step 1: State \(H_0\) and \(H_1\)
- Step 2: Choose a significance level \(\alpha\)
- Step 3: Collect data and compute the test statistic
- Step 4: Find the p-value (or compare the test statistic to a critical value)
- Step 5: If \(p \le \alpha\), reject \(H_0\). Otherwise, fail to reject \(H_0\)
Worked example: A factory claims their bolts have a mean length of 10 cm. You measure 36 bolts and find a sample mean of 10.3 cm. The known population standard deviation is 0.9 cm. Is there evidence that the mean has changed?
\(H_0\): \(\mu = 10\), \(H_1\): \(\mu \neq 10\), \(\alpha = 0.05\)
Test statistic (z-test, since \(\sigma\) is known and \(n\) is large):

\[z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{10.3 - 10}{0.9 / \sqrt{36}} = \frac{0.3}{0.15} = 2.0\]

For a two-tailed test at \(\alpha = 0.05\), the critical values are \(\pm 1.96\). Our \(z = 2.0 > 1.96\), so we reject \(H_0\). The p-value is approximately 0.046, which is less than 0.05.
Conclusion: there is statistically significant evidence that the mean bolt length differs from 10 cm.
A one-tailed test checks for an effect in one specific direction (\(H_1\): \(\mu > 10\) or \(\mu < 10\)). The entire \(\alpha\) goes into one tail, making it easier to reject \(H_0\) in that direction but impossible to detect an effect in the opposite direction.
A two-tailed test checks for any difference (\(H_1\): \(\mu \neq 10\)). The \(\alpha\) is split between both tails (\(\alpha/2\) each). This is more conservative but catches effects in either direction.
Even with a good procedure, mistakes happen. There are exactly two types of errors:

2x2 grid showing Type I and Type II errors: reality vs decision

Type I Error (false positive): you reject \(H_0\) when it is actually true. The probability of this is \(\alpha\), which you control by choosing your significance level. Like a fire alarm going off when there is no fire.
Type II Error (false negative): you fail to reject \(H_0\) when it is actually false. The probability of this is \(\beta\). Like a fire alarm staying silent during a real fire.
Power is \(1 - \beta\), the probability of correctly rejecting a false \(H_0\). Higher power means you are better at detecting real effects. Power increases when:
- The true effect size is larger (bigger differences are easier to detect)
- The sample size is larger (more data = more precision)
- The significance level \(\alpha\) is larger (but this raises Type I error risk)
- The variability is lower (less noise)
There is a tension between Type I and Type II errors. Lowering \(\alpha\) (being more cautious about false positives) increases \(\beta\) (more false negatives). You cannot minimise both simultaneously with a fixed sample size.
Parametric tests assume the data follows a specific distribution (usually normal). They are more powerful when the assumptions hold.
Z-test: compares a sample mean to a known value when \(\sigma\) is known and \(n\) is large (\(n \ge 30\)). Test statistic:

\[z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}\]

T-test: like the z-test, but for when \(\sigma\) is unknown (estimated from the sample) or \(n\) is small. Uses the t-distribution, which has heavier tails than the normal. The heavier tails account for the extra uncertainty from estimating \(\sigma\).

\[t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}\]

The t-distribution has a parameter called degrees of freedom (\(df = n - 1\)). As \(df\) increases, the t-distribution approaches the normal distribution.
There are several flavours of t-test:
- One-sample t-test: is the sample mean different from a specific value?
- Independent two-sample t-test: are the means of two separate groups different?
- Paired t-test: are the means of two related measurements different (e.g. before and after treatment on the same subjects)?
ANOVA (Analysis of Variance): tests whether three or more group means are equal. Instead of running multiple t-tests (which inflates the Type I error rate), ANOVA does a single test by comparing the variance between groups to the variance within groups.

\[F = \frac{\text{variance between groups}}{\text{variance within groups}}\]

A large \(F\) ratio means the groups differ more than you would expect from random variation alone.
Non-parametric tests make fewer assumptions about the data distribution. They work on ranks rather than raw values, making them robust to outliers and non-normality.
Chi-square test (\(\chi^2\)): tests whether observed frequencies match expected frequencies. Used for categorical data. For example: do the proportions of red, blue, and green cars match the manufacturer's claimed proportions?

\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]

Mann-Whitney U test: the non-parametric alternative to the independent two-sample t-test. It tests whether one group tends to have larger values than the other by comparing ranks.
Wilcoxon signed-rank test: the non-parametric alternative to the paired t-test. Compares paired observations by looking at the magnitude and direction of differences.
Kruskal-Wallis test: the non-parametric alternative to one-way ANOVA. Tests whether multiple groups come from the same distribution by comparing ranks across all groups.
Goodness-of-fit tests check whether your data follows a specific theoretical distribution. The chi-square goodness-of-fit test compares observed bin counts to expected counts under the hypothesised distribution.
Normality tests specifically check whether data is normally distributed. Common ones include the Shapiro-Wilk test (powerful for small samples) and the Kolmogorov-Smirnov test (compares the sample CDF to the theoretical CDF).
In ML, hypothesis testing appears when you compare model performance. If model A achieves 92% accuracy and model B achieves 91%, is the difference real or just noise? A paired t-test on cross-validation scores can answer this.

Coding Tasks (use CoLab or notebook)¶

Perform a z-test for the bolt factory example from the text. Compute the test statistic, p-value, and make a decision.

import jax.numpy as jnp

x_bar = 10.3    # sample mean
mu_0 = 10.0     # null hypothesis value
sigma = 0.9     # known population std
n = 36           # sample size
alpha = 0.05

# Test statistic
z = (x_bar - mu_0) / (sigma / jnp.sqrt(n))
print(f"z = {z:.4f}")

# p-value (two-tailed) using the normal CDF approximation
# For |z| = 2.0, p ≈ 0.0456
from jax.scipy.stats import norm
p_value = 2 * (1 - norm.cdf(jnp.abs(z)))
print(f"p-value = {p_value:.4f}")
print(f"Reject H₀? {p_value <= alpha}")

Simulate Type I error: when \(H_0\) is true, how often do we mistakenly reject it? Run 10,000 experiments and check that the rejection rate matches \(\alpha\).

import jax
import jax.numpy as jnp

key = jax.random.PRNGKey(0)
mu_0 = 50.0
sigma = 10.0
n = 30
alpha = 0.05
n_experiments = 10_000

rejections = 0
for i in range(n_experiments):
    key, subkey = jax.random.split(key)
    sample = mu_0 + sigma * jax.random.normal(subkey, shape=(n,))
    z = (sample.mean() - mu_0) / (sigma / jnp.sqrt(n))
    p_value = 2 * (1 - __import__("jax").scipy.stats.norm.cdf(jnp.abs(z)))
    if p_value <= alpha:
        rejections += 1

print(f"Rejection rate: {rejections/n_experiments:.4f}")
print(f"Expected (α):   {alpha}")

Compare a t-test and a Mann-Whitney U test on two groups. Generate data where one group has a slightly higher mean and see which test detects the difference.

import jax
import jax.numpy as jnp

key = jax.random.PRNGKey(99)
k1, k2 = jax.random.split(key)

group_a = jax.random.normal(k1, shape=(25,)) * 5 + 100
group_b = jax.random.normal(k2, shape=(25,)) * 5 + 103  # slightly higher mean

# Two-sample t-test (equal variance assumed)
n_a, n_b = len(group_a), len(group_b)
mean_a, mean_b = group_a.mean(), group_b.mean()
pooled_var = ((n_a - 1) * group_a.var() + (n_b - 1) * group_b.var()) / (n_a + n_b - 2)
se = jnp.sqrt(pooled_var * (1/n_a + 1/n_b))
t_stat = (mean_a - mean_b) / se
print(f"T-test statistic: {t_stat:.4f}")

# Mann-Whitney: count how often group_a values beat group_b values
u_stat = jnp.sum(group_a[:, None] < group_b[None, :])
print(f"Mann-Whitney U:   {u_stat}")
print(f"\nGroup A mean: {mean_a:.2f}, Group B mean: {mean_b:.2f}")