Skip to content

Hypothesis Testing

Hypothesis testing provides a rigorous framework for deciding whether observed effects are real or due to chance. This file covers null and alternative hypotheses, p-values, significance levels, t-tests, chi-squared tests, ANOVA, and Type I/II errors -- the same logic used in A/B testing and model comparison.

  • Statistics is not just about describing data. Often you need to make a decision: does a new drug work? Is one algorithm faster than another? Has the average changed? Hypothesis testing gives you a structured framework for answering these questions using data.

  • The idea is simple: assume nothing has changed (the "null hypothesis"), then check whether the data is so extreme that this assumption becomes hard to believe.

  • The null hypothesis (\(H_0\)) is the default claim, usually a statement of "no effect" or "no difference." For example: "the average delivery time is still 30 minutes" or "the new model is no better than the old one."

  • The alternative hypothesis (\(H_1\) or \(H_a\)) is what you suspect might be true instead: "the average delivery time has changed" or "the new model is better."

  • You never prove \(H_1\) directly. Instead, you ask: if \(H_0\) were true, how likely is it that I would see data this extreme? If it is very unlikely, you reject \(H_0\) in favour of \(H_1\).

  • The test statistic is a single number that summarises how far your sample result is from what \(H_0\) predicts. Different tests use different formulas, but the logic is always the same: measure the distance between observed and expected.

  • The p-value is the probability of observing a test statistic at least as extreme as yours, assuming \(H_0\) is true. A small p-value means the data is surprising under \(H_0\).

  • The significance level (\(\alpha\)) is the threshold you set before looking at the data. If \(p \le \alpha\), you reject \(H_0\). Common choices are \(\alpha = 0.05\) (5%) and \(\alpha = 0.01\) (1%).

Normal curve with rejection regions shaded, test statistic marked, and p-value area highlighted

  • The shaded tails are the rejection regions. If your test statistic lands there, the data is surprising enough under \(H_0\) that you reject it. The green area shows the p-value for a particular test statistic.

  • Here is the step-by-step procedure:

    • Step 1: State \(H_0\) and \(H_1\)
    • Step 2: Choose a significance level \(\alpha\)
    • Step 3: Collect data and compute the test statistic
    • Step 4: Find the p-value (or compare the test statistic to a critical value)
    • Step 5: If \(p \le \alpha\), reject \(H_0\). Otherwise, fail to reject \(H_0\)
  • Worked example: A factory claims their bolts have a mean length of 10 cm. You measure 36 bolts and find a sample mean of 10.3 cm. The known population standard deviation is 0.9 cm. Is there evidence that the mean has changed?

  • \(H_0\): \(\mu = 10\), \(H_1\): \(\mu \neq 10\), \(\alpha = 0.05\)

  • Test statistic (z-test, since \(\sigma\) is known and \(n\) is large):

\[z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{10.3 - 10}{0.9 / \sqrt{36}} = \frac{0.3}{0.15} = 2.0\]
  • For a two-tailed test at \(\alpha = 0.05\), the critical values are \(\pm 1.96\). Our \(z = 2.0 > 1.96\), so we reject \(H_0\). The p-value is approximately 0.046, which is less than 0.05.

  • Conclusion: there is statistically significant evidence that the mean bolt length differs from 10 cm.

  • A one-tailed test checks for an effect in one specific direction (\(H_1\): \(\mu > 10\) or \(\mu < 10\)). The entire \(\alpha\) goes into one tail, making it easier to reject \(H_0\) in that direction but impossible to detect an effect in the opposite direction.

  • A two-tailed test checks for any difference (\(H_1\): \(\mu \neq 10\)). The \(\alpha\) is split between both tails (\(\alpha/2\) each). This is more conservative but catches effects in either direction.

  • Even with a good procedure, mistakes happen. There are exactly two types of errors:

2x2 grid showing Type I and Type II errors: reality vs decision

  • Type I Error (false positive): you reject \(H_0\) when it is actually true. The probability of this is \(\alpha\), which you control by choosing your significance level. Like a fire alarm going off when there is no fire.

  • Type II Error (false negative): you fail to reject \(H_0\) when it is actually false. The probability of this is \(\beta\). Like a fire alarm staying silent during a real fire.

  • Power is \(1 - \beta\), the probability of correctly rejecting a false \(H_0\). Higher power means you are better at detecting real effects. Power increases when:

    • The true effect size is larger (bigger differences are easier to detect)
    • The sample size is larger (more data = more precision)
    • The significance level \(\alpha\) is larger (but this raises Type I error risk)
    • The variability is lower (less noise)
  • There is a tension between Type I and Type II errors. Lowering \(\alpha\) (being more cautious about false positives) increases \(\beta\) (more false negatives). You cannot minimise both simultaneously with a fixed sample size.

  • Parametric tests assume the data follows a specific distribution (usually normal). They are more powerful when the assumptions hold.

  • Z-test: compares a sample mean to a known value when \(\sigma\) is known and \(n\) is large (\(n \ge 30\)). Test statistic:

\[z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}\]
  • T-test: like the z-test, but for when \(\sigma\) is unknown (estimated from the sample) or \(n\) is small. Uses the t-distribution, which has heavier tails than the normal. The heavier tails account for the extra uncertainty from estimating \(\sigma\).
\[t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}\]
  • The t-distribution has a parameter called degrees of freedom (\(df = n - 1\)). As \(df\) increases, the t-distribution approaches the normal distribution.

  • There are several flavours of t-test:

    • One-sample t-test: is the sample mean different from a specific value?
    • Independent two-sample t-test: are the means of two separate groups different?
    • Paired t-test: are the means of two related measurements different (e.g. before and after treatment on the same subjects)?
  • ANOVA (Analysis of Variance): tests whether three or more group means are equal. Instead of running multiple t-tests (which inflates the Type I error rate), ANOVA does a single test by comparing the variance between groups to the variance within groups.

\[F = \frac{\text{variance between groups}}{\text{variance within groups}}\]
  • A large \(F\) ratio means the groups differ more than you would expect from random variation alone.

  • Non-parametric tests make fewer assumptions about the data distribution. They work on ranks rather than raw values, making them robust to outliers and non-normality.

  • Chi-square test (\(\chi^2\)): tests whether observed frequencies match expected frequencies. Used for categorical data. For example: do the proportions of red, blue, and green cars match the manufacturer's claimed proportions?

\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]
  • Mann-Whitney U test: the non-parametric alternative to the independent two-sample t-test. It tests whether one group tends to have larger values than the other by comparing ranks.

  • Wilcoxon signed-rank test: the non-parametric alternative to the paired t-test. Compares paired observations by looking at the magnitude and direction of differences.

  • Kruskal-Wallis test: the non-parametric alternative to one-way ANOVA. Tests whether multiple groups come from the same distribution by comparing ranks across all groups.

  • Goodness-of-fit tests check whether your data follows a specific theoretical distribution. The chi-square goodness-of-fit test compares observed bin counts to expected counts under the hypothesised distribution.

  • Normality tests specifically check whether data is normally distributed. Common ones include the Shapiro-Wilk test (powerful for small samples) and the Kolmogorov-Smirnov test (compares the sample CDF to the theoretical CDF).

  • In ML, hypothesis testing appears when you compare model performance. If model A achieves 92% accuracy and model B achieves 91%, is the difference real or just noise? A paired t-test on cross-validation scores can answer this.

Coding Tasks (use CoLab or notebook)

  1. Perform a z-test for the bolt factory example from the text. Compute the test statistic, p-value, and make a decision.

    import jax.numpy as jnp
    
    x_bar = 10.3    # sample mean
    mu_0 = 10.0     # null hypothesis value
    sigma = 0.9     # known population std
    n = 36           # sample size
    alpha = 0.05
    
    # Test statistic
    z = (x_bar - mu_0) / (sigma / jnp.sqrt(n))
    print(f"z = {z:.4f}")
    
    # p-value (two-tailed) using the normal CDF approximation
    # For |z| = 2.0, p ≈ 0.0456
    from jax.scipy.stats import norm
    p_value = 2 * (1 - norm.cdf(jnp.abs(z)))
    print(f"p-value = {p_value:.4f}")
    print(f"Reject H₀? {p_value <= alpha}")
    

  2. Simulate Type I error: when \(H_0\) is true, how often do we mistakenly reject it? Run 10,000 experiments and check that the rejection rate matches \(\alpha\).

    import jax
    import jax.numpy as jnp
    
    key = jax.random.PRNGKey(0)
    mu_0 = 50.0
    sigma = 10.0
    n = 30
    alpha = 0.05
    n_experiments = 10_000
    
    rejections = 0
    for i in range(n_experiments):
        key, subkey = jax.random.split(key)
        sample = mu_0 + sigma * jax.random.normal(subkey, shape=(n,))
        z = (sample.mean() - mu_0) / (sigma / jnp.sqrt(n))
        p_value = 2 * (1 - __import__("jax").scipy.stats.norm.cdf(jnp.abs(z)))
        if p_value <= alpha:
            rejections += 1
    
    print(f"Rejection rate: {rejections/n_experiments:.4f}")
    print(f"Expected (α):   {alpha}")
    

  3. Compare a t-test and a Mann-Whitney U test on two groups. Generate data where one group has a slightly higher mean and see which test detects the difference.

    import jax
    import jax.numpy as jnp
    
    key = jax.random.PRNGKey(99)
    k1, k2 = jax.random.split(key)
    
    group_a = jax.random.normal(k1, shape=(25,)) * 5 + 100
    group_b = jax.random.normal(k2, shape=(25,)) * 5 + 103  # slightly higher mean
    
    # Two-sample t-test (equal variance assumed)
    n_a, n_b = len(group_a), len(group_b)
    mean_a, mean_b = group_a.mean(), group_b.mean()
    pooled_var = ((n_a - 1) * group_a.var() + (n_b - 1) * group_b.var()) / (n_a + n_b - 2)
    se = jnp.sqrt(pooled_var * (1/n_a + 1/n_b))
    t_stat = (mean_a - mean_b) / se
    print(f"T-test statistic: {t_stat:.4f}")
    
    # Mann-Whitney: count how often group_a values beat group_b values
    u_stat = jnp.sum(group_a[:, None] < group_b[None, :])
    print(f"Mann-Whitney U:   {u_stat}")
    print(f"\nGroup A mean: {mean_a:.2f}, Group B mean: {mean_b:.2f}")