Statistical Measures¶

Statistical measures summarise data with single numbers that capture spread, position, shape, and association. This file covers variance, standard deviation, quartiles, skewness, kurtosis, covariance, correlation, and z-scores -- the toolkit for exploratory data analysis and feature engineering in ML.

In the previous file we introduced moments as a family of summary statistics. Here we unpack the practical tools that flow from them: measures of dispersion, position, shape, and association.
Dispersion answers the question: how spread out is the data? Two classrooms can have the same average test score, but very different spreads.

Two distributions with the same mean but different spreads

The narrow (blue) distribution has low variance: most values cluster tightly around the mean. The wide (red) distribution has high variance: values are scattered further out.
Variance is the average squared distance from the mean. We square to avoid positive and negative deviations cancelling each other out.

\[\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2\]

When working with a sample (not the full population), we divide by \(N - 1\) instead of \(N\). This correction (called Bessel's correction) accounts for the fact that a sample tends to underestimate the true variability:

\[s^2 = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2\]

Standard deviation is the square root of variance: \(\sigma = \sqrt{\sigma^2}\). It brings the measure back to the original units. If your data is in centimetres, variance is in cm\(^2\), but standard deviation is back in cm.
Mean Absolute Deviation (MAD) is a simpler alternative. Instead of squaring, take the absolute value of each deviation:

\[\text{MAD} = \frac{1}{N} \sum_{i=1}^{N} |x_i - \mu|\]

MAD is more robust to outliers than variance because it does not amplify large deviations by squaring them. However, variance is more mathematically convenient (it decomposes nicely in proofs and ML optimisation).
Position answers a different question: where does a specific value sit relative to the rest of the data?
Quartiles split sorted data into four equal parts. Q1 (25th percentile) is the value below which 25% of data falls. Q2 is the median (50th percentile). Q3 is the 75th percentile.
The Interquartile Range (IQR) is \(Q3 - Q1\). It captures the spread of the middle 50% of data, ignoring extremes.

Box plot showing Q1, median, Q3, IQR, whiskers, and an outlier

The box plot is one of the most useful visualisations in statistics. The box spans Q1 to Q3, the line inside is the median, whiskers extend to the most extreme non-outlier values, and dots beyond the whiskers are outliers.
Percentiles generalise quartiles. The \(p\)-th percentile is the value below which \(p\%\) of observations fall. Q1 is the 25th percentile, the median is the 50th, and Q3 is the 75th.
The z-score tells you how many standard deviations a value is from the mean:

\[z = \frac{x - \mu}{\sigma}\]

A z-score of 2 means the value is 2 standard deviations above the mean. A z-score of \(-1.5\) means it is 1.5 standard deviations below. This is also called standardisation and is used heavily in ML for feature scaling, as it transforms any distribution to have mean 0 and standard deviation 1.
Shape describes the geometry of a distribution beyond its centre and spread.
Skewness (the standardised 3rd moment from the previous file) measures asymmetry. A perfectly symmetric distribution like the normal curve has skewness of zero. Positive skewness means a longer right tail (e.g. income distributions). Negative skewness means a longer left tail (e.g. age at retirement).

\[\text{Skewness} = \frac{1}{N} \sum_{i=1}^{N} \left(\frac{x_i - \mu}{\sigma}\right)^3\]

Kurtosis (the standardised 4th moment) measures tail heaviness. The normal distribution has kurtosis of 3. Distributions with heavier tails (more prone to outliers) have kurtosis greater than 3.

\[\text{Kurtosis} = \frac{1}{N} \sum_{i=1}^{N} \left(\frac{x_i - \mu}{\sigma}\right)^4\]

Correlation measures the strength and direction of a relationship between two variables. It answers: when one variable goes up, does the other tend to go up, go down, or do nothing?

Three scatter plots showing positive, no, and negative correlation

Pearson correlation (\(r\)) measures linear association. It ranges from \(-1\) (perfect negative) through \(0\) (none) to \(+1\) (perfect positive).

\[r = \frac{\sum_{i=1}^{N} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \cdot \sqrt{\sum (y_i - \bar{y})^2}}\]

If you recall dot products from Chapter 1, Pearson correlation is essentially the cosine similarity between the mean-centred versions of \(\mathbf{x}\) and \(\mathbf{y}\).
Spearman correlation (\(\rho\)) measures monotonic association. Instead of using raw values, it ranks them first and then computes Pearson correlation on the ranks. This makes it robust to outliers and works even when the relationship is nonlinear, as long as it is consistently increasing or decreasing.
Geometric mean is the appropriate average when values multiply together, like growth rates. If your investment grows by 10%, then 20%, then 30%, the average growth factor is not the arithmetic mean of those rates. Instead:

\[\bar{x}_{\text{geo}} = \left(\prod_{i=1}^{N} x_i\right)^{1/N}\]

For growth rates specifically, convert percentages to factors first (1.10, 1.20, 1.30), compute the geometric mean, then subtract 1.
Exponential Moving Average (EMA) gives more weight to recent observations. Unlike a simple moving average where all points in the window are equally weighted, EMA decays exponentially:

\[\text{EMA}_t = \alpha \cdot x_t + (1 - \alpha) \cdot \text{EMA}_{t-1}\]

The smoothing factor \(\alpha\) (between 0 and 1) controls how quickly old observations lose influence. Higher \(\alpha\) means more responsive to recent changes, lower \(\alpha\) means smoother. In ML, EMA is used in optimisers like Adam and in batch normalisation's running statistics.
Outlier detection identifies data points that are unusually far from the rest. Two common methods:
- IQR method: a point is an outlier if it falls below \(Q1 - 1.5 \times \text{IQR}\) or above \(Q3 + 1.5 \times \text{IQR}\)
- Z-score method: a point is an outlier if \(|z| > 3\) (more than 3 standard deviations from the mean)
The IQR method is more robust because it does not assume a normal distribution. The z-score method works well when data is approximately normal but can fail when the distribution is heavily skewed.

Coding Tasks (use CoLab or notebook)¶

Compute variance, standard deviation, and MAD for a dataset and compare them. Observe what happens when you add an extreme outlier.

import jax.numpy as jnp

data = jnp.array([4, 8, 6, 5, 3, 7, 9, 5, 6, 7], dtype=jnp.float32)

mean = jnp.mean(data)
variance = jnp.var(data)
std = jnp.std(data)
mad = jnp.mean(jnp.abs(data - mean))

print("Original data:")
print(f"  Variance: {variance:.3f}, Std: {std:.3f}, MAD: {mad:.3f}")

# Add an outlier and recompute
data_outlier = jnp.append(data, 100.0)
mean2 = jnp.mean(data_outlier)
print(f"\nWith outlier (100):")
print(f"  Variance: {jnp.var(data_outlier):.3f}, Std: {jnp.std(data_outlier):.3f}, MAD: {jnp.mean(jnp.abs(data_outlier - mean2)):.3f}")

Compute Pearson and Spearman correlation between two variables. Experiment with different relationships.

import jax
import jax.numpy as jnp

# Perfect linear relationship
x = jnp.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=jnp.float32)
y = 2 * x + 1  # try changing this!

def pearson(a, b):
    a_c = a - jnp.mean(a)
    b_c = b - jnp.mean(b)
    return jnp.sum(a_c * b_c) / (jnp.sqrt(jnp.sum(a_c**2)) * jnp.sqrt(jnp.sum(b_c**2)))

def spearman(a, b):
    rank_a = jnp.argsort(jnp.argsort(a)).astype(jnp.float32)
    rank_b = jnp.argsort(jnp.argsort(b)).astype(jnp.float32)
    return pearson(rank_a, rank_b)

print(f"Pearson r:  {pearson(x, y):.4f}")
print(f"Spearman ρ: {spearman(x, y):.4f}")

Implement outlier detection using both the IQR and z-score methods, then compare their results on skewed data.

import jax.numpy as jnp

data = jnp.array([2, 3, 3, 4, 5, 5, 5, 6, 6, 7, 50], dtype=jnp.float32)

# IQR method
q1, q3 = jnp.percentile(data, 25), jnp.percentile(data, 75)
iqr = q3 - q1
lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
iqr_outliers = data[(data < lower) | (data > upper)]
print(f"IQR bounds: [{lower:.1f}, {upper:.1f}]")
print(f"IQR outliers: {iqr_outliers}")

# Z-score method
z_scores = (data - jnp.mean(data)) / jnp.std(data)
z_outliers = data[jnp.abs(z_scores) > 3]
print(f"\nZ-scores: {z_scores}")
print(f"Z-score outliers (|z| > 3): {z_outliers}")

Compute and plot an Exponential Moving Average with different smoothing factors on noisy data.

import jax.numpy as jnp
import matplotlib.pyplot as plt

# Generate noisy data
key = __import__("jax").random.PRNGKey(0)
noise = __import__("jax").random.normal(key, shape=(50,))
signal = jnp.linspace(0, 5, 50) + noise

def ema(data, alpha):
    result = jnp.zeros_like(data)
    result = result.at[0].set(data[0])
    for t in range(1, len(data)):
        result = result.at[t].set(alpha * data[t] + (1 - alpha) * result[t - 1])
    return result

plt.figure(figsize=(10, 4))
plt.plot(signal, "o", alpha=0.3, label="raw data", color="#999")
for alpha, color in [(0.1, "#e74c3c"), (0.3, "#3498db"), (0.7, "#27ae60")]:
    plt.plot(ema(signal, alpha), label=f"α={alpha}", color=color, linewidth=2)
plt.legend()
plt.title("EMA with different smoothing factors")
plt.show()