The problem
The standard deviation is a measurement of the spread of the data — it is the average distance of the data from the mean.
We are rarely interested in the amount of variation in our sample: the sample standard deviation is only useful as an approximation of the population standard deviation.
When our sample is the whole population of interest, then the formula for the standard deviation is:
\(\sigma = \sqrt{\frac{\sum_{i=1}^{N}(x_i-\mu)^2}{N}}\)
Where:
- σ is the population standard deviation.
- xi is a data point.
- µ is the population mean.
- N is the population size.
But the problem is that most of the time, our sample is only a small subset of the population, so we do not know the value of µ. Instead, we want to use x̄ (the sample mean) as an approximation of µ.
Intuition
If we want to estimate the average distance of the data from µ (i.e. when we want to estimate the population standard deviation), we will have to account for 2 things:
- the average distance of the data from x̄ (i.e. the sample standard deviation), and
- the average distance of x̄ from µ.
So, the sample standard deviation (i.e. division by n) will be an underestimation of true standard deviation of the population. To correct for this underestimation, we divide by a smaller number, n-1, to get:
\(s = \sqrt{\frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1}}\)
Where:
- s is the sample standard deviation.
- xi is a data point.
- x̄ is the sample mean.
- n is the sample size.
R simulation
Here’s a simulation in R to prove that dividing by n-1 provides a better estimation of the population standard deviation.
First, let’s create the population:
# create a population of 1000 observations set.seed(1) N = 1000 pop = runif(n = N, min = 1, max = 100) # calculate the population mean mean(pop) # 50.46948 # calculate the population standard deviation sqrt(sum( (pop - mean(pop))^2 ) / N) # 28.53604
So, the true (population) standard deviation is 28.54.
Now, let’s take some samples from this population and try different ways of calculating the sample standard deviation to see which one is the closest to 28.54:
# create 10,000 samples of 30 observations each # and calculate the standard deviation n = 30 # dividing by n sd_n = replicate(10000, { # repeat 10,000 times s = sample(pop, n) # take a sample sqrt(sum( (s - mean(s))^2) / n) # calculate sd }) # dividing by n-1 sd_nMinus1 = replicate(10000, { # repeat 10,000 times s = sample(pop, n) # take a sample sqrt(sum( (s - mean(s))^2) / (n-1)) # calculate sd }) # dividing by n, but using the true population mean sd_n_popMean = replicate(10000, { # repeat 10,000 times s = sample(pop, n) # take a sample sqrt(sum( (s - mean(pop))^2) / n) # calculate sd using the pop mean }) # average results of the simulations mean(sd_n) # 27.96303 mean(sd_nMinus1) # 28.46863 mean(sd_n_popMean) # 28.42268
This proves that dividing by n-1 when calculating the sample standard deviation yields a better estimation of the population standard deviation. This is only true when we use the sample mean as an estimation of the population mean. If the population mean is known, there is no need for the “n-1” correction.