Let’s start by creating IQ, a normally distributed numerical variable, with a mean of 100 and a standard deviation of 15, that represents the IQ scores of a sample of 100 participants:
set.seed(1) # create IQ by sampling data from a normal distribution IQ = rnorm(n = 100, mean = 100, sd = 15)
Next, we will (1) summarize this variable and (2) describe its distribution.
1. Summary statistics
# get the mean and important percentiles summary(IQ) # outputs: # Min. 1st Qu. Median Mean 3rd Qu. Max. # 66.78 92.59 101.71 101.63 110.37 136.02
Reminder: The 1st quartile is the 25th percentile, this means that 25% of the data fall below this value. And the 3rd quartile is the 75th percentile, this means that 75% of the data fall below this value.
We can visualize these numbers by looking at the boxplot:
# box plot boxplot(IQ, col = '#D3E5EE') # show the 5 numbers that describe the boxplot text(x = 1.33, y = fivenum(IQ), labels = paste(round(fivenum(IQ), 2), c('(Minimum, excluding outliers)', '(25th percentile)', '(Median)', '(75th percentile)', '(Maximum, excluding outliers)')))
Here’s a summary of what we can learn from this boxplot:
|Range||The range is the difference between the maximum and minimum values: 136.02 – 66.78 = 69.24||69.24||According to our sample, IQ scores can differ by up to 69.24 points between one person and another.|
|Interquartile range (IQR)||IQR is the difference between the 75th percentile and the 25th percentile: 110.4 – 92.34 = 18.06||18.06||50% of the data are spread over 26% of the range (IQR / range = 18.06 / 69.24 = 0.26), this means that the data are not uniformly distributed, instead a lot of data points seem to be clustered around the median.|
|Median||The median is the 50th percentile. Since our sample contains an even number of values, we will have 2 middles values and the median will be their average: (101.12 + 102.30) / 2 = 101.71||101.71||A typical person in our sample has an IQ of 101.7, half the sample has a lower IQ score and half the sample has a higher IQ score.|
2. Describing the distribution
A frequency histogram divides the numerical variable into bins and shows, on the y-axis, the number of observations that fall into each bin:
# frequency histogram min(IQ) # 66.7795 max(IQ) # 136.0243 # so create a custom x-axis, starting from 65 up to 140 xvalues = seq(65, 140, by = 5) hist(IQ, breaks = xvalues, col = '#D3E5EE', border = '#98C2D7', labels = TRUE, # show bin values xaxt = 'n') # remove x-axis text axis(1, at = xvalues) # add the custom x-axis
This histogram shows, for example, that 17 values fall between 105 and 110, and only 3 values are above 125.
Now, instead of showing the frequency (i.e. the number of occurrences) of values that fall into each bin, we can plot a density histogram that shows on the y-axis the probability of being in a given bin:
# density histogram hist(IQ, breaks = 10, freq = FALSE, col = '#D3E5EE', border = '#98C2D7') # probability density curve lines(density(IQ), lw = 2, col = '#3E7FA3') # show the mean ± 1 standard deviation abline(v = mean(IQ), lw = 2) legend(x = mean(IQ) - 4, y = 0.033, # x and y coordinates of the legend, picked manually by trial and error 'Mean', adj = 0.3) abline(v = mean(IQ) + sd(IQ), lw = 2, lty = 2) legend(x = mean(IQ) + sd(IQ) - 7, y = 0.033, 'Mean + 1 SD', adj = 0.2) abline(v = mean(IQ) - sd(IQ), lw = 2, lty = 2) legend(x = mean(IQ) - sd(IQ) - 7, y = 0.033, 'Mean - 1 SD', adj = 0.2)
The density histogram and the density curve suggest that the IQ scores are normally distributed, since the distribution is bell-shaped and symmetric (i.e. not skewed).
Another graphical way to assess normality is to look at the normal Q-Q plot:
# normal Q-Q plot qqnorm(IQ, col = '#5398BE', pch = 16) qqline(IQ)
The points are reasonably close to the diagonal line and they do not show any non-linear pattern, also suggesting that the IQ scores are normally distributed, with a mean and standard deviation of:
mean(IQ) # outputs: 101.63 sd(IQ) # outputs: 13.47