Let’s start by creating *IQ*, a normally distributed numerical variable, with a mean of 100 and a standard deviation of 15, that represents the IQ scores of a sample of 100 participants:

set.seed(1) # create IQ by sampling data from a normal distribution IQ = rnorm(n = 100, mean = 100, sd = 15)

Next, we will (1) summarize this variable and (2) describe its distribution.

## 1. Summary statistics

# get the mean and important percentiles summary(IQ) # outputs: # Min. 1st Qu. Median Mean 3rd Qu. Max. # 66.78 92.59 101.71 101.63 110.37 136.02

Reminder: The 1^{st} quartile is the 25^{th} percentile, this means that 25% of the data fall below this value. And the 3^{rd} quartile is the 75^{th} percentile, this means that 75% of the data fall below this value.

We can visualize these numbers by looking at the boxplot:

# box plot boxplot(IQ, col = '#D3E5EE') # show the 5 numbers that describe the boxplot text(x = 1.33, y = fivenum(IQ), labels = paste(round(fivenum(IQ), 2), c('(Minimum, excluding outliers)', '(25th percentile)', '(Median)', '(75th percentile)', '(Maximum, excluding outliers)')))

**Output:**

Here’s a summary of what we can learn from this boxplot:

Quantity | Definition/Calculation | Value | Interpretation |
---|---|---|---|

Range | The range is the difference between the maximum and minimum values: 136.02 – 66.78 = 69.24 | 69.24 | According to our sample, IQ scores can differ by up to 69.24 points between one person and another. |

Interquartile range (IQR) | IQR is the difference between the 75^{th} percentile and the 25^{th} percentile: 110.4 – 92.34 = 18.06 | 18.06 | 50% of the data are spread over 26% of the range (IQR / range = 18.06 / 69.24 = 0.26), this means that the data are not uniformly distributed, instead a lot of data points seem to be clustered around the median. |

Median | The median is the 50^{th} percentile. Since our sample contains an even number of values, we will have 2 middles values and the median will be their average: (101.12 + 102.30) / 2 = 101.71 | 101.71 | A typical person in our sample has an IQ of 101.7, half the sample has a lower IQ score and half the sample has a higher IQ score. |

## 2. Describing the distribution

A **frequency histogram** divides the numerical variable into bins and shows, on the y-axis, the number of observations that fall into each bin:

# frequency histogram min(IQ) # 66.7795 max(IQ) # 136.0243 # so create a custom x-axis, starting from 65 up to 140 xvalues = seq(65, 140, by = 5) hist(IQ, breaks = xvalues, col = '#D3E5EE', border = '#98C2D7', labels = TRUE, # show bin values xaxt = 'n') # remove x-axis text axis(1, at = xvalues) # add the custom x-axis

**Output:**

This histogram shows, for example, that 17 values fall between 105 and 110, and only 3 values are above 125.

Now, instead of showing the frequency (i.e. the number of occurrences) of values that fall into each bin, we can plot a **density histogram** that shows on the y-axis the probability of being in a given bin:

# density histogram hist(IQ, breaks = 10, freq = FALSE, col = '#D3E5EE', border = '#98C2D7') # probability density curve lines(density(IQ), lw = 2, col = '#3E7FA3') # show the mean ± 1 standard deviation abline(v = mean(IQ), lw = 2) legend(x = mean(IQ) - 4, y = 0.033, # x and y coordinates of the legend, picked manually by trial and error 'Mean', adj = 0.3) abline(v = mean(IQ) + sd(IQ), lw = 2, lty = 2) legend(x = mean(IQ) + sd(IQ) - 7, y = 0.033, 'Mean + 1 SD', adj = 0.2) abline(v = mean(IQ) - sd(IQ), lw = 2, lty = 2) legend(x = mean(IQ) - sd(IQ) - 7, y = 0.033, 'Mean - 1 SD', adj = 0.2)

**Output:**

The density histogram and the density curve suggest that the IQ scores are normally distributed, since the distribution is bell-shaped and symmetric (i.e. not skewed).

Another graphical way to assess normality is to look at the normal Q-Q plot:

# normal Q-Q plot qqnorm(IQ, col = '#5398BE', pch = 16) qqline(IQ)

**Output:**

The points are reasonably close to the diagonal line and they do not show any non-linear pattern, also suggesting that the IQ scores are normally distributed, with a mean and standard deviation of:

mean(IQ) # outputs: 101.63 sd(IQ) # outputs: 13.47