How to Describe/Summarize Numerical Data in R (Example)

Let’s start by creating IQ, a normally distributed numerical variable, with a mean of 100 and a standard deviation of 15, that represents the IQ scores of a sample of 100 participants:


# create IQ by sampling data from a normal distribution
IQ = rnorm(n = 100, mean = 100, sd = 15)

Next, we will (1) summarize this variable and (2) describe its distribution.

1. Summary statistics

# get the mean and important percentiles
# outputs:
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   66.78   92.59  101.71  101.63  110.37  136.02 

Reminder: The 1st quartile is the 25th percentile, this means that 25% of the data fall below this value. And the 3rd quartile is the 75th percentile, this means that 75% of the data fall below this value.

We can visualize these numbers by looking at the boxplot:

# box plot
boxplot(IQ, col = '#D3E5EE')

# show the 5 numbers that describe the boxplot
text(x = 1.33,
     y = fivenum(IQ),
     labels = paste(round(fivenum(IQ), 2),
     c('(Minimum, excluding outliers)',
       '(25th percentile)',
       '(75th percentile)',
       '(Maximum, excluding outliers)')))


Boxplot of IQ scores

Here’s a summary of what we can learn from this boxplot:

RangeThe range is the difference between the maximum and minimum values: 136.02 – 66.78 = 69.2469.24According to our sample, IQ scores can differ by up to 69.24 points between one person and another.
Interquartile range (IQR)IQR is the difference between the 75th percentile and the 25th percentile: 110.4 – 92.34 = 18.0618.0650% of the data are spread over 26% of the range (IQR / range = 18.06 / 69.24 = 0.26), this means that the data are not uniformly distributed, instead a lot of data points seem to be clustered around the median.
MedianThe median is the 50th percentile. Since our sample contains an even number of values, we will have 2 middles values and the median will be their average: (101.12 + 102.30) / 2 = 101.71101.71A typical person in our sample has an IQ of 101.7, half the sample has a lower IQ score and half the sample has a higher IQ score.

2. Describing the distribution

A frequency histogram divides the numerical variable into bins and shows, on the y-axis, the number of observations that fall into each bin:

# frequency histogram
min(IQ) # 66.7795
max(IQ) # 136.0243
# so create a custom x-axis, starting from 65 up to 140
xvalues = seq(65, 140, by = 5)

hist(IQ, breaks = xvalues, col = '#D3E5EE', border = '#98C2D7',
     labels = TRUE, # show bin values
     xaxt = 'n') # remove x-axis text
axis(1, at = xvalues) # add the custom x-axis


frequency histogram of IQ scores

This histogram shows, for example, that 17 values fall between 105 and 110, and only 3 values are above 125.

Now, instead of showing the frequency (i.e. the number of occurrences) of values that fall into each bin, we can plot a density histogram that shows on the y-axis the probability of being in a given bin:

# density histogram
hist(IQ, breaks = 10, freq = FALSE, col = '#D3E5EE', border = '#98C2D7')

# probability density curve
lines(density(IQ), lw = 2, col = '#3E7FA3')

# show the mean ± 1 standard deviation
abline(v = mean(IQ), lw = 2)
legend(x = mean(IQ) - 4, y = 0.033, # x and y coordinates of the legend, picked manually by trial and error
       'Mean', adj = 0.3)

abline(v = mean(IQ) + sd(IQ), lw = 2, lty = 2)
legend(x = mean(IQ) + sd(IQ) - 7, y = 0.033, 'Mean + 1 SD', adj = 0.2)

abline(v = mean(IQ) - sd(IQ), lw = 2, lty = 2)
legend(x = mean(IQ) - sd(IQ) - 7, y = 0.033, 'Mean - 1 SD', adj = 0.2)


density histogram showing the mean and the mean and the variability in the distribution of IQ scores

The density histogram and the density curve suggest that the IQ scores are normally distributed, since the distribution is bell-shaped and symmetric (i.e. not skewed).

Another graphical way to assess normality is to look at the normal Q-Q plot:

# normal Q-Q plot
qqnorm(IQ, col = '#5398BE', pch = 16)


Normal Q-Q plot of IQ scores

The points are reasonably close to the diagonal line and they do not show any non-linear pattern, also suggesting that the IQ scores are normally distributed, with a mean and standard deviation of:

mean(IQ) # outputs: 101.63
sd(IQ) # outputs: 13.47

Further reading