How to Describe/Summarize Numerical Data in R (Example)

Let’s start by creating IQ, a normally distributed numerical variable, with a mean of 100 and a standard deviation of 15, that represents the IQ scores of a sample of 100 participants:

set.seed(1)

# create IQ by sampling data from a normal distribution
IQ = rnorm(n = 100, mean = 100, sd = 15)

Next, we will (1) summarize this variable and (2) describe its distribution.

1. Summary statistics

# get the mean and important percentiles
summary(IQ)
# outputs:
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   66.78   92.59  101.71  101.63  110.37  136.02

Reminder: The 1^st quartile is the 25^th percentile, this means that 25% of the data fall below this value. And the 3^rd quartile is the 75^th percentile, this means that 75% of the data fall below this value.

We can visualize these numbers by looking at the boxplot:

# box plot
boxplot(IQ, col = '#D3E5EE')

# show the 5 numbers that describe the boxplot
text(x = 1.33,
     y = fivenum(IQ),
     labels = paste(round(fivenum(IQ), 2),
     c('(Minimum, excluding outliers)',
       '(25th percentile)',
       '(Median)',
       '(75th percentile)',
       '(Maximum, excluding outliers)')))

Output:

Here’s a summary of what we can learn from this boxplot:

Quantity	Definition/Calculation	Value	Interpretation
Range	The range is the difference between the maximum and minimum values: 136.02 – 66.78 = 69.24	69.24	According to our sample, IQ scores can differ by up to 69.24 points between one person and another.
Interquartile range (IQR)	IQR is the difference between the 75^th percentile and the 25^th percentile: 110.4 – 92.34 = 18.06	18.06	50% of the data are spread over 26% of the range (IQR / range = 18.06 / 69.24 = 0.26), this means that the data are not uniformly distributed, instead a lot of data points seem to be clustered around the median.
Median	The median is the 50^th percentile. Since our sample contains an even number of values, we will have 2 middles values and the median will be their average: (101.12 + 102.30) / 2 = 101.71	101.71	A typical person in our sample has an IQ of 101.7, half the sample has a lower IQ score and half the sample has a higher IQ score.

2. Describing the distribution

A frequency histogram divides the numerical variable into bins and shows, on the y-axis, the number of observations that fall into each bin:

# frequency histogram
min(IQ) # 66.7795
max(IQ) # 136.0243
# so create a custom x-axis, starting from 65 up to 140
xvalues = seq(65, 140, by = 5)

hist(IQ, breaks = xvalues, col = '#D3E5EE', border = '#98C2D7',
     labels = TRUE, # show bin values
     xaxt = 'n') # remove x-axis text
axis(1, at = xvalues) # add the custom x-axis

Output:

This histogram shows, for example, that 17 values fall between 105 and 110, and only 3 values are above 125.

Now, instead of showing the frequency (i.e. the number of occurrences) of values that fall into each bin, we can plot a density histogram that shows on the y-axis the probability of being in a given bin:

# density histogram
hist(IQ, breaks = 10, freq = FALSE, col = '#D3E5EE', border = '#98C2D7')

# probability density curve
lines(density(IQ), lw = 2, col = '#3E7FA3')

# show the mean ± 1 standard deviation
abline(v = mean(IQ), lw = 2)
legend(x = mean(IQ) - 4, y = 0.033, # x and y coordinates of the legend, picked manually by trial and error
       'Mean', adj = 0.3)

abline(v = mean(IQ) + sd(IQ), lw = 2, lty = 2)
legend(x = mean(IQ) + sd(IQ) - 7, y = 0.033, 'Mean + 1 SD', adj = 0.2)

abline(v = mean(IQ) - sd(IQ), lw = 2, lty = 2)
legend(x = mean(IQ) - sd(IQ) - 7, y = 0.033, 'Mean - 1 SD', adj = 0.2)

Output:

density histogram showing the mean and the mean and the variability in the distribution of IQ scores

The density histogram and the density curve suggest that the IQ scores are normally distributed, since the distribution is bell-shaped and symmetric (i.e. not skewed).

Another graphical way to assess normality is to look at the normal Q-Q plot:

# normal Q-Q plot
qqnorm(IQ, col = '#5398BE', pch = 16)
qqline(IQ)

Output:

The points are reasonably close to the diagonal line and they do not show any non-linear pattern, also suggesting that the IQ scores are normally distributed, with a mean and standard deviation of:

mean(IQ) # outputs: 101.63
sd(IQ) # outputs: 13.47

1. Summary statistics

2. Describing the distribution

Further reading