# How to Describe/Summarize Numerical Data in R (Example)

Let’s start by creating IQ, a normally distributed numerical variable, with a mean of 100 and a standard deviation of 15, that represents the IQ scores of a sample of 100 participants:

```set.seed(1)

# create IQ by sampling data from a normal distribution
IQ = rnorm(n = 100, mean = 100, sd = 15)```

Next, we will (1) summarize this variable and (2) describe its distribution.

## 1. Summary statistics

```# get the mean and important percentiles
summary(IQ)
# outputs:
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#   66.78   92.59  101.71  101.63  110.37  136.02 ```

Reminder: The 1st quartile is the 25th percentile, this means that 25% of the data fall below this value. And the 3rd quartile is the 75th percentile, this means that 75% of the data fall below this value.

We can visualize these numbers by looking at the boxplot:

```# box plot
boxplot(IQ, col = '#D3E5EE')

# show the 5 numbers that describe the boxplot
text(x = 1.33,
y = fivenum(IQ),
labels = paste(round(fivenum(IQ), 2),
c('(Minimum, excluding outliers)',
'(25th percentile)',
'(Median)',
'(75th percentile)',
'(Maximum, excluding outliers)')))```

Output:

Here’s a summary of what we can learn from this boxplot:

## 2. Describing the distribution

A frequency histogram divides the numerical variable into bins and shows, on the y-axis, the number of observations that fall into each bin:

```# frequency histogram
min(IQ) # 66.7795
max(IQ) # 136.0243
# so create a custom x-axis, starting from 65 up to 140
xvalues = seq(65, 140, by = 5)

hist(IQ, breaks = xvalues, col = '#D3E5EE', border = '#98C2D7',
labels = TRUE, # show bin values
xaxt = 'n') # remove x-axis text
axis(1, at = xvalues) # add the custom x-axis```

Output:

This histogram shows, for example, that 17 values fall between 105 and 110, and only 3 values are above 125.

Now, instead of showing the frequency (i.e. the number of occurrences) of values that fall into each bin, we can plot a density histogram that shows on the y-axis the probability of being in a given bin:

```# density histogram
hist(IQ, breaks = 10, freq = FALSE, col = '#D3E5EE', border = '#98C2D7')

# probability density curve
lines(density(IQ), lw = 2, col = '#3E7FA3')

# show the mean ± 1 standard deviation
abline(v = mean(IQ), lw = 2)
legend(x = mean(IQ) - 4, y = 0.033, # x and y coordinates of the legend, picked manually by trial and error
'Mean', adj = 0.3)

abline(v = mean(IQ) + sd(IQ), lw = 2, lty = 2)
legend(x = mean(IQ) + sd(IQ) - 7, y = 0.033, 'Mean + 1 SD', adj = 0.2)

abline(v = mean(IQ) - sd(IQ), lw = 2, lty = 2)
legend(x = mean(IQ) - sd(IQ) - 7, y = 0.033, 'Mean - 1 SD', adj = 0.2)```

Output:

The density histogram and the density curve suggest that the IQ scores are normally distributed, since the distribution is bell-shaped and symmetric (i.e. not skewed).

Another graphical way to assess normality is to look at the normal Q-Q plot:

```# normal Q-Q plot
qqnorm(IQ, col = '#5398BE', pch = 16)
qqline(IQ)```

Output:

The points are reasonably close to the diagonal line and they do not show any non-linear pattern, also suggesting that the IQ scores are normally distributed, with a mean and standard deviation of:

```mean(IQ) # outputs: 101.63
sd(IQ) # outputs: 13.47```