How to Describe/Summarize Categorical Data in R (Example)

Let’s start by creating our own data, consisting of 2 categorical variables: gender and smoking:

set.seed(10)

# create 2 categorical variables with 80 observations each
gender = sample(c('Female', 'Male'), 80, replace = TRUE)
smoking = sample(c('Past smoker', 'Current smoker', 'Non-smoker'), 80, replace = TRUE)

# pack these variables into a data frame
dat = data.frame(gender = as.factor(gender),
                 smoking = as.factor(smoking))

# what our data look like
str(dat)
# outputs:
#'data.frame':	80 obs. of  2 variables:
# $ gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 1 2 2 1 1 ...
# $ smoking: Factor w/ 3 levels "Current smoker",..: 2 1 3 2 1 1 1 2 2 2 ...

Next, we will create a frequency table and a bar plot to summarize these data one variable at a time, then we will create a contingency table and a stacked bar plot to describe the relationship between the 2 variables.

1. Summarizing gender and smoking, one variable at a time

A frequency table shows the number of occurrences of each category of a variable:

table(dat$gender)
# outputs:
# Female   Male 
#     40     40 

table(dat$smoking)
# outputs:
# Current smoker     Non-smoker    Past smoker 
#             26             30             24 

Interpretation: Our sample consists of 40 females and 40 males. And 26 participants are current smokers, 24 are past smokers, and 30 are non-smokers.

We can use bar plots to visualize these 2 frequency tables:

par(mfrow=c(1,2)) # show the following plots side by side

barplot(table(dat$gender), ylab = 'Number of participants')
barplot(table(dat$smoking), ylab = 'Number of participants')

Output:

bar plots showing the frequency on the y-axis of the 2 variables

Returning to tables, instead of showing the number of occurrences of each category, we can show the proportion of each category:

prop.table(table(dat$gender))
# outputs:
# Female   Male 
#    0.5    0.5 

prop.table(table(dat$smoking))
# outputs:
# Current smoker     Non-smoker    Past smoker 
#          0.325          0.375          0.300 

Interpretation: 50% of the participants are females and 50% are males. And 32.5% of the total participants are current smokers, 30% are past smokers and 37.5% are non-smokers.

Again, we can create bar plots to visualize these tables:

par(mfrow=c(1,2)) # show the following plots side by side

barplot(prop.table(table(dat$gender)), ylab = 'Proportion of participants')
barplot(prop.table(table(dat$smoking)), ylab = 'Proportion of participants')

Output:

2. Describing the relationship between gender and smoking

A contingency table (a.k.a. a 2-way frequency table or a frequency table with 2 variables) describes the relationship between 2 categorical variables. Each cell in this table corresponds to the number of occurrences of a particular combination of values of the 2 variables.

table(dat$gender, dat$smoking)
# outputs:
#          Current smoker Non-smoker Past smoker
#   Female             12         17          11
#   Male               14         13          13

Interpretation: For instance, in our sample, 12 participants are females and current smokers.

We can also show the proportion of individuals in each cell. But we have 3 ways to do it:

1. Showing total proportions

Here, each cell represents the count of individuals in this category divided by the total sample size:

table1 = prop.table(table(dat$gender, dat$smoking))
table1
# outputs:
#          Current smoker Non-smoker Past smoker
#   Female         0.1500     0.2125      0.1375
#   Male           0.1750     0.1625      0.1625

# we can add margins to the table to make it clearer
addmargins(table1)
# outputs:
#          Current smoker Non-smoker Past smoker    Sum
#   Female         0.1500     0.2125      0.1375 0.5000
#   Male           0.1750     0.1625      0.1625 0.5000
#   Sum            0.3250     0.3750      0.3000 1.0000

(Notice that all the cells in the table sum up to 1)

Interpretation: The first number, 0.15 is the proportion of individuals in our sample who are both females and current smokers. More generally, we can say that our sample contains mostly male current smokers (17.5%). And females who are past smokers is the smallest category in our data (only 13.75%).

2. Showing row proportions

Here, each cell represents the count of individuals in this category divided by the row total:

table2 = prop.table(table(dat$gender, dat$smoking), 1)
table2
# outputs:
#          Current smoker Non-smoker Past smoker
#   Female          0.300      0.425       0.275
#   Male            0.350      0.325       0.325

addmargins(table2)
# outputs:
#          Current smoker Non-smoker Past smoker   Sum
#   Female          0.300      0.425       0.275 1.000
#   Male            0.350      0.325       0.325 1.000
#   Sum             0.650      0.750       0.600 2.000

(Notice that each row sums up to 1)

Interpretation: For instance, 0.3 is the proportion of females who currently smoke ( this is different from the proportion of current smokers who are females). More generally we can say that, in our sample, most females are non-smokers (42.5%), but most males are current smokers (35%).

3. Showing column proportions

Here, each cell represents the count of individuals in this category divided by the column total:

table3 = prop.table(table(dat$gender, dat$smoking), 2)
table3
# outputs:
#          Current smoker Non-smoker Past smoker
#   Female      0.4615385  0.5666667   0.4583333
#   Male        0.5384615  0.4333333   0.5416667

addmargins(table3)
# outputs:
#          Current smoker Non-smoker Past smoker       Sum
#   Female      0.4615385  0.5666667   0.4583333 1.4865385
#   Male        0.5384615  0.4333333   0.5416667 1.5134615
#   Sum         1.0000000  1.0000000   1.0000000 3.0000000

(Notice that each column sums up to 1)

Interpretation: For instance, 0.4615… is the proportion of current smokers who are females ( this is different from the proportion females who are current smokers). More generally we can say that, in our sample, most current and past smokers are males (53.85% and 54.16%, respectively) and most non-smokers are females (56.67%).

We can use a stacked bar plot to visualize a contingency table:

library(scales) # to show percentages on the y-axis

x = barplot(prop.table(table(dat$gender, dat$smoking)),
            col = rep(c('#F2A6B8', '#B5D3E3')),
            legend = TRUE,
            ylim = c(0, 0.4),
            yaxt = 'n', # remove y-axis
            ylab = 'Percent of participants')

# creating a y-axis with percentages
yticks = seq(0, 0.4, by = 0.05)
axis(2, at = yticks, lab = percent(yticks))

# showing the values of each category
y = prop.table(table(dat$gender, dat$smoking))
text(x, y[1,]/2, labels = paste0(as.character(y[1,]*100), '%')) # placing female values
text(x, y[1,] + y[2,]/2, labels = paste0(as.character(y[2,]*100), '%')) # placing male values

Output:

stacked bar plot

Another plot that is related to the stacked bar plot is the mosaic plot:

mosaicplot(prop.table(table(dat$smoking, dat$gender)),
           col = c('#F2A6B8', '#B5D3E3'),
           main = '')

Output:

Mosaic plot

The width of each column of this mosaic plot corresponds to the proportions of different categories of smoking. For example, we can see that non-smoker is a bigger category than past smokers since it has a wider base.

The horizontal “gender” splits in this plot show that non-smoker is the only category with more females than males.

Further reading