An Example of Using Marginal and Conditional Distributions

The conditional distribution of a variable, for example heights, is the distribution of heights given the value of another variable, for example gender. Plotting the conditional distribution of heights given gender is a way of visualizing the relationship between the 2 variables.

The marginal distribution of heights is the distribution of heights for everybody, independent of gender. Plotting the marginal distribution of heights is a way of describing the probability of occurrence of each value of the variable.

If heights was a numerical variable

The density plot below represents the marginal and conditional distributions of heights:

  • The dashed curve is the marginal distribution of heights — the distribution of everybody’s heights regardless of their gender.
  • The pink and blue curves are the conditional distributions:
    • The pink curve is the distribution of heights given that the gender is female.
    • The blue curve is the distribution of heights given that the gender is male.
conditional distribution plots of height

The conditional distribution of males is shifted to the right which means that, on average, males are taller than females. The marginal distribution is bimodal which reflects the fact that it is the combination of the 2 conditional distributions.

If heights was a categorical variable

Now consider the case where heights is a binary categorical variable that can take on 2 values: “Tall” or “Normal/Short”.

The following bar plot represents the marginal distribution of heights (regardless of gender):

marginal distribution of a categorical variable: heights

The bar plot shows that 13% of all participants are tall, and 87% are either normal or short.

The following stacked bar plot represents the distribution of heights conditional on gender:

distribution of the categorical variable, heights, conditional on gender

Given that the person is female, then the probability of being tall is only 3%; And given that the person is male, this probability is 23%.

The conditional distribution of a categorical variable can also be represented in a table:

FemaleMale
Normal/Short97%77%
Tall3%23%

R code

Here’s the R code that generated these plots:

## marginal and conditional distributions of a numerical variable
#################################################################

set.seed(1)
# sample female heights from a normal distribution
# with mean = 63 and std = 2.5
female.heights = rnorm(n = 100, mean = 63, sd = 2.5)

# sample male heights from a normal distribution
# with mean = 69 and std = 3
male.heights = rnorm(n = 100, mean = 69, sd = 3)

# combine female and male data
all.heights = c(male.heights, female.heights)


plot(density(all.heights), # plot density of all heights
     lwd = 2, # line thickness
     ylim = c(0, 0.2), # y-axis limits
     lty = 2, # dashed line
     main = '', # remove title
     xlab = 'Height (inch)') # x-axis label

polygon(density(male.heights), # plot density of male heights
        col = rgb(0.325, 0.596, 0.745, alpha = 0.5), # fill color
        border = '#5398BE', # line color
        lwd = 2) # line thickness

polygon(density(female.heights), # plot density of female heights
        col = rgb(0.906, 0.353, 0.486, alpha = 0.5), # fill color
        border = '#E75A7C', # line color
        lwd = 2) # line width

legend(x = 75, y = 0.2, # coordinates of the legend
       legend = c("females", "males", "everyone"),
       col = c("#E75A7C", "#5398BE", "black"),
       lty = c(1, 1, 2),
       lwd = 2)



## marginal and conditional distributions of a categorical variable
###################################################################

dat = data.frame(gender = c(rep('Female', 100), rep('Male', 100)),
                heights = c(female.heights, male.heights))

# males above 71 inches are tall
dat$isTall[dat$gender == 'Male' & dat$heights >= 71] = 'Tall'
# males below 71 inches are normal/short
dat$isTall[dat$gender == 'Male' & dat$heights < 71] = 'Normal/Short'

# females above 67 inches are tall
dat$isTall[dat$gender == 'Female' & dat$heights >= 67] = 'Tall'
# females below 67 inches are normal/short
dat$isTall[dat$gender == 'Female' & dat$heights < 67] = 'Normal/Short'


# marginal distribution of isTall
library(scales) # to show percentages on the y-axis
x = barplot(prop.table(table(dat$isTall)),
            col = rep(c('#86DEB7', '#5398BE')),
            legend = TRUE,
            yaxt = 'n', # remove y-axis
            ylab = 'Percent of participants',
            ylim = c(0, 1),
            main = 'Marginal distribution of heights')
# creating a y-axis with percentages
yticks = seq(0, 0.9, by = 0.05)
axis(2, at = yticks, lab = percent(yticks))
# showing the values of each category
y = prop.table(table(dat$isTall))
text(x[1], y[1]/2, labels = paste0(as.character(y[1]*100), '%')) # placing Normal/short values
text(x[2], y[2]/2, labels = paste0(as.character(y[2]*100), '%')) # placing Tall values



# distribution of isTall conditional on gender
library(scales) # to show percentages on the y-axis
x = barplot(prop.table(table(dat$isTall, dat$gender), 2),
            col = rep(c('#86DEB7', '#5398BE')),
            legend = TRUE,
            yaxt = 'n', # remove y-axis
            ylab = 'Percent of participants',
            ylim = c(0, 1),
            main = 'Conditional distribution of heights')
# creating a y-axis with percentages
yticks = seq(0, 1, by = 0.05)
axis(2, at = yticks, lab = percent(yticks))
# showing the values of each category
y = prop.table(table(dat$isTall, dat$gender), 2)
text(x, y[1,]/2, labels = paste0(as.character(y[1,]*100), '%')) # placing Normal/short values
text(x, y[1,] + y[2,]/2, labels = paste0(as.character(y[2,]*100), '%')) # placing Tall values

For a tutorial on how to use the functions table() and prop.table(), see: How to Describe/Summarize Categorical Data in R (Example).

Further reading