Articles - R Graphics Essentials

Visualizing Multivariate Categorical Data

  |   535  |  Post a comment  |  R Graphics Essentials

To visualize a small data set containing multiple categorical (or qualitative) variables, you can create either a bar plot, a balloon plot or a mosaic plot.

For a large multivariate categorical data, you need specialized statistical techniques dedicated to categorical data analysis, such as simple and multiple correspondence analysis. These methods make it possible to analyze and visualize the association (i.e. correlation) between a large number of qualitative variables.

Here, you’ll learn some examples of graphs, in R programming language, for visualizing the frequency distribution of categorical variables contained in small contingency tables. We provide also the R code for computing the simple correspondence analysis.

Contents:


Prerequisites

Load required R packages and set the default theme:

library(ggplot2)
library(ggpubr)
theme_set(theme_pubr())

Bar plots of contingency tables

Demo data set: HairEyeColor (distribution of hair and eye color and sex in 592 statistics students)

  • Prepare and inspect the data:
data("HairEyeColor")
df <- as.data.frame(HairEyeColor)
head(df)
##    Hair   Eye  Sex Freq
## 1 Black Brown Male   32
## 2 Brown Brown Male   53
## 3   Red Brown Male   10
## 4 Blond Brown Male    3
## 5 Black  Blue Male   11
## 6 Brown  Blue Male   50
  • Create the bar graph:
    • Hair color on x-axis
    • Change bar fill by Eye color
    • Split the graph into multiple panel by Sex
ggplot(df, aes(x = Hair, y = Freq))+
  geom_bar(
    aes(fill = Eye), stat = "identity", color = "white",
    position = position_dodge(0.9)
    )+
  facet_wrap(~Sex) + 
  fill_palette("jco")

Balloon plot

Balloon plot is an alternative to bar plot for visualizing a large categorical data. We’ll use the function ggballoonplot() [in ggpubr], which draws a graphical matrix of a contingency table, where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.

Demo data sets: Housetasks (a contingency table containing the frequency of execution of 13 house tasks in the couple.)

housetasks <- read.delim(
  system.file("demo-data/housetasks.txt", package = "ggpubr"),
  row.names = 1
  )
head(housetasks, 4)
##            Wife Alternating Husband Jointly
## Laundry     156          14       2       4
## Main_meal   124          20       5       4
## Dinner       77          11       7      13
## Breakfeast   82          36      15       7
  • Create a simple balloon plot of a contingency table. Change the fill color by the values in the cells.
ggballoonplot(housetasks, fill = "value")+
  scale_fill_viridis_c(option = "C")

  • Visualize a grouped frequency table. Demo data set: HairEyeColor. Create a multi-panel plot by Sex
df <- as.data.frame(HairEyeColor)
ggballoonplot(df, x = "Hair", y = "Eye", size = "Freq",
              fill = "Freq", facet.by = "Sex",
              ggtheme = theme_bw()) +
  scale_fill_viridis_c(option = "C")

Mosaic plot

A mosaic plot is basically an area-proportional visualization of observed frequencies, composed of tiles (corresponding to the cells) created by recursive vertical and horizontal splits of a rectangle. The area of each tile is proportional to the corresponding cell entry, given the dimensions of previous splits.

Mosaic graph can be created using either the function mosaicplot() [in graphics] or the function mosaic() [in vcd package]. Read more at: Visualizing Multi-way Contingency Tables with vcd.

Example of mosaic plot:

library(vcd)
mosaic(HairEyeColor, shade = TRUE, legend = TRUE) 

Correspondence analysis

Correspondence analysis can be used to summarize and visualize the information contained in a large contingency table formed by two categorical variables.

Required package: FactoMineR for the analysis and factoextra for the visualization

library(FactoMineR)
library(factoextra)
res.ca <- CA(housetasks, graph = FALSE)
fviz_ca_biplot(res.ca, repel = TRUE)

From the graphic above, it’s clear that:

  • Housetasks such as dinner, breakfeast, laundry are done more often by the wife
  • Driving and repairs are done more frequently by the husband

Read more at: Correspondence analysis in R