Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization

This article has been updated, you are now consulting an old release of this article!


Description

The R package factoextra provides some easy-to-use functions to extract and visualize the output of PCA (Principal Component Analysis), CA (Correspondence Analysis) and MCA (Multiple Correspondence Analysis) functions from several packages : PCA, CA, MCA [FactoMineR]; prcomp and princomp [stats]; dudi.pca, dudi.coa, dudi.acm [ade4]; ca [ca]; corresp [MASS]. Ggplot2 plotting system is used.

  • Principal Component Analysis is used to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important informations.

  • Correspondence Analysis (CA) is an extension of Principal Component Analysis suited to analyse a large contingency table formed by two qualitative variables (or categorical data).

  • Multiple Correspondence Analysis (MCA) is an adaptation of CA to a data table containing more van two categorical variables.

multivariate analysis - factoextra

For each R package mentioned in the figure above, only the functions that are compatible with factoextra are shown.

As mentioned above there are different solutions to perform PCA, CA and MCA in R software (FactoMineR, ade4, stats, ca, MASS). The result is presented differently according to the used packages.

The package factoextra has flexible and easy-to-use methods to extract and visualize quickly the results of the analysis from the above packages. The ggplot2 plotting system is used for the data visualization.

But wasn’t this problem solved already?

The answer of this question is no. One of the previous solutions is ggbiplot package which provides only a function to make a biplot of individuals and variables from the outputs of PCA.

Why should I use factoextra?

  1. factoextra can handle the results of PCA, CA and MCA, from several packages, for extracting and visualizing the most important information contained in your data.

  2. After PCA, CA or MCA, the most important row/column variables can be highlighted using :
  • their cos2 values : informations about their qualities of the representation on the factor map
  • their contributions to the definition of the principal dimensions

If you want to do this, there is no other package, use factoextra, it’s simple.

  1. PCA and MCA are used sometimes for prediction problems : This means that we can predict the coordinates of new supplementary variables (quantitative and qualitative) and supplementary individuals using the informations provided by the previously performed PCA. This can be done easily using FactoMineR and this issue is described also, step by step, using the built-in R functions prcomp().

If you want to make predictions with PCA and to visualize the position of the supplementary variables/individuals on the factor map using ggplot2 : then factoextra can help you. It’s quick, write less and do more…

  1. If you use ade4 and FactoMineR (the most used R packages for factor analyses) and you want to make easily a beautiful ggplot2 visualization : then use factoextra, it’s flexible, it has methods for these packages and more.

  2. Several functions from different packages are available in R for performing PCA, CA or MCA. However, The components of the output vary from package to package. No matter the used packages, factoextra can give you a human understable output.

Install and load factoextra

The package devtools is required for the installation as factoextra is hosted on github.

# install.packages("devtools")
library("devtools")
install_github("kassambara/factoextra")

Load factoextra :

library("factoextra")

Main functions in factoextra package

To read more about a given function, click on the corresponding link in the tables below.

Extract data from the output of PCA, CA and MCA

Functions Description
get_eig, get_eigenvalue Extract and visualize the eigenvalues/variances of dimensions.
get_pca, get_pca_ind, get_pca_var Extract all the results (coordinates, squared cosine, contributions) for the active individuals/variables from Principal Component Analysis (PCA) outputs.
get_ca, get_ca_col, get_ca_row Extract all the results (coordinates, squared cosine, contributions) for the active column/row variables from Correspondence Analysis outputs.
get_mca, get_mca_ind, get_mca_var Extract results from Multiple Correspondence Analysis outputs
facto_summarize Subset and summarize the output of factor analyses

Visualization of PCA, CA and MCA

Functions Description
fviz_eig (or fviz_eigenvalue) Extract and visualize the eigenvalues/variances of dimensions.
fviz_pca_var, fviz_pca_ind, fviz_pca_biplot (or fviz_pca) Graph of individuals/variables from the output of Principal Component Analysis (PCA).
fviz_ca_row, fviz_ca_col, fviz_ca_biplot (or fviz_ca) Graph of column/row variables from the output of Correspondence Analysis (CA).
fviz_mca_var, fviz_mca_ind, fviz_mca_biplot (or fviz_mca) Graph of individuals/variables from the output of Multiple Correspondence Analysis (MCA).
fviz_cos2 Visualize the quality of the representation of the row/column variable from the results of PCA, CA, MCA functions
fviz_contrib Visualize the contributions of row/column elements from the results of PCA, CA, MCA functions

Principal component analysis

A principal component analysis (PCA) is performed using the built-in R function prcomp() and iris data :

data(iris)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
# The variable Species (index = 5) is removed
# before PCA analysis
res.pca <- prcomp(iris[, -5],  scale = TRUE)
# Extract eigenvalues/variances
get_eig(res.pca)
      eigenvalue variance.percent cumulative.variance.percent
Dim.1 2.91849782       72.9624454                    72.96245
Dim.2 0.91403047       22.8507618                    95.81321
Dim.3 0.14675688        3.6689219                    99.48213
Dim.4 0.02071484        0.5178709                   100.00000
# Visualize eigenvalues/variances
fviz_eig(res.pca)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Add labels, change theme
fviz_screeplot(res.pca,  addlabels=TRUE, hjust = -0.3) + 
    theme_minimal()

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Variables
#++++++++++++++++++++++
# Extract the results for variables
var <- get_pca_var(res.pca)
var
Principal Component Analysis Results for variables
 ===================================================
  Name       Description                                    
1 "$coord"   "Coordinates for the variables"                
2 "$cor"     "Correlations between variables and dimensions"
3 "$cos2"    "Cos2 for the variables"                       
4 "$contrib" "contributions of the variables"               
# Coordinates of variables
head(var$coord)
                  Dim.1       Dim.2       Dim.3       Dim.4
Sepal.Length  0.8901688 -0.36082989  0.27565767  0.03760602
Sepal.Width  -0.4601427 -0.88271627 -0.09361987 -0.01777631
Petal.Length  0.9915552 -0.02341519 -0.05444699 -0.11534978
Petal.Width   0.9649790 -0.06399985 -0.24298265  0.07535950
# Contribution of variables
head(var$contrib)
                 Dim.1       Dim.2     Dim.3     Dim.4
Sepal.Length 27.150969 14.24440565 51.777574  6.827052
Sepal.Width   7.254804 85.24748749  5.972245  1.525463
Petal.Length 33.687936  0.05998389  2.019990 64.232089
Petal.Width  31.906291  0.44812296 40.230191 27.415396
# Graph of variables
fviz_pca_var(res.pca)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Change color and theme
fviz_pca_var(res.pca, col.var="steelblue")+
  theme_minimal()

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Control variable colors using their contributions
# Use gradient color
fviz_pca_var(res.pca, col.var="contrib")+
scale_color_gradient2(low="white", mid="blue", 
      high="red", midpoint = 96) + theme_minimal()

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Variable contributions on axis 1
fviz_contrib(res.pca, choice="var", axes = 1 )

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Variable contributions on axes 1 + 2
fviz_contrib(res.pca, choice="var", axes = 1:2)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Individuals
# +++++++++++++++++++
# Extract the results for individuals
ind <- get_pca_ind(res.pca)
ind
Principal Component Analysis Results for individuals
 ===================================================
  Name       Description                       
1 "$coord"   "Coordinates for the individuals" 
2 "$cos2"    "Cos2 for the individuals"        
3 "$contrib" "contributions of the individuals"
# Coordinates of individuals
head(ind$coord)
      Dim.1      Dim.2       Dim.3        Dim.4
1 -2.257141 -0.4784238  0.12727962  0.024087508
2 -2.074013  0.6718827  0.23382552  0.102662845
3 -2.356335  0.3407664 -0.04405390  0.028282305
4 -2.291707  0.5953999 -0.09098530 -0.065735340
5 -2.381863 -0.6446757 -0.01568565 -0.035802870
6 -2.068701 -1.4842053 -0.02687825  0.006586116
# Graph of individuals
fviz_pca_ind(res.pca)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Use text only
fviz_pca_ind(res.pca, geom="text")

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Control automatically the color of individuals using the cos2
# cos2 = the quality of the individuals on the factor map
# Use points only
# Use gradient color
fviz_pca_ind(res.pca, col.ind="cos2", geom = "point") + 
   scale_color_gradient2(low="blue", mid="white",
      high="red", midpoint=0.6)+ theme_minimal()

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Color by groups
p <- fviz_pca_ind(res.pca, geom = "point",
    habillage=iris$Species, addEllipses=TRUE,
    ellipse.level= 0.95)+ theme_minimal()
print(p)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Change color using RColorBrewer palettes
p + scale_color_brewer(palette ="Set1")

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Change color manually
p + scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Biplot of individuals and variables
# ++++++++++++++++++++++++++
fviz_pca_biplot(res.pca)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Only variables are labelled
 fviz_pca_biplot(res.pca,  label="var", habillage=iris$Species,
      addEllipses=TRUE, ellipse.level=0.95) +
  theme_minimal()

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

Correspondence Analysis

The function CA() in FactoMineR package is used:

# Install and load FactoMineR to compute CA
# install.packages("FactoMineR")
library("FactoMineR")
data("housetasks")
res.ca <- CA(housetasks, graph = FALSE)
# Result for column variables
get_ca_col(res.ca)
Correspondence Analysis - Results for columns
 ===================================================
  Name       Description                   
1 "$coord"   "Coordinates for the columns" 
2 "$cos2"    "Cos2 for the columns"        
3 "$contrib" "contributions of the columns"
4 "$inertia" "Inertia of the columns"      
# Result for row variables
get_ca_row(res.ca)
Correspondence Analysis - Results for rows
 ===================================================
  Name       Description                
1 "$coord"   "Coordinates for the rows" 
2 "$cos2"    "Cos2 for the rows"        
3 "$contrib" "contributions of the rows"
4 "$inertia" "Inertia of the rows"      
# Visualize row contributions on axes 1
fviz_contrib(res.ca, choice ="row", axes = 1)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Visualize column contributions on axes 1
fviz_contrib(res.ca, choice ="col", axes = 1)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Graph of row variables
fviz_ca_row(res.ca)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Select and visualize rows with cos2 >= 0.9
fviz_ca_row(res.ca, select.row = list(cos2 = 0.9))

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Graph of column points
fviz_ca_col(res.ca)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Symetric Biplot of rows and columns
fviz_ca_biplot(res.ca)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Asymetric biplot, use arrows for columns
fviz_ca_biplot(res.ca, map ="rowprincipal",
               arrow = c(FALSE, TRUE)) +
  theme_minimal()

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

Multiple Correspondence Analysis

The function MCA() in FactoMineR package is used:

library(FactoMineR)
data(poison)
res.mca <- MCA(poison, quanti.sup = 1:2,
              quali.sup = 3:4, graph=FALSE)
# Extract the results for variable categories
get_mca_var(res.mca)
Multiple Correspondence Analysis Results for variables
 ===================================================
  Name       Description                  
1 "$coord"   "Coordinates for categories" 
2 "$cos2"    "Cos2 for categories"        
3 "$contrib" "contributions of categories"
# Extract the results for individuals
get_mca_ind(res.mca)
Multiple Correspondence Analysis Results for individuals
 ===================================================
  Name       Description                       
1 "$coord"   "Coordinates for the individuals" 
2 "$cos2"    "Cos2 for the individuals"        
3 "$contrib" "contributions of the individuals"
# Visualize variable categorie contributions on axes 1
fviz_contrib(res.mca, choice ="var", axes = 1)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Visualize individual contributions on axes 1
# select the top 20
fviz_contrib(res.mca, choice ="ind", axes = 1, top = 20)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Graph of individuals
# ++++++++++++++++++++++++++++
fviz_mca_ind(res.mca, col.ind = "blue")+
   theme_minimal()

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Color individuals by groups
grp <- as.factor(poison[, "Vomiting"])
fviz_mca_ind(res.mca, label="none", habillage=grp)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Add ellipses
p <- fviz_mca_ind(res.mca, label="none", habillage=grp,
             addEllipses=TRUE, ellipse.level=0.95)
print(p)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Change group colors using RColorBrewer color palettes
p + scale_color_brewer(palette="Paired") +
     theme_minimal()

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

p + scale_color_brewer(palette="Set1") +
     theme_minimal()

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Graph of variable categories
# ++++++++++++++++++++++++++++
fviz_mca_var(res.mca)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Select the top 10 contributing variable categories
fviz_mca_var(res.mca, select.var = list(contrib = 10))

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Select by names
fviz_mca_var(res.mca,
 select.var= list(name = c("Courg_n", "Fever_y", "Fever_n")))

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# biplot
# ++++++++++++++++++++++++++
fviz_mca_biplot(res.mca)

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

# Select the top 30 contributing individuals
# And the top 10 variables
fviz_mca_biplot(res.mca,
               select.ind = list(contrib = 30),
               select.var = list(contrib = 10))

factoextra and visualization of the outputs of a multivariate analysis - R software and data mining

Infos

This analysis has been performed using R software (ver. 3.1.2) and factoextra (ver. 1.0.2)


Enjoyed this article? I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!
Avez vous aimé cet article? Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!