Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization
Description
The R package factoextra provides some easy-to-use functions to extract and visualize the output of PCA (Principal Component Analysis), CA (Correspondence Analysis) and MCA (Multiple Correspondence Analysis) functions from several packages : PCA, CA, MCA [FactoMineR]; prcomp and princomp [stats]; dudi.pca, dudi.coa, dudi.acm [ade4]; ca [ca]; corresp [MASS]. Ggplot2 plotting system is used.
Principal Component Analysis is used to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important informations.
Correspondence Analysis (CA) is an extension of Principal Component Analysis suited to analyse a large contingency table formed by two qualitative variables (or categorical data).
Multiple Correspondence Analysis (MCA) is an adaptation of CA to a data table containing more van two categorical variables.
For each R package mentioned in the figure above, only the functions that are compatible with factoextra are shown.
As mentioned above there are different solutions to perform PCA, CA and MCA in R software (FactoMineR, ade4, stats, ca, MASS). The result is presented differently according to the used packages.
The package factoextra has flexible and easy-to-use methods to extract and visualize quickly the results of the analysis from the above packages. The ggplot2 plotting system is used for the data visualization.
But wasn’t this problem solved already?
The answer of this question is no. One of the previous solutions is ggbiplot package which provides only a function to make a biplot of individuals and variables from the outputs of PCA.
Why should I use factoextra?
factoextra can handle the results of PCA, CA and MCA, from several packages, for extracting and visualizing the most important information contained in your data.
- After PCA, CA or MCA, the most important row/column variables can be highlighted using :
- their cos2 values : informations about their qualities of the representation on the factor map
- their contributions to the definition of the principal dimensions
If you want to do this, there is no other package, use factoextra, it’s simple.
- PCA and MCA are used sometimes for prediction problems : This means that we can predict the coordinates of new supplementary variables (quantitative and qualitative) and supplementary individuals using the informations provided by the previously performed PCA. This can be done easily using FactoMineR and this issue is described also, step by step, using the built-in R functions prcomp().
If you want to make predictions with PCA and to visualize the position of the supplementary variables/individuals on the factor map using ggplot2 : then factoextra can help you. It’s quick, write less and do more…
If you use ade4 and FactoMineR (the most used R packages for factor analyses) and you want to make easily a beautiful ggplot2 visualization : then use factoextra, it’s flexible, it has methods for these packages and more.
Several functions from different packages are available in R for performing PCA, CA or MCA. However, The components of the output vary from package to package. No matter the used packages, factoextra can give you a human understable output.
Install and load factoextra
The package devtools is required for the installation as factoextra is hosted on github.
# install.packages("devtools")
library("devtools")
install_github("kassambara/factoextra")
Load factoextra :
library("factoextra")
Main functions in factoextra package
To read more about a given function, click on the corresponding link in the tables below.
Extract data from the output of PCA, CA and MCA
Functions | Description |
---|---|
get_eig, get_eigenvalue | Extract and visualize the eigenvalues/variances of dimensions. |
get_pca, get_pca_ind, get_pca_var | Extract all the results (coordinates, squared cosine, contributions) for the active individuals/variables from Principal Component Analysis (PCA) outputs. |
get_ca, get_ca_col, get_ca_row | Extract all the results (coordinates, squared cosine, contributions) for the active column/row variables from Correspondence Analysis outputs. |
get_mca, get_mca_ind, get_mca_var | Extract results from Multiple Correspondence Analysis outputs |
facto_summarize | Subset and summarize the output of factor analyses |
Visualization of PCA, CA and MCA
Functions | Description |
---|---|
fviz_eig (or fviz_eigenvalue) | Extract and visualize the eigenvalues/variances of dimensions. |
fviz_pca_var, fviz_pca_ind, fviz_pca_biplot (or fviz_pca) | Graph of individuals/variables from the output of Principal Component Analysis (PCA). |
fviz_ca_row, fviz_ca_col, fviz_ca_biplot (or fviz_ca) | Graph of column/row variables from the output of Correspondence Analysis (CA). |
fviz_mca_var, fviz_mca_ind, fviz_mca_biplot (or fviz_mca) | Graph of individuals/variables from the output of Multiple Correspondence Analysis (MCA). |
fviz_cos2 | Visualize the quality of the representation of the row/column variable from the results of PCA, CA, MCA functions |
fviz_contrib | Visualize the contributions of row/column elements from the results of PCA, CA, MCA functions |
Principal component analysis
A principal component analysis (PCA) is performed using the built-in R function prcomp() and iris data :
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# The variable Species (index = 5) is removed
# before PCA analysis
res.pca <- prcomp(iris[, -5], scale = TRUE)
# Extract eigenvalues/variances
get_eig(res.pca)
eigenvalue variance.percent cumulative.variance.percent
Dim.1 2.91849782 72.9624454 72.96245
Dim.2 0.91403047 22.8507618 95.81321
Dim.3 0.14675688 3.6689219 99.48213
Dim.4 0.02071484 0.5178709 100.00000
# Visualize eigenvalues/variances
fviz_eig(res.pca)
# Add labels, change theme
fviz_screeplot(res.pca, addlabels=TRUE, hjust = -0.3) +
theme_minimal()
# Variables
#++++++++++++++++++++++
# Extract the results for variables
var <- get_pca_var(res.pca)
var
Principal Component Analysis Results for variables
===================================================
Name Description
1 "$coord" "Coordinates for the variables"
2 "$cor" "Correlations between variables and dimensions"
3 "$cos2" "Cos2 for the variables"
4 "$contrib" "contributions of the variables"
# Coordinates of variables
head(var$coord)
Dim.1 Dim.2 Dim.3 Dim.4
Sepal.Length 0.8901688 -0.36082989 0.27565767 0.03760602
Sepal.Width -0.4601427 -0.88271627 -0.09361987 -0.01777631
Petal.Length 0.9915552 -0.02341519 -0.05444699 -0.11534978
Petal.Width 0.9649790 -0.06399985 -0.24298265 0.07535950
# Contribution of variables
head(var$contrib)
Dim.1 Dim.2 Dim.3 Dim.4
Sepal.Length 27.150969 14.24440565 51.777574 6.827052
Sepal.Width 7.254804 85.24748749 5.972245 1.525463
Petal.Length 33.687936 0.05998389 2.019990 64.232089
Petal.Width 31.906291 0.44812296 40.230191 27.415396
# Graph of variables
fviz_pca_var(res.pca)
# Change color and theme
fviz_pca_var(res.pca, col.var="steelblue")+
theme_minimal()
# Control variable colors using their contributions
# Use gradient color
fviz_pca_var(res.pca, col.var="contrib")+
scale_color_gradient2(low="white", mid="blue",
high="red", midpoint = 96) + theme_minimal()
# Variable contributions on axis 1
fviz_contrib(res.pca, choice="var", axes = 1 )
# Variable contributions on axes 1 + 2
fviz_contrib(res.pca, choice="var", axes = 1:2)
# Individuals
# +++++++++++++++++++
# Extract the results for individuals
ind <- get_pca_ind(res.pca)
ind
Principal Component Analysis Results for individuals
===================================================
Name Description
1 "$coord" "Coordinates for the individuals"
2 "$cos2" "Cos2 for the individuals"
3 "$contrib" "contributions of the individuals"
# Coordinates of individuals
head(ind$coord)
Dim.1 Dim.2 Dim.3 Dim.4
1 -2.257141 -0.4784238 0.12727962 0.024087508
2 -2.074013 0.6718827 0.23382552 0.102662845
3 -2.356335 0.3407664 -0.04405390 0.028282305
4 -2.291707 0.5953999 -0.09098530 -0.065735340
5 -2.381863 -0.6446757 -0.01568565 -0.035802870
6 -2.068701 -1.4842053 -0.02687825 0.006586116
# Graph of individuals
fviz_pca_ind(res.pca)
# Use text only
fviz_pca_ind(res.pca, geom="text")
# Control automatically the color of individuals using the cos2
# cos2 = the quality of the individuals on the factor map
# Use points only
# Use gradient color
fviz_pca_ind(res.pca, col.ind="cos2", geom = "point") +
scale_color_gradient2(low="blue", mid="white",
high="red", midpoint=0.6)+ theme_minimal()
# Color by groups
p <- fviz_pca_ind(res.pca, geom = "point",
habillage=iris$Species, addEllipses=TRUE,
ellipse.level= 0.95)+ theme_minimal()
print(p)
# Change color using RColorBrewer palettes
p + scale_color_brewer(palette ="Set1")
# Change color manually
p + scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))
# Biplot of individuals and variables
# ++++++++++++++++++++++++++
fviz_pca_biplot(res.pca)
# Only variables are labelled
fviz_pca_biplot(res.pca, label="var", habillage=iris$Species,
addEllipses=TRUE, ellipse.level=0.95) +
theme_minimal()
Correspondence Analysis
The function CA() in FactoMineR package is used:
# Install and load FactoMineR to compute CA
# install.packages("FactoMineR")
library("FactoMineR")
data("housetasks")
res.ca <- CA(housetasks, graph = FALSE)
# Result for column variables
get_ca_col(res.ca)
Correspondence Analysis - Results for columns
===================================================
Name Description
1 "$coord" "Coordinates for the columns"
2 "$cos2" "Cos2 for the columns"
3 "$contrib" "contributions of the columns"
4 "$inertia" "Inertia of the columns"
# Result for row variables
get_ca_row(res.ca)
Correspondence Analysis - Results for rows
===================================================
Name Description
1 "$coord" "Coordinates for the rows"
2 "$cos2" "Cos2 for the rows"
3 "$contrib" "contributions of the rows"
4 "$inertia" "Inertia of the rows"
# Visualize row contributions on axes 1
fviz_contrib(res.ca, choice ="row", axes = 1)
# Visualize column contributions on axes 1
fviz_contrib(res.ca, choice ="col", axes = 1)
# Graph of row variables
fviz_ca_row(res.ca)
# Select and visualize rows with cos2 >= 0.9
fviz_ca_row(res.ca, select.row = list(cos2 = 0.9))
# Graph of column points
fviz_ca_col(res.ca)
# Symetric Biplot of rows and columns
fviz_ca_biplot(res.ca)
# Asymetric biplot, use arrows for columns
fviz_ca_biplot(res.ca, map ="rowprincipal",
arrow = c(FALSE, TRUE)) +
theme_minimal()
Multiple Correspondence Analysis
The function MCA() in FactoMineR package is used:
library(FactoMineR)
data(poison)
res.mca <- MCA(poison, quanti.sup = 1:2,
quali.sup = 3:4, graph=FALSE)
# Extract the results for variable categories
get_mca_var(res.mca)
Multiple Correspondence Analysis Results for variables
===================================================
Name Description
1 "$coord" "Coordinates for categories"
2 "$cos2" "Cos2 for categories"
3 "$contrib" "contributions of categories"
# Extract the results for individuals
get_mca_ind(res.mca)
Multiple Correspondence Analysis Results for individuals
===================================================
Name Description
1 "$coord" "Coordinates for the individuals"
2 "$cos2" "Cos2 for the individuals"
3 "$contrib" "contributions of the individuals"
# Visualize variable categorie contributions on axes 1
fviz_contrib(res.mca, choice ="var", axes = 1)
# Visualize individual contributions on axes 1
# select the top 20
fviz_contrib(res.mca, choice ="ind", axes = 1, top = 20)
# Graph of individuals
# ++++++++++++++++++++++++++++
fviz_mca_ind(res.mca, col.ind = "blue")+
theme_minimal()
# Color individuals by groups
grp <- as.factor(poison[, "Vomiting"])
fviz_mca_ind(res.mca, label="none", habillage=grp)
# Add ellipses
p <- fviz_mca_ind(res.mca, label="none", habillage=grp,
addEllipses=TRUE, ellipse.level=0.95)
print(p)
# Change group colors using RColorBrewer color palettes
p + scale_color_brewer(palette="Paired") +
theme_minimal()
p + scale_color_brewer(palette="Set1") +
theme_minimal()
# Graph of variable categories
# ++++++++++++++++++++++++++++
fviz_mca_var(res.mca)
# Select the top 10 contributing variable categories
fviz_mca_var(res.mca, select.var = list(contrib = 10))
# Select by names
fviz_mca_var(res.mca,
select.var= list(name = c("Courg_n", "Fever_y", "Fever_n")))
# biplot
# ++++++++++++++++++++++++++
fviz_mca_biplot(res.mca)
# Select the top 30 contributing individuals
# And the top 10 variables
fviz_mca_biplot(res.mca,
select.ind = list(contrib = 30),
select.var = list(contrib = 10))
Infos
This analysis has been performed using R software (ver. 3.1.2) and factoextra (ver. 1.0.2)
Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!
Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!
Recommended for You!
Recommended for you
This section contains the best data science and self-development resources to help you on your path.
Books - Data Science
Our Books
- Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
- Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
- Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
- GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
- Network Analysis and Visualization in R by A. Kassambara (Datanovia)
- Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
- Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)
Others
- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
- Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
- Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
- An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet
Click to follow us on Facebook :
Comment this article by clicking on "Discussion" button (top-right position of this page)