Easy Guides

Multiple Correspondence Analysis Essentials: Interpretation and application to investigate the associations between categories of multiple qualitative variables - R software and data mining

Wed, 01 Jul 2015 15:48:33 +0200

Required packages
Load FactoMineR and factoextra
Data format
Exploratory data analysis
Multiple Correspondence Analysis (MCA)
Summary of MCA outputs
Interpretation of MCA outputs
Eigenvalues/variances and screeplot
MCA scatter plot: Biplot of individuals and variable categories
Variable categories
Individuals
MCA using supplementary individuals and variables
Filter the MCA result
Dimension description
Infos
References and further reading

As described in my previous article, the simple correspondence analysis (CA) is used to analyse the contingency table formed by two categorical variables.

To learn more about CA, read this article: Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation.

Multiple Correspondence Analysis (MCA) is an extension of simple CA to analyse a data table containing more than two categorical variables.

MCA is generally used to analyse a data from survey.

The objectives are to identify:

A group of individuals with similar profile in their answers to the questions
The associations between variable categories

There are several R functions from different packages to compute MCA, including:

MCA() [in FactoMineR package]
dudi.mca() [in ade4 package]

These packages provide also some standard functions to visualize the results of the analysis. It’s also possible to use the package factoextra to generate easily beautiful graphs.

This article describes how to perform and interpret multiple correspondence analysis using FactoMineR package.

Required packages

FactoMineR(for computing MCA) and factoextra (for MCA visualization) packages are used.

These packages can be installed as follow :

install.packages("FactoMineR")

# install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Note that, for factoextra a version >= 1.0.2 is required for this tutorial. If it’s already installed on your computer, you should re-install it to have the most updated version.

Load FactoMineR and factoextra

library("FactoMineR")
library("factoextra")

Data format

We’ll use the data sets poison [in FactoMineR]

data(poison)
head(poison[, 1:7])

  Age Time   Sick Sex   Nausea Vomiting Abdominals
1   9   22 Sick_y   F Nausea_y  Vomit_n     Abdo_y
2   5    0 Sick_n   F Nausea_n  Vomit_n     Abdo_n
3   6   16 Sick_y   F Nausea_n  Vomit_y     Abdo_y
4   9    0 Sick_n   F Nausea_n  Vomit_n     Abdo_n
5   7   14 Sick_y   M Nausea_n  Vomit_y     Abdo_y
6  72    9 Sick_y   M Nausea_n  Vomit_n     Abdo_y

An image of the data is shown below:

This data is a result from a survey carried out on children of primary school who suffered from food poisoning. They were asked about their symptoms and about what they ate.

The data contains 55 rows (children, individuals) and 15 columns (variables).

Only some of these individuals (children) and variables will be used to perform the multiple correspondence analysis (MCA).

The coordinates of the remaining individuals and variables on the factor map will be predicted after the MCA.

In MCA terminology, our data contains :

Active individuals (rows 1:55): Individuals that are used during the correspondence analysis.
Active variables (columns 5:15) : Variables that are used for the MCA.
Supplementary variables : They don’t participate to the MCA. The coordinates of these variables will be predicted.
Supplementary continuous variables : Columns 1 and 2 corresponding to the columns age and time, respectively.
Supplementary qualitative variables : Columns 3 and 4 corresponding to the columns Sick and Sex, respectively. This factor variables will be used to color individuals by groups.

Subset only active individuals and variables for multiple correspondence analysis:

poison.active <- poison[1:55, 5:15]
head(poison.active[, 1:6])

    Nausea Vomiting Abdominals   Fever   Diarrhae   Potato
1 Nausea_y  Vomit_n     Abdo_y Fever_y Diarrhea_y Potato_y
2 Nausea_n  Vomit_n     Abdo_n Fever_n Diarrhea_n Potato_y
3 Nausea_n  Vomit_y     Abdo_y Fever_y Diarrhea_y Potato_y
4 Nausea_n  Vomit_n     Abdo_n Fever_n Diarrhea_n Potato_y
5 Nausea_n  Vomit_y     Abdo_y Fever_y Diarrhea_y Potato_y
6 Nausea_n  Vomit_n     Abdo_y Fever_y Diarrhea_y Potato_y

Exploratory data analysis

The function summary() can be used to compute the frequency of variable categories. As the data table contains a large number of variables, we’ll display only the results for the first 4 variables.

Statistical summaries:

# Summary of the 4 first variables
summary(poison.active)[, 1:4]

      Nausea        Vomiting     Abdominals       Fever     
 "Nausea_n:43  " "Vomit_n:33  " "Abdo_n:18  " "Fever_n:20  "
 "Nausea_y:12  " "Vomit_y:22  " "Abdo_y:37  " "Fever_y:35  "

It’s also possible to plot the frequency of variable categories:

for (i in 1:ncol(poison.active)) {
  plot(poison.active[,i], main=colnames(poison.active)[i],
       ylab = "Count", col="steelblue", las = 2)
  }

The graphs above can be used to identify variable categories with a very low frequency. These types of variables can distort the analysis.

Multiple Correspondence Analysis (MCA)

The function MCA() [in FactoMineR package] can be used. A simplified format is :

MCA(X, ncp = 5, graph = TRUE)

X : a data frame with n rows (individuals) and p columns (categorical variables)
ncp : number of dimensions kept in the final results.
graph : a logical value. If TRUE a graph is displayed.

In the R code below, the MCA is performed only on the active individuals/variables :

res.mca <- MCA(poison.active, graph = FALSE)

The output of the function MCA() is a list including :

print(res.mca)

**Results of the Multiple Correspondence Analysis (MCA)**
The analysis was performed on 55 individuals, described by 11 variables
*The results are available in the following objects:

   name              description                       
1  "$eig"            "eigenvalues"                     
2  "$var"            "results for the variables"       
3  "$var$coord"      "coord. of the categories"        
4  "$var$cos2"       "cos2 for the categories"         
5  "$var$contrib"    "contributions of the categories" 
6  "$var$v.test"     "v-test for the categories"       
7  "$ind"            "results for the individuals"     
8  "$ind$coord"      "coord. for the individuals"      
9  "$ind$cos2"       "cos2 for the individuals"        
10 "$ind$contrib"    "contributions of the individuals"
11 "$call"           "intermediate results"            
12 "$call$marge.col" "weights of columns"              
13 "$call$marge.li"  "weights of rows"

The object that is created using the function MCA() contains results as lists. These values are described in the next sections.

Summary of MCA outputs

The function summary.MCA() [in FactoMineR] is used to print a summary of multiple correspondence analysis results:

summary(object, nb.dec = 3, nbelements = 10, 
        ncp = TRUE, file ="", ...)

object: an object of class MCA
nb.dec: number of decimal printed
nbelements: number of row/column variables to be written. To have all the elements, use nbelements = Inf.
ncp: Number of dimensions to be printed
file: an optional file name for exporting the summaries.

Print the summary of the MCA for the dimensions 1 and 2:

summary(res.mca, nb.dec = 2, ncp = 2)



Eigenvalues
                      Dim.1  Dim.2  Dim.3  Dim.4  Dim.5  Dim.6  Dim.7  Dim.8  Dim.9 Dim.10 Dim.11
Variance               0.34   0.13   0.11   0.10   0.08   0.07   0.06   0.06   0.04   0.01   0.01
% of var.             33.52  12.91  10.73   9.59   7.88   7.11   6.02   5.58   4.12   1.30   1.23
Cumulative % of var.  33.52  46.44  57.17  66.76  74.64  81.75  87.77  93.35  97.47  98.77 100.00

Individuals (the 10 first)
             Dim.1   ctr  cos2   Dim.2   ctr  cos2  
1          | -0.45  1.11  0.35 | -0.26  0.98  0.12 |
2          |  0.84  3.79  0.56 | -0.03  0.01  0.00 |
3          | -0.45  1.09  0.55 |  0.14  0.26  0.05 |
4          |  0.88  4.20  0.75 | -0.09  0.10  0.01 |
5          | -0.45  1.09  0.55 |  0.14  0.26  0.05 |
6          | -0.36  0.70  0.02 | -0.44  2.68  0.04 |
7          | -0.45  1.09  0.55 |  0.14  0.26  0.05 |
8          | -0.64  2.23  0.62 | -0.01  0.00  0.00 |
9          | -0.45  1.11  0.35 | -0.26  0.98  0.12 |
10         | -0.14  0.11  0.04 |  0.12  0.21  0.03 |

Categories (the 10 first)
             Dim.1   ctr  cos2 v.test   Dim.2   ctr  cos2 v.test  
Nausea_n   |  0.27  1.52  0.26   3.72 |  0.12  0.81  0.05   1.69 |
Nausea_y   | -0.96  5.43  0.26  -3.72 | -0.43  2.91  0.05  -1.69 |
Vomit_n    |  0.48  3.73  0.34   4.31 | -0.41  7.07  0.25  -3.68 |
Vomit_y    | -0.72  5.60  0.34  -4.31 |  0.61 10.61  0.25   3.68 |
Abdo_n     |  1.32 15.42  0.85   6.76 | -0.04  0.03  0.00  -0.18 |
Abdo_y     | -0.64  7.50  0.85  -6.76 |  0.02  0.01  0.00   0.18 |
Fever_n    |  1.17 13.54  0.78   6.51 | -0.17  0.78  0.02  -0.97 |
Fever_y    | -0.67  7.74  0.78  -6.51 |  0.10  0.45  0.02   0.97 |
Diarrhea_n |  1.18 13.80  0.80   6.57 |  0.00  0.00  0.00  -0.02 |
Diarrhea_y | -0.68  7.88  0.80  -6.57 |  0.00  0.00  0.00   0.02 |

Categorical variables (eta2)
             Dim.1 Dim.2  
Nausea     |  0.26  0.05 |
Vomiting   |  0.34  0.25 |
Abdominals |  0.85  0.00 |
Fever      |  0.78  0.02 |
Diarrhae   |  0.80  0.00 |
Potato     |  0.03  0.40 |
Fish       |  0.01  0.03 |
Mayo       |  0.38  0.03 |
Courgette  |  0.02  0.45 |
Cheese     |  0.19  0.05 |

The result of the function summary() contains 4 tables:

Table 1 - Eigenvalues: table 1 contains the variances and the percentage of variances retained by each dimension.
Table 2 contains the coordinates, the contribution and the cos2 (quality of representation [in 0-1]) of the first 10 active individuals on the dimensions 1 and 2.
Table 3 contains the coordinates, the contribution and the cos2 (quality of representation [in 0-1]) of the first 10 active variable categories on the dimensions 1 and 2. This table contains also a column called v.test. The value of the v.test is generally comprised between 2 and -2. For a given variable category, if the absolute value of the v.test is superior to 2, this means that the coordinate is significantly different from 0.
Table 4 - categorical variables (eta2): contains the squared correlation between each variable and the dimensions.

For exporting the summary to a file, use the code: summary(res.mca, file =“myfile.txt”)
For displaying the summary of more than 10 elements, use the argument nbelements in the function summary()

Interpretation of MCA outputs

MCA results is interpreted as the results from a simple correspondence analysis (CA).

I recommend to read the interpretation of simple CA which has been comprehensively described in my previous post: Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation.

Eigenvalues/variances and screeplot

The proportion of variances retained by the different dimensions (axes) can be extracted using the function get_eigenvalue() [in factoextra] as follow :

eigenvalues <- get_eigenvalue(res.mca)
head(round(eigenvalues, 2))

      eigenvalue variance.percent cumulative.variance.percent
Dim.1       0.34            33.52                       33.52
Dim.2       0.13            12.91                       46.44
Dim.3       0.11            10.73                       57.17
Dim.4       0.10             9.59                       66.76
Dim.5       0.08             7.88                       74.64
Dim.6       0.07             7.11                       81.75

The function fviz_screeplot() [in factoextra package] can be used to draw the scree plot (the percentages of inertia explained by the MCA dimensions):

fviz_screeplot(res.mca)

Read more about eigenvalues and screeplot: Eigenvalues data visualization

MCA scatter plot: Biplot of individuals and variable categories

The function plot.MCA() [in FactoMineR package] can be used. A simplified format is :

plot(x, axes = c(1,2), choix=c("ind", "var"))

x : An object of class MCA
axes : A numeric vector of length 2 specifying the component to plot
choix : The graph to be plotted. Possible values are “ind” for the individuals and “var” for the variables

FactoMineR base graph for MCA:

plot(res.mca)

It’s also possible to use the function fviz_mca_biplot()[in factoextra package] to draw a nice looking plot:

fviz_mca_biplot(res.mca)

# Change the theme
fviz_mca_biplot(res.mca) +
  theme_minimal()

Read more about fviz_mca_biplot(): fviz_mca_biplot

The graph above shows a global pattern within the data. Rows (individuals) are represented by blue points and columns (variable categories) by red triangles.

The distance between any row points or column points gives a measure of their similarity (or dissimilarity).

Row points with similar profile are closed on the factor map. The same holds true for column points.

Variable categories

The function get_mca_var()[in factoextra] is used to extract the results for variable categories. This function returns a list containing the coordinates, the cos2 and the contribution of variable categories:

var <- get_mca_var(res.mca)
var

Multiple Correspondence Analysis Results for variables
 ===================================================
  Name       Description                  
1 "$coord"   "Coordinates for categories" 
2 "$cos2"    "Cos2 for categories"        
3 "$contrib" "contributions of categories"

Correlation between variables and principal dimensions

Variables can be visualized as follow:

plot(res.mca, choix = "var")

The plot above helps to identify variables that are the most correlated with each dimension. The squared correlations between variables and the dimensions are used as coordinates.
It can be seen that, the variables Diarrhae, Abdominals and Fever are the most correlated with dimension 1. Similarly, the variables Courgette and Potato are the most correlated with dimension 2.

Coordinates of variable categories

head(round(var$coord, 2))

         Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
Nausea_n  0.27  0.12 -0.27  0.03  0.07
Nausea_y -0.96 -0.43  0.95 -0.12 -0.26
Vomit_n   0.48 -0.41  0.08  0.27  0.05
Vomit_y  -0.72  0.61 -0.13 -0.41 -0.08
Abdo_n    1.32 -0.04 -0.01 -0.15 -0.07
Abdo_y   -0.64  0.02  0.00  0.07  0.03

Use the function fviz_mca_var() [in factoextra] to visualize only variable categories:

# Default plot
fviz_mca_var(res.mca)

It’s possible to change the color and the shape of the variable points using the arguments col.var and shape.var as follow:

fviz_mca_var(res.mca, col.var="black", shape.var = 15)

Note that, it’s also possible to make the graph of variables only using FactoMineR base graph. The argument invisible is used to hide the individual points:

# Hide individuals
plot(res.mca, invisible="ind")

Contribution of variable categories to the dimensions

The contribution of the variable categories (in %) to the definition of the dimensions can be extracted as follow:

head(round(var$contrib,2))

         Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
Nausea_n  1.52  0.81  4.67  0.08  0.49
Nausea_y  5.43  2.91 16.73  0.30  1.76
Vomit_n   3.73  7.07  0.36  4.26  0.19
Vomit_y   5.60 10.61  0.54  6.39  0.29
Abdo_n   15.42  0.03  0.00  0.73  0.18
Abdo_y    7.50  0.01  0.00  0.36  0.09

The variable categories with the larger value, contribute the most to the definition of the dimensions.

The different categories in the table are:

categories <- rownames(var$coord)
length(categories)

[1] 22

print(categories)

 [1] "Nausea_n"   "Nausea_y"   "Vomit_n"    "Vomit_y"    "Abdo_n"     "Abdo_y"     "Fever_n"   
 [8] "Fever_y"    "Diarrhea_n" "Diarrhea_y" "Potato_n"   "Potato_y"   "Fish_n"     "Fish_y"    
[15] "Mayo_n"     "Mayo_y"     "Courg_n"    "Courg_y"    "Cheese_n"   "Cheese_y"   "Icecream_n"
[22] "Icecream_y"

It’s possible to use the function corrplot to highlight the most contributing variables for each dimension:

library("corrplot")
corrplot(var$contrib, is.corr = FALSE)

The function fviz_contrib()[in factoextra] can be used to draw a bar plot of variable contributions:

# Contributions of variables on Dim.1
fviz_contrib(res.mca, choice = "var", axes = 1)

If the contribution of variable categories were uniform, the expected value would be 1/number_of_categories = 1/22 = 4.5%.
The red dashed line on the graph above indicates the expected average contribution. For a given dimension, any category with a contribution larger than this threshold could be considered as important in contributing to that dimension.

It can be seen that the categories Abdo_n, Diarrhea_n, Fever_n and Mayo_n are the most important in the definition of the first dimension.

# Contributions of rows on Dim.2
fviz_contrib(res.mca, choice = "var", axes = 2)

The row items Courg_n, Potato_n, Vomit_y and Icecream_n contribute the most to the dimension 2.

# Total contribution on Dim.1 and Dim.2
fviz_contrib(res.mca, choice = "var", axes = 1:2)

The total contribution of a category, on explaining the variations retained by Dim.1 and Dim.2, is calculated as follow : (C1 * Eig1) + (C2 * Eig2).

C1 and C2 are the contributions of the category to dimensions 1 and 2, respectively. Eig1 and Eig2 are the eigenvalues of dimensions 1 and 2, respectively.

The expected average contribution of a category for Dim.1 and Dim.2 is : (4.5 * Eig1) + (4.5 * Eig2) = (4.50.34) + (4.50.13) = 2.12%

If your data contains many categories, the top contributing categories can be displayed as follow:

fviz_contrib(res.mca, choice = "var", axes = 1, top = 10)

Read more about fviz_contrib(): fviz_contrib

A second option is to draw a scatter plot of categories and to highlight categories according to the amount of their contributions. The function fviz_mca_var() is used.

Note that, using factoextra package, the color or the transparency of the variable categories can be automatically controlled by the value of their contributions, their cos2, their coordinates on x or y axis.

# Control category point colors using their contribution
# Possible values for the argument col.row are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_mca_var(res.mca, col.var = "contrib")

# Change the gradient color
fviz_mca_var(res.mca, col.var="contrib")+
scale_color_gradient2(low="white", mid="blue", 
                      high="red", midpoint=2)+theme_minimal()

The scatter plot is also helpful to highlight the most important categories in the determination of the dimensions.

In addition we can have an idea of what pole of the dimensions the categories are actually contributing to.

It is evident that the categories Abdo_n, Diarrhea_n, Fever_n and Mayo_n have an important contribution to the positive pole of the first dimension, while the categories Fever_y and Diarrhea_y have a major contribution to the negative pole of the first dimension; etc, ….

It’s also possible to control automatically the transparency of variable categories by their contributions. The argument alpha.var is used:

# Control the transparency of categories using their contribution
# Possible values for the argument alpha.var are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_mca_var(res.mca, alpha.var="contrib")+
  theme_minimal()

It’s possible to select and display only the top contributing categories as illustrated in the R code below.

# Select the top 10 contributing categories
fviz_mca_var(res.mca, select.var=list(contrib=10))

Variable category/individual selections are discussed in details in the next sections

Read more about fviz_mca_var(): fviz_mca_var

Cos2 : The quality of representation of variable categories

The two dimensions 1 and 2 are sufficient to retain 46% of the total inertia contained in the data.

However, not all the points are equally well displayed in the two dimensions.

The quality of representation of the categories on the factor map is called the squared cosine (cos2) or the squared correlations.

The cos2 measures the degree of association between variable categories and a particular axis.

The cos2 of variable categories can be extracted as follow:

head(var$cos2)

             Dim 1        Dim 2        Dim 3       Dim 4       Dim 5
Nausea_n 0.2562007 0.0528025759 2.527485e-01 0.004084375 0.019466197
Nausea_y 0.2562007 0.0528025759 2.527485e-01 0.004084375 0.019466197
Vomit_n  0.3442016 0.2511603912 1.070855e-02 0.112294813 0.004126898
Vomit_y  0.3442016 0.2511603912 1.070855e-02 0.112294813 0.004126898
Abdo_n   0.8451157 0.0006215864 1.262496e-05 0.011479077 0.002374929
Abdo_y   0.8451157 0.0006215864 1.262496e-05 0.011479077 0.002374929

The values of the cos2 are comprised between 0 and 1.

The sum of the cos2 for rows on all the MCA dimensions is equal to one.

The quality of representation of a variable category or an individual in n dimensions is simply the sum of the squared cosine of that variable category or individual over the n dimensions.

If a variable category is well represented by two dimensions, the sum of the cos2 is closed to one.

For some of the categories, more than 2 dimensions are required to perfectly represent the data.

Visualize the cos2 of variable categories using corrplot:

library("corrplot")
corrplot(var$cos2, is.corr=FALSE)

The function fviz_cos2()[in factoextra] can be used to draw a bar plot of rows cos2:

# Cos2 of variable categories on Dim.1 and Dim.2
fviz_cos2(res.mca, choice = "var", axes = 1:2)

Note that, variable categories Fish_n, Fish_y, Icecream_n and Icecream_y are not very well represented by the first two dimensions. This implies that the position of the corresponding points on the scatter plot should be interpreted with some caution. A higher dimensional solution is probably necessary.

Read more about fviz_cos2(): fviz_cos2

Individuals

The function get_mca_ind()[in factoextra] is used to extract the results for individuals. This function returns a list containing the coordinates, the cos2 and the contributions of individuals:

ind <- get_mca_ind(res.mca)
ind

Multiple Correspondence Analysis Results for individuals
 ===================================================
  Name       Description                       
1 "$coord"   "Coordinates for the individuals" 
2 "$cos2"    "Cos2 for the individuals"        
3 "$contrib" "contributions of the individuals"

The result for individuals gives the same information as described for variable categories. For this reason, I’ll just displayed the result for individuals in this section without commenting.

Coordinates of individuals

head(ind$coord)

       Dim 1       Dim 2       Dim 3       Dim 4       Dim 5
1 -0.4525811 -0.26415072  0.17151614  0.01369348 -0.11696806
2  0.8361700 -0.03193457 -0.07208249 -0.08550351  0.51978710
3 -0.4481892  0.13538726 -0.22484048 -0.14170168 -0.05004753
4  0.8803694 -0.08536230 -0.02052044 -0.07275873 -0.22935022
5 -0.4481892  0.13538726 -0.22484048 -0.14170168 -0.05004753
6 -0.3594324 -0.43604390 -1.20932223  1.72464616  0.04348157

Use the function fviz_mca_ind() [in factoextra] to visualize only column points:

fviz_mca_ind(res.mca)

Read more about fviz_mca_ind(): fviz_mca_ind

Note that, it’s also possible to make the graph of individuals only using FactoMineR base graph.The argument invisible is used to hide the variable categories on the factor map:

# Hide variable categories
plot(res.mca, invisible="var")

Contribution of individuals to the dimensions

head(ind$contrib)

     Dim 1      Dim 2        Dim 3        Dim 4      Dim 5
1 1.110927 0.98238297  0.498254685  0.003555817 0.31554778
2 3.792117 0.01435818  0.088003703  0.138637089 6.23134138
3 1.089470 0.25806722  0.856229950  0.380768961 0.05776914
4 4.203611 0.10259105  0.007132055  0.100387990 1.21319013
5 1.089470 0.25806722  0.856229950  0.380768961 0.05776914
6 0.700692 2.67693398 24.769968729 56.404214518 0.04360547

Note that, you can use the previously mentioned corrplot() function to visualize the contribution of individuals.

Use the function fviz_contrib()[in factoextra] to visualize column contributions on dimensions 1+2:

fviz_contrib(res.mca, choice = "ind", axes = 1:2, top = 20)

If the individual contributions were uniform, the expected value would be 1/nrow(poison) = 1/55 = 1.8%.
The expected average contribution (reference line) of a column for Dim.1 and Dim.2 is : (1.8 * Eig1) + (1.8 * Eig2) = (1.8 * 0.34) + (1.8 * 0.13) = 0.85%.

Draw a scatter plot of individuals points and highlight individuals according to the amount of their contributions. The function fviz_mca_ind() [in factoextra] is used:

# Control individual colors using their contribution
# Possible values for the argument col.ind are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_mca_ind(res.mca, col.ind="contrib")+
scale_color_gradient2(low="white", mid="blue", 
                      high="red", midpoint=0.85)+theme_minimal()

Note that, it’s also possible to control automatically the transparency of individuals by their contributions using the argument alpha.ind:

# Control the transparency of individuals using their contribution
# Possible values for the argument alpha.ind are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_mca_ind(res.mca, alpha.ind="contrib")

Cos2 : The quality of representation of individuals

head(ind$cos2)

       Dim 1        Dim 2        Dim 3        Dim 4        Dim 5
1 0.34652591 0.1180447167 0.0497683175 0.0003172275 0.0231460846
2 0.55589562 0.0008108236 0.0041310808 0.0058126211 0.2148103098
3 0.54813888 0.0500176790 0.1379484860 0.0547920948 0.0068349171
4 0.74773962 0.0070299584 0.0004062504 0.0051072923 0.0507479873
5 0.54813888 0.0500176790 0.1379484860 0.0547920948 0.0068349171
6 0.02485357 0.0365775483 0.2813443706 0.5722083217 0.0003637178

Note that, the value of the cos2 is between 0 and 1. A cos2 closed to 1 corresponds to a variable categories/individuals that are well represented on the factor map.

The function fviz_cos2()[in factoextra] can be used to draw a bar plot of individuals cos2:

# Cos2 of individuals on Dim.1 and Dim.2
fviz_cos2(res.mca, choice = "ind", axes = 1:2, top = 20)

Change the color of individuals by groups

As mentioned above, our data contains supplementary qualitative variables: Columns 3 and 4 corresponding to the columns Sick and Sex, respectively. These factor variables will be used to color individuals by groups.

sick <- as.factor(poison$Sick)
head(sick)

[1] Sick_y Sick_n Sick_y Sick_n Sick_y Sick_y
Levels: Sick_n Sick_y

sex <- as.factor(poison$Sex)
head(sex)

[1] F F F F M M
Levels: F M

Individuals factor map :

# Default plot
fviz_mca_ind(res.mca, label ="none")

Change individual colors by groups using the levels of the variable sick. The argument habillage is used:

fviz_mca_ind(res.mca, label = "none", habillage=sick)

Add ellipses of point concentrations : the argument habillage is used to specify the factor variable for coloring the observations by groups.

fviz_mca_ind(res.mca, label="none", habillage = sick,
             addEllipses = TRUE, ellipse.level = 0.95)

Now, let’s :

make a biplot of individuals and variable categories
change the color of individuals by groups (sick levels)
show only the labels for variables

fviz_mca_biplot(res.mca, 
  habillage = sick, addEllipses = TRUE,
  label = "var", shape.var = 15) +
  scale_color_brewer(palette="Dark2")+
  theme_minimal()

Note that, it’s possible to color the individuals using any of the qualitative variable in the initial data table (poison)

Let’s color the individuals by groups using the levels of the variable Vomiting:

fviz_mca_ind(res.mca, 
  habillage = poison$Vomiting, addEllipses = TRUE) +
  scale_color_brewer(palette="Dark2")+
  theme_minimal()

It’s also possible to use the index of the column as follow (habillage = 2):

fviz_mca_ind(res.mca, 
  habillage = 2, addEllipses = TRUE) +
  scale_color_brewer(palette="Dark2")+
  theme_minimal()

You can also use the function plotellipses() [in FactoMineR] to draw confidence ellipses around the categories. The simplified format is:

plotellipses(model, keepvar="all", axis =c(1,2))

model: object of class MCA or PCA
keppvar: a boolean or numeric vector of indexes of variables or a character vector of names of variables. If keepvar is “all”, “quali” or “quali.sup”, variables which are plotted are all the categorical variables, only those which are used to compute the dimensions (active variables) or only the supplementary categorical variables. If keepvar is a numeric vector of indexes or a character vector of names of variables, only relevant variables are plotted.

plotellipses(res.mca, keepvar=1)

plotellipses(res.mca, keepvar=1:4)

plotellipses(res.mca, keepvar="Vomiting")

plotellipses(res.mca, keepvar=c("Vomiting", "Fever"))

plotellipses(res.mca, keepvar="all")

MCA using supplementary individuals and variables

As described above, the data set poison contains:

supplementary continuous variables (quanti.sup = 1:2, columns 1 and 2 corresponding to the columns Sick and Sex, respectively)
supplementary qualitative variables (quali.sup = 3:4, corresponding to the columns Sick and Sex, respectively). This factor variables are used to color individuals by groups

The data doesn’t contain supplementary individuals. However for demonstration, we’ll use the individuals 53:55 as supplementary individuals. The coordinates of these individuals will be predicted from the parameters of the MCA on the active individuals (1:52)

Supplementary variables and individuals are not used for the determination of the principal dimensions. Their coordinates are predicted using only the information provided by the performed multiple correspondence analysis on active variables/individuals.

To specify supplementary individuals and variables, the function MCA() can be used as follow :

MCA(X,  ncp = 5, ind.sup = NULL,
    quanti.sup=NULL, quali.sup=NULL, graph=TRUE, axes = c(1,2))

X : a data frame. Rows are individuals and columns are variables.
ncp : number of dimensions kept in the final results.
ind.sup : a numeric vector specifying the indexes of the supplementary individuals
quanti.sup, quali.sup : a numeric vector specifying, respectively, the indexes of the quantitative and qualitative variables
graph : a logical value. If TRUE a graph is displayed.
axes : a vector of length 2 specifying the components to be plotted

Example of usage :

res.mca <- MCA(poison, ind.sup=53:55, 
               quanti.sup = 1:2, quali.sup = 3:4,  graph=FALSE)

The summary of the MCA is :

summary(res.mca, nb.dec = 2, ncp = 2)



Eigenvalues
                      Dim.1  Dim.2  Dim.3  Dim.4  Dim.5  Dim.6  Dim.7  Dim.8  Dim.9 Dim.10 Dim.11
Variance               0.33   0.13   0.11   0.10   0.09   0.07   0.06   0.06   0.04   0.01   0.01
% of var.             32.88  13.04  10.63   9.67   8.60   6.66   6.40   5.94   3.89   1.33   0.95
Cumulative % of var.  32.88  45.92  56.56  66.23  74.83  81.49  87.89  93.83  97.72  99.05 100.00

Individuals (the 10 first)
             Dim.1   ctr  cos2   Dim.2   ctr  cos2  
1          | -0.44  1.14  0.35 | -0.27  1.10  0.13 |
2          |  0.85  4.23  0.54 | -0.01  0.00  0.00 |
3          | -0.43  1.09  0.50 |  0.13  0.24  0.04 |
4          |  0.91  4.81  0.77 | -0.03  0.01  0.00 |
5          | -0.43  1.09  0.50 |  0.13  0.24  0.04 |
6          | -0.34  0.67  0.02 | -0.45  2.93  0.04 |
7          | -0.43  1.09  0.50 |  0.13  0.24  0.04 |
8          | -0.63  2.32  0.61 | -0.02  0.00  0.00 |
9          | -0.44  1.14  0.35 | -0.27  1.10  0.13 |
10         | -0.12  0.08  0.03 |  0.14  0.27  0.04 |

Supplementary individuals
             Dim.1  cos2   Dim.2  cos2  
53         |  1.08  0.36 |  0.52  0.08 |
54         | -0.12  0.03 |  0.14  0.04 |
55         | -0.43  0.50 |  0.13  0.04 |

Categories (the 10 first)
             Dim.1   ctr  cos2 v.test   Dim.2   ctr  cos2 v.test  
Nausea_n   |  0.29  1.78  0.28   3.77 |  0.13  0.94  0.06   1.72 |
Nausea_y   | -0.97  5.94  0.28  -3.77 | -0.44  3.12  0.06  -1.72 |
Vomit_n    |  0.46  3.56  0.33   4.13 | -0.39  6.57  0.24  -3.53 |
Vomit_y    | -0.73  5.70  0.33  -4.13 |  0.63 10.51  0.24   3.53 |
Abdo_n     |  1.32 15.80  0.85   6.58 |  0.02  0.01  0.00   0.12 |
Abdo_y     | -0.64  7.68  0.85  -6.58 | -0.01  0.01  0.00  -0.12 |
Fever_n    |  1.17 13.89  0.79   6.35 | -0.12  0.36  0.01  -0.65 |
Fever_y    | -0.68  8.00  0.79  -6.35 |  0.07  0.21  0.01   0.65 |
Diarrhea_n |  1.26 15.31  0.85   6.57 |  0.04  0.04  0.00   0.20 |
Diarrhea_y | -0.67  8.10  0.85  -6.57 | -0.02  0.02  0.00  -0.20 |

Categorical variables (eta2)
             Dim.1 Dim.2  
Nausea     |  0.28  0.06 |
Vomiting   |  0.33  0.24 |
Abdominals |  0.85  0.00 |
Fever      |  0.79  0.01 |
Diarrhae   |  0.85  0.00 |
Potato     |  0.03  0.40 |
Fish       |  0.01  0.03 |
Mayo       |  0.33  0.04 |
Courgette  |  0.02  0.48 |
Cheese     |  0.13  0.03 |

Supplementary categories
             Dim.1  cos2 v.test   Dim.2  cos2 v.test  
Sick_n     |  1.42  0.89   6.75 |  0.00  0.00   0.01 |
Sick_y     | -0.63  0.89  -6.75 |  0.00  0.00  -0.01 |
F          | -0.03  0.00  -0.23 |  0.11  0.01   0.83 |
M          |  0.03  0.00   0.23 | -0.12  0.01  -0.83 |

Supplementary categorical variables (eta2)
             Dim.1 Dim.2  
Sick       |  0.89  0.00 |
Sex        |  0.00  0.01 |

Supplementary continuous variables
             Dim.1   Dim.2  
Age        |  0.00 | -0.01 |
Time       | -0.84 | -0.08 |

For the supplementary individuals/variable categories, the coordinates and the quality of representation (cos2) on the factor maps are shown. They don’t contribute to the dimensions.

Make a biplot of individuals and variable categories

FactomineR base graph:

plot(res.mca)

Active individuals are in blue
Supplementary individuals are in darkblue
Active variable categories are in red
Supplementary variable categories are in darkgreen

Use factoextra:

fviz_mca_biplot(res.mca) +
  theme_minimal()

Visualize supplementary variables

The graph below highlight the correlation between variables (active & supplementary) and dimensions:

plot(res.mca, choix ="var")

Supplementary qualitative variable categories

All the results (coordinates, cos2, v.test and eta2) for the supplementary qualitative variable categories can be extracted as follow :

res.mca$quali.sup

$coord
             Dim 1         Dim 2       Dim 3        Dim 4       Dim 5
Sick_n  1.41809140  0.0020394048  0.13199139 -0.016036841 -0.08354663
Sick_y -0.63026284 -0.0009064021 -0.05866284  0.007127485  0.03713184
F      -0.03108147  0.1123143957  0.05033124 -0.055927173 -0.06832928
M       0.03356798 -0.1212995474 -0.05435774  0.060401347  0.07379562

$cos2
             Dim 1        Dim 2       Dim 3        Dim 4       Dim 5
Sick_n 0.893770319 1.848521e-06 0.007742990 0.0001143023 0.003102240
Sick_y 0.893770319 1.848521e-06 0.007742990 0.0001143023 0.003102240
F      0.001043342 1.362369e-02 0.002735892 0.0033780765 0.005042401
M      0.001043342 1.362369e-02 0.002735892 0.0033780765 0.005042401

$v.test
            Dim 1        Dim 2      Dim 3       Dim 4      Dim 5
Sick_n  6.7514655  0.009709509  0.6284047 -0.07635063 -0.3977615
Sick_y -6.7514655 -0.009709509 -0.6284047  0.07635063  0.3977615
F      -0.2306739  0.833551410  0.3735378 -0.41506855 -0.5071119
M       0.2306739 -0.833551410 -0.3735378  0.41506855  0.5071119

$eta2
           Dim 1        Dim 2       Dim 3        Dim 4       Dim 5
Sick 0.893770319 1.848521e-06 0.007742990 0.0001143023 0.003102240
Sex  0.001043342 1.362369e-02 0.002735892 0.0033780765 0.005042401

Factor map :

fviz_mca_var(res.mca) + theme_minimal()

# Hide active variables
fviz_mca_var(res.mca, invisible ="var") +
  theme_minimal()

# Hide supplementary qualitative variables
fviz_mca_var(res.mca, invisible ="quali.sup") +
  theme_minimal()

Supplementary variable categories are shown in darkgreen color.

Supplementary quantitative variables

The coordinates of supplementary quantitative variables are:

res.mca$quanti

$coord
            Dim 1       Dim 2       Dim 3       Dim 4       Dim 5
Age   0.003934896 -0.00741340 -0.26494536  0.20015501  0.02928483
Time -0.838158507 -0.08330586 -0.08718851 -0.08421599 -0.02316931

Graph using FactoMineR base graph:

plot(res.mca, choix="quanti.sup")

Visualize supplementary individuals

The results for supplementary individuals can be extracted as follow :

res.mca$ind.sup

$coord
        Dim 1     Dim 2      Dim 3      Dim 4      Dim 5
53  1.0835684 0.5172478  0.5794063  0.5390903  0.4553650
54 -0.1249473 0.1417271 -0.1765234 -0.1526587 -0.2779565
55 -0.4315948 0.1270468 -0.2071580 -0.1186804 -0.1891760

$cos2
        Dim 1      Dim 2      Dim 3      Dim 4      Dim 5
53 0.36304957 0.08272764 0.10380536 0.08986204 0.06411692
54 0.03157652 0.04062716 0.06302535 0.04713607 0.15626590
55 0.50232519 0.04352713 0.11572730 0.03798314 0.09650827

Factor map for individuals:

fviz_mca_ind(res.mca) +
  theme_minimal()

# Show the label of ind.sup only
fviz_mca_ind(res.mca, label="ind.sup") +
  theme_minimal()

Supplementary individuals are shown in darkblue.

Filter the MCA result

If you have many individuals/variable categories, it’s possible to visualize only some of them using the arguments select.ind and select.var.

select.ind, select.var: a selection of individuals/variable categories to be drawn. Allowed values are NULL or a list containing the arguments name, cos2 or contrib:

name: is a character vector containing individuals/variable category names to be drawn
cos2: if cos2 is in [0, 1], ex: 0.6, then individuals/variable categories with a cos2 > 0.6 are drawn
if cos2 > 1, ex: 5, then the top 5 active individuals/variable categories and top 5 supplementary columns/rows with the highest cos2 are drawn
contrib: if contrib > 1, ex: 5, then the top 5 individuals/variable categories with the highest cos2 are drawn

# Visualize variable categories with cos2 >= 0.4
fviz_mca_var(res.mca, select.var = list(cos2 = 0.4))

# Top 10 active variables with the highest cos2
fviz_mca_var(res.mca, select.var= list(cos2 = 10))

The top 10 active individuals and the top 10 supplementary individuals are shown.

# Select by names
name <- list(name = c("Fever_n", "Abdo_y", "Diarrhea_n", "Fever_Y", "Vomit_y", "Vomit_n"))
fviz_mca_var(res.mca, select.var = name)

#top 5 contributing individuals and variable categories
fviz_mca_biplot(res.mca, select.ind = list(contrib = 5), 
               select.var = list(contrib = 5)) +
  theme_minimal()

Supplementary individuals/variable categories are not shown because they don’t contribute to the construction of the axes.

Dimension description

The function dimdesc() can be used to identify the most correlated variables with a given dimension.

A simplified format is :

dimdesc(res, axes = 1:2, proba = 0.05)

res : an object of class MCA
axes : a numeric vector specifying the dimensions to be described
prob : the significance level

Example of usage :

res.desc <- dimdesc(res.mca, axes = c(1,2))
# Description of dimension 1
res.desc$`Dim 1`

$quanti
     correlation     p.value
Time  -0.8381585 9.12658e-15

$quali
                  R2      p.value
Sick       0.8937703 5.368221e-26
Abdominals 0.8493262 3.429439e-22
Diarrhae   0.8467702 5.229788e-22
Fever      0.7916690 1.168654e-18
Vomiting   0.3348718 7.001487e-06
Mayo       0.3257425 9.967995e-06
Nausea     0.2794053 5.623583e-05
Cheese     0.1344785 7.495656e-03

$category
             Estimate      p.value
Sick_n      0.5872910 5.368221e-26
Abdo_n      0.5632879 3.429439e-22
Diarrhea_n  0.5545730 5.229788e-22
Fever_n     0.5297728 1.168654e-18
Vomit_n     0.3410366 7.001487e-06
Mayo_n      0.4325471 9.967995e-06
Nausea_n    0.3597065 5.623583e-05
Cheese_n    0.3290968 7.495656e-03
Cheese_y   -0.3290968 7.495656e-03
Nausea_y   -0.3597065 5.623583e-05
Mayo_y     -0.4325471 9.967995e-06
Vomit_y    -0.3410366 7.001487e-06
Fever_y    -0.5297728 1.168654e-18
Diarrhea_y -0.5545730 5.229788e-22
Abdo_y     -0.5632879 3.429439e-22
Sick_y     -0.5872910 5.368221e-26

# Description of dimension 2
res.desc$`Dim 2`

$quali
                 R2      p.value
Courgette 0.4839477 1.039252e-08
Potato    0.4020987 4.489421e-07
Vomiting  0.2449186 1.917736e-04
Icecream  0.1366683 6.989716e-03

$category
             Estimate      p.value
Courg_n     0.4261065 1.039252e-08
Potato_y    0.4910893 4.489421e-07
Vomit_y     0.1836850 1.917736e-04
Icecream_n  0.2863045 6.989716e-03
Icecream_y -0.2863045 6.989716e-03
Vomit_n    -0.1836850 1.917736e-04
Potato_n   -0.4910893 4.489421e-07
Courg_y    -0.4261065 1.039252e-08

Infos

This analysis has been performed using R software (ver. 3.2.1), FactoMineR (ver. 1.30) and factoextra (ver. 1.0.2)

References and further reading

Bendixen M.1995, Compositional perceptual mapping using chi-squared tree analysis and Correspondence Analysis, «Journal of Marketing Management», 11, 571-581.
Bendixen M. 2003, A Practical Guide to the Use of Correspondence Analysis in Marketing Research, Marketing Bulletin, 2003, 14, Technical Note 2. http://marketing-bulletin.massey.ac.nz/V14/MB_V14_T2_Bendixen.pdf
Greenacre M.. Contribution biplots. http://www.econ.upf.edu/docs/papers/downloads/1162.pdf
François Husson, http://factominer.free.fr/contact/index.html

ca package and factoextra : Correspondence Analysis - R software and data mining

Sun, 28 Jun 2015 12:01:34 +0200

Required packages
Load ca and factoextra
Data format
Correspondence analysis (CA)
Summary of CA outputs
Interpretation of CA outputs
Eigenvalues and scree plot
Biplot of row and column variables
References and further reading
Infos

As described here, correspondence analysis is used to analyse the contingency table formed by two qualitative variables.

This article describes how to perform a correspondence analysis using ca package

Required packages

ca(for computing CA) and factoextra (for CA visualization) packages are used.

These packages can be installed as follow :

install.packages("ca")

# install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Note that, for factoextra a version >= 1.0.1 is required for this tutorial. If it’s already installed on your computer, you should re-install it to have the most updated version.

Load ca and factoextra

library("ca")
library("factoextra")

Data format

We’ll use the data sets housetasks taken from the package ade4.

data(housetasks)
head(housetasks, 13)

           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Shopping     33          23       9      55
Official     12          46      23      15
Driving      10          51      75       3
Finances     13          13      21      66
Insurance     8           1      53      77
Repairs       0           3     160       2
Holidays      0           1       6     153

The data is a contingency table containing 13 housetasks and their repartition in the couple :

rows are the different tasks
values are the frequencies of the tasks done :
- by the wife only
- alternatively
- by the husband only
- or jointly

Correspondence analysis (CA)

The function ca() [in ca package] can be used. A simplified format is :

ca(obj,  nd = NA)

obj : a data frame, matrice or table (contingency table)
nd : number of dimensions to be included in the output

Example of usage :

res.ca <- ca(housetasks, nd = 3)

The output of the function ca() is structured as a list including :

names(res.ca)

 [1] "sv"         "nd"         "rownames"   "rowmass"    "rowdist"    "rowinertia" "rowcoord"  
 [8] "rowsup"     "colnames"   "colmass"    "coldist"    "colinertia" "colcoord"   "colsup"    
[15] "call"

The standard coordinates of row variables can be extracted as follow:

res.ca$rowcoord

                 Dim1       Dim2       Dim3
Laundry    -1.3461225 -0.7425167 -0.8885935
Main_meal  -1.1883460 -0.7347025 -0.4602894
Dinner     -0.9399625 -0.4618664 -0.5819061
Breakfeast -0.6902730 -0.6787794  0.6183521
Tidying    -0.5344773  0.6511077 -0.2643198
Dishes     -0.2564623  0.6625334  0.7489349
Shopping   -0.1597173  0.6045960  0.5684434
Official    0.3075858 -0.3801811  2.5905284
Driving     1.0067309 -0.9795065  1.5274961
Finances    0.3674852  0.9262210  0.0976236
Insurance   0.8782125  0.7102288 -0.8118104
Repairs     2.0748608 -1.2955835 -1.3244577
Holidays    0.3426748  2.1511592 -0.3635596

The standard coordinates of columns are:

res.ca$colcoord

                   Dim1       Dim2       Dim3
Wife        -1.13682130 -0.5474873 -0.5608580
Alternating -0.08439706 -0.4371162  2.3807453
Husband      1.57560041 -0.9023133 -0.5298508
Jointly      0.20280133  1.5389023 -0.1302974

Note that, the methods print() and summary() are available for ca objects.

# printing method
print(x)

# Summary method
summary(object, scree = TRUE, rows = TRUE, columns = TRUE)

x, object: CA object
scree: If TRUE, the scree plot is included in the output
rows: If TRUE, the results for rows are included in the output
columns: If TRUE, the results for columns are included in the output

Summary of CA outputs

summary(res.ca)


Principal inertias (eigenvalues):

 dim    value      %   cum%   scree plot               
 1      0.542889  48.7  48.7  ************             
 2      0.445003  39.9  88.6  **********               
 3      0.127048  11.4 100.0  ***                      
        -------- -----                                 
 Total: 1.114940 100.0                                 


Rows:
     name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr    k=3 cor ctr  
1  | Lndr |  101 1000  120 | -992 740 183 | -495 185  56 | -317  75  80 |
2  | Mn_m |   88 1000   81 | -876 742 124 | -490 232  47 | -164  26  19 |
3  | Dnnr |   62 1000   34 | -693 777  55 | -308 154  13 | -207  70  21 |
4  | Brkf |   80 1000   37 | -509 505  38 | -453 400  37 |  220  95  31 |
5  | Tdyn |   70 1000   22 | -394 440  20 |  434 535  30 |  -94  25   5 |
6  | Dshs |   65 1000   18 | -189 118   4 |  442 646  28 |  267 236  36 |
7  | Shpp |   69 1000   13 | -118  64   2 |  403 748  25 |  203 189  22 |
8  | Offc |   55 1000   48 |  227  53   5 | -254  66   8 |  923 881 369 |
9  | Drvn |   80 1000   91 |  742 432  81 | -653 335  76 |  544 233 186 |
10 | Fnnc |   65 1000   27 |  271 161   9 |  618 837  56 |   35   3   1 |
11 | Insr |   80 1000   52 |  647 576  61 |  474 309  40 | -289 115  53 |
12 | Rprs |   95 1000  281 | 1529 707 407 | -864 226 159 | -472  67 166 |
13 | Hldy |   92 1000  176 |  252  30  11 | 1435 962 425 | -130   8  12 |

Columns:
    name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr    k=3 cor ctr  
1 | Wife |  344 1000  270 | -838 802 445 | -365 152 103 | -200  46 108 |
2 | Altr |  146 1000  106 |  -62   5   1 | -292 105  28 |  849 890 825 |
3 | Hsbn |  218 1000  342 | 1161 772 542 | -602 208 178 | -189  20  61 |
4 | Jntl |  292 1000  282 |  149  21  12 | 1027 977 691 |  -46   2   5 |

The result of the function summary() contains 3 tables:

Table 1 - Eigenvalues: table 1 contains the eigenvalues and the percentage of inertia retained by each dimension. Additionally, accumulated percentages and a scree plot are shown.
Table 2 contains the results for row variables (X1000):
- The principal coordinates for the first 3 dimensions (k = 1, k = 2 and k = 3).
- Squared correlations (cor or cos2) and contributions (ctr) of the points. Note that, cor and ctr are expressed in per mills.
- mass: the mass (or total frequency) of each point (X1000).
- qlt is the total quality (X1000) of representation of points by the 3 included dimensions. In our example, it is the sum of the squared correlations over the three included dimensions.
- inr: the inertia of the point (in per mills of the total inertia).
Table 3 contains the results for column variables (the same as the row variables).

The function summary.ca() returns a list : list(scree, rows, columns).

Use the R code below to get the table containing the results for rows:

summary(res.ca)$rows

   name mass  qlt  inr  k=1 cor ctr  k=2 cor ctr  k=3 cor ctr
1  Lndr  101 1000  120 -992 740 183 -495 185  56 -317  75  80
2  Mn_m   88 1000   81 -876 742 124 -490 232  47 -164  26  19
3  Dnnr   62 1000   34 -693 777  55 -308 154  13 -207  70  21
4  Brkf   80 1000   37 -509 505  38 -453 400  37  220  95  31
5  Tdyn   70 1000   22 -394 440  20  434 535  30  -94  25   5
6  Dshs   65 1000   18 -189 118   4  442 646  28  267 236  36
7  Shpp   69 1000   13 -118  64   2  403 748  25  203 189  22
8  Offc   55 1000   48  227  53   5 -254  66   8  923 881 369
9  Drvn   80 1000   91  742 432  81 -653 335  76  544 233 186
10 Fnnc   65 1000   27  271 161   9  618 837  56   35   3   1
11 Insr   80 1000   52  647 576  61  474 309  40 -289 115  53
12 Rprs   95 1000  281 1529 707 407 -864 226 159 -472  67 166
13 Hldy   92 1000  176  252  30  11 1435 962 425 -130   8  12

The summary for column variables is:

summary(res.ca)$columns

  name mass  qlt  inr  k=1 cor ctr  k=2 cor ctr  k=3 cor ctr
1 Wife  344 1000  270 -838 802 445 -365 152 103 -200  46 108
2 Altr  146 1000  106  -62   5   1 -292 105  28  849 890 825
3 Hsbn  218 1000  342 1161 772 542 -602 208 178 -189  20  61
4 Jntl  292 1000  282  149  21  12 1027 977 691  -46   2   5

Interpretation of CA outputs

The interpretation of correspondence analysis has been described in my previous post: Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation.

Eigenvalues and scree plot

The proportion of inertia explained by the principal dimensions can be extracted using the function get_eigenvalue() [in factoextra] as follow :

eigenvalues <- get_eigenvalue(res.ca)
eigenvalues

      eigenvalue variance.percent cumulative.variance.percent
Dim.1  0.5428893         48.69222                    48.69222
Dim.2  0.4450028         39.91269                    88.60491
Dim.3  0.1270484         11.39509                   100.00000

The function fviz_screeplot() [in factoextra package] can be used to draw the scree plot (the percentages of inertia explained by the CA dimensions):

fviz_screeplot(res.ca)

Read more about eigenvalues and screeplot: Eigenvalues data visualization

Biplot of row and column variables

The base plot()[in ca package] function can be used:

plot(res.ca)

It’s also possible to use the function fviz_ca_biplot() [in factoextra]:

fviz_ca_biplot(res.ca)

Read more about fviz_ca_biplot(): fviz_ca_biplot

References and further reading

Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation
Correspondence Analysis using ade4 and factoextra
Oleg Nenadic’ and Michael Greenacre. Correspondence Analysis in R, with Two- and. Three-dimensional Graphics: The ca Package. Journal of Statistical Software, May 2007. http://www.jstatsoft.org/v20/i03/paper

Infos

This analysis has been performed using R software (ver. 3.1.2), ca (ver. 0.58) and factoextra (ver. 1.0.2)

MASS package and factoextra : Correspondence Analysis - R software and data mining

Wed, 24 Jun 2015 07:37:48 +0200

Required packages
Load MASS and factoextra
Data format
Correspondence analysis (CA)
Interpretation of CA outputs
Eigenvalues and scree plot
Biplot of row and column variables
Row variables
Column varables
References and further reading
Infos

As illustrated in my previous article, correspondence analysis (CA) is used to analyse the contingency table formed by two categorical variables.

This article describes how to perform correspondence analysis using MASS package

Required packages

MASS(for computing CA) and factoextra (for CA visualization) packages are used.

These packages can be installed as follow :

install.packages("MASS")

# install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Note that, for factoextra a version >= 1.0.1 is required for this tutorial. If it’s already installed on your computer, you should re-install it to have the most updated version.

Load MASS and factoextra

library("MASS")
library("factoextra")

Data format

We’ll use the data sets housetasks [in factoextra].

data(housetasks)
head(housetasks)

           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53

The data is contingency table containing 13 housetasks and their repartition in the couple :

rows are the different tasks
values are the frequencies of the tasks done :
- by the wife only
- alternatively
- by the husband only
- or jointly

Correspondence analysis (CA)

The function corresp() [in MASS package] can be used. A simplified format is :

corresp(x,  nf = 1)

x : a data frame, matrix or table (contingency table)
nf : number of dimensions to be included in the output

Example of usage :

res.ca <- corresp(housetasks, nf= 3)

The output of the function corresp() is an object of class correspondence structured as a list including :

names(res.ca)

[1] "cor"    "rscore" "cscore" "Freq"

cor: the square root of eigenvalues
rscore, cscore: the row and column scores
Freq: the initial contingency table

Interpretation of CA outputs

For the interpretation of result, read this article: Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation.

Eigenvalues and scree plot

The proportion of inertia explained by the principal axes can be obtained using the function get_eigenvalue() [in factoextra] as follow :

eigenvalues <- get_eigenvalue(res.ca)
eigenvalues

      eigenvalue variance.percent cumulative.variance.percent
Dim.1  0.5428893         48.69222                    48.69222
Dim.2  0.4450028         39.91269                    88.60491
Dim.3  0.1270484         11.39509                   100.00000

The function fviz_screeplot() [in factoextra package] can be used to draw the scree plot (the percentages of inertia explained by the CA dimensions):

fviz_screeplot(res.ca)

Read more about eigenvalues and screeplot: Eigenvalues data visualization

Biplot of row and column variables

You can use the base R function biplot(res.ca) or use the function the function fviz_ca_biplot()[in factoextra package] to draw a nice looking plot:

fviz_ca_biplot(res.ca)

# Change the theme
fviz_ca_biplot(res.ca) +
  theme_minimal()

Read more about fviz_ca_biplot(): fviz_ca_biplot

Row variables

The function get_ca_row()[in factoextra] is used to extract the results for row variables. This functions returns a list containing the coordinates, the cos2, the contribution and the inertia of row variables. The function fviz_ca_row() [in factoextra] is used to visualize only row points.

row <- get_ca_row(res.ca)
row

Correspondence Analysis - Results for rows
 ===================================================
  Name       Description                
1 "$coord"   "Coordinates for the rows" 
2 "$cos2"    "Cos2 for the rows"        
3 "$contrib" "contributions of the rows"
4 "$inertia" "Inertia of the rows"

# Coordinates
head(row$coord)

                Dim.1      Dim.2       Dim.3
Laundry    -0.9918368 -0.4953220 -0.31672897
Main_meal  -0.8755855 -0.4901092 -0.16406487
Dinner     -0.6925740 -0.3081043 -0.20741377
Breakfeast -0.5086002 -0.4528038  0.22040453
Tidying    -0.3938084  0.4343444 -0.09421375
Dishes     -0.1889641  0.4419662  0.26694926

# Visualize row variables only 
fviz_ca_row(res.ca) +
  theme_minimal()

Column varables

The result for columns gives the same information as described for rows.

col <- get_ca_col(res.ca)
# Coordinates
head(col$coord)

                  Dim.1      Dim.2       Dim.3
Wife        -0.83762154 -0.3652207 -0.19991139
Alternating -0.06218462 -0.2915938  0.84858939
Husband      1.16091847 -0.6019199 -0.18885924
Jointly      0.14942609  1.0265791 -0.04644302

# Visualize column variables only 
fviz_ca_col(res.ca) +
  theme_minimal()

References and further reading

Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation
Correspondence Analysis using ade4 and factoextra
Oleg Nenadic’ and Michael Greenacre. Correspondence Analysis in R, with Two- and. Three-dimensional Graphics: The ca Package. Journal of Statistical Software, May 2007. http://www.jstatsoft.org/v20/i03/paper

Infos

This analysis has been performed using R software (ver. 3.1.2), FactoMineR (ver. ) and factoextra (ver. 1.0.2)

Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation - R software and data mining

Mon, 22 Jun 2015 08:17:47 +0200

How this article is organized?
Required packages
Load FactoMineR and factoextra
Data format: Contingency tables
Exploratory data analysis (EDA)
Correspondence analysis (CA)
Summary of CA outputs
Interpretation of CA outputs
Biplot of rows and columns
Correspondence analysis using supplementary rows and columns
Filter CA results
Dimension description
CA and outliers
Infos

Correspondence analysis (CA) is an extension of Principal Component Analysis (PCA) suited to handle qualitative variables (or categorical data).

CA is used to analyze frequencies formed by categorical data (i.e, contengency table) and it provides factor scores (coordinates) for both the rows and the columns of contingency table. These coordinates are used to visualize graphically the association between row and column variables in the contingency table.

This article describes how to compute and interpret a correspondence analysis using FactoMineR and factoextra R packages.

The mathematical procedures of CA has been described in my previous tutorial. In the current tutorial, we’ll focus on the practical application and interpretation of correspondence analysis rather than the mathematical and statistical details.

How this article is organized?

This article contains mainly 5 important parts:

Part I describes the exploratory data analysis tools for contingency tables
Part II shows how to use FactoMineR package for computing correspondence analysis (CA)
Part III is a step-by-step guide for interpreting and visualizing the output of CA
Part IV provides an explanation about symmetric and asymmetric biplot. This section is very important and we’ll see why.
Part V covers how to apply correspondence analysis using supplementary rows and colums. This is important, if you want to make predictions with CA.

The last sections of this guide describe also how to filter CA result in order to keep only the most contributing variables. Finally, we’ll see how to deal with outliers.

Required packages

There are many functions from different packages in R, to perform correspondence analysis:

CA [in FactoMineR package]
ca() [in ca package]
dudi.coa() [in ade4 package]
corresp() [in MASS package]

In this tutorial, FactoMineR(for computing CA) and factoextra (for CA visualization) packages are used.

Note that, no matter what function you decide to use for computing CA, the output can be visualized using the R functions available in factoextra package, as described in the next sections.

FactoMineR and factoextra R packages can be installed as follow :

install.packages("FactoMineR")

# install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Note that, for factoextra a version >= 1.0.2 is required for this tutorial. If it’s already installed on your computer, you should re-install it to have the most updated version.

Load FactoMineR and factoextra

library("FactoMineR")
library("factoextra")

Data format: Contingency tables

We’ll use the data sets housetasks [in factoextra]

data(housetasks)
# head(housetasks)

An image of the data is shown below:

The data is a contingency table containing 13 housetasks and their repartition in the couple:

rows are the different tasks
values are the frequencies of the tasks done :
by the wife only
alternatively
by the husband only
or jointly

Exploratory data analysis (EDA)

Most of the EDA methods presented here (graphical matrix, mosaic/association plots and Chi-square statistic), have been already described in my previous tutorial: correspondence analysis basics.

If you’re already familiar with these approaches, you can skip this section.

Visual inspection

The above contingency table is not very large. Therefore, it’s easy to visually inspect and interpret row and column profiles:

It’s evident that, the housetasks - Laundry, Main_Meal and Dinner - are more frequently done by the “Wife”.
Repairs and driving are dominantly done by the husband
Holidays are frequently associated with the column “jointly”

Visualize a contingency table using graphical matrix

It’s also possible to visualize a contingency table using the function balloonplot() [in gplots package]. This function draws a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.

To execute the R code below, you should install the package gplots: install.packages(“gplots”).

library("gplots")
# 1. convert the data as a table
dt <- as.table(as.matrix(housetasks))
# 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

Note that, row and column sums are printed by default in the bottom and right margins, respectively. These values can be hidden using the argument show.margins = FALSE.

Mosaic / association plots

The function mosaicplot() from the built-in R package garphics can be used also to visualize a contingency table.

library("graphics")
mosaicplot(dt, shade = TRUE, las=2,
           main = "housetasks")

The argument shade is used to color the graph
The argument las = 2 produces vertical labels

The surface of an element of the mosaic reflects the relative magnitude of its value.

Blue color indicates that the observed value is higher than the expected value if the data were random
Red color specifies that the observed value is lower than the expected value if the data were random

From this mosaic plot, it can be seen that the housetasks Laundry, Main_meal, Dinner and breakfeast (blue color) are mainly done by the wife in our example.

It’s also possible to use the package vcd to make a mosaic plot (function mosaic()) or an association plot (function assoc()).

# install.packages("vcd")
library("vcd")
# plot just a subset of the table
assoc(head(dt), shade = T, las=3)

Chi-square statistic

Another method to analyse a frequency table is to use the Chi-square test of independence. The Chi-square test evaluates whether there is a significant dependence between row and column categories.

Chi-square statistic can be easily computed using the function chisq.test() as follow:

chisq <- chisq.test(housetasks)
chisq


    Pearson's Chi-squared test

data:  housetasks
X-squared = 1944.456, df = 36, p-value < 2.2e-16

In our example, the row and the column variables are statistically significantly associated (p-value = 0).

Correspondence analysis (CA)

The EDA methods described in the previous sections are useful only for small contingency table. For a large contingency table, statistical approaches, such as CA, are required to reduce the dimension of the data without loosing the most important information. In other words, CA is used to graphically visualize row points and column points in a low dimensional space.

The function CA() [in FactoMineR package] can be used. A simplified format is :

CA(X, ncp = 5, graph = TRUE)

X : a data frame (contingency table)
ncp : number of dimensions kept in the final results.
graph : a logical value. If TRUE a graph is displayed.

Example of usage :

res.ca <- CA(housetasks, graph = FALSE)

The output of the function CA() is a list including :

print(res.ca)

**Results of the Correspondence Analysis (CA)**
The row variable has  13  categories; the column variable has 4 categories
The chi square of independence between the two variables is equal to 1944.456 (p-value =  0 ).
*The results are available in the following objects:

   name              description                   
1  "$eig"            "eigenvalues"                 
2  "$col"            "results for the columns"     
3  "$col$coord"      "coord. for the columns"      
4  "$col$cos2"       "cos2 for the columns"        
5  "$col$contrib"    "contributions of the columns"
6  "$row"            "results for the rows"        
7  "$row$coord"      "coord. for the rows"         
8  "$row$cos2"       "cos2 for the rows"           
9  "$row$contrib"    "contributions of the rows"   
10 "$call"           "summary called parameters"   
11 "$call$marge.col" "weights of the columns"      
12 "$call$marge.row" "weights of the rows"

The object that is created using the function CA() contains many informations found in many different lists and matrices. These values are described in the next sections.

Summary of CA outputs

The function summary.CA() is used to print a summary of correspondence analysis results:

summary(object, nb.dec = 3, nbelements = 10, 
        ncp = TRUE, file ="", ...)

object: an object of class CA
nb.dec: number of decimal printed
nbelements: number of row/column variables to be written. To have all the elements, use nbelements = Inf.
ncp: Number of dimensions to be printed
file: an optional file name for exporting the summaries.

Print the summary of the CA analysis for the dimensions 1 and 2:

summary(res.ca, nb.dec = 2, ncp = 2)


Call:
rmarkdown::render("factominer-correspondance-analysis.Rmd", encoding = "UTF-8") 

The chi square of independence between the two variables is equal to 1944.456 (p-value =  0 ).

Eigenvalues
                      Dim.1  Dim.2  Dim.3  Dim.4
Variance               0.54   0.45   0.13   0.00
% of var.             48.69  39.91  11.40   0.00
Cumulative % of var.  48.69  88.60 100.00 100.00

Rows (the 10 first)
               Dim.1    ctr   cos2    Dim.2    ctr   cos2  
Laundry     |  -0.99  18.29   0.74 |   0.50   5.56   0.18 |
Main_meal   |  -0.88  12.39   0.74 |   0.49   4.74   0.23 |
Dinner      |  -0.69   5.47   0.78 |   0.31   1.32   0.15 |
Breakfeast  |  -0.51   3.82   0.50 |   0.45   3.70   0.40 |
Tidying     |  -0.39   2.00   0.44 |  -0.43   2.97   0.54 |
Dishes      |  -0.19   0.43   0.12 |  -0.44   2.84   0.65 |
Shopping    |  -0.12   0.18   0.06 |  -0.40   2.52   0.75 |
Official    |   0.23   0.52   0.05 |   0.25   0.80   0.07 |
Driving     |   0.74   8.08   0.43 |   0.65   7.65   0.34 |
Finances    |   0.27   0.88   0.16 |  -0.62   5.56   0.84 |

Columns
               Dim.1    ctr   cos2    Dim.2    ctr   cos2  
Wife        |  -0.84  44.46   0.80 |   0.37  10.31   0.15 |
Alternating |  -0.06   0.10   0.00 |   0.29   2.78   0.11 |
Husband     |   1.16  54.23   0.77 |   0.60  17.79   0.21 |
Jointly     |   0.15   1.20   0.02 |  -1.03  69.12   0.98 |

The result of the function summary() contains the chi-square statistic and 3 tables:

Table 1 - Eigenvalues: table 1 contains the variances and the percentage of variances retained by each dimension.
Table 2 contains the coordinates, the contribution and the cos2 (quality of representation [in 0-1]) of the first 10 active row variables on the dimensions 1 and 2.
Table 3 contains the coordinates, the contribution and the cos2 (quality of representation [in 0-1]) of the first 10 active column variables on the dimensions 1 and 2.

Note that,

to export the summary into a file use summary(res.ca, file =“myfile.txt”)
to display the summary of more than 10 elements, use the argument nbelements in the function summary()

Interpretation of CA outputs

Significance of the association between rows and columns

To interpret correspondence analysis, the first step is to evaluate whether there is a significant dependency between the rows and columns.

There are two methods to inspect the significance:

Using the trace
Using the Chi-square statistic

The trace is the the total inertia of the table (i.e, the sum of the eigenvalues). The square root of the trace is interpreted as the correlation coefficient between rows and columns.

The correlation coefficient is calculated as follow:

eig <- get_eigenvalue(res.ca)
trace <- sum(eig$eigenvalue) 
cor.coef <- sqrt(trace)
cor.coef

[1] 1.055907

Note that, as a rule of thumb 0.2 is the threshold above which the correlation can be considered as important (Bendixen 1995, 576; Healey 2013, 289-290).

In our example, the correlation coefficient is 1.0559074 indicating a strong association between row and column variables.

A more rigorous method is to use the chi-square statistic for examining the association. This appears at the top of the report generated by the function summary.CA(). A high chi-square statistic means strong link between row and column variables.

In our example, the association is highly significant (chi-square: 1944.456, p = 0).

Note that, the chi-square statistics = trace * n, where n is the grand total of the table (total frequency); see the R code below:

# Chi-square statistics
chi2 <- trace*sum(as.matrix(housetasks))
chi2

[1] 1944.456

# Degree of freedom
df <- (nrow(housetasks) - 1) * (ncol(housetasks) - 1)
# P-value
pval <- pchisq(chi2, df = df, lower.tail = FALSE)
pval

[1] 0

Eigenvalues and scree plot

How many dimensions are sufficient for the data interpretation?

The number of dimensions to retain in the solution can be determined by examining the table of eigenvalues.

As mentioned above, trace is the total sum of eigenvalues. For a given axis, the ratio of the axis eigenvalue to the trace is called the percentage of variance (or total inertia or chi-square value) explained by that axis.

The proportion of variances retained by the different dimensions (axes) can be extracted using the function get_eigenvalue()[in factoextra] as follow :

eigenvalues <- get_eigenvalue(res.ca)
head(round(eigenvalues, 2))

      eigenvalue variance.percent cumulative.variance.percent
Dim.1       0.54            48.69                       48.69
Dim.2       0.45            39.91                       88.60
Dim.3       0.13            11.40                      100.00
Dim.4       0.00             0.00                      100.00

Eigenvalues correspond to the amount of information retained by each axis. Dimensions are ordered decreasingly and listed according to the amount of variance explained in the solution. Dimension 1 explains the most variance in the solution, followed by dimension 2 and so on.

There is no “rule of thumb” to choose the number of dimension to keep for the data interpretation. It depends on the research question and the researcher’s need. For example, if you are satisfied with 80% of the total inertia explained then use the number of dimensions necessary to achieve that.

Another method is to visually inspect the scree plot in which dimensions are ordered decreasingly according the amount of explained inertia.

The function fviz_screeplot() [in factoextra package] can be used to draw the scree plot (the percentages of inertia explained by the CA dimensions):

fviz_screeplot(res.ca)

The point at which the scree plot shows a bend (so called “elbow”) can be considered as indicating an optimal dimensionality.

It’s also possible to calculate an average eigenvalue above which the axis should be kept in the solution.

Our data contains 13 rows and 4 columns.

If the data were random, the expected value of the eigenvalue for each axis would be 1/(nrow(housetasks)-1) = 1/12 = 8.33% in terms of rows.

Likewise, the average axis should account for 1/(ncol(housetasks)-1) = 1/3 = 33.33% in terms of the 4 columns.

Any axis with a contribution larger than the maximum of these two percentages should be considered as important and included in the solution for the interpretation of the data (see, Bendixen 1995, 577).

The R code below, draws the scree plot with a red dashed line specifying the average eigenvalue:

fviz_screeplot(res.ca) +
 geom_hline(yintercept=33.33, linetype=2, color="red")

According to the graph above, only dimensions 1 and 2 should be used in the solution. The dimension 3 explains only 11.4% of the total inertia which is below the average eigeinvalue (33.33%) and too little to be kept for further analysis.

Note that, you can use more than 2 dimensions. However, the supplementary dimensions are unlikely to contribute significantly to the interpretation of nature of the association between the rows and columns.

Dimensions 1 and 2 explain approximately 48.7% and 39.9% of the total inertia respectively. This corresponds to a cumulative total of 88.6% of total inertia retained by the 2 dimensions.

The higher the retention, the more subtlety in the original data is retained in the low-dimensional solution (Mike Bendixen, 2003).

Read more about eigenvalues and screeplot: Eigenvalues data visualization

CA scatter plot: Biplot of row and column variables

The function plot.CA()[in FactoMineR] can be used to plot the coordinates of rows and columns presented in the correspondence analysis output.

A simplified format is :

plot.CA(x, axes = c(1,2), col.row = "blue", col.col = "red")

x : An object of class CA
axes : A numeric vector of length 2 specifying the component to plot variables
col.row, col.col : colors for rows and columns respectively

FactoMineR base graph for CA:

plot(res.ca)

It’s also possible to use the function fviz_ca_biplot()[in factoextra package] to draw a nice looking plot:

fviz_ca_biplot(res.ca)

# Change the theme
fviz_ca_biplot(res.ca) +
  theme_minimal()

Read more about fviz_ca_biplot(): fviz_ca_biplot

The graph above is called symetric plot and shows a global pattern within the data. Rows are represented by blue points and columns by red triangles.

The distance between any row points or column points gives a measure of their similarity (or dissimilarity).

Row points with similar profile are closed on the factor map. The same holds true for column points.

This graph shows that :

housetasks such as dinner, breakfeast, laundry are done more often by the wife
Driving and repairs are done by the husband
……

Symetric plot represents the row and column profiles simultaneously in a common space (Bendixen, 2003). In this case, only the distance between row points or the distance between column points can be really interpreted.
The distance between any row and column items is not meaningful! You can only make a general statements about the observed pattern.
In order to interpret the distance between column and row points, the column profiles must be presented in row space or vice-versa. This type of map is called asymmetric biplot and is discussed at the end of this article.

The next step for the interpretation is to determine which row and column variables contribute the most in the definition of the different dimensions retained in the model.

Row variables

The function get_ca_row()[in factoextra] is used to extract the results for row variables. This function returns a list containing the coordinates, the cos2, the contribution and the inertia of row variables:

row <- get_ca_row(res.ca)
row

Correspondence Analysis - Results for rows
 ===================================================
  Name       Description                
1 "$coord"   "Coordinates for the rows" 
2 "$cos2"    "Cos2 for the rows"        
3 "$contrib" "contributions of the rows"
4 "$inertia" "Inertia of the rows"

Coordinates of rows

head(row$coord)

                Dim 1      Dim 2       Dim 3
Laundry    -0.9918368  0.4953220 -0.31672897
Main_meal  -0.8755855  0.4901092 -0.16406487
Dinner     -0.6925740  0.3081043 -0.20741377
Breakfeast -0.5086002  0.4528038  0.22040453
Tidying    -0.3938084 -0.4343444 -0.09421375
Dishes     -0.1889641 -0.4419662  0.26694926

The data indicate the coordinates of each row point in each dimension (1, 2 and 3)

Use the function fviz_ca_row() [in factoextra] to visualize only row points:

# Default plot
fviz_ca_row(res.ca)

It’s possible to change the color and the shape of the row points using the arguments col.row and shape.row as follow:

fviz_ca_row(res.ca, col.row="steelblue", shape.row = 15)

Note that, it’s also possible to make the graph of rows only using FactoMineR base graph. The argument invisible is used to hide the column points:

# Hide columns
plot(res.ca, invisible="col")

Read more about fviz_ca_row(): fviz_ca_row

Contribution of rows to the dimensions

The contribution of rows (in %) to the definition of the dimensions can be extracted as follow:

head(row$contrib)

                Dim 1    Dim 2    Dim 3
Laundry    18.2867003 5.563891 7.968424
Main_meal  12.3888433 4.735523 1.858689
Dinner      5.4713982 1.321022 2.096926
Breakfeast  3.8249284 3.698613 3.069399
Tidying     1.9983518 2.965644 0.488734
Dishes      0.4261663 2.844117 3.634294

The row variables with the larger value, contribute the most to the definition of the dimensions.

It’s possible to use the function corrplot to highlight the most contributing variables for each dimension:

library("corrplot")
corrplot(row$contrib, is.corr=FALSE)

The function fviz_contrib()[in factoextra] can be used to draw a bar plot of row contributions:

# Contributions of rows on Dim.1
fviz_contrib(res.ca, choice = "row", axes = 1)

If the row contributions were uniform, the expected value would be 1/nrow(housetasks) = 1/13 = 7.69%.
The red dashed line on the graph above indicates the expected average contribution. For a given dimension, any row with a contribution larger than this threshold could be considered as important in contributing to that dimension.

It can be seen that the row items Repairs, Laundry, Main_meal and Driving are the most important in the definition of the first dimension.

# Contributions of rows on Dim.2
fviz_contrib(res.ca, choice = "row", axes = 2)

The row items Holidays and Repairs contribute the most to the dimension 2.

# Total contribution on Dim.1 and Dim.2
fviz_contrib(res.ca, choice = "row", axes = 1:2)

The total contribution of a row, on explaining the variations retained by Dim.1 and Dim.2, is calculated as follow : (C1 * Eig1) + (C2 * Eig2).

C1 and C2 are the contributions of the row to dimensions 1 and 2, respectively. Eig1 and Eig2 are the eigenvalues of dimensions 1 and 2, respectively.

The expected average contribution of a row for Dim.1 and Dim.2 is : (7.69 * Eig1) + (7.69 * Eig2) = (7.690.54) + (7.690.44) = 7.53%

If your data contains many row items, the top contributing rows can be displayed as follow:

fviz_contrib(res.ca, choice = "row", axes = 1, top = 5)

Read more about fviz_contrib(): fviz_contrib

A second option is to draw a scatter plot of row points and to highlight rows according to the amount of their contributions. The function fviz_ca_row() is used.

Note that, using factoextra package, the color or the transparency of the row variables can be automatically controlled by the value of their contributions, their cos2, their coordinates on x or y axis.

# Control row point colors using their contribution
# Possible values for the argument col.row are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_ca_row(res.ca, col.row = "contrib")

# Change the gradient color
fviz_ca_row(res.ca, col.row="contrib")+
scale_color_gradient2(low="white", mid="blue", 
                      high="red", midpoint=10)+theme_minimal()

The scatter plot is also helpful to highlight the most important row variables in the determination of the dimensions.

In addition we can have an idea of what pole of the dimensions the row categories are actually contributing to.

It is evident that row categories Repair and Driving have an important contribution to the positive pole of the first dimension, while the categories Laundry and Main_meal have a major contribution to the negative pole of the first dimension; etc, ….

In other words, dimension 1 is mainly defined by the opposition of Repair and Driving (positive pole), and Laundry and Main_meal (negative pole).

It’s also possible to control automatically the transparency of rows by their contributions. The argument alpha.row is used:

# Control the transparency of rows using their contribution
# Possible values for the argument alpha.var are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_ca_row(res.ca, alpha.row="contrib")+
  theme_minimal()

It’s possible to select and display only the top contributing row as illustrated in the R code below.

# Select the top 5 contributing rows
fviz_ca_row(res.ca, alpha.row="contrib", select.row=list(contrib=5))

Row/column selections are discussed in details in the next sections

The contribution of row/column variables can be visualized using the so-called contribution biplots (discussed in the last sections of this article).

Read more about fviz_ca_row(): fviz_ca_row

Cos2 : The quality of representation of rows

The result of the analysis shows that, the contingency table has been successfully represented in low dimension space using correspondence analysis. The two dimensions 1 and 2 are sufficient to retain 88.6% of the total inertia contained in the data.

However, not all the points are equally well displayed in the two dimensions.

The quality of representation of the rows on the factor map is called the squared cosine (cos2) or the squared correlations.

The cos2 measures the degree of association between rows/columns and a particular axis.

The cos2 of rows can be extracted as follow:

head(row$cos2)

               Dim 1     Dim 2      Dim 3
Laundry    0.7399874 0.1845521 0.07546047
Main_meal  0.7416028 0.2323593 0.02603787
Dinner     0.7766401 0.1537032 0.06965666
Breakfeast 0.5049433 0.4002300 0.09482670
Tidying    0.4398124 0.5350151 0.02517249
Dishes     0.1181178 0.6461525 0.23572969

The values of the cos2 are comprised between 0 and 1.

The sum of the cos2 for rows on all the CA dimensions is equal to one.

The quality of representation of a row or column in n dimensions is simply the sum of the squared cosine of that row or column over the n dimensions.

If a row item is well represented by two dimensions, the sum of the cos2 is closed to one.

For some of the row items, more than 2 dimensions are required to perfectly represent the data.

Visualize the cos2 of rows using corrplot:

library("corrplot")
corrplot(row$cos2, is.corr=FALSE)

The function fviz_cos2()[in factoextra] can be used to draw a bar plot of rows cos2:

# Cos2 of rows on Dim.1 and Dim.2
fviz_cos2(res.ca, choice = "row", axes = 1:2)

Note that, all row points except Official are well represented by the first two dimensions. This implies that the position of the point corresponding the item Official on the scatter plot should be interpreted with some caution. A higher dimensional solution is probably necessary for the item Official.

Read more about fviz_cos2(): fviz_cos2

Column varables

The function get_ca_col()[in factoextra] is used to extract the results for column variables. This function returns a list containing the coordinates, the cos2, the contribution and the inertia of columns variables:

col <- get_ca_col(res.ca)
col

Correspondence Analysis - Results for columns
 ===================================================
  Name       Description                   
1 "$coord"   "Coordinates for the columns" 
2 "$cos2"    "Cos2 for the columns"        
3 "$contrib" "contributions of the columns"
4 "$inertia" "Inertia of the columns"

The result for columns gives the same information as described for rows. For this reason, I’ll just displayed the result for columns in this section without commenting.

Coordinates of columns

head(col$coord)

                  Dim 1      Dim 2       Dim 3
Wife        -0.83762154  0.3652207 -0.19991139
Alternating -0.06218462  0.2915938  0.84858939
Husband      1.16091847  0.6019199 -0.18885924
Jointly      0.14942609 -1.0265791 -0.04644302

Use the function fviz_ca_col() [in factoextra] to visualize only column points:

fviz_ca_col(res.ca)

Note that, it’s also possible to make the graph of columns only using FactoMineR base graph.The argument invisible is used to hide the rows on the factor map:

# Hide rows
plot(res.ca, invisible="row")

Read more about fviz_ca_col(): fviz_ca_col

Contribution of columns to the dimensions

head(col$contrib)

                Dim 1     Dim 2      Dim 3
Wife        44.462018 10.312237 10.8220753
Alternating  0.103739  2.782794 82.5492464
Husband     54.233879 17.786612  6.1331792
Jointly      1.200364 69.118357  0.4954991

Note that, you can use the previously mentioned corrplot() function to visualize the contribution of columns.

Use the function fviz_contrib() [in factoextra] to visualize column contributions on dimensions 1+2:

fviz_contrib(res.ca, choice = "col", axes = 1:2)

If the column contributions were uniform, the expected value would be 1/ncol(housetasks) = 1/4 = 25%.
The expected average contribution (reference line) of a column for Dim.1 and Dim.2 is : (25 * Eig1) + (25 * Eig2) = (25 * 0.54) + (25 * 0.44) = 24.5%.

Draw a scatter plot of column points and highlight columns according to the amount of their contributions. The function fviz_ca_col() [in factoextra] is used:

# Control column point colors using their contribution
# Possible values for the argument col.col are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_ca_col(res.ca, col.col="contrib")+
scale_color_gradient2(low="white", mid="blue", 
                      high="red", midpoint=24.5)+theme_minimal()

Note that, it’s also possible to control automatically the transparency of columns by their contributions using the argument alpha.col:

# Control the transparency of rows using their contribution
# Possible values for the argument alpha.col are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_ca_col(res.ca, alpha.col="contrib")

Cos2 : The quality of representation of columns

head(col$cos2)

                  Dim 1     Dim 2       Dim 3
Wife        0.801875947 0.1524482 0.045675847
Alternating 0.004779897 0.1051016 0.890118521
Husband     0.772026244 0.2075420 0.020431728
Jointly     0.020705858 0.9772939 0.002000236

Note that, the value of the cos2 is between 0 and 1. A cos2 closed to 1 corresponds to a column/row variables that are well represented on the factor map.

The function fviz_cos2() [in factoextra] can be used to draw a bar plot of columns cos2:

# Cos2 of columns on Dim.1 and Dim.2
fviz_cos2(res.ca, choice = "col", axes = 1:2)

Note that, only the column item Alternating is not very well displayed on the first two dimensions. The position of this item must be interpreted with caution in the space formed by dimensions 1 and 2.

Biplot of rows and columns

Symmetric biplot

As mentioned above, the standard plot of correspondence analysis is a symmetric biplot in which both rows (blue points) and columns (red triangles) are represented in the same space using the principal coordinates. These coordinates represent the row and column profiles. In this case, only the distance between row points or the distance between column points can be really interpreted.

With symmetric plot, the inter-distance between rows and columns can’t be interpreted. Only a general statements can be made about the pattern.

fviz_ca_biplot(res.ca)+
  theme_minimal()

Remove the points from the graph, use texts only :

fviz_ca_biplot(res.ca, geom="text")

Note that, allowed values for the argument geom are the combination of :

“point” to show only points (dots)
“text” to show only labels
c(“point”, “text”) to show both types

Note that, in order to interpret the distance between column points and row points, the simplest way is to make an asymmetric plot (Bendixen, 2003). This means that, the column profiles must be presented in row space or vice-versa.

Read more about fviz_ca_biplot(): fviz_ca_biplot

Asymmetric biplot for correspondence analysis

To make an asymetric plot, rows (or columns) points are plotted from the standard co-ordinates (S) and the profiles of the columns (or the rows) are plotted from the principale coordinates (P) (Bendixen 2003).

For a given axis, the standard and principle co-ordinates are related as follows:

P = sqrt(eigenvalue) X S

P: the principal coordinate of a row (or a column) on the axis
eigenvalue: the eigenvalue of the axis

Depending on the situation, other types of display can be set using the argument map for the function fviz_ca_biplot()[in factoextra]. This is inspired from ca package (Michael Greenacre).

The allowed options for the argument map are:

“rowprincipal” or “colprincipal” - these are the so-called asymmetric biplots, with either rows in principal coordinates and columns in standard coordinates, or vice versa (also known as row-metric-preserving or column-metric-preserving respectively).

“rowprincipal”: columns are represented in row space
“colprincipal”: rows are represented in column space

“symbiplot” - both rows and columns are scaled to have variances equal to the singular values (square roots of eigenvalues), which gives a symmetric biplot but does not preserve row or column metrics.
“rowgab” or “colgab”: Asymetric maps proposed by Gabriel & Odoroff (1990):

“rowgab”: rows in principal coordinates and columns in standard coordinates multiplied by the mass.
“colgab”: columns in principal coordinates and rows in standard coordinates multiplied by the mass.

“rowgreen” or “colgreen”: The so-called contribution biplots showing visually the most contributing points (Greenacre 2006b).

“rowgreen”: rows in principal coordinates and columns in standard coordinates multiplied by square root of the mass.
“colgreen”: columns in principal coordinates and rows in standard coordinates multiplied by the square root of the mass.

The R code below draw a standard asymetric biplot:

fviz_ca_biplot(res.ca, map ="rowprincipal", arrow = c(TRUE, TRUE))

The argument arrows is a vector of two logicals specifying if the plot should contain points (FALSE, default) or arrows (TRUE). First value sets the rows and the second value sets the columns.

If the angle between two arrows is acute, then their is a strong association between the corresponding row and column.

To interpret the distance between rows and and a column you should perpendicularly project row points on the column arrow.

Contribution biplot

In correspondence analysis, biplot is a graphical display of rows and columns in 2 or 3 dimensions.

In the standard symmetric biplot (mentioned in the previous sections), it’s difficult to know the most contributing points to the solution of the CA.

Michael Greenacre proposed a new scaling displayed (called contribution biplot) which incorporates the contribution of points. In this display, points that contribute very little to the solution, are close to the center of the biplot and are relatively unimportant to the interpretation.

A contribution biplot can be drawn using the argument map = “rowgreen” or map = “colgreen”.

Firstly, you have to decide whether to analyse the contributions of rows or columns to the definition of the axes.

In our example we’ll interpret the contribution of rows to the axes. The argument map =“colgreen” is used. In this case, remember that columns are in principal coordinates and rows in standard coordinates multiplied by the square root of the mass. For a given row, the square of the new coordinate on an axis i is exactly the contribution of this row to the inertia of the axis i.

fviz_ca_biplot(res.ca, map ="colgreen",
               arrow = c(TRUE, FALSE))

In the graph above, the position of the column profile points is unchanged relative to that in the conventional biplot. However, the distances of the row points from the plot origin are related to their contributions to the two-dimensional factor map.

The closer an arrow is (in terms of angular distance) to an axis the greater is the contribution of the row category on that axis relative to the other axis. If the arrow is halfway between the two, its row category contributes to the two axes to the same extent.

It is evident that row category Repairs have an important contribution to the positive pole of the first dimension, while the categories Laundry and Main_meal have a major contribution to the negative pole of the first dimension;
Dimension 2 is mainly defined by the row category Holidays.
The row category Driving contributes to the two axes to the same extent.

Plot rows or columns only

It’s also possible to draw the rows or columns only using the function fviz_ca_biplot() (instead of using fviz_ca_row() and fviz_ca_col)

Plot rows only by hiding the columns (invisible =“col”):

fviz_ca_biplot(res.ca, invisible = "col")+
  theme_minimal()

Plot columns only by hiding the rows (invisible =“row”):

fviz_ca_biplot(res.ca, invisible = "row")+
  theme_minimal()

Correspondence analysis using supplementary rows and columns

Data

We’ll use the data set children [in FactoMineR package]. It contains 18 rows and 8 columns:

data(children)
# head(children)

The data used here is a contingency table describing the answers given by different categories of people to the following question: What are the reasons that can make hesitate a woman or a couple to have children?

Only some of the rows and columns will be used to perform the correspondence analysis (CA).

The coordinates of the remaining (supplementary) rows/columns on the factor map will be predicted after the CA.

In CA terminology, our data contains :

Active rows (rows 1:14) : Rows that are used during the correspondence analysis.
Supplementary rows (row.sup 15:18) : The coordinates of these rows will be predicted using the CA informations and parameters obtained with active rows/columns
Active columns (columns 1:5) : Columns that are used for the correspondence analysis.
Supplementary columns (col.sup 6:8) : As supplementary rows, the coordinates of these columns will be predicted also.

CA with supplementary rows/columns

As mentioned above, supplementary rows and columns are not used for the definition of the principal dimensions. Their coordinates are predicted using only the informations provided by the performed CA on active rows/columns.

To specify supplementary rows/columns, the function CA()[in FactoMineR] can be used as follow :

CA(X,  ncp = 5, row.sup = NULL, col.sup = NULL,
   graph = TRUE)

X : a data frame (contingency table)
row.sup : a numeric vector specifying the indexes of the supplementary rows
col.sup : a numeric vector specifying the indexes of the supplementary columns
ncp : number of dimensions kept in the final results.
graph : a logical value. If TRUE a graph is displayed.

Example of usage :

res.ca <- CA (children, row.sup = 15:18, col.sup = 6:8,
              graph = FALSE)

The summary of the CA is :

summary(res.ca, nb.dec = 2, ncp = 2)


Call:
rmarkdown::render("factominer-correspondance-analysis.Rmd", encoding = "UTF-8") 

The chi square of independence between the two variables is equal to 98.80159 (p-value =  9.748064e-05 ).

Eigenvalues
                      Dim.1  Dim.2  Dim.3  Dim.4  Dim.5
Variance               0.04   0.01   0.01   0.01   0.00
% of var.             57.04  21.13  11.76  10.06   0.00
Cumulative % of var.  57.04  78.17  89.94 100.00 100.00

Rows (the 10 first)
                      Dim.1   ctr  cos2   Dim.2   ctr  cos2  
money               | -0.12  4.55  0.43 |  0.02  0.37  0.01 |
future              |  0.18 17.57  0.72 | -0.10 14.59  0.22 |
unemployment        | -0.21 22.62  0.87 | -0.07  6.78  0.10 |
circumstances       |  0.40  6.27  0.58 |  0.33 11.54  0.40 |
hard                | -0.25  2.99  0.88 |  0.07  0.59  0.06 |
economic            |  0.35 12.00  0.48 |  0.32 26.60  0.40 |
egoism              |  0.06  0.68  0.07 | -0.03  0.34  0.01 |
employment          | -0.14  2.62  0.16 |  0.22 17.55  0.41 |
finances            | -0.24  2.79  0.28 | -0.21  5.69  0.21 |
war                 |  0.22  2.17  0.75 | -0.07  0.69  0.09 |

Columns
                      Dim.1   ctr  cos2   Dim.2   ctr  cos2  
unqualified         | -0.21 25.11  0.68 | -0.08 10.08  0.10 |
cep                 | -0.14 18.30  0.64 |  0.06  8.08  0.11 |
bepc                |  0.11  6.76  0.31 | -0.03  1.25  0.02 |
high_school_diploma |  0.27 37.98  0.76 | -0.12 20.10  0.15 |
university          |  0.23 11.86  0.31 |  0.32 60.49  0.59 |

Supplementary rows
                      Dim.1 cos2   Dim.2 cos2  
comfort             |  0.21 0.07 |  0.70 0.78 |
disagreement        |  0.15 0.13 |  0.12 0.09 |
world               |  0.52 0.88 |  0.14 0.07 |
to_live             |  0.31 0.14 |  0.50 0.37 |

Supplementary columns
                      Dim.1  cos2   Dim.2  cos2  
thirty              |  0.11  0.14 | -0.06  0.04 |
fifty               | -0.02  0.01 |  0.05  0.09 |
more_fifty          | -0.18  0.29 | -0.05  0.02 |

For the supplementary rows/columns, the coordinates and the quality of representation (cos2) on the factor maps are displayed. They don’t contribute to the dimensions.

Make a biplot of rows and columns

FactomineR base graph:

plot(res.ca)

Active rows are in blue
Supplementary rows are in darkblue
Columns are in red
Supplementary columns are in darkred

Use factoextra:

fviz_ca_biplot(res.ca) +
  theme_minimal()

It’s also possible to hide supplementary rows and columns using the argument invisible:

fviz_ca_biplot(res.ca, invisible = c("row.sup", "col.sup") ) +
  theme_minimal()

The argument invisible is also available in FactoMineR base graph.

Visualize supplementary rows

All the results (coordinates and cos2) for the supplementary rows can be extracted as follow :

res.ca$row.sup

$coord
                 Dim 1     Dim 2      Dim 3      Dim 4
comfort      0.2096705 0.7031677 0.07111168  0.3071354
disagreement 0.1462777 0.1190106 0.17108916 -0.3132169
world        0.5233045 0.1429707 0.08399269 -0.1063597
to_live      0.3083067 0.5020193 0.52093397  0.2557357

$cos2
                  Dim 1      Dim 2       Dim 3      Dim 4
comfort      0.06892759 0.77524032 0.007928672 0.14790342
disagreement 0.13132177 0.08692632 0.179649183 0.60210272
world        0.87587685 0.06537746 0.022564054 0.03618163
to_live      0.13899699 0.36853645 0.396830367 0.09563620

Factor map for rows :

fviz_ca_row(res.ca) +
  theme_minimal()

Supplementary rows are shown in darkblue color.

Visualize supplementary columns

Factor map for columns:

fviz_ca_col(res.ca) +
  theme_minimal()

Supplementary columns are shown in darkred.

The results for supplementary columns can be extracted as follow :

res.ca$col.sup

$coord
                 Dim 1       Dim 2       Dim 3       Dim 4
thirty      0.10541339 -0.05969594 -0.10322613  0.06977996
fifty      -0.01706444  0.04907657 -0.01568923 -0.01306117
more_fifty -0.17706810 -0.04813788  0.10077299 -0.08517528

$cos2
               Dim 1      Dim 2       Dim 3       Dim 4
thirty     0.1375601 0.04411543 0.131910759 0.060278490
fifty      0.0108695 0.08990298 0.009188167 0.006367804
more_fifty 0.2860989 0.02114509 0.092666735 0.066200714

Filter CA results

If you have many row/column variables, it’s possible to visualize only some of them using the arguments select.row and select.col.

select.col, select.row: a selection of columns/rows to be drawn. Allowed values are NULL or a list containing the arguments name, cos2 or contrib:

name: is a character vector containing column/row names to be drawn
cos2: if cos2 is in [0, 1], ex: 0.6, then columns/rows with a cos2 > 0.6 are drawn
if cos2 > 1, ex: 5, then the top 5 active columns/rows and top 5 supplementary columns/rows with the highest cos2 are drawn
contrib: if contrib > 1, ex: 5, then the top 5 columns/rows with the highest cos2 are drawn

# Visualize rows with cos2 >= 0.8
fviz_ca_row(res.ca, select.row = list(cos2 = 0.8))

# Top 5 active rows and 5 suppl. rows with the highest cos2
fviz_ca_row(res.ca, select.row = list(cos2 = 5))

The top 5 active rows and the top 5 supplementary rows are shown.

# Select by names
name <- list(name = c("employment", "fear", "future"))
fviz_ca_row(res.ca, select.row = name)

#top 5 contributing rows and columns
fviz_ca_biplot(res.ca, select.row = list(contrib = 5), 
               select.col = list(contrib = 5)) +
  theme_minimal()

Supplementary rows/columns are not shown because they don’t contribute to the construction of the axes.

Dimension description

The function dimdesc() [in FactoMineR] can be used to identify the most correlated variables with a given dimension.

A simplified format is :

dimdesc(res, axes = 1:2, proba = 0.05)

res : an object of class CA
axes : a numeric vector specifying the dimensions to be described
prob : the significance level

Example of usage :

res.desc <- dimdesc(res.ca, axes = c(1,2))
# Description of dimension 1
res.desc$`Dim 1`

$row
                     coord
hard          -0.249984356
finances      -0.236995598
unemployment  -0.212227692
work          -0.211677086
employment    -0.136754598
money         -0.115267468
housing       -0.006680991
egoism         0.059889455
health         0.111651752
disagreement   0.146277736
future         0.176449413
fear           0.203347917
comfort        0.209670471
war            0.216824026
to_live        0.308306674
economic       0.353963920
circumstances  0.400922001
world          0.523304472

$col
                          coord
unqualified         -0.20931790
more_fifty          -0.17706810
cep                 -0.13857658
fifty               -0.01706444
thirty               0.10541339
bepc                 0.10875778
university           0.23123279
high_school_diploma  0.27403930

# Description of dimension 2
res.desc$`Dim 2`

$row
                    coord
finances      -0.20598461
future        -0.09786326
war           -0.07466267
unemployment  -0.07071770
fear          -0.05806796
egoism        -0.02566733
health         0.00429124
money          0.02004613
hard           0.06765048
work           0.10888448
disagreement   0.11901056
housing        0.12824218
world          0.14297067
employment     0.21539408
economic       0.32072390
circumstances  0.33098674
to_live        0.50201935
comfort        0.70316769

$col
                          coord
high_school_diploma -0.12134373
unqualified         -0.08072742
thirty              -0.05969594
more_fifty          -0.04813788
bepc                -0.02848299
fifty                0.04907657
cep                  0.05604703
university           0.31785751

CA and outliers

If one or more “outliers” are present in the contingency table, they can dominate the interpretation the axes (Bendixen M. 2003).

Outliers are points that have high absolute co-ordinate values and high contributions. They are represented, on the graph, very far from the centroïd. In this case, the remaining row/column points tend to be tightly clustered in the graph which become difficult to interpret.

In the CA output, the coordinates of row/column points represent the number of standard deviations the row/column is away from the barycentre (Bendixen M. 2003).

Outliers are points that are are at least one standard deviation away from the barycentre. They contribute also, significantly to the interpretation to one pole of an axis (Bendixen M. 2003).

There are no apparent outliers in our data.

If there are outliers in the data, they must be suppressed or treated as supplementary points when re-running the correspondence analysis.

Infos

This analysis has been performed using R software (ver. 3.1.2), FactoMineR (ver. 1.29) and factoextra (ver. 1.0.2)

References and further reading:

Bendixen M.1995, Compositional perceptual mapping using chi-squared tree analysis and Correspondence Analysis, «Journal of Marketing Management», 11, 571-581.
Bendixen M. 2003, A Practical Guide to the Use of Correspondence Analysis in Marketing Research, Marketing Bulletin, 2003, 14, Technical Note 2. http://marketing-bulletin.massey.ac.nz/V14/MB_V14_T2_Bendixen.pdf
G Alberti, An R Script to Facilitate Correspondence Analysis. A Guide to the Use and the Interpretation of Results from an Archaeological Perspective, in Archeologia e Calcolatori 24 2013, 25-53. http://soi.cnr.it/archcalc/indice/PDF24/02_Alberti.pdf
Greenacre M.. Contribution biplots. http://www.econ.upf.edu/docs/papers/downloads/1162.pdf
Healey J.F. 2013, The Essentials of Statistics. A Tool for Social Research, 3rded., Belmont, Wadsworth.
Laura Doey and Jessica Kurta. Correspondence Analysis applied to psychological research. Tutorials in Quantitative Methods for Psychology 2011, Vol. 7(1), p. 5-14. http://www.tqmp.org/RegularArticles/vol07-1/p005/p005.pdf
François Husson. FactomineR. http://factominer.free.fr

ade4 and factoextra : Correspondence Analysis - R software and data mining

Sun, 21 Jun 2015 19:40:26 +0200

Correspondence Analysis (CA) is an adaptation of Principal Component Analysis used to analyse a contingency (or frequency) table formed by two qualitative variables.

A comprehensive guide for CA computing, analysis and visualization has been provided in my previous post: Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation.

The basic idea and the mathematical procedures of correspondence analysis are covered here: Correspondence analysis basics

This current R tutorial describes how to compute CA using R software and ade4 package.

Required packages

The R packages ade4(for computing CA) and factoextra (for CA visualization) are used.

They can be installed as follow :

install.packages("ade4")

# install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Note that, for factoextra a version >= 1.0.2 is required for this tutorial. If it’s already installed on your computer, you should re-install it to have the most updated version.

Load ade4 and factoextra

library("ade4")
library("factoextra")

Data format: Contingency tables

We’ll use the data sets housetasks taken from the package ade4.

data(housetasks)
# head(housetasks)

An image of the data is shown below:

The data is a contingency table containing 13 housetasks and their repartition in the couple :

rows are the different tasks
values are the frequencies of the tasks done :
by the wife only
alternatively
by the husband only
or jointly

Note that, it’s possible to visualize a contingency table using the functions: balloonplot() [in gplots package], mosaicplot() [in graphics package], assoc() [in vcd package].

To learn more about these functions, read this article: Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation

Correspondence analysis (CA)

The function dudi.coa() [in ade4 package] can be used. A simplified format is :

dudi.coa(df, scannf = TRUE, nf = 2)

df : a data frame (contingency table)
scannf : a logical value specifying whether the eigenvalues bar plot should be displayed
nf : number of dimensions kept in the final results.

Example of usage:

res.ca <- dudi.coa(housetasks, scannf = FALSE, nf = 5)

Eigenvalues and scree plot

Extract the eigenvalues

Eigenvalues measure the amount of variation retained by a principal axis :

summary(res.ca)

Class: coa dudi
Call: dudi.coa(df = housetasks, scannf = FALSE, nf = 5)

Total inertia: 1.115

Eigenvalues:
    Ax1     Ax2     Ax3 
 0.5429  0.4450  0.1270 

Projected inertia (%):
    Ax1     Ax2     Ax3 
  48.69   39.91   11.40 

Cumulative projected inertia (%):
    Ax1   Ax1:2   Ax1:3 
  48.69   88.60  100.00

You can also use the function get_eigenvalue() [in factoextra package] to extract the eigenvalues :

eig.val <- get_eigenvalue(res.ca)
head(eig.val)

      eigenvalue variance.percent cumulative.variance.percent
Dim.1  0.5428893         48.69222                    48.69222
Dim.2  0.4450028         39.91269                    88.60491
Dim.3  0.1270484         11.39509                   100.00000

Make a scree plot using ade4 base graphics

The function screeplot() can be used to draw the amount of inertia (variance) retained by the dimensions.

A simplified format is:

screeplot(x, ncps = length(x$eig), type = c("barplot", "lines"))

x : an object of class dudi
ncps : the number of components to be plotted
type : the type of plot

Example of usage :

screeplot(res.ca, main ="Screeplot - Eigenvalues")

~89% of the information contained in the data are retained by the first two dimensions.

Make the scree plot using factoextra

It’s also possible to use the function fviz_screeplot() [in factoextra] to make the scree plot. In the R code below, we’ll draw the percentage of variances retained by each component :

fviz_screeplot(res.ca, ncp=3)

Read more about eigenvalues and screeplot: Eigenvalues data visualization

CA scatter plot: Biplot of row and column variables

The function scatter() or biplot() can be used as follow :

# Remove the scree plot (posieig ="none")
scatter(res.ca, posieig = "none")

NULL

By default, the scree plot is displayed on the scatter plot. The argument posieig =“none” is used to remove the scree plot.

Note that, if you want to remove row or column labels the argument clab.row = 0 or clab.col = 0 can be used.

Biplot can be drawn using the combination of the two functions below :

s.label() to plot rows or columns as points
s.arrow() to add rows or columns as arrows

# Plot of rows as points
s.label(res.ca$li, xax = 1, yax = 2)
# Add column variables as arrows
s.arrow(res.ca$co, add.plot = TRUE)

It’s also possible to use the function fviz_ca_biplot()[in factoextra package] to draw a nice looking plot:

fviz_ca_biplot(res.ca)

# Change the theme
fviz_ca_biplot(res.ca) +
  theme_minimal()

The graph above is called symetric plot representing row and column profiles. Rows are represented by blue points and columns by red triangles.

Read more about fviz_ca_biplot(): fviz_ca_biplot

Row variables

The simplest way is to use the function get_ca_row() [in factoextra] to extract the results for row variables. This function returns a list containing the coordinates, the cos2 and the contribution of row variables:

row <- get_ca_row(res.ca)
row

Correspondence Analysis - Results for rows
 ===================================================
  Name       Description                
1 "$coord"   "Coordinates for the rows" 
2 "$cos2"    "Cos2 for the rows"        
3 "$contrib" "contributions of the rows"
4 "$inertia" "Inertia of the rows"

# Print the coordinates
head(row$coord)

               Dim.1      Dim.2       Dim.3
Laundry    0.9918368 -0.4953220 -0.31672897
Main_meal  0.8755855 -0.4901092 -0.16406487
Dinner     0.6925740 -0.3081043 -0.20741377
Breakfeast 0.5086002 -0.4528038  0.22040453
Tidying    0.3938084  0.4343444 -0.09421375
Dishes     0.1889641  0.4419662  0.26694926

In the next section, I’ll show how to extract row coordinates, cos2 and contribution using ade4 base code.

Coordinates of rows

The coordinates of the rows on the factor map are :

head(res.ca$li)

               Axis1      Axis2       Axis3
Laundry    0.9918368 -0.4953220 -0.31672897
Main_meal  0.8755855 -0.4901092 -0.16406487
Dinner     0.6925740 -0.3081043 -0.20741377
Breakfeast 0.5086002 -0.4528038  0.22040453
Tidying    0.3938084  0.4343444 -0.09421375
Dishes     0.1889641  0.4419662  0.26694926

Use the function fviz_ca_row() [in factoextra package] to visualize only row points:

# Default plot
fviz_ca_row(res.ca)

Note that, it’s also possible to plot rows only using the ade4 base graph:

s.label(res.ca$li, xax = 1, yax = 2)

Contribution of rows to the dimensions

The cos2 and the contributions of rows / columns are calculated using the function inertia.dudi() as follow :

inertia <- inertia.dudi(res.ca, row.inertia = TRUE,
                        col.inertia = TRUE)

Note that, the contributions and the cos2 are printed in 1/10 000. The sign is the sign of the coordinates.

The contributions can be printed in % as follow :

# absolute contribution of columns
contrib <- inertia$col.abs/100
head(contrib)

            Comp1 Comp2 Comp3
Wife        44.46 10.31 10.82
Alternating  0.10  2.78 82.55
Husband     54.23 17.79  6.13
Jointly      1.20 69.12  0.50

Recall that, as mentioned above, the simplest way is to use the function get_ca_row() [in factoextra package]. It provides a list of matrices containing all the results for the active rows(coordinates, squared cosine and contributions).

row <- get_ca_row(res.ca)
row

Correspondence Analysis - Results for rows
 ===================================================
  Name       Description                
1 "$coord"   "Coordinates for the rows" 
2 "$cos2"    "Cos2 for the rows"        
3 "$contrib" "contributions of the rows"
4 "$inertia" "Inertia of the rows"

# Row contributions
row$contrib

           Dim.1 Dim.2 Dim.3
Laundry    18.29  5.56  7.97
Main_meal  12.39  4.74  1.86
Dinner      5.47  1.32  2.10
Breakfeast  3.82  3.70  3.07
Tidying     2.00  2.97  0.49
Dishes      0.43  2.84  3.63
Shopping    0.18  2.52  2.22
Official    0.52  0.80 36.94
Driving     8.08  7.65 18.60
Finances    0.88  5.56  0.06
Insurance   6.15  4.02  5.25
Repairs    40.73 15.88 16.60
Holidays    1.08 42.45  1.21

The row category with the largest value, contribute the most to the definition of the dimensions.

The function fviz_contrib()[in factoextra] can be used to visualize the most important row variables:

# Contributions of rows on Dim.1
fviz_contrib(res.ca, choice = "row", axes = 1)

The red dashed line represents the expected average row contributions if the contributions were uniform: 1/nrow(housetasks) = 1/13 = 7.69%.
For a given dimension, any row with a contribution above this threshold could be considered as important in contributing to that dimension.

The row items Repairs, Laundry, Main_meal and Driving contribute the most in the definition of the first axis.

# Contributions of rows on Dim.2
fviz_contrib(res.ca, choice = "row", axes = 2)

Read more about fviz_contrib(): fviz_contrib

Using factoextra package, the color of rows can be automatically controlled by the value of their contributions

fviz_ca_row(res.ca, col.row="contrib")+
scale_color_gradient2(low="white", mid="blue", 
                      high="red", midpoint=10)+theme_minimal()

The graph above highlight the most important rows in the correspondence analysis solution.

Read more about fviz_ca_row(): fviz_ca_row

Cos2 : quality of representation of rows on the factor map

A high cos2 indicates a good representation of the rows on the factor map.
A low cos2 indicates that the variable is not perfectly represented by the principal dimensions.

The cos2 of the rows are (factoextra code) :

head(row$cos2)

            Dim.1  Dim.2  Dim.3
Laundry    0.7400 0.1846 0.0755
Main_meal  0.7416 0.2324 0.0260
Dinner     0.7766 0.1537 0.0697
Breakfeast 0.5049 0.4002 0.0948
Tidying    0.4398 0.5350 0.0252
Dishes     0.1181 0.6462 0.2357

Note that, the ade4 code is:

# relative contributions of rows
cos2 <- abs(inertia$row.rel/10000)
head(cos2)

The values of the cos2 are comprised between 0 and 1.

The function fviz_cos2()[in factoextra] can be used to draw a bar plot of rows cos2:

# Cos2 of rows on Dim.1 and Dim.2
fviz_cos2(res.ca, choice = "row", axes = 1:2)

Note that, all row points except Official are well represented by the first two dimensions. The position of the point corresponding the item Official on the scatter plot should be interpreted with some caution.

Using factoextra package, the color of rows can be automatically controlled by the value of their cos2.

fviz_ca_row(res.ca, col.row="cos2")+
scale_color_gradient2(low="white", mid="blue", 
      high="red", midpoint=0.5) + theme_minimal()

Read more about fviz_cos2(): fviz_cos2

Column variables

The function get_ca_col()[in factoextra] is used to extract the results for column variables. This function returns a list containing the coordinates, the cos2 and the contribution of columns variables:

col <- get_ca_col(res.ca)
col

Correspondence Analysis - Results for columns
 ===================================================
  Name       Description                   
1 "$coord"   "Coordinates for the columns" 
2 "$cos2"    "Cos2 for the columns"        
3 "$contrib" "contributions of the columns"
4 "$inertia" "Inertia of the columns"

# Coordinates
col$coord

                  Dim.1      Dim.2       Dim.3
Wife         0.83762154 -0.3652207 -0.19991139
Alternating  0.06218462 -0.2915938  0.84858939
Husband     -1.16091847 -0.6019199 -0.18885924
Jointly     -0.14942609  1.0265791 -0.04644302

The result for columns gives the same information as described for rows. For this reason, I’ll just displayed the result for columns in this section without commenting.

Coordinates of columns

The coordinates of the columns on the factor maps can be extracted as follow :

# ade4 code
head(res.ca$co)

                  Comp1      Comp2       Comp3
Wife         0.83762154 -0.3652207 -0.19991139
Alternating  0.06218462 -0.2915938  0.84858939
Husband     -1.16091847 -0.6019199 -0.18885924
Jointly     -0.14942609  1.0265791 -0.04644302

Use the function fviz_ca_col() [in factoextra] to visualize only column points:

fviz_ca_col(res.ca)

Note that, it’s also possible to plot columns only using the ade4 base graph:

s.label(res.ca$co, xax = 1, yax = 2)

Contribution of columns

The contributions can be printed in % as follow :

# absolute contributions of columns
# ade4 code
contrib <- inertia$col.abs/100
head(contrib)

            Comp1 Comp2 Comp3
Wife        44.46 10.31 10.82
Alternating  0.10  2.78 82.55
Husband     54.23 17.79  6.13
Jointly      1.20 69.12  0.50

It’s simple to use the function get_ca_col() [from factoextra package]. factoextra provides, a list of matrices containing all the results for the active columns (coordinates, squared cosine and contributions)./span>

columns <- get_ca_col(res.ca)
columns

Correspondence Analysis - Results for columns
 ===================================================
  Name       Description                   
1 "$coord"   "Coordinates for the columns" 
2 "$cos2"    "Cos2 for the columns"        
3 "$contrib" "contributions of the columns"
4 "$inertia" "Inertia of the columns"

# Contributions of columns
head(columns$contrib)

            Dim.1 Dim.2 Dim.3
Wife        44.46 10.31 10.82
Alternating  0.10  2.78 82.55
Husband     54.23 17.79  6.13
Jointly      1.20 69.12  0.50

Use the function fviz_contrib()[factoextra package] to visualize the most contributing columns :

# Contributions of columns on Dim.1
fviz_contrib(res.ca, choice = "col", axes = 1)

# Contributions of columns on Dim.2
fviz_contrib(res.ca, choice = "col", axes = 2)

Read more about fviz_contrib(): fviz_contrib

Draw a scatter plot of column points and highlight columns according to the amount of their contributions. The function fviz_ca_col() [in factoextra] is used:

# Control column point colors using their contribution
# Possible values for the argument col.col are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_ca_col(res.ca, col.col="contrib")+
scale_color_gradient2(low="white", mid="blue", 
                      high="red", midpoint=24.5)+theme_minimal()

Cos2 : The quality of representation of columns

# relative contributions of columns
cos2 <- abs(inertia$col.rel)/10000
head(cos2)

             Comp1  Comp2  Comp3 con.tra
Wife        0.8019 0.1524 0.0457  0.2700
Alternating 0.0048 0.1051 0.8901  0.1057
Husband     0.7720 0.2075 0.0204  0.3421
Jointly     0.0207 0.9773 0.0020  0.2823

The function fviz_cos2()[in factoextra] can be used to draw a bar plot of columns cos2:

# Cos2 of columns on Dim.1 and Dim.2
fviz_cos2(res.ca, choice = "col", axes = 1:2)

Read more about fviz_cos2(): fviz_cos2

Correspondence analysis using supplementary rows and columns

Data

We’ll use the data set children available on STHDA website. It contains 18 rows and 8 columns:

ff <- "https://www.sthda.com/sthda/RDoc/data/ca-children.txt"
children <- read.table(file = ff, sep ="\t", 
                       header = TRUE, row.names = 1)

Only some of the rows and columns will be used to compute the correspondence analysis (CA).

The coordinates of the remaining (supplementary) rows/columns on the factor map will be predicted after the CA.

In CA terminology, our data contains :

Active rows (rows 1:14) : Rows that are used during the correspondence analysis.
Supplementary rows (row.sup 15:18) : The coordinates of these rows will be predicted using the CA informations and parameters obtained with active rows/columns
Active columns (columns 1:5) : Columns that are used for the correspondence analysis.
Supplementary columns (col.sup 6:8) : As supplementary rows, the coordinates of these columns will be predicted also.

R functions

The functions suprow() and supcol() [in ade4 package] are used to calculate the coordinates of supplementary rows and columns, respectively.

The simplified formats are :

# For supplementary rows
suprow(x, Xsup)

# For supplementary columns
supcol(x, Xsup)

Supplementary rows

# Data for the supplementary rows
row.sup <- children[15:18, 1:5, drop = FALSE]
head(row.sup)

             unqualified cep bepc high_school_diploma university
comfort                2   4    3                   1          4
disagreement           2   8    2                   5          2
world                  1   5    4                   6          3
to_live                3   3    1                   3          4

STEP 1/2 - CA using active rows/columns:

d.active <- children[1:14, 1:5]
res.ca <- dudi.coa(d.active, scannf = FALSE, nf =5)

STEP 2/2 - Predict the coordinates of the supplementary rows:

row.sup.ca <- suprow(res.ca, row.sup)
names(row.sup.ca)

[1] "tabsup" "lisup"

# coordinates 
row.sup.coord <- row.sup.ca$lisup
head(row.sup.coord)

                 Axis1     Axis2      Axis3      Axis4
comfort      0.2096705 0.7031677 0.07111168  0.3071354
disagreement 0.1462777 0.1190106 0.17108916 -0.3132169
world        0.5233045 0.1429707 0.08399269 -0.1063597
to_live      0.3083067 0.5020193 0.52093397  0.2557357

How to visualize supplementary rows on the factor map?

The function fviz_add() is used :

# Plot of active rows
p <- fviz_ca_row(res.ca)
# Add supplementary rows
fviz_add(p, row.sup.coord, color ="darkgreen")

Supplementary columns

# Data for the supplementary quantitative variables
col.sup <- children[1:14, 6:8, drop = FALSE]
head(col.sup)

              thirty fifty more_fifty
money             59    66         70
future           115   117         86
unemployment      79    88        177
circumstances      9     8          5
hard               2    17         18
economic          18    19         17

Recall that, rows 15:18 are supplementary rows. We don’t want them in this current analysis. This is why, I extracted only rows 1:14.

Predict the coordinates of the supplementary columns :

col.sup.ca <- supcol(res.ca, col.sup)
names(col.sup.ca)

[1] "tabsup" "cosup"

# coordinates 
col.sup.coord <- col.sup.ca$cosup
head(col.sup.coord)

                 Comp1       Comp2       Comp3       Comp4
thirty      0.10541339 -0.05969594 -0.10322613  0.06977996
fifty      -0.01706444  0.04907657 -0.01568923 -0.01306117
more_fifty -0.17706810 -0.04813788  0.10077299 -0.08517528

Visualize supplementary columns on the factor map using factoextra :

# Plot of active columns
p <- fviz_ca_col(res.ca)
# Add supplementary active variables
fviz_add(p, col.sup.coord , color ="darkgreen")

Infos

This analysis has been performed using R software (ver. 3.1.2), ade4 (ver. 1.6-2) and factoextra (ver. 1.0.2)

Correspondence analysis basics - R software and data mining

Mon, 01 Jun 2015 00:01:30 +0200

Required package
Load FactoMineR and factoextra
Data format: Contingency tables
Visualize a contingency table using graphical matrix
Row sums and column sums
Row variables
Column variables
Association between row and column variables
- Chi-square test
- Chi-square statistic and the total inertia
Graphical representation of a contingency table: Mosaic plot
G-test: Likelihood ratio test
- Likelihood ratio test in R
- Interpret the association between rows and columns using likelihood ratio
Correspondence analysis
CA - Singular value decomposition of the standardized residuals
Packages in R
Infos

Correspondence analysis (CA) is an extension of Principal Component Analysis (PCA) suited to analyze frequencies formed by qualitative variables (i.e, contingency table).

This R tutorial describes the idea and the mathematical procedures of Correspondence Analysis (CA) using R software.

The mathematical procedures of CA are complex and require matrix algebra.

In this tutorial, I put a lot of effort into writing all the formula in a very simple format so that every beginner can understand the methods.

Required package

FactoMineR(for computing CA) and factoextra (for CA visualization) packages are used.

These packages can be installed as follow :

install.packages("FactoMineR")

# install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Load FactoMineR and factoextra

library("FactoMineR")
library("factoextra")

Data format: Contingency tables

We’ll use the data set housetasks[in factoextra]

data(housetasks)
head(housetasks)

           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53

An image of the data is shown below:

The data is a contingency table containing 13 housetasks and their repartition in the couple :

rows are the different tasks
values are the frequencies of the tasks done:
by the wife only
alternatively
by the husband only
or jointly

As the above contingency table is not very large, with a quick visual examination it can be seen that:

The house tasks Laundry, Main_Meal and Dinner are dominant in the column Wife
Repairs are dominant in the column Husband
Holidays are dominant in the column Jointly

Visualize a contingency table using graphical matrix

To easily interpret the contingency table, a graphical matrix can be drawn using the function balloonplot() [in gplots package]. In this graph, each cell contains a dot whose size reflects the relative magnitude of the value it contains.

library("gplots")
# 1. convert the data as a table
dt <- as.table(as.matrix(housetasks))
# 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

For a very large contingency table, the visual interpretation would be very hard. Other methods are required such as correspondence analysis.

I will describe step by step many tools and statistical approaches to visualize, analyse and interpret a contingency table.

Row sums and column sums

Row sums (row.sum) and column sums (col.sum) are called row margins and column margins, respectively. They can be calculated as follow:

# Row margins
row.sum <- apply(housetasks, 1, sum)
head(row.sum)

   Laundry  Main_meal     Dinner Breakfeast    Tidying     Dishes 
       176        153        108        140        122        113

# Column margins
col.sum <- apply(housetasks, 2, sum)
head(col.sum)

       Wife Alternating     Husband     Jointly 
        600         254         381         509

# grand total
n <- sum(housetasks)

The grand total is the total sum of all values in the contingency table.

The contingency table with row and column margins are shown below:

	Wife	Alternating	Husband	Jointly	TOTAL
Laundry	156	14	2	4	176
Main_meal	124	20	5	4	153
Dinner	77	11	7	13	108
Breakfeast	82	36	15	7	140
Tidying	53	11	1	57	122
Dishes	32	24	4	53	113
Shopping	33	23	9	55	120
Official	12	46	23	15	96
Driving	10	51	75	3	139
Finances	13	13	21	66	113
Insurance	8	1	53	77	139
Repairs	0	3	160	2	165
Holidays	0	1	6	153	160
TOTAL	600	254	381	509	1744

Row margins: light gray
Column margins: light blue
The grand total (the total of all values in the table): pink

Row variables

To compare rows, we can analyse their profiles in order to identify similar row variables.

Row profiles

The profile of a given row is calculated by taking each row point and dividing by its margin (i.e, the sum of all row points). The formula is:

\[ row.profile = \frac{row}{row.sum} \]

For example the profile of the row point Laundry/wife is P = 156/176 = 88.6%.

The R code below can be used to compute row profiles:

row.profile <- housetasks/row.sum
# head(row.profile)

	Wife	Alternating	Husband	Jointly	TOTAL
Laundry	0.88636364	0.079545455	0.011363636	0.02272727	1
Main_meal	0.81045752	0.130718954	0.032679739	0.02614379	1
Dinner	0.71296296	0.101851852	0.064814815	0.12037037	1
Breakfeast	0.58571429	0.257142857	0.107142857	0.05000000	1
Tidying	0.43442623	0.090163934	0.008196721	0.46721311	1
Dishes	0.28318584	0.212389381	0.035398230	0.46902655	1
Shopping	0.27500000	0.191666667	0.075000000	0.45833333	1
Official	0.12500000	0.479166667	0.239583333	0.15625000	1
Driving	0.07194245	0.366906475	0.539568345	0.02158273	1
Finances	0.11504425	0.115044248	0.185840708	0.58407080	1
Insurance	0.05755396	0.007194245	0.381294964	0.55395683	1
Repairs	0.00000000	0.018181818	0.969696970	0.01212121	1
Holidays	0.00000000	0.006250000	0.037500000	0.95625000	1
TOTAL	0.34403670	0.145642202	0.218463303	0.29185780	1

In the table above, the row TOTAL (in light blue) is called the average row profile (or marginal profile of columns or column margin)

The average row profile is computed as follow:

\[ average.rp = \frac{column.sum}{grand.total} \]

For example, the average row profile is : (600/1744, 254/1744, 381/1744, 509/1744). It can be computed in R as follow:

# Column sums
col.sum <- apply(housetasks, 2, sum)
# average row profile = Column sums / grand total
average.rp <- col.sum/n 
average.rp

       Wife Alternating     Husband     Jointly 
  0.3440367   0.1456422   0.2184633   0.2918578

Distance (or similarity) between row profiles

If we want to compare 2 rows (row1 and row2), we need to compute the squared distance between their profiles as follow:

\[ d^2(row_1, row_2) = \sum{\frac{(row.profile_1 - row.profile_2)^2}{average.profile}} \]

This distance is called Chi-square distance.

For example the distance between the rows Laundry and Main_meal are:

\[ d^2(Laundry, Main\_meal) = \frac{(0.886-0.810)^2}{0.344} + \frac{(0.0795-0.131)^2}{0.146} + ... = 0.036 \]

The distance between Laundry and Main_meal can be calculated as follow in R:

# Laundry and Main_meal profiles
laundry.p <- row.profile["Laundry",]
main_meal.p <- row.profile["Main_meal",]
# Distance between Laundry and Main_meal
d2 <- sum(((laundry.p - main_meal.p)^2) / average.rp)
d2

[1] 0.03684787

The distance between Laundry and Driving is:

# Driving profile
driving.p <- row.profile["Driving",]
# Distance between Laundry and Driving
d2 <- sum(((laundry.p - driving.p)^2) / average.rp)
d2

[1] 3.772028

Note that, the rows Laundry and Main_meal are very close (d2 ~ 0.036, similar profiles) compared to the rows Laundry and Driving (d2 ~ 3.77)

You can also compute the squared distance between each row profile and the average row profile in order to view rows that are the most similar or different to the average row.

Squared distance between each row profile and the average row profile

\[ d^2(row_i, average.profile) = \sum{\frac{(row.profile_i - average.profile)^2}{average.profile}} \]

The R code below computes the distance from the average profile for all the row variables:

d2.row <- apply(row.profile, 1, 
        function(row.p, av.p){sum(((row.p - av.p)^2)/av.p)}, 
        average.rp)
as.matrix(round(d2.row,3))

            [,1]
Laundry    1.329
Main_meal  1.034
Dinner     0.618
Breakfeast 0.512
Tidying    0.353
Dishes     0.302
Shopping   0.218
Official   0.968
Driving    1.274
Finances   0.456
Insurance  0.727
Repairs    3.307
Holidays   2.140

The rows Repairs, Holidays, Laundry and Driving have the most different profiles from the average profile.

Distance matrix

In this section the squared distance is computed between each row profile and the other rows in the contingency table.

The result is a distance matrix (a kind of correlation or dissimilarity matrix).

The custom R function below is used to compute the distance matrix:

## data: a data frame or matrix; 
## average.profile: average profile
dist.matrix <- function(data, average.profile){
   mat <- as.matrix(t(data))
    n <- ncol(mat)
    dist.mat<- matrix(NA, n, n)
    diag(dist.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            d2 <- sum(((mat[, i] - mat[, j])^2) / average.profile)
            dist.mat[i, j] <- dist.mat[j, i] <- d2
        }
    }
  colnames(dist.mat) <- rownames(dist.mat) <- colnames(mat)
  dist.mat
}

Compute and visualize the distance between row profiles. The package corrplot is required for the visualization. It can be installed as follow: install.packages(“corrplot”).

# Distance matrix
dist.mat <- dist.matrix(row.profile, average.rp)
dist.mat <-round(dist.mat, 2)
# Visualize the matrix
library("corrplot")
corrplot(dist.mat, type="upper",  is.corr = FALSE)

The size of the circle is proportional to the magnitude of the distance between row profiles.

When the data contains many categories, correspondence analysis is very useful to visualize the similarity between items.

Row mass and inertia

The Row mass (or row weight) is the total frequency of a given row. It’s calculated as follow:

\[ row.mass = \frac{row.sum}{grand.total} \]

row.sum <- apply(housetasks, 1, sum)
grand.total <- sum(housetasks)
row.mass <- row.sum/grand.total
head(row.mass)

   Laundry  Main_meal     Dinner Breakfeast    Tidying     Dishes 
0.10091743 0.08772936 0.06192661 0.08027523 0.06995413 0.06479358

The Row inertia is calculated as the row mass multiplied by the squared distance between the row and the average row profile:

\[ row.inertia = row.mass * d^2(row) \]

The inertia of a row (or a column) is the amount of information it contains.
The total inertia is the total information contained in the data table. It’s computed as the sum of rows inertia (or equivalently, as the sum of columns inertia)

# Row inertia
row.inertia <- row.mass * d2.row
head(row.inertia)

   Laundry  Main_meal     Dinner Breakfeast    Tidying     Dishes 
0.13415976 0.09069235 0.03824633 0.04112368 0.02466697 0.01958732

# Total inertia
sum(row.inertia)

[1] 1.11494

The total inertia corresponds to the amount of the information the data contains.

Row summary

The result for rows can be summarized as follow:

row <- cbind.data.frame(d2 = d2.row, mass = row.mass, inertia = row.inertia)
round(row,3)

              d2  mass inertia
Laundry    1.329 0.101   0.134
Main_meal  1.034 0.088   0.091
Dinner     0.618 0.062   0.038
Breakfeast 0.512 0.080   0.041
Tidying    0.353 0.070   0.025
Dishes     0.302 0.065   0.020
Shopping   0.218 0.069   0.015
Official   0.968 0.055   0.053
Driving    1.274 0.080   0.102
Finances   0.456 0.065   0.030
Insurance  0.727 0.080   0.058
Repairs    3.307 0.095   0.313
Holidays   2.140 0.092   0.196

Column variables

Column profiles

These are calculated in the same way as the row profiles table.

The profile of a given column is computed as follow:

\[ col.profile = \frac{col}{col.sum} \]

The R code below can be used to compute column profile:

col.profile <- t(housetasks)/col.sum
col.profile <- as.data.frame(t(col.profile))
# head(col.profile)

	Wife	Alternating	Husband	Jointly	TOTAL
Laundry	0.26000000	0.055118110	0.005249344	0.007858546	0.10091743
Main_meal	0.20666667	0.078740157	0.013123360	0.007858546	0.08772936
Dinner	0.12833333	0.043307087	0.018372703	0.025540275	0.06192661
Breakfeast	0.13666667	0.141732283	0.039370079	0.013752456	0.08027523
Tidying	0.08833333	0.043307087	0.002624672	0.111984283	0.06995413
Dishes	0.05333333	0.094488189	0.010498688	0.104125737	0.06479358
Shopping	0.05500000	0.090551181	0.023622047	0.108055010	0.06880734
Official	0.02000000	0.181102362	0.060367454	0.029469548	0.05504587
Driving	0.01666667	0.200787402	0.196850394	0.005893910	0.07970183
Finances	0.02166667	0.051181102	0.055118110	0.129666012	0.06479358
Insurance	0.01333333	0.003937008	0.139107612	0.151277014	0.07970183
Repairs	0.00000000	0.011811024	0.419947507	0.003929273	0.09461009
Holidays	0.00000000	0.003937008	0.015748031	0.300589391	0.09174312
TOTAL	1.00000000	1.000000000	1.000000000	1.000000000	1.00000000

In the table above, the column TOTAL is called the average column profile (or marginale profile of rows)

The average column profile is calculated as follow:

\[ average.cp = row.sum/grand.total \]

For example, the average column profile is : (176/1744, 153/1744, 108/1744, 140/1744, …). It can be computed in R as follow:

# Row sums
row.sum <- apply(housetasks, 1, sum)
# average column profile= row sums/grand total
average.cp <- row.sum/n 
head(average.cp)

   Laundry  Main_meal     Dinner Breakfeast    Tidying     Dishes 
0.10091743 0.08772936 0.06192661 0.08027523 0.06995413 0.06479358

Distance (similarity) between column profiles

If we want to compare columns, we need to compute the squared distance between their profiles as follow:

\[ d^2(col_1, col_2) = \sum{\frac{(col.profile_1 - col.profile_2)^2}{average.profile}} \]

For example the distance between the columns Wife and Husband are:

\[ d^2(Wife, Husband) = \frac{(0.26-0.005)^2}{0.10} + \frac{(0.21-0.013)^2}{0.09} + ... + ... = 4.05 \]

The distance between Wife and Husband can be calculated as follow in R:

# Wife and Husband profiles
wife.p <- col.profile[, "Wife"]
husband.p <- col.profile[, "Husband"]
# Distance between Wife and Husband
d2 <- sum(((wife.p - husband.p)^2) / average.cp)
d2

[1] 4.050311

You can also compute the squared distance between each column profile and the average column profile

Squared distance between each column profile and the average column profile

\[ d^2(col_i, average.profile) = \sum{\frac{(col.profile_i - average.profile)^2}{average.profile}} \]

The R code below computes the distance from the average profile for all the column variables

d2.col <- apply(col.profile, 2, 
        function(col.p, av.p){sum(((col.p - av.p)^2)/av.p)}, 
        average.cp)
round(d2.col,3)

       Wife Alternating     Husband     Jointly 
      0.875       0.809       1.746       1.078

Distance matrix

# Distance matrix
dist.mat <- dist.matrix(t(col.profile), average.cp)
dist.mat <-round(dist.mat, 2)
dist.mat

            Wife Alternating Husband Jointly
Wife        0.00        1.71    4.05    2.93
Alternating 1.71        0.00    2.67    2.58
Husband     4.05        2.67    0.00    3.70
Jointly     2.93        2.58    3.70    0.00

# Visualize the matrix
library("corrplot")
corrplot(dist.mat, type="upper", order="hclust", is.corr = FALSE)

column mass and inertia

The column mass(or column weight) is the total frequency of each column. It’s calculated as follow:

\[ col.mass = \frac{col.sum}{grand.total} \]

col.sum <- apply(housetasks, 2, sum)
grand.total <- sum(housetasks)
col.mass <- col.sum/grand.total
head(col.mass)

       Wife Alternating     Husband     Jointly 
  0.3440367   0.1456422   0.2184633   0.2918578

The column inertia is calculated as the column mass multiplied by the squared distance between the column and the average column profile:

\[ col.inertia = col.mass * d^2(col) \]

col.inertia <- col.mass * d2.col
head(col.inertia)

       Wife Alternating     Husband     Jointly 
  0.3010185   0.1178242   0.3813729   0.3147248

# total inertia
sum(col.inertia)

[1] 1.11494

Recall that the total inertia corresponds to the amount of the information the data contains. Note that, the total inertia obtained using column profile is the same as the one obtained when analyzing row profile. That’s normal, because we are analyzing the same data with just a different angle of view.

Column summary

The result for rows can be summarized as follow:

col <- cbind.data.frame(d2 = d2.col, mass = col.mass, 
                        inertia = col.inertia)
round(col,3)

               d2  mass inertia
Wife        0.875 0.344   0.301
Alternating 0.809 0.146   0.118
Husband     1.746 0.218   0.381
Jointly     1.078 0.292   0.315

Association between row and column variables

When the contingency table is not very large (as above), it’s easy to visually inspect and interpret row and column profiles:

It’s evident that, the housetasks - Laundry, Main_Meal and Dinner - are more frequently done by the “Wife”.
Repairs and driving are dominantly done by the husband
Holidays are more frequently taken jointly

Larger contingency table is complex to interpret visually and several methods are required to help to this process.

Another statistical method that can be applied to contingency table is the Chi-square test of independence.

Chi-square test

Chi-square test issued to examine whether rows and columns of a contingency table are statistically significantly associated.

Null hypothesis (H0): the row and the column variables of the contingency table are independent.
Alternative hypothesis (H1): row and column variables are dependent

For each cell of the table, we have to calculate the expected value under null hypothesis.

For a given cell, the expected value is calculated as follow:

\[ e = \frac{row.sum * col.sum}{grand.total} \]

The Chi-square statistic is calculated as follow:

\[ \chi^2 = \sum{\frac{(o - e)^2}{e}} \]

o is the observed value
e is the expected value

This calculated Chi-square statistic is compared to the critical value (obtained from statistical tables) with $df = (r - 1)(c - 1)$ degrees of freedom and p = 0.05.

r is the number of rows in the contingency table
c is the number of column in the contingency table

If the calculated Chi-square statistic is greater than the critical value, then we must conclude that the row and the column variables are not independent of each other. This implies that they are significantly associated.

Note that, Chi-square test should only be applied when the expected frequency of any cell is at least 5.

Chi-square statistic can be easily computed using the function chisq.test() as follow:

chisq <- chisq.test(housetasks)
chisq


    Pearson's Chi-squared test

data:  housetasks
X-squared = 1944.456, df = 36, p-value < 2.2e-16

In our example, the row and the column variables are statistically significantly associated(p-value = 0)

Note that, while Chi-square test can help to establish dependence between rows and the columns, the nature of the dependency is unknown.

The observed and the expected counts can be extracted from the result of the test as follow:

# Observed counts
chisq$observed

           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Shopping     33          23       9      55
Official     12          46      23      15
Driving      10          51      75       3
Finances     13          13      21      66
Insurance     8           1      53      77
Repairs       0           3     160       2
Holidays      0           1       6     153

# Expected counts
round(chisq$expected,2)

            Wife Alternating Husband Jointly
Laundry    60.55       25.63   38.45   51.37
Main_meal  52.64       22.28   33.42   44.65
Dinner     37.16       15.73   23.59   31.52
Breakfeast 48.17       20.39   30.58   40.86
Tidying    41.97       17.77   26.65   35.61
Dishes     38.88       16.46   24.69   32.98
Shopping   41.28       17.48   26.22   35.02
Official   33.03       13.98   20.97   28.02
Driving    47.82       20.24   30.37   40.57
Finances   38.88       16.46   24.69   32.98
Insurance  47.82       20.24   30.37   40.57
Repairs    56.77       24.03   36.05   48.16
Holidays   55.05       23.30   34.95   46.70

As mentioned above the Chi-square statistic is 1944.456196.

Which are the most contributing cells to the definition of the total Chi-square statistic?

If you want to know the most contributing cells to the total Chi-square score, you just have to calculate the Chi-square statistic for each cell:

\[ r = \frac{o - e}{\sqrt{e}} \]

The above formula returns the so-called Pearson residuals (r) for each cell (or standardized residuals)

Cells with the highest absolute standardized residuals contribute the most to the total Chi-square score.

Pearson residuals can be easily extracted from the output of the function chisq.test():

round(chisq$residuals, 3)

             Wife Alternating Husband Jointly
Laundry    12.266      -2.298  -5.878  -6.609
Main_meal   9.836      -0.484  -4.917  -6.084
Dinner      6.537      -1.192  -3.416  -3.299
Breakfeast  4.875       3.457  -2.818  -5.297
Tidying     1.702      -1.606  -4.969   3.585
Dishes     -1.103       1.859  -4.163   3.486
Shopping   -1.289       1.321  -3.362   3.376
Official   -3.659       8.563   0.443  -2.459
Driving    -5.469       6.836   8.100  -5.898
Finances   -4.150      -0.852  -0.742   5.750
Insurance  -5.758      -4.277   4.107   5.720
Repairs    -7.534      -4.290  20.646  -6.651
Holidays   -7.419      -4.620  -4.897  15.556

Let’s visualize Pearson residuals using the package corrplot:

library(corrplot)
corrplot(chisq$residuals, is.cor = FALSE)

For a given cell, the size of the circle is proportional to the amount of the cell contribution.

The sign of the standardized residuals is also very important to interpret the association between rows and columns as explained in the block below.

Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.

In the image above, it’s evident that there are an association between the column Wife and the rows Laundry, Main_meal.
There is a strong positive association between the column Husband and the row Repair

Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables. For example the column Wife are negatively associated (~ “not associated”) with the row Repairs. There is a repulsion between the column Husband and, the rows Laundry and Main_meal

Note that, correspondence analysis is just the singular value decomposition of the standardized residuals. This will be explained in the next section.

The contribution (in %) of a given cell to the total Chi-square score is calculated as follow:

\[ contrib = \frac{r^2}{\chi^2} \]

r is the residual of the cell

# Contibution in percentage (%)
contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3)

            Wife Alternating Husband Jointly
Laundry    7.738       0.272   1.777   2.246
Main_meal  4.976       0.012   1.243   1.903
Dinner     2.197       0.073   0.600   0.560
Breakfeast 1.222       0.615   0.408   1.443
Tidying    0.149       0.133   1.270   0.661
Dishes     0.063       0.178   0.891   0.625
Shopping   0.085       0.090   0.581   0.586
Official   0.688       3.771   0.010   0.311
Driving    1.538       2.403   3.374   1.789
Finances   0.886       0.037   0.028   1.700
Insurance  1.705       0.941   0.868   1.683
Repairs    2.919       0.947  21.921   2.275
Holidays   2.831       1.098   1.233  12.445

# Visualize the contribution
corrplot(contrib, is.cor = FALSE)

The relative contribution of each cell to the total Chi-square score give some indication of the nature of the dependency between rows and columns of the contingency table.

It can be seen that:

The column “Wife” is strongly associated with Laundry, Main_meal, Dinner
The column “Husband” is strongly associated with the row Repairs
The column jointly is frequently associated with the row Holidays

From the image above, it can be seen that the most contributing cells to the Chi-square are Wife/Laundry (7.74%), Wife/Main_meal (4.98%), Husband/Repairs (21.9%), Jointly/Holidays (12.44%).

These cells contribute about 47.06% to the total Chi-square score and thus account for most of the difference between expected and observed values.

This confirms the earlier visual interpretation of the data. As stated earlier, visual interpretation may be complex when the contingency table is very large. In this case, the contribution of one cell to the total Chi-square score becomes a useful way of establishing the nature of dependency.

Chi-square statistic and the total inertia

As mentioned above, the total inertia is the amount of the information contained in the data table.

It’s called $\phi^2$ (squared phi) and is calculated as follow:

\[ \phi^2 = \frac{\chi^2}{grand.total} \]

phi2 <- as.numeric(chisq$statistic/sum(housetasks))
phi2

[1] 1.11494

The square root of $\phi^2$ are called trace and may be interpreted as a correlation coefficient(Bendixen, 2003). Any value of the trace > 0.2 indicates a significant dependency between rows and columns (Bendixen M., 2003)

Graphical representation of a contingency table: Mosaic plot

Mosaic plot is used to visualize a contingency table in order to examine the association between categorical variables.

The function mosaicplot() [in garphics package] can be used.

library("graphics")
# Mosaic plot of observed values
mosaicplot(housetasks,  las=2, col="steelblue",
           main = "housetasks - observed counts")

# Mosaic plot of expected values
mosaicplot(chisq$expected,  las=2, col = "gray",
           main = "housetasks - expected counts")

In these plots, column variables are firstly splited (vertical split) and then row variables are splited(horizontal split). For each cell, the height of bars is proportional to the observed relative frequency it contains:

\[ \frac{cell.value}{column.sum} \]

The blue plot, is the mosaic plot of the observed values. The gray one is the mosaic plot of the expected values under null hypothesis.

If row and column variables were completely independent the mosaic bars for the observed values (blue graph) would be aligned as the mosaic bars for the expected values (gray graph).

It’s also possible to color the mosaic plot according to the value of the standardized residuals:

mosaicplot(housetasks, shade = TRUE, las=2,main = "housetasks")

The argument shade is used to color the graph
The argument las = 2 produces vertical labels

This plot clearly show you that Laundry, Main_meal, Dinner and Breakfeast are more often done by the “Wife”.
Repairs are done by the Husband

G-test: Likelihood ratio test

The G–test of independence is an alternative to the chi-square test of independence, and they will give approximately the same conclusion.

The test is based on the likelihood ratio defined as follow:

\[ ratio = \frac{o}{e} \]

o is the observed value
e is the expected value under null hypothesis

This likelihood ratio, or its logarithm, can be used to compute a p-value. When the logarithm of the likelihood ratio is used, the statistic is known as a log-likelihood ratio statistic.

This test is called G-test or likelihood ratio test or maximum likelihood statistical significance test) and can be used in situations where Chi-square tests were previously recommended.

The G-test is generally defined as follow:

\[ G = 2 * \sum{o * log(\frac{o}{e})} \]

o is the observed frequency in a cell
e is the expected frequency under the null hypothesis
log is the natural logarithm
The sum is taken over all non-empty cells.

The distribution of G is approximately a chi-squared distribution, with the same number of degrees of freedom as in the corresponding chi-squared test:

\[df = (r - 1)(c - 1)\]

r is the number of rows in the contingency table
c is the number of column in the contingency table

The commonly used Pearson Chi-square test is, in fact, just an approximation of the log-likelihood ratio on which the G-tests are based.

Remember that, the Chi-square formula is:

\[ \chi^2 = \sum{\frac{(o - e)^2}{e}} \]

Likelihood ratio test in R

The functions likelihood.test()[in Deducer package] or G.test()[in RVAideMemoire] can be used to perform a G-test on a contingency table.

We’ll use the package RVAideMemoire which can be installed as follow : install.packages(“RVAideMemoire”).

The function G.test() work as chisq.test():

library("RVAideMemoire")
gtest <- G.test(as.matrix(housetasks))
gtest


    G-test

data:  as.matrix(housetasks)
G = 1907.658, df = 36, p-value < 2.2e-16

Interpret the association between rows and columns using likelihood ratio

To interpret the association between the rows and the columns of the contingency table, the likelihood ratio can be used as an index (i):

\[ ratio = \frac{o}{e} \]

For a given cell,

If ratio > 1, there is an “attraction” (association) between the corresponding column and row
If ratio < 1, there is a “repulsion” between the corresponding column and row

The ratio can be calculated as follow:

ratio <- chisq$observed/chisq$expected
round(ratio,3)

            Wife Alternating Husband Jointly
Laundry    2.576       0.546   0.052   0.078
Main_meal  2.356       0.898   0.150   0.090
Dinner     2.072       0.699   0.297   0.412
Breakfeast 1.702       1.766   0.490   0.171
Tidying    1.263       0.619   0.038   1.601
Dishes     0.823       1.458   0.162   1.607
Shopping   0.799       1.316   0.343   1.570
Official   0.363       3.290   1.097   0.535
Driving    0.209       2.519   2.470   0.074
Finances   0.334       0.790   0.851   2.001
Insurance  0.167       0.049   1.745   1.898
Repairs    0.000       0.125   4.439   0.042
Holidays   0.000       0.043   0.172   3.276

Note that, you can also use the R code : gtest$observed/gtest$expected

The package corrplot can be used to make a graph of the likelihood ratio:

corrplot(ratio, is.cor = FALSE)

The image above confirms our previous observations:

The rows Laundry, Main_meal and Dinner are associated with the column Wife
Repairs are done more often by the Husband
Holidays are taken Jointly

Let’s take the log(ratio) to see the attraction and the repulsion in different colors:

If ratio < 1 => log(ratio) < 0 (negative values) => red color
If ratio > 1 = > log(ratio) > 0 (positive values) => blue color

We’ll also add a small value (0.5) to all cells to avoid log(0):

corrplot(log2(ratio + 0.5), is.cor = FALSE)

Correspondence analysis

Correspondence analysis (CA) is required for large contingency table.

It used to graphically visualize row points and column points in a low dimensional space.

CA is a dimensional reduction method applied to a contingency table. The information retained by each dimension is called eigenvalue.

The total information (or inertia) contained in the data is called phi ($\phi^2$) and can be calculated as follow:

\[ \phi^2 = \frac{\chi^2}{grand.total} \]

For a given axis, the eigenvalue ($\lambda$) is computed as follow:

\[ \lambda_{axis} = \sum{\frac{row.sum}{grand.total} * row.coord^2} \]

Or equivalently

\[ \lambda_{axis} = \sum{\frac{col.sum}{grand.total} * col.coord^2} \]

row.coord and col.coord are the coordinates of row and column variables on the axis.

The association index between a row and column for the principal axes can be computed as follow:

\[ i = 1 + \sum{\frac{row.coord * col.coord}{\sqrt{\lambda}}} \]

$\lambda$ is the eigenvalue of the axes
The sum denotes the sum for all axis

If there is an attraction the corresponding row and column coordinates have the same sign on the axes. If there is a repulsion the corresponding row and column coordinates have different signs on the axes. A high value indicates a strong attraction or repulsion

CA - Singular value decomposition of the standardized residuals

Correspondence analysis (CA) is used to represent graphically the table of distances between row variables or between column variables.

CA approach includes the following steps:

STEP 1. Compute the standardized residuals

The standardized residuals (S) is:

\[ S = \frac{o - e}{\sqrt{e}} \]

In fact, S is just the square roots of the terms comprising $\chi^2$ statistic.

STEP II. Compute the singular value decomposition (SVD) of the standardized residuals.

Let M be: $M = \frac{1}{sqrt(grand.total)} \times S$

SVD means that we want to find orthogonal matrices U and V, together with a diagonal matrix $\Delta$, such that:

\[ M = U \Delta V^T \]

(Phillip M. Yelland, 2010)

$U$ is a matrix containing row eigenvectors
$\Delta$ is the diagonal matrix. The numbers on the diagonal of the matrix are called singular values (SV). The eigenvalues are the squared SV.
$V$ is a matrix containing column eigenvectors

The eigenvalue of a given axis is:

\[ \lambda = \delta^2 \]

$\delta$ is the singular value

The coordinates of row variables on a given axis are:

\[ row.coord = \frac{U * \delta }{\sqrt{row.mass}} \]

The coordinates of columns are:

\[ col.coord = \frac{V * \delta }{\sqrt{col.mass}} \]

Compute SVD in R:

# Grand total
n <- sum(housetasks)
# Standardized residuals
residuals <- chisq$residuals/sqrt(n)
# Number of dimensions
nb.axes <- min(nrow(residuals)-1, ncol(residuals)-1)
# Singular value decomposition
res.svd <- svd(residuals, nu = nb.axes, nv = nb.axes)
res.svd

$d
[1] 7.368102e-01 6.670853e-01 3.564385e-01 1.012225e-16

$u
             [,1]        [,2]        [,3]
 [1,] -0.42762952 -0.23587902 -0.28228398
 [2,] -0.35197789 -0.21761257 -0.13633376
 [3,] -0.23391020 -0.11493572 -0.14480767
 [4,] -0.19557424 -0.19231779  0.17519699
 [5,] -0.14136307  0.17221046 -0.06990952
 [6,] -0.06528142  0.16864510  0.19063825
 [7,] -0.04189568  0.15859251  0.14910925
 [8,]  0.07216535 -0.08919754  0.60778606
 [9,]  0.28421536 -0.27652950  0.43123528
[10,]  0.09354184  0.23576569  0.02484968
[11,]  0.24793268  0.20050833 -0.22918636
[12,]  0.63820133 -0.39850534 -0.40738669
[13,]  0.10379321  0.65156733 -0.11011902

$v
            [,1]       [,2]       [,3]
[1,] -0.66679846 -0.3211267 -0.3289692
[2,] -0.03220853 -0.1668171  0.9085662
[3,]  0.73643655 -0.4217418 -0.2476526
[4,]  0.10956112  0.8313745 -0.0703917

sv <- res.svd$d[1:nb.axes] # singular value
u <-res.svd$u
v <- res.svd$v

Eigenvalues and screeplot

# Eigenvalues
eig <- sv^2
# Variances in percentage
variance <- eig*100/sum(eig)
# Cumulative variances
cumvar <- cumsum(variance)

eig<- data.frame(eig = eig, variance = variance,
                     cumvariance = cumvar)
head(eig)

        eig variance cumvariance
1 0.5428893 48.69222    48.69222
2 0.4450028 39.91269    88.60491
3 0.1270484 11.39509   100.00000

barplot(eig[, 2], names.arg=1:nrow(eig), 
       main = "Variances",
       xlab = "Dimensions",
       ylab = "Percentage of variances",
       col ="steelblue")
# Add connected line segments to the plot
lines(x = 1:nrow(eig), eig[, 2], 
      type="b", pch=19, col = "red")

How many dimensions to retain?:

The maximum number of axes in the CA is :

\[ nb.axes = min( r-1, c-1) \]

r and c are respectively the number of rows and columns in the table.

Use elbow method

Row coordinates

We can use the function apply to perform arbitrary operations on the rows and columns of a matrix.

A simplified format is:

apply(X, MARGIN, FUN, ...)

x: a matrix
MARGIN: allowed values can be 1 or 2. 1 specifies that we want to operate on the rows of the matrix. 2 specifies that we want to operate on the column.
FUN: the function to be applied
…: optional arguments to FUN

# row sum
row.sum <- apply(housetasks, 1, sum)
# row mass
row.mass <- row.sum/n

# row coord = sv * u /sqrt(row.mass)
cc <- t(apply(u, 1, '*', sv)) # each row X sv
row.coord <- apply(cc, 2, '/', sqrt(row.mass))
rownames(row.coord) <- rownames(housetasks)
colnames(row.coord) <- paste0("Dim.", 1:nb.axes)
round(row.coord,3)

            Dim.1  Dim.2  Dim.3
Laundry    -0.992 -0.495 -0.317
Main_meal  -0.876 -0.490 -0.164
Dinner     -0.693 -0.308 -0.207
Breakfeast -0.509 -0.453  0.220
Tidying    -0.394  0.434 -0.094
Dishes     -0.189  0.442  0.267
Shopping   -0.118  0.403  0.203
Official    0.227 -0.254  0.923
Driving     0.742 -0.653  0.544
Finances    0.271  0.618  0.035
Insurance   0.647  0.474 -0.289
Repairs     1.529 -0.864 -0.472
Holidays    0.252  1.435 -0.130

# plot
plot(row.coord, pch=19, col = "blue")
text(row.coord, labels =rownames(row.coord), pos = 3, col ="blue")
abline(v=0, h=0, lty = 2)

Column coordinates

# Coordinates of columns
col.sum <- apply(housetasks, 2, sum)
col.mass <- col.sum/n
# coordinates sv * v /sqrt(col.mass)
cc <- t(apply(v, 1, '*', sv))
col.coord <- apply(cc, 2, '/', sqrt(col.mass))
rownames(col.coord) <- colnames(housetasks)
colnames(col.coord) <- paste0("Dim", 1:nb.axes)
head(col.coord)

                   Dim1       Dim2        Dim3
Wife        -0.83762154 -0.3652207 -0.19991139
Alternating -0.06218462 -0.2915938  0.84858939
Husband      1.16091847 -0.6019199 -0.18885924
Jointly      0.14942609  1.0265791 -0.04644302

# plot
plot(col.coord, pch=17, col = "red")
text(col.coord, labels =rownames(col.coord), pos = 3, col ="red")
abline(v=0, h=0, lty = 2)

Biplot of rows and columns to view the association

xlim <- range(c(row.coord[,1], col.coord[,1]))*1.1
ylim <- range(c(row.coord[,2], col.coord[,2]))*1.1
# Plot of rows
plot(row.coord, pch=19, col = "blue", xlim = xlim, ylim = ylim)
text(row.coord, labels =rownames(row.coord), pos = 3, col ="blue")
# plot off columns
points(col.coord, pch=17, col = "red")
text(col.coord, labels =rownames(col.coord), pos = 3, col ="red")
abline(v=0, h=0, lty = 2)

You can interpret the distance between rows points or between column points but the distance between column points and row points are not meaningful.

Diagnostic

Recall that, the total inertia contained in the data is:

\[ \phi^2 = \frac{\chi^2}{n} = 1.11494 \]

Our two-dimensional plot captures about 88% of the total inertia of the table.

Contribution of rows and columns

The contributions of a rows/columns to the definition of a principal axis are :

\[ row.contrib = \frac{row.mass * row.coord^2}{eigenvalue} \]

\[ col.contrib = \frac{col.mass * col.coord^2}{eigenvalue} \]

Contribution of rows in %

# contrib <- row.mass * row.coord^2/eigenvalue
cc <- apply(row.coord^2, 2, "*", row.mass)
row.contrib <- t(apply(cc, 1, "/", eig[1:nb.axes,1])) *100
round(row.contrib, 2)

           Dim.1 Dim.2 Dim.3
Laundry    18.29  5.56  7.97
Main_meal  12.39  4.74  1.86
Dinner      5.47  1.32  2.10
Breakfeast  3.82  3.70  3.07
Tidying     2.00  2.97  0.49
Dishes      0.43  2.84  3.63
Shopping    0.18  2.52  2.22
Official    0.52  0.80 36.94
Driving     8.08  7.65 18.60
Finances    0.88  5.56  0.06
Insurance   6.15  4.02  5.25
Repairs    40.73 15.88 16.60
Holidays    1.08 42.45  1.21

corrplot(row.contrib, is.cor = FALSE)

Contribution of columns in %

# contrib <- col.mass * col.coord^2/eigenvalue
cc <- apply(col.coord^2, 2, "*", col.mass)
col.contrib <- t(apply(cc, 1, "/", eig[1:nb.axes,1])) *100
round(col.contrib, 2)

             Dim1  Dim2  Dim3
Wife        44.46 10.31 10.82
Alternating  0.10  2.78 82.55
Husband     54.23 17.79  6.13
Jointly      1.20 69.12  0.50

corrplot(col.contrib, is.cor = FALSE)

Quality of the representation

The quality of the representation is called COS2.

The quality of the representation of a row on an axis is:

\[ row.cos2 = \frac{row.coord^2}{d^2} \]

row.coord is the coordinate of the row on the axis
$d^2$ is the squared distance from the average profile

Recall that the distance between each row profile and the average row profile is:

\[ d^2(row_i, average.profile) = \sum{\frac{(row.profile_i - average.profile)^2}{average.profile}} \]

row.profile <- housetasks/row.sum
head(round(row.profile, 3))

            Wife Alternating Husband Jointly
Laundry    0.886       0.080   0.011   0.023
Main_meal  0.810       0.131   0.033   0.026
Dinner     0.713       0.102   0.065   0.120
Breakfeast 0.586       0.257   0.107   0.050
Tidying    0.434       0.090   0.008   0.467
Dishes     0.283       0.212   0.035   0.469

average.profile <- col.sum/n
head(round(average.profile, 3))

       Wife Alternating     Husband     Jointly 
      0.344       0.146       0.218       0.292

The R code below computes the distance from the average profile for all the row variables

d2.row <- apply(row.profile, 1, 
                function(row.p, av.p){sum(((row.p - av.p)^2)/av.p)}, 
                average.rp)
head(round(d2.row,3))

   Laundry  Main_meal     Dinner Breakfeast    Tidying     Dishes 
     1.329      1.034      0.618      0.512      0.353      0.302

The cos2 of rows on the factor map are:

row.cos2 <- apply(row.coord^2, 2, "/", d2.row)
round(row.cos2, 3)

           Dim.1 Dim.2 Dim.3
Laundry    0.740 0.185 0.075
Main_meal  0.742 0.232 0.026
Dinner     0.777 0.154 0.070
Breakfeast 0.505 0.400 0.095
Tidying    0.440 0.535 0.025
Dishes     0.118 0.646 0.236
Shopping   0.064 0.748 0.189
Official   0.053 0.066 0.881
Driving    0.432 0.335 0.233
Finances   0.161 0.837 0.003
Insurance  0.576 0.309 0.115
Repairs    0.707 0.226 0.067
Holidays   0.030 0.962 0.008

visualize the cos2:

corrplot(row.cos2, is.cor = FALSE)

Cos2 of columns

\[ col.cos2 = \frac{col.coord^2}{d^2} \]

col.profile <- t(housetasks)/col.sum
col.profile <- t(col.profile)
#head(round(col.profile, 3))

average.profile <- row.sum/n
#head(round(average.profile, 3))

The R code below computes the distance from the average profile for all the column variables

d2.col <- apply(col.profile, 2, 
        function(col.p, av.p){sum(((col.p - av.p)^2)/av.p)}, 
        average.profile)
#round(d2.col,3)

The cos2 of columns on the factor map are:

col.cos2 <- apply(col.coord^2, 2, "/", d2.col)
round(col.cos2, 3)

             Dim1  Dim2  Dim3
Wife        0.802 0.152 0.046
Alternating 0.005 0.105 0.890
Husband     0.772 0.208 0.020
Jointly     0.021 0.977 0.002

visualize the cos2:

corrplot(col.cos2, is.cor = FALSE)

Supplementary rows/columns

The supplementary row coordinates

\[ sup.row.coord = sup.row.profile * \frac{v}{\sqrt{col.mass}} \]

# Supplementary row
sup.row <- as.data.frame(housetasks["Dishes",, drop = FALSE])
# Supplementary row profile
sup.row.sum <- apply(sup.row, 1, sum)
sup.row.profile <- sweep(sup.row, 1, sup.row.sum, "/")
# V/sqrt(col.mass)
vv <- sweep(v, 1, sqrt(col.mass), FUN = "/")
# Supplementary row coord
sup.row.coord <- as.matrix(sup.row.profile) %*% vv
sup.row.coord

             [,1]      [,2]      [,3]
Dishes -0.1889641 0.4419662 0.2669493

## COS2 = coor^2/Distance from average profile
d2.row <- apply(sup.row.profile, 1, 
        function(row.p, av.p){sum(((row.p - av.p)^2)/av.p)}, 
        average.rp)
sup.row.cos2 <- sweep(sup.row.coord^2, 1, d2.row, FUN = "/")

Packages in R

There are many packages for CA:

FactoMineR
ade4
ca

library(FactoMineR)
res.ca <- CA(housetasks, graph = F)
# print
res.ca

**Results of the Correspondence Analysis (CA)**
The row variable has  13  categories; the column variable has 4 categories
The chi square of independence between the two variables is equal to 1944.456 (p-value =  0 ).
*The results are available in the following objects:

   name              description                   
1  "$eig"            "eigenvalues"                 
2  "$col"            "results for the columns"     
3  "$col$coord"      "coord. for the columns"      
4  "$col$cos2"       "cos2 for the columns"        
5  "$col$contrib"    "contributions of the columns"
6  "$row"            "results for the rows"        
7  "$row$coord"      "coord. for the rows"         
8  "$row$cos2"       "cos2 for the rows"           
9  "$row$contrib"    "contributions of the rows"   
10 "$call"           "summary called parameters"   
11 "$call$marge.col" "weights of the columns"      
12 "$call$marge.row" "weights of the rows"

# eigenvalue
head(res.ca$eig)[, 1:2]

        eigenvalue percentage of variance
dim 1 5.428893e-01           4.869222e+01
dim 2 4.450028e-01           3.991269e+01
dim 3 1.270484e-01           1.139509e+01
dim 4 5.119700e-33           4.591904e-31

# barplot of percentage of variance
barplot(res.ca$eig[,2], names.arg = rownames(res.ca$eig))

# Plot row points
plot(res.ca, invisible ="col")

# Plot column points
plot(res.ca, invisible ="col")

# Biplot of rows and columns
plot(res.ca)

Infos

This analysis has been performed using R software (ver. 3.1.2), FactoMineR (ver. 1.29) and factoextra (ver. 1.0.2)

Phillip M. Yelland. An introduction to correspondence analysis. Mathematica Journal. 2010. http://www.mathematica-journal.com/data/uploads/2010/09/Yelland.pdf
Ricco RAKOTOMALALA (article in french). Analyse factorielle des correspondances. University Lyon 2. http://eric.univ-lyon2.fr/~ricco/cours/slides/AFC.pdf
Bendixen M. 2003, A Practical Guide to the Use of Correspondence Analysis in Marketing Research, Marketing Bulletin, 2003, 14, Technical Note 2. http://marketing-bulletin.massey.ac.nz/V14/MB_V14_T2_Bendixen.pdf

Principal component analysis in R : prcomp() vs. princomp() - R software and data mining

Mon, 25 May 2015 09:13:12 +0200

Packages in R for principal component analysis
prcomp() and princomp() functions
Install factoextra for visualization
Prepare the data
Use the R function prcomp() for PCA
Variances of the principal components
Graph of variables : The correlation circle
Graph of individuals
Prediction using Principal Component Analysis
Infos

The basics of Principal Component Analysis (PCA) have been already described in my previous article : PCA basics.

This R tutorial describes how to perform a Principal Component Analysis (PCA) using the built-in R functions prcomp() and princomp().

You will learn how to :

determine the number of components to retain for summarizing the information in your data
calculate the coordinates, the cos2 and the contribution of variables
calculate the coordinates, the cos2 and the contribution of individuals
interpret the correlation circle of PCA
make a prediction with PCA

Packages in R for principal component analysis

There are two general methods to perform PCA in R :

Spectral decomposition which examines the covariances / correlations between variables
Singular value decomposition which examines the covariances / correlations between individuals

The singular value decomposition method is the preferred analysis for numerical accuracy.

There are several functions from different packages for performing PCA :

The functions prcomp() and princomp() from the built-in R stats package
PCA() from FactoMineR package. Read more here : PCA with FactoMineR
dudi.pca() from ade4 package. Read more here : PCA with ade4

The functions prcomp() and princomp() are described in the next section.

prcomp() and princomp() functions

The function princomp() uses the spectral decomposition approach.

The functions prcomp() and PCA()[FactoMineR] use the singular value decomposition (SVD).

According to R help, SVD has slightly better numerical accuracy. Therefore, prcomp() is the preferred function.

The simplified format of these 2 functions are :

prcomp(x, scale = FALSE)

princomp(x, cor = FALSE, scores = TRUE)

Arguments for prcomp() :

x : a numeric matrix or data frame
scale : a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place

Arguments for princomp() :

x : a numeric matrix or data frame
cor : a logical value. If TRUE, the data will be centered and scaled before the analysis
scores : a logical value. If TRUE, the coordinates on each principal component are calculated

The elements of the outputs returned by the functions prcomp() and princomp() includes :

prcomp() name	princomp() name	Description
sdev	sdev	the standard deviations of the principal components
rotation	loadings	the matrix of variable loadings (columns are eigenvectors)
center	center	the variable means (means that were substracted)
scale	scale	the variable standard deviations (the scalings applied to each variable )
x	scores	The coordinates of the individuals (observations) on the principal components.

In the following sections, we’ll focus only on the function prcomp()

Install factoextra for visualization

The package factoextra is used for the visualization of the principal component analysis results.

factoextra can be installed and loaded as follow :

# install.packages("devtools")
devtools::install_github("kassambara/factoextra")

# load
library("factoextra")

Prepare the data

We’ll use the data sets decathlon2 from the package factoextra :

library("factoextra")
data(decathlon2)

This data is a subset of decathlon data in FactoMineR package

As illustrated below, the data used here describes athletes’ performance during two sporting events (Desctar and OlympicG). It contains 27 individuals (athletes) described by 13 variables :

Only some of these individuals and variables will be used to perform the principal component analysis (PCA).

The coordinates of the remaining individuals and variables on the factor map will be predicted after the PCA.

In PCA terminology, our data contains :

Active individuals (in blue, rows 1:23) : Individuals that are used during the principal component analysis.
Supplementary individuals (in green, rows 24:27) : The coordinates of these individuals will be predicted using the PCA information and parameters obtained with active individuals/variables
Active variables (in pink, columns 1:10) : Variables that are used for the principal component analysis.
Supplementary variables : As supplementary individuals, the coordinates of these variables will be predicted also.
Supplementary continuous variables : Columns 11 and 12 corresponding respectively to the rank and the points of athletes.
Supplementary qualitative variables : Column 13 corresponding to the two athletic meetings (2004 Olympic Game or 2004 Decastar). This factor variables will be used to color individuals by groups.

Extract only active individuals and variables for principal component analysis:

decathlon2.active <- decathlon2[1:23, 1:10]
head(decathlon2.active[, 1:6])

          X100m Long.jump Shot.put High.jump X400m X110m.hurdle
SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69
CLAY      10.76      7.40    14.26      1.86 49.37        14.05
BERNARD   11.02      7.23    14.25      1.92 48.93        14.99
YURKOV    11.34      7.09    15.19      2.10 50.42        15.31
ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17
McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38

Use the R function prcomp() for PCA

res.pca <- prcomp(decathlon2.active, scale = TRUE)

The values returned, by the function prcomp(), are :

names(res.pca)

[1] "sdev"     "rotation" "center"   "scale"    "x"

sdev : the standard deviations of the principal components (the square roots of the eigenvalues)

head(res.pca$sdev)

[1] 2.0308159 1.3559244 1.1131668 0.9052294 0.8375875 0.6502944

rotation : the matrix of variable loadings (columns are eigenvectors)

head(unclass(res.pca$rotation)[, 1:4])

                    PC1         PC2        PC3         PC4
X100m        -0.4188591  0.13230683 -0.2708996  0.03708806
Long.jump     0.3910648 -0.20713320  0.1711752 -0.12746997
Shot.put      0.3613881 -0.06298590 -0.4649778  0.14191803
High.jump     0.3004132  0.34309742 -0.2965280  0.15968342
X400m        -0.3454786 -0.21400770 -0.2547084  0.47592968
X110m.hurdle -0.3762651  0.01824645 -0.4032525 -0.01866477

center, scale : the centering and scaling used, or FALSE

Variances of the principal components

The variance retained by each principal component can be obtained as follow :

# Eigenvalues
eig <- (res.pca$sdev)^2

# Variances in percentage
variance <- eig*100/sum(eig)

# Cumulative variances
cumvar <- cumsum(variance)

eig.decathlon2.active <- data.frame(eig = eig, variance = variance,
                     cumvariance = cumvar)
head(eig.decathlon2.active)

        eig  variance cumvariance
1 4.1242133 41.242133    41.24213
2 1.8385309 18.385309    59.62744
3 1.2391403 12.391403    72.01885
4 0.8194402  8.194402    80.21325
5 0.7015528  7.015528    87.22878
6 0.4228828  4.228828    91.45760

Note that, you can use the function summary() to extract the eigenvalues and variances from an object of class prcomp.

summary(res.pca)

You can also use the package factoextra. It’s simple :

library("factoextra")
eig.val <- get_eigenvalue(res.pca)
head(eig.val)

      eigenvalue variance.percent cumulative.variance.percent
Dim.1  4.1242133        41.242133                    41.24213
Dim.2  1.8385309        18.385309                    59.62744
Dim.3  1.2391403        12.391403                    72.01885
Dim.4  0.8194402         8.194402                    80.21325
Dim.5  0.7015528         7.015528                    87.22878
Dim.6  0.4228828         4.228828                    91.45760

What mean eigenvalues ?

Recall that eigenvalues measures the variability retained by each PC. It’s large for the first PC and small for the subsequent PCs.

The importance of princpal components (PCs) can be visualized with a scree plot.

Scree plot using base graphics :

barplot(eig.decathlon2.active[, 2], names.arg=1:nrow(eig.decathlon2.active), 
       main = "Variances",
       xlab = "Principal Components",
       ylab = "Percentage of variances",
       col ="steelblue")
# Add connected line segments to the plot
lines(x = 1:nrow(eig.decathlon2.active), 
      eig.decathlon2.active[, 2], 
      type="b", pch=19, col = "red")

~60% of the information (variances) contained in the data are retained by the first two principal components.

Scree plot using factoextra :

fviz_screeplot(res.pca, ncp=10)

It’s also possible to visualize the eigenvalues instead of the variances :

fviz_screeplot(res.pca, ncp=10, choice="eigenvalue")

Graph of variables : The correlation circle

A simple method to extract the results, for variables, from a PCA output is to use the function get_pca_var() [factoextra]. This function provides a list of matrices containing all the results for the active variables (coordinates, correlation between variables and axes, squared cosine and contributions)

var <- get_pca_var(res.pca)
var

Principal Component Analysis Results for variables
 ===================================================
  Name       Description                                    
1 "$coord"   "Coordinates for the variables"                
2 "$cor"     "Correlations between variables and dimensions"
3 "$cos2"    "Cos2 for the variables"                       
4 "$contrib" "contributions of the variables"

# Coordinates of variables
var$coord[, 1:4]

                    Dim.1       Dim.2       Dim.3       Dim.4
X100m        -0.850625692  0.17939806 -0.30155643  0.03357320
Long.jump     0.794180641 -0.28085695  0.19054653 -0.11538956
Shot.put      0.733912733 -0.08540412 -0.51759781  0.12846837
High.jump     0.610083985  0.46521415 -0.33008517  0.14455012
X400m        -0.701603377 -0.29017826 -0.28353292  0.43082552
X110m.hurdle -0.764125197  0.02474081 -0.44888733 -0.01689589
Discus        0.743209016 -0.04966086 -0.17652518  0.39500915
Pole.vault   -0.217268042 -0.80745110 -0.09405773 -0.33898477
Javeline      0.428226639 -0.38610928 -0.60412432 -0.33173454
X1500m        0.004278487 -0.78448019  0.21947068  0.44800961

In this section I’ll show you, step by step, how to calculate the coordinates, the cos2 and the contribution of variables.

Coordinates of variables on the principal components

The correlation between variables and principal components is used as coordinates. It can be calculated as follow :

Variable correlations with PCs = loadings * the component standard deviations.

# Helper function : 
# Correlation between variables and principal components
var_cor_func <- function(var.loadings, comp.sdev){
  var.loadings*comp.sdev
  }

# Variable correlation/coordinates
loadings <- res.pca$rotation
sdev <- res.pca$sdev

var.coord <- var.cor <- t(apply(loadings, 1, var_cor_func, sdev))
head(var.coord[, 1:4])

                    PC1         PC2        PC3         PC4
X100m        -0.8506257  0.17939806 -0.3015564  0.03357320
Long.jump     0.7941806 -0.28085695  0.1905465 -0.11538956
Shot.put      0.7339127 -0.08540412 -0.5175978  0.12846837
High.jump     0.6100840  0.46521415 -0.3300852  0.14455012
X400m        -0.7016034 -0.29017826 -0.2835329  0.43082552
X110m.hurdle -0.7641252  0.02474081 -0.4488873 -0.01689589

Graph of variables using R base graph

# Plot the correlation circle
a <- seq(0, 2*pi, length = 100)
plot( cos(a), sin(a), type = 'l', col="gray",
      xlab = "PC1",  ylab = "PC2")

abline(h = 0, v = 0, lty = 2)

# Add active variables
arrows(0, 0, var.coord[, 1], var.coord[, 2], 
      length = 0.1, angle = 15, code = 2)

# Add labels
text(var.coord, labels=rownames(var.coord), cex = 1, adj=1)

Graph of variables using factoextra

fviz_pca_var(res.pca)

Read more about the function fviz_pca_var() : Graph of variables - Principal Component Analysis

How to interpret the correlation plot?

The graph of variables shows the relationships between all variables :

Positively correlated variables are grouped together.
Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants).
The distance between variables and the origine measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map.

Cos2 : quality of representation for variables on the factor map

The cos2 of variables are calculated as the squared coordinates : var.cos2 = var.coord * var.coord

var.cos2 <- var.coord^2
head(var.cos2[, 1:4])

                   PC1          PC2        PC3          PC4
X100m        0.7235641 0.0321836641 0.09093628 0.0011271597
Long.jump    0.6307229 0.0788806285 0.03630798 0.0133147506
Shot.put     0.5386279 0.0072938636 0.26790749 0.0165041211
High.jump    0.3722025 0.2164242070 0.10895622 0.0208947375
X400m        0.4922473 0.0842034209 0.08039091 0.1856106269
X110m.hurdle 0.5838873 0.0006121077 0.20149984 0.0002854712

Using factoextra package, the color of variables can be automatically controlled by the value of their cos2.

fviz_pca_var(res.pca, col.var="contrib")+
scale_color_gradient2(low="white", mid="blue", 
      high="red", midpoint=55) + theme_minimal()

Contributions of the variables to the principal components

The contribution of a variable to a given principal component is (in percentage) : (var.cos2 * 100) / (total cos2 of the component)

comp.cos2 <- apply(var.cos2, 2, sum)

contrib <- function(var.cos2, comp.cos2){var.cos2*100/comp.cos2}

var.contrib <- t(apply(var.cos2,1, contrib, comp.cos2))
head(var.contrib[, 1:4])

                   PC1        PC2       PC3         PC4
X100m        17.544293  1.7505098  7.338659  0.13755240
Long.jump    15.293168  4.2904162  2.930094  1.62485936
Shot.put     13.060137  0.3967224 21.620432  2.01407269
High.jump     9.024811 11.7715838  8.792888  2.54987951
X400m        11.935544  4.5799296  6.487636 22.65090599
X110m.hurdle 14.157544  0.0332933 16.261261  0.03483735

Highlight the most important (i.e, contributing) variables :

fviz_pca_var(res.pca, col.var="contrib") +
scale_color_gradient2(low="white", mid="blue", 
      high="red", midpoint=50) + theme_minimal()

You can also use the function fviz_contrib() described here : Principal Component Analysis: How to reveal the most important variables in your data?

Graph of individuals

Coordinates of individuals on the principal components

ind.coord <- res.pca$x
head(ind.coord[, 1:4])

                 PC1        PC2        PC3         PC4
SEBRLE     0.1912074 -1.5541282 -0.6283688  0.08205241
CLAY       0.7901217 -2.4204156  1.3568870  1.26984296
BERNARD   -1.3292592 -1.6118687 -0.1961500 -1.92092203
YURKOV    -0.8694134  0.4328779 -2.4739822  0.69723814
ZSIVOCZKY -0.1057450  2.0233632  1.3049312 -0.09929630
McMULLEN   0.1185550  0.9916237  0.8435582  1.31215266

Cos2 : quality of representation for individuals on the principal components

To calculate the cos2 of individuals, 2 simple steps are required :

Calculate the square distance between each individual and the PCA center of gravity

d2 = [(var1_ind_i - mean_var1)/sd_var1]^2 + …+ [(var10_ind_i - mean_var10)/sd_var10]^2 + …+..

Calculate the cos2 = ind.coord^2/d2

# Compute the square of the distance between an individual and the
# center of gravity
center <- res.pca$center
scale<- res.pca$scale
getdistance <- function(ind_row, center, scale){
  return(sum(((ind_row-center)/scale)^2))
  }
d2 <- apply(decathlon2.active,1,getdistance, center, scale)

# Compute the cos2
cos2 <- function(ind.coord, d2){return(ind.coord^2/d2)}
ind.cos2 <- apply(ind.coord, 2, cos2, d2)
head(ind.cos2[, 1:4])

                  PC1        PC2         PC3         PC4
SEBRLE    0.007530179 0.49747323 0.081325232 0.001386688
CLAY      0.048701249 0.45701660 0.143628117 0.125791741
BERNARD   0.197199804 0.28996555 0.004294015 0.411819183
YURKOV    0.096109800 0.02382571 0.778230322 0.061812637
ZSIVOCZKY 0.001574385 0.57641944 0.239754152 0.001388216
McMULLEN  0.002175437 0.15219499 0.110137872 0.266486530

The sum of each row is 1, if we consider the 10 components

Contribution of individuals to the princial components

The contribution of individuals (in percentage) to the principal components can be computed as follow :

100 * (1 / number_of_individuals)*(ind.coord^2 / comp_sdev^2)

# Contributions of individuals
contrib <- function(ind.coord, comp.sdev, n.ind){
  100*(1/n.ind)*ind.coord^2/comp.sdev^2
}

ind.contrib <- t(apply(ind.coord,1, contrib, 
                       res.pca$sdev, nrow(ind.coord)))
head(ind.contrib[, 1:4])

                 PC1        PC2        PC3         PC4
SEBRLE    0.03854254  5.7118249  1.3854184  0.03572215
CLAY      0.65814114 13.8541889  6.4600973  8.55568792
BERNARD   1.86273218  6.1441319  0.1349983 19.57827284
YURKOV    0.79686310  0.4431309 21.4755770  2.57939100
ZSIVOCZKY 0.01178829  9.6816398  5.9748485  0.05231437
McMULLEN  0.01481737  2.3253860  2.4967890  9.13531719

Note that the sum of all the contributions per column is 100

Graph of individuals : base graph

plot(ind.coord[,1], ind.coord[,2], pch = 19,  
     xlab="PC1 - 41.2%",ylab="PC2 - 18.4%")
abline(h=0, v=0, lty = 2)
text(ind.coord[,1], ind.coord[,2], labels=rownames(ind.coord),
        cex=0.7, pos = 3)

Biplot of individuals and variables :

biplot(res.pca, cex = 0.8, col = c("black", "red") )

Graph of individuals : factoextra

Extract the results for the individuals

factoextra provides, with less code, a list of matrices containing all the results for the active individuals (coordinates, square cosine, contributions).

ind <- get_pca_ind(res.pca)
ind

Principal Component Analysis Results for individuals
 ===================================================
  Name       Description                       
1 "$coord"   "Coordinates for the individuals" 
2 "$cos2"    "Cos2 for the individuals"        
3 "$contrib" "contributions of the individuals"

# Coordinates for individuals
head(ind$coord[, 1:4])

               Dim.1      Dim.2      Dim.3       Dim.4
SEBRLE     0.1912074 -1.5541282 -0.6283688  0.08205241
CLAY       0.7901217 -2.4204156  1.3568870  1.26984296
BERNARD   -1.3292592 -1.6118687 -0.1961500 -1.92092203
YURKOV    -0.8694134  0.4328779 -2.4739822  0.69723814
ZSIVOCZKY -0.1057450  2.0233632  1.3049312 -0.09929630
McMULLEN   0.1185550  0.9916237  0.8435582  1.31215266

Graph of individuals using factoextra

Note that, in the R code below, the argument data is required only when res.pca is an object of class princomp or prcomp (two functions from the built-in R stats package).

In other words, if res.pca is a result of PCA functions from FactoMineR or ade4 package, the argument data can be omitted.

Yes, factoextra can also handle the output of FactoMineR and ade4 packages.

Default individuals factor map :

fviz_pca_ind(res.pca)

Control automatically the color of individuals using the cos2 values (the quality of the individuals on the factor map) :

fviz_pca_ind(res.pca, col.ind="cos2") +
scale_color_gradient2(low="white", mid="blue", 
    high="red", midpoint=0.50) + theme_minimal()

Read more about fviz_pca_ind() : Graph of individuals - principal component analysis

Make a biplot of individuals and variables :

fviz_pca_biplot(res.pca,  geom = "text") +
  theme_minimal()

Read more about fviz_pca_biplot() : Biplot of individuals and variables - principal component analysis

Prediction using Principal Component Analysis

Supplementary quantitative variables

As described above, the data sets decathlon2 contain some supplementary continuous variables at columns 11 and 12 corresponding respectively to the rank and the points of athletes.

# Data for the supplementary quantitative variables
quanti.sup <- decathlon2[1:23, 11:12, drop = FALSE]
head(quanti.sup)

          Rank Points
SEBRLE       1   8217
CLAY         2   8122
BERNARD      4   8067
YURKOV       5   8036
ZSIVOCZKY    7   8004
McMULLEN     8   7995

Recall that, rows 24:27 are supplementary individuals. We don’t want them in this current analysis. This is why, I extracted only rows 1:23.

In this section we’ll see how to calculate the predicted coordinates of these two variables using the information provided by the previously performed principal component analysis.

2 simples steps are required :

Calculate the correlation between each supplementary quantitative variables and the principal components
Make a factor map of all variables (active and supplementary ones) to visualize the position of the supplementary variables

The R code below can be used :

# Calculate the correlations between supplementary variables
# and the principal components
ind.coord <- res.pca$x
quanti.coord <- cor(quanti.sup, ind.coord)
head(quanti.coord[, 1:4])

              PC1         PC2        PC3         PC4
Rank   -0.7014777  0.24519443  0.1834294  0.05575186
Points  0.9637075 -0.07768262 -0.1580225 -0.16623092

# Variable factor maps
#++++++++++++++++++
# Plot the correlation circle
a <- seq(0, 2*pi, length = 100)
plot( cos(a), sin(a), type = 'l', col="gray",
      xlab = "PC1",  ylab = "PC2")
abline(h = 0, v = 0, lty = 2)
# Add active variables
var.coord <- get_pca_var(res.pca)$coord
arrows(0 ,0, x1=var.coord[,1], y1 = var.coord[,2], 
       col="black", length = 0.09)
text(var.coord[,1], var.coord[,2],
     labels=rownames(var.coord), cex=0.8)
# Add supplementary quantitative variables
arrows(0 ,0, x1= quanti.coord[,1], y1 = quanti.coord[,2], 
       col="blue", lty =2, length = 0.09)
text(quanti.coord[,1], quanti.coord[,2],
     labels=rownames(quanti.coord), cex=0.8, col ="blue")

It’s also possible to make the graph of variables using factoextra:

# Plot of active variables
p <- fviz_pca_var(res.pca)
# Add supplementary active variables
fviz_add(p, quanti.coord, color ="blue", geom="arrow")

# get the cos2 of the supplementary quantitative variables
(quanti.coord^2)[, 1:4]

             PC1         PC2        PC3        PC4
Rank   0.4920710 0.060120310 0.03364635 0.00310827
Points 0.9287322 0.006034589 0.02497110 0.02763272

Supplementary qualitative variables

The data sets decathlon2 contain a supplementary qualitative variable at columns 13 corresponding to the type of competitions.

Qualitative variable can be helpful for interpreting the data and for coloring individuals by groups :

# Data for the supplementary qualitative variables
quali.sup <- as.factor(decathlon2[1:23, 13])
head(quali.sup)

[1] Decastar Decastar Decastar Decastar Decastar Decastar
Levels: Decastar OlympicG

Color individuals by groups :

fviz_pca_ind(res.pca, 
  habillage = quali.sup, addEllipses = TRUE, ellipse.level = 0.68) +
  theme_minimal()

Note that, the argument habillage is used to specify the variable containing the groups of individuals

It’s very easy to get the coordinates for the levels of a supplementary qualitative variables. The helper function below can be used :

# Return the coordinates of a group levels
# x : coordinate of individuals on x axis
# y : coordinate of indiviuals on y axis
get_coord_quali<-function(x, y, groups){
  data.frame(
    x= tapply(x, groups, mean),
    y = tapply(y, groups, mean)
  )
}

Calculate the coordinates on components 1 and 2 :

coord.quali <- get_coord_quali(ind.coord[,1], ind.coord[,2],
                               groups = quali.sup)
coord.quali

                 x          y
Decastar -1.313921 -0.1191322
OlympicG  1.204428  0.1092046

Supplementary individuals

The data sets decathlon2 contain some supplementary individuals from row 24 to 27.

# Data for the supplementary individuals
ind.sup <- decathlon2[24:27, 1:10, drop = FALSE]
ind.sup[, 1:6]

        X100m Long.jump Shot.put High.jump X400m X110m.hurdle
KARPOV  11.02      7.30    14.77      2.04 48.37        14.09
WARNERS 11.11      7.60    14.31      1.98 48.68        14.23
Nool    10.80      7.53    14.26      1.88 48.81        14.80
Drews   10.87      7.38    13.07      1.88 48.51        14.01

Remember that, columns 11:13 are supplementary variables. We don’t want them in this current analysis. This is why, I extracted only columns 1:10. I used also the argument drop = FALSE to preserve the type of the data (which is a data.frame).

In this section we’ll see how to predict the coordinates of the supplementary individuals using only the information provided by the previously performed principal component analysis.

A simple function to predict the coordinates of new individuals data

One simple approach is to use the function predict() from the built-in R stats package :

ind.sup.coord <- predict(res.pca, newdata = ind.sup)
ind.sup.coord[, 1:4]

               PC1         PC2       PC3        PC4
KARPOV   0.7772521 -0.76237804 1.5971253  1.6863286
WARNERS -0.3779697  0.11891968 1.7005146 -0.6908084
Nool    -0.5468405 -1.93402211 0.4724184 -2.2283706
Drews   -1.0848227 -0.01703198 2.9818031 -1.5006207

Calculate the predicted coordinates by hand

2 simples steps are required :

Center and scale the values for the supplementary individuals using the center and the scale of the PCA
Calculate the predicted coordinates by multiplying the scaled values with the eigenvectors (loadings) of the principal components.

The R code below can be used :

# Centering and scaling the supplementary individuals
scale_func <- function(ind_row, center, scale){
  (ind_row-center)/scale
}

ind.scaled <- t(apply(ind.sup, 1, scale_func, res.pca$center, res.pca$scale))

# Coordinates of the individividuals
pca.loadings <- res.pca$rotation
coord_func <- function(ind, loadings){
  r <- loadings*ind
  r <- apply(r, 2, sum)
  r
}

ind.sup.coord <- t(apply(ind.scaled, 1, coord_func, pca.loadings ))
ind.sup.coord[, 1:4]

               PC1         PC2       PC3        PC4
KARPOV   0.7772521 -0.76237804 1.5971253  1.6863286
WARNERS -0.3779697  0.11891968 1.7005146 -0.6908084
Nool    -0.5468405 -1.93402211 0.4724184 -2.2283706
Drews   -1.0848227 -0.01703198 2.9818031 -1.5006207

Make a factor map including the supplementary individuals using factoextra

# Plot of active individuals
p <- fviz_pca_ind(res.pca)
# Add supplementary individuals
fviz_add(p, ind.sup.coord, color ="blue")

Infos

This analysis has been performed using R software (ver. 3.1.2) and factoextra (ver. 1.0.2)

Gregory B. Anderson, principal component analysis in R, https://www.ime.usp.br/~pavan/pdf/MAE0330-PCA-R-2013

Principal component analysis : the basics you should read - R software and data mining

Mon, 25 May 2015 09:06:37 +0200

What is principal component analysis?
PCA basics
Main purpose of PCA
Basic statistics - Covariance between two variables
Covariance/correlation matrix
Interpretention of the covariance matrix
How to minimize the distortion in the data ?
PCA terminologies : Eigenvalues / eigenvectors
Steps for principal component analysis
Compute principal component analysis (step by step)
Packages in R for the principal component analysis
Infos

What is principal component analysis?

Principal component analysis (PCA) is used to summarize the information in a data set described by multiple variables.

Note that, the information in a data is the total variation it contains.

PCA reduces the dimensionality of data containing a large set of variables. This is achieved by transforming the initial variables into a new small set of variables without loosing the most important information in the original data set.

These new variables corresponds to a linear combination of the originals and are called principal components.

This article describes, step by step, how PCA works using R software.

PCA basics

Understanding the details of PCA requires knowledge of linear algebra. In this section, we’ll explain the basics with simple graphical representation of the data.

In the Figure 1A below, the data are represented in the X-Y coordinate system. The dimension reduction is achieved by identifying the principal directions, called principal components, in which the data varies.

PCA assumes that the directions with the largest variances are the most “important” (i.e, the most principal).

In the figure below, the PC1 axis is the first principal direction along which the samples show the largest variation. The PC2 axis is the second most important direction and it is orthogonal to the PC1 axis.

The dimensionality of our two-dimensional data can be reduced to a single dimension by projecting each sample onto the first principal component (Figure 1B)

Main purpose of PCA

The main goals of principal component analysis is :

to identify hidden pattern in a data set
to reduce the dimensionnality of the data by removing the noise and redundancy in the data
to identify correlated variables

PCA method is particularly useful when the variables within the data set are highly correlated.

Correlation indicates that there is redundancy in the data. Due to this redundancy, PCA can be used to reduce the original variables into a smaller number of new variables ( = principal components) explaining most of the variance in the original variables.

How to remove the redundancy?

PCA is traditionally performed on covariance matrix or correlation matrix.

Basic statistics - Covariance between two variables

Let x and y be two variables with length n.

The variance of x is :

\[\sigma^2_{xx} = \frac{\sum_i(x_i - m_x)(x_i - m_x)}{n - 1}\]

The variance of y is :

\[\sigma^2_{yy} = \frac{\sum_i(y_i - m_y)(y_i - m_y)}{n - 1}\]

The covariance of x and y is :

\[\sigma^2_{xy} = \frac{\sum_i(x_i - m_x)(y_i - m_y)}{n - 1}\]

$m_x$ and $m_y$ are the means of x and y variables, respectively.

The covariance measures the degree of the relationship between x and y.

Covariance/correlation matrix

A covariance matrix contains the covariances between all possible pairs of variables in the data set :

df <- iris[, -5]
res.cov <- cov(df)
round(res.cov,2)

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length         0.69       -0.04         1.27        0.52
Sepal.Width         -0.04        0.19        -0.33       -0.12
Petal.Length         1.27       -0.33         3.12        1.30
Petal.Width          0.52       -0.12         1.30        0.58

Note that, the covariance matrix is symmetric. In the table above, covariance between Sepal.Length and Sepal.Width = covariance between Sepal.Width and Sepal.Length.

Interpretention of the covariance matrix

The diagonal elements are the variances of the different variables. A large diagonal values correspond to strong signal.

diag(res.cov)

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
   0.6856935    0.1899794    3.1162779    0.5810063

The off-diagonal values are the covariances between variables. They reflect distortions in the data (noise, redundancy, …). Large off-diagonal values correspond to high distortions in our data.

The aim of PCA is to minimize this distortions and to summarize the essential information in the data

How to minimize the distortion in the data ?

In the covariance table above, the off-diagonal values are different from zero. This indicates the presence of redundancy in the data. In other words, there is a certain amount of correlation between variables.

This kind of matrix, with non-zero off-diagonal values, is called “non-diagonal” matrix.

We need to redefine our initial variables (x, y, z, ….) in order to diagonalize the covariance matrix.

This means that we want to change the covariance matrix so that the off–diagonal elements are close to zero (i.e, zero correlation between pairs of distinct variables).

The new variables (x’, y’, z’, …) are a linear combination of the old ones :

\[X' = a_1X + a_2Y + a_3Z, ...\]

\[Y' = b_1X + b_2Y + b_3Z, ...\]

In PCA, the constants a1, a2, an, b1, b2, bn are calculated such that the covariance matrix is diagonal.

PCA terminologies : Eigenvalues / eigenvectors

Eigenvalues : The numbers on the diagonal of the diagonalized covariance matrix are called eigenvalues of the covariance matrix. Large eigenvalues correspond to large variances.

Eigenvectors : The directions of the new rotated axes are called the eigenvectors of the covariance matrix.

Eigenvalues and eigenvectors can be easily calculated in R as follow :

eigen(res.cov)

$values
[1] 4.22824171 0.24267075 0.07820950 0.02383509

$vectors
            [,1]        [,2]        [,3]       [,4]
[1,]  0.36138659 -0.65658877 -0.58202985  0.3154872
[2,] -0.08452251 -0.73016143  0.59791083 -0.3197231
[3,]  0.85667061  0.17337266  0.07623608 -0.4798390
[4,]  0.35828920  0.07548102  0.54583143  0.7536574

The first principal components of the data are the first directions explaining maximum variances. This is equivalent to the first eigenvectors of the covariance matrix.

Steps for principal component analysis

The procedure includes 5 simple steps :

Prepare the data :

Center the data : subtract the mean from each variables. This produces a data set whose mean is zero.
Scale the data : If the variances of the variables in your data are significantly different, it’s a good idea to scale the data to unit variance. This is achieved by dividing each variables by its standard deviation.

Calculate the covariance/correlation matrix
Calculate the eigenvectors and the eigenvalues of the covariance matrix
Choose principal components : eigenvectors are ordered by eigenvalues from the highest to the lowest. The number of chosen eigenvectors will be the number of dimensions of the new data set. eigenvectors = (eig_1, eig_2,…, eig_n)
compute the new dataset :

transpose eigeinvectors : rows are eigenvectors
transpose the adjusted data (rows are variables and columns are individuals)
new.data = eigenvectors.transposed X adjustedData.transposed

Compute principal component analysis (step by step)

The data set iris is used : columns are variables and rows are observations:

df <- iris[, -5]
head(df)

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

1. Center and scale the data

df.scaled <- scale(df, center = TRUE, scale = TRUE)

2. Compute the correlation matrix :

# 1. Correlation matrix
res.cor <- cor(df.scaled)
round(res.cor, 2)

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length         1.00       -0.12         0.87        0.82
Sepal.Width         -0.12        1.00        -0.43       -0.37
Petal.Length         0.87       -0.43         1.00        0.96
Petal.Width          0.82       -0.37         0.96        1.00

3. Calculate the eigenvectors/eigenvalues of the correlation matrix :

# 2. Calculate eigenvectors/eigenvalues
res.eig <- eigen(res.cor)
res.eig

$values
[1] 2.91849782 0.91403047 0.14675688 0.02071484

$vectors
           [,1]        [,2]       [,3]       [,4]
[1,]  0.5210659 -0.37741762  0.7195664  0.2612863
[2,] -0.2693474 -0.92329566 -0.2443818 -0.1235096
[3,]  0.5804131 -0.02449161 -0.1421264 -0.8014492
[4,]  0.5648565 -0.06694199 -0.6342727  0.5235971

The first eigenvalue (2.9) is much larger than the second (0.9), and so on…. The highest eigenvalues correspond to the first data principal components.

5. compute the new dataset :

# Transpose eigeinvectors
eigenvectors.t <- t(res.eig$vectors)
# Transpose the adjusted data
df.scaled.t <- t(df.scaled)
# The new dataset
df.new <- eigenvectors.t %*% df.scaled.t
# Transpose new data ad rename columns
df.new <- t(df.new)
colnames(df.new) <- c("PC1", "PC2", "PC3", "PC4")
head(df.new)

           PC1        PC2         PC3          PC4
[1,] -2.257141 -0.4784238  0.12727962  0.024087508
[2,] -2.074013  0.6718827  0.23382552  0.102662845
[3,] -2.356335  0.3407664 -0.04405390  0.028282305
[4,] -2.291707  0.5953999 -0.09098530 -0.065735340
[5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
[6,] -2.068701 -1.4842053 -0.02687825  0.006586116

Packages in R for the principal component analysis

There are several functions from different packages for performing PCA :

The functions prcomp() and princomp() from the built-in R stats package. Read more here: prcomp and princomp
PCA() from FactoMineR package. Read more here : PCA with FactoMineR
dudi.pca() from ade4 package. Read more here : PCA with ade4

Infos

This analysis has been performed using R software (ver. 3.1.2) and ggplot2 (ver. 1.0.0)

Gregory B. Anderson, principal component analysis in R, https://www.ime.usp.br/~pavan/pdf/MAE0330-PCA-R-2013
Wikibooks, http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Principal_Component_Analysis
Carlos Pinto, Data reduction, https://medicine.tcd.ie/neuropsychiatric-genetics/assets/pdf/2009_7_PCA_+_Factor_analyses.pdf

ade4 and factoextra : Principal Component Analysis - R software and data mining

Sun, 15 Mar 2015 08:43:27 +0100

Required packages
Prepare the data
Principal component analysis
Variances of the principal components
Graph of variables : the circle of correlations
Graph of individuals
Principal component analysis using supplementary individuals and variables
- Supplementary individuals
- Supplementary quantitative variables
Infos

This R tutorial describes how to perform a Principal Component Analysis (PCA) using R software and ade4 package.

Required packages

The package ade4 can be installed and loaded as follow :

install.packages("ade4")

library("ade4")

The package factoextra is used for the visualization of the principal component analysis results

factoextra can be installed as follow :

# install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Load it :

library("factoextra")

Prepare the data

We’ll used the data sets decathlon2 from the package factoextra :

library("factoextra")

data(decathlon2)
head(decathlon2[, 1:6])

           X100m Long.jump Shot.put High.jump X400m X110m.hurdle
SEBRLE     11.04      7.58    14.83      2.07 49.81        14.69
CLAY       10.76      7.40    14.26      1.86 49.37        14.05
BERNARD    11.02      7.23    14.25      1.92 48.93        14.99
YURKOV     11.34      7.09    15.19      2.10 50.42        15.31
ZSIVOCZKY  11.13      7.30    13.48      2.01 48.62        14.17
McMULLEN   10.83      7.31    13.76      2.13 49.91        14.38

This data is a subset of decathlon data in FactoMineR package

As illustrated below, the data used here describes athletes’ performance during two sporting events (Desctar and OlympicG). It contains 27 individuals (athletes) described by 13 variables :

Only some of these individuals and variables will be used to perform the principal component analysis (PCA).

The coordinates of the remaining individuals and variables on the factor map will be predicted after the PCA.

In PCA terminology, our data contains :

Active individuals (in blue, rows 1:23) : Individuals that are used during the principal component analysis.
Supplementary individuals (in green, rows 24:27) : The coordinates of these individuals will be predicted using the PCA informations and parameters obtained with active individuals/variables
Active variables (in pink, columns 1:10) : Variables that are used for the principal component analysis.
Supplementary variables : As supplementary individuals, the coordinates of these variables will be predicted also.
Supplementary continuous variables : Columns 11 and 12 corresponding respectively to the rank and the points of athletes.
Supplementary qualitative variables : Column 13 corresponding to the two athletic meetings (2004 Olympic Game or 2004 Decastar). This factor variables will be used to color individuals by groups.

Extract only active individuals and variables for principal component analysis:

decathlon2.active <- decathlon2[1:23, 1:10]
head(decathlon2.active[, 1:6])

           X100m Long.jump Shot.put High.jump X400m X110m.hurdle
SEBRLE     11.04      7.58    14.83      2.07 49.81        14.69
CLAY       10.76      7.40    14.26      1.86 49.37        14.05
BERNARD    11.02      7.23    14.25      1.92 48.93        14.99
YURKOV     11.34      7.09    15.19      2.10 50.42        15.31
ZSIVOCZKY  11.13      7.30    13.48      2.01 48.62        14.17
McMULLEN   10.83      7.31    13.76      2.13 49.91        14.38

Principal component analysis

The function dudi.pca() [in ade4 package] can be used. A simplified format is :

dudi.pca(df, center = TRUE,  scale = TRUE, 
         scannf = TRUE, nf = 2)

df : a data frame. Rows are individuals and columns are numeric variables
center : a logical value specifying whether the variables should be shifted to be zero centered.
scale : a logical value. If TRUE, the data are scaled to unit variance before the analysis. This standardization to the same scale avoids some variables to become dominant just because of their large measurement units.
scannf : a logical value specifying whether the scree plot should be displayed
nf : number of dimensions kept in the final results.

In the R code below, the PCA is performed only on the active individuals/variables :

library("ade4")
res.pca <- dudi.pca(decathlon2.active, scannf = FALSE, nf = 5)

Variances of the principal components

Extract the eigenvalues

Eigenvalues measure the amount of variation retained by a principal component :

summary(res.pca)

Class: pca dudi
Call: dudi.pca(df = decathlon2.active, scannf = FALSE, nf = 5)

Total inertia: 10

Eigenvalues:
    Ax1     Ax2     Ax3     Ax4     Ax5 
 4.1242  1.8385  1.2391  0.8194  0.7016 

Projected inertia (%):
    Ax1     Ax2     Ax3     Ax4     Ax5 
 41.242  18.385  12.391   8.194   7.016 

Cumulative projected inertia (%):
    Ax1   Ax1:2   Ax1:3   Ax1:4   Ax1:5 
  41.24   59.63   72.02   80.21   87.23 

(Only 5 dimensions (out of 10) are shown)

You can also use the package factoextra to extract the eigenvalues :

library("factoextra")
eig.val <- get_eigenvalue(res.pca)
head(eig.val)

      eigenvalue variance.percent cumulative.variance.percent
Dim 1  4.1242133        41.242133                    41.24213
Dim 2  1.8385309        18.385309                    59.62744
Dim 3  1.2391403        12.391403                    72.01885
Dim 4  0.8194402         8.194402                    80.21325
Dim 5  0.7015528         7.015528                    87.22878
Dim 6  0.4228828         4.228828                    91.45760

Make a scree plot using ade4 base graphics

The function scree plot() can be used to represent the amount of inertia (variance) associated with each principal component (PC).

A simplified format is :

screeplot(x, ncps = length(x$eig), type = c("barplot", "lines"))

x : an object of class dudi
ncps : the number of components to be plotted
type : the type of plot

Example of usage :

screeplot(res.pca, main ="Screeplot - Eigenvalues")

You can also customize the plot using the standard barplot() function. In the R code below, we’ll draw the percentage of variances retained by each component :

barplot(eig.val[, 2], names.arg=1:nrow(eig.val), 
       main = "Variances",
       xlab = "Principal Components",
       ylab = "Percentage of variances",
       col ="steelblue")
# Add connected line segments to the plot
lines(x = 1:nrow(eig.val), eig.val[, 2], 
      type="b", pch=19, col = "red")

~60% of the information (variances) contained in the data are retained by the first two principal components.

Make the scree plot using the package factoextra

fviz_screeplot(res.pca, ncp=10)

Graph of variables : the circle of correlations

Coordinates of variables on the principal components

The coordinates of the variables on the factor map are :

# Column coordinates
head(res.pca$co)

                  Comp1       Comp2      Comp3       Comp4      Comp5
X100m         0.8506257 -0.17939806 -0.3015564  0.03357320  0.1944440
Long.jump    -0.7941806  0.28085695  0.1905465 -0.11538956 -0.2331567
Shot.put     -0.7339127  0.08540412 -0.5175978  0.12846837  0.2488129
High.jump    -0.6100840 -0.46521415 -0.3300852  0.14455012 -0.4027002
X400m         0.7016034  0.29017826 -0.2835329  0.43082552 -0.1039085
X110m.hurdle  0.7641252 -0.02474081 -0.4488873 -0.01689589 -0.2242200

Graph of variables using ade4 base graph

The function s.corcircle() can be used to plot the correlation circle. A simplified format is :

s.corcircle(dfxy, label = row.names(dfxy), grid = TRUE,
            box = FALSE)

dfxy : a data frame specifying the coordinates of variables
label : a vector of strings specifying point labels
grid : a logical value specifying whether a grid in the background of the plot should be drawn
box : a logical value indicating whether a box should be drawn

# Graph of variables
s.corcircle(res.pca$co)

Graph of variables using factoextra

The function fviz_pca_var() is used to visualize variables :

# Default plot
fviz_pca_var(res.pca)

# Change color and theme
fviz_pca_var(res.pca, col.var="steelblue")+
  theme_minimal()

Read more about the function fviz_pca_var() : Graph of variables - Principal Component Analysis

How to calculate the cos2 and the contribution of variables?

The cos2 and the contributions of variables (columns) / individuals (rows) are calculated using the function inertia.dudi() as follow :

inertia <- inertia.dudi(res.pca, row.inertia = TRUE,
                        col.inertia = TRUE)

Note that, the contributions and the cos2 are printed in 1/10 000. The sign is the sign of the coordinates.

Cos2 : quality of the representation for variables on the factor map

The squared coordinates of variables are called cos2.

A high cos2 indicates a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle.
A low cos2 indicates that the variable is not perfectly represented by the PCs. In this case the variable is close to the center of the circle.

The cos2 of the variables are :

# relative contributions of columns
var.cos2 <- abs(inertia$col.rel/10000)
head(var.cos2)

              Comp1  Comp2  Comp3  Comp4  Comp5 con.tra
X100m        0.7236 0.0322 0.0909 0.0011 0.0378     0.1
Long.jump    0.6307 0.0789 0.0363 0.0133 0.0544     0.1
Shot.put     0.5386 0.0073 0.2679 0.0165 0.0619     0.1
High.jump    0.3722 0.2164 0.1090 0.0209 0.1622     0.1
X400m        0.4922 0.0842 0.0804 0.1856 0.0108     0.1
X110m.hurdle 0.5839 0.0006 0.2015 0.0003 0.0503     0.1

It can also be calculated as follow :

# squared coordinates
head(res.pca$co^2)

                 Comp1        Comp2      Comp3        Comp4      Comp5
X100m        0.7235641 0.0321836641 0.09093628 0.0011271597 0.03780845
Long.jump    0.6307229 0.0788806285 0.03630798 0.0133147506 0.05436203
Shot.put     0.5386279 0.0072938636 0.26790749 0.0165041211 0.06190783
High.jump    0.3722025 0.2164242070 0.10895622 0.0208947375 0.16216747
X400m        0.4922473 0.0842034209 0.08039091 0.1856106269 0.01079698
X110m.hurdle 0.5838873 0.0006121077 0.20149984 0.0002854712 0.05027463

Using factoextra package, the color of variables can be automatically controlled by the value of their cos2.

fviz_pca_var(res.pca, col.var="contrib")+
scale_color_gradient2(low="white", mid="blue", 
      high="red", midpoint=55) + theme_minimal()

Contributions of the variables to the principal components

The contributions can be printed in % as follow :

# absolute contribution of columns
var.contrib <- inertia$col.abs/100
head(var.contrib)

             Comp1 Comp2 Comp3 Comp4 Comp5
X100m        17.54  1.75  7.34  0.14  5.39
Long.jump    15.29  4.29  2.93  1.62  7.75
Shot.put     13.06  0.40 21.62  2.01  8.82
High.jump     9.02 11.77  8.79  2.55 23.12
X400m        11.94  4.58  6.49 22.65  1.54
X110m.hurdle 14.16  0.03 16.26  0.03  7.17

Note that, You can also use the function get_pca_var() [from factoextra package]. It provides a list of matrices containing all the results for the active variables (coordinates, correlation between variables and axes, squared cosine and contributions).

var <- get_pca_var(res.pca)
names(var)

[1] "coord"   "cor"     "cos2"    "contrib"

# Contributions of variables
head(var$contrib)

                 Dim.1      Dim.2     Dim.3       Dim.4     Dim.5
X100m        17.544293  1.7505098  7.338659  0.13755240  5.389252
Long.jump    15.293168  4.2904162  2.930094  1.62485936  7.748815
Shot.put     13.060137  0.3967224 21.620432  2.01407269  8.824401
High.jump     9.024811 11.7715838  8.792888  2.54987951 23.115504
X400m        11.935544  4.5799296  6.487636 22.65090599  1.539012
X110m.hurdle 14.157544  0.0332933 16.261261  0.03483735  7.166193

Using factoextra package, the color of variables can be automatically controlled by the value of their contributions

fviz_pca_var(res.pca, col.var="contrib") +
scale_color_gradient2(low="white", mid="blue", 
      high="red", midpoint=50) + theme_minimal()

This is helpful to highlight the most important variables for the principal components.

The most important variables for a given PC can be visualized using the function fviz_pca_contrib()[factoextra package] :

(factoextra >= 1.0.1 is required)

# Contributions of variables on PC1
fviz_pca_contrib(res.pca, choice = "var", axes = 1)

# Contributions of variables on PC2
fviz_pca_contrib(res.pca, choice = "var", axes = 2)

Read more about fviz_pca_contrib() : Principal Component Analysis: How to reveal the most important variables in your data?

Graph of individuals

Coordinates of individuals on the principal components

The coordinates of the individuals on the factor maps can be extracted as follow :

# The row coordinates
head(res.pca$li)

                Axis1      Axis2      Axis3       Axis4       Axis5
SEBRLE     -0.1955047  1.5890567 -0.6424912  0.08389652 -1.16829387
CLAY       -0.8078795  2.4748137  1.3873827  1.29838232  0.82498206
BERNARD     1.3591340  1.6480950 -0.2005584 -1.96409420 -0.08419345
YURKOV      0.8889532 -0.4426067 -2.5295843  0.71290837 -0.40782264
ZSIVOCZKY   0.1081216 -2.0688377  1.3342591 -0.10152796  0.20145217
McMULLEN   -0.1212195 -1.0139102  0.8625170  1.34164291 -1.62151286

Cos2 : quality of the representation for individuals on the principal components

# relative contributions of rows
ind.cos2 <- abs(inertia$row.rel)/10000
head(ind.cos2)

            Axis1  Axis2  Axis3  Axis4  Axis5 con.tra
SEBRLE     0.0075 0.4975 0.0813 0.0014 0.2689  0.0221
CLAY       0.0487 0.4570 0.1436 0.1258 0.0508  0.0583
BERNARD    0.1972 0.2900 0.0043 0.4118 0.0008  0.0407
YURKOV     0.0961 0.0238 0.7782 0.0618 0.0202  0.0357
ZSIVOCZKY  0.0016 0.5764 0.2398 0.0014 0.0055  0.0323
McMULLEN   0.0022 0.1522 0.1101 0.2665 0.3893  0.0294

Contribution of the individuals to the princial components

The contributions can be printed in % as follow :

# absolute contributions of rows
ind.contrib <- inertia$row.abs/100
head(ind.contrib)

           Axis1 Axis2 Axis3 Axis4 Axis5
SEBRLE      0.04  5.97  1.45  0.04  8.46
CLAY        0.69 14.48  6.75  8.94  4.22
BERNARD     1.95  6.42  0.14 20.47  0.04
YURKOV      0.83  0.46 22.45  2.70  1.03
ZSIVOCZKY   0.01 10.12  6.25  0.05  0.25
McMULLEN    0.02  2.43  2.61  9.55 16.29

It’s also possible to use the function get_pca_ind() [from factoextra package]. factoextra provides, a list of matrices containing all the results for the active individuals (coordinates, squared cosine and contributions)./span>

ind <- get_pca_ind(res.pca)
names(ind)

[1] "coord"   "cos2"    "contrib"

# Contributions of individuals
head(ind$contrib)

                Dim.1      Dim.2      Dim.3       Dim.4       Dim.5
SEBRLE     0.04029447  5.9714533  1.4483919  0.03734589  8.45894063
CLAY       0.68805664 14.4839248  6.7537381  8.94458283  4.21794385
BERNARD    1.94740183  6.4234107  0.1411345 20.46819433  0.04393073
YURKOV     0.83308415  0.4632733 22.4517396  2.69663605  1.03075263
ZSIVOCZKY  0.01232413 10.1217143  6.2464325  0.05469230  0.25151025
McMULLEN   0.01549089  2.4310854  2.6102794  9.55055888 16.29493304

Use the function fviz_pca_contrib()[factoextra package] to visualize the most contributing individuals :

(factoextra >= 1.0.1 is required)

# Contributions of variables on PC1
fviz_pca_contrib(res.pca, choice = "ind", axes = 1)

# Contributions of variables on PC2
fviz_pca_contrib(res.pca, choice = "ind", axes = 2)

Read more about fviz_pca_contrib() : Principal Component Analysis: How to reveal the most important variables in your data?

Graph of individuals using ade4 base graph

The function s.label() can be used. A simplified format is :

s.label(dfxy, xax = 1, yax = 2)

dfxy : a data frame with at least two coordinates
xax : a numeric value specifying the column number containing x values
yax : a numeric value specifying the column number containing y values

Factor map of individuals :

s.label(res.pca$li, xax = 1, yax = 2)

Biplot of individuals and variables using ade4

Biplot can be drawn using the combination of the two functions below :

s.label() to plot individuals
s.arrow() to add variables

# Plot of individuals
s.label(res.pca$li, xax = 1, yax = 2)
# Add variables
s.arrow(7*res.pca$c1, add.plot = TRUE)

It’s also possible to use the function scatter() or biplot() :

scatter(res.pca)

# Remove the scree plot (posieig ="none")
# Remove row labels (clab.row = 0)
scatter(res.pca,  posieig = "none", clab.row = 0)

NULL

Note that, to remove variable labels the argument clab.col = 0 can be used.

Graph of individuals using factoextra

The function fviz_pca_ind() is used to visualize individuals :

fviz_pca_ind(res.pca)

Control automatically the color of individuals using the cos2 values (the quality of the individuals on the factor map) :

fviz_pca_ind(res.pca, col.ind="cos2") +
scale_color_gradient2(low="white", mid="blue", 
                  high="red", midpoint=0.50)

Change the theme :

fviz_pca_ind(res.pca, col.ind="cos2") +
scale_color_gradient2(low="white", mid="blue", 
    high="red", midpoint=0.50) + theme_minimal()

Read more about fviz_pca_ind() : Graph of individuals - principal component analysis

Make a biplot of individuals and variables :

fviz_pca_biplot(res.pca, geom = "text") +
  theme_minimal()

Read more about fviz_pca_biplot() : Biplot of individuals and variables - principal component analysis

Change the color of individuals by groups

The data sets decathlon2 contain a supplementary qualitative variable at columns 13 corresponding to the type of competitions.

Qualitative variable can be helpful for interpreting the data and for coloring individuals by groups :

# Data for the supplementary qualitative variables
quali.sup <- as.factor(decathlon2[1:23, 13])
head(quali.sup)

[1] Decastar Decastar Decastar Decastar Decastar Decastar
Levels: Decastar OlympicG

The function s.class() can be used to visualize the classes (groups) of points :

s.class(dfxy, fac, xax = 1, yax = 2, col)

dfxy : a data frame containing the two columns for x and y axes
fac : a factor variable partitioning the individuals in classes
xax, yax : a numeric value specifying the column number containing x and y values
col : a vector of colors used to draw each class in a different color

Color individuals by groups :

s.class(res.pca$li, fac = quali.sup, xax = 1, yax = 2)

# Change the colors
s.class(res.pca$li, fac = quali.sup, col = c("blue", "red"))

# Make a biplot
# clab.row : hide the label for rows (individuals)
res <- scatter(res.pca, clab.row = 0, posieig = "none")
s.class(res.pca$li, fac = quali.sup, col = c("blue", "red"),
        add.plot = TRUE)

# Customize the biplot
# - remove row labels (clab.row = 0)
# - hide the scree plot (posieig = 0)
# - remove stars (cstar = 0)
# - remove ellipse (cellipse = 0)
res <- scatter(res.pca, clab.row = 0, posieig = "none")
s.class(res.pca$li, fac = quali.sup, col = c("blue", "red"),
        add.plot = TRUE, cstar = 0, cellipse = 0)

# remove labels for classes (clabel = 0)
res <- scatter(res.pca, clab.row = 0, posieig = "none")
s.class(res.pca$li, fac = quali.sup, col = c("blue", "red"),
        add.plot = TRUE, cstar = 0, cellipse = 0, clabel = 0)

It’s also possible to use factoextra :

fviz_pca_ind(res.pca, habillage = quali.sup,
     addEllipses =TRUE, ellipse.level = 0.68) +
  theme_minimal()

Elegant biplot using factoextra and iris data :

data(iris)

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

# The variable Species (index = 5) is removed
# before PCA analysis
iris.pca <- dudi.pca(iris[,-5], scannf = FALSE, nf = 2)

Now, let’s :

make a biplot of individuals and variables
change the color of individuals by groups
change the transparency of variable colors by their contribution values
show only the labels for variables

fviz_pca_biplot(iris.pca, 
  habillage = iris$Species, addEllipses = TRUE,
  col.var = "red", alpha.var ="cos2",
  label = "var") +
  scale_color_brewer(palette="Dark2")+
  theme_minimal()

Principal component analysis using supplementary individuals and variables

As described above, the data sets decathlon2 contain supplementary continuous variables (quanti.sup, columns 11:12), supplementary qualitative variables (quali.sup, column 13) and supplementary individuals (ind.sup, rows 24:27)

Supplementary variables / individuals are not used to compute the principal component. Their coordinates are predicted using only the information provided by the performed principal component analysis on active variables / individuals.

The functions suprow() and supcol() [in ade4 package] are used to calculate the coordinates of supplementary rows (individuals) and columns (variables), respectively.

The simplified formats are :

# For supplementary individuals (rows)
suprow(x, Xsup)

# For supplementary variables (columns)
supcol(x, Xsup)

Supplementary individuals

# Data for the supplementary individuals
ind.sup <- decathlon2[24:27, 1:10, drop = FALSE]
ind.sup[, 1:6]

         X100m Long.jump Shot.put High.jump X400m X110m.hurdle
KARPOV   11.02      7.30    14.77      2.04 48.37        14.09
WARNERS  11.11      7.60    14.31      1.98 48.68        14.23
Nool     10.80      7.53    14.26      1.88 48.81        14.80
Drews    10.87      7.38    13.07      1.88 48.51        14.01

Predict the coordinates of the supplementary individuals :

ind.sup.pca <- suprow(res.pca, ind.sup)
names(ind.sup.pca)

[1] "tabsup" "lisup"

# coordinates 
ind.sup.coord <- ind.sup.pca$lisup
head(ind.sup.coord)

              Axis1       Axis2     Axis3      Axis4      Axis5
KARPOV   -0.7947206  0.77951227 1.6330203  1.7242283 0.75070396
WARNERS   0.3864645 -0.12159237 1.7387332 -0.7063341 0.03230011
Nool      0.5591306  1.97748871 0.4830358 -2.2784526 0.25461493
Drews     1.1092038  0.01741477 3.0488182 -1.5343468 0.32642192

How to visualize supplementary individuals on the factor map?

The function fviz_add() is used :

# Plot of active individuals
p <- fviz_pca_ind(res.pca)
# Add supplementary individuals
fviz_add(p, ind.sup.coord, color ="blue")

How to calculate the cos2 (quality of the representation) for supplementary individuals?

cos2.func <-function(x){x^2/sum(x^2)}
ind.sup.cos2 <- t(apply(ind.sup.coord, 1, cos2.func))
head(ind.sup.cos2)

              Axis1        Axis2      Axis3     Axis4        Axis5
KARPOV   0.08486144 8.164458e-02 0.35831467 0.3994579 0.0757214366
WARNERS  0.04050537 4.009646e-03 0.81989704 0.1353050 0.0002829447
Nool     0.03218782 4.026179e-01 0.02402281 0.5344967 0.0066747159
Drews    0.09473792 2.335268e-05 0.71575477 0.1812793 0.0082046453

Supplementary quantitative variables

# Data for the supplementary quantitative variables
quanti.sup <- decathlon2[1:23, 11:12, drop = FALSE]
head(quanti.sup)

           Rank Points
SEBRLE        1   8217
CLAY          2   8122
BERNARD       4   8067
YURKOV        5   8036
ZSIVOCZKY     7   8004
McMULLEN      8   7995

Remember that, rows 24:27 are supplementary individuals. We don’t want them in this current analysis. This is why, I extracted only rows 1:23.

Predict the coordinates of the supplementary variables :

(You have to scale the supplementary variables before the analysis as the PCA has been performed on scaled data.)

quanti.pca <- supcol(res.pca, scale(quanti.sup))
names(quanti.pca)

[1] "tabsup" "cosup"

# coordinates 
quanti.coord <- quanti.pca$cosup
head(quanti.coord)

            Comp1      Comp2      Comp3      Comp4      Comp5
Rank    0.6860587 -0.2398049  0.1793975  0.0545264 0.07220371
Points -0.9425246  0.0759751 -0.1545490 -0.1625770 0.03046248

Visualize supplementary variables on the factor map using factoextra :

# Plot of active variables
p <- fviz_pca_var(res.pca)
# Add supplementary active variables
fviz_add(p, quanti.coord, geom="arrow", color ="blue")

# Get the cos2 of the supplementary quantitative variables
(quanti.coord^2)[, 1:4]

           Comp1       Comp2      Comp3       Comp4
Rank   0.4706766 0.057506383 0.03218347 0.002973128
Points 0.8883526 0.005772216 0.02388540 0.026431296

Infos

This analysis has been performed using R software (ver. 3.1.2) and ggplot2 (ver. 1.0.0)

FactoMineR and factoextra : Principal Component Analysis Visualization - R software and data mining

Wed, 11 Mar 2015 21:35:38 +0100

Install and load FactoMineR package
Install and load factoextra for visualization
Prepare the data
Exploratory data analysis
- Descriptive statistics
- Correlation matrix
Principal component analysis
- Variances of the principal components
Graph of individus and variables
Variables factor map : The correlation circle
Graph of individuals
Principal component analysis using supplementary individuals and variables
Dimension description
Infos

Principal component analysis (PCA) allows us to summarize the variations (informations) in a data set described by multiple variables. Each variable could be considered as a different dimension. If you have more than 3 variables in your data sets, it could be very difficult to visualize a multi-dimensional hyperspace.

The goal of principal component analysis is to transform the initial variables into a new set of variables which explain the variation in the data. These new variables corresponds to a linear combination of the originals and are called principal components.

PCA reduces the dimensionality of multivariate data, to two or three that can be visualized graphically with minimal loss of information.

Several functions from different packages are available in R for performing PCA : prcomp and princomp (built-in R stats package), PCA (FactoMineR package), dudi.pca(ade4 package).

This R tutorial describes :

How to perform a principal component analysis using R software and FactoMineR package
How to visualize the output of the PCA using the R package factoextra

Install and load FactoMineR package

FactoMineR (Husson et al.) is one of the most powerful R packages and my favorite one for performing a multivariate exploratory data analysis. A rich documentation is available on the FactoMineR official website (http://factominer.free.fr/index.html) and on youtube. Many thanks to François Husson for this effort…

FactoMineR can be installed and loaded as follow :

install.packages("FactoMineR")

library("FactoMineR")

Install and load factoextra for visualization

The package factoextra has flexible methods for the classes PCA, prcomp, princomp and dudi in order to extract and visualize quickly the results of the analysis. The ggplot2 plotting system is used for the data visualization.

Install and load factoextra as follow :

library("devtools")
install_github("kassambara/factoextra")

Load it :

library("factoextra")

Prepare the data

We’ll use the data sets decathlon2 from the package factoextra :

data(decathlon2)
head(decathlon2[, 1:6])

           X100m Long.jump Shot.put High.jump X400m X110m.hurdle
SEBRLE     11.04      7.58    14.83      2.07 49.81        14.69
CLAY       10.76      7.40    14.26      1.86 49.37        14.05
BERNARD    11.02      7.23    14.25      1.92 48.93        14.99
YURKOV     11.34      7.09    15.19      2.10 50.42        15.31
ZSIVOCZKY  11.13      7.30    13.48      2.01 48.62        14.17
McMULLEN   10.83      7.31    13.76      2.13 49.91        14.38

This data is just a subset of the decathlon data in FactoMineR package

As illustrated below, the data used here describes athletes’ performance during two sporting events (Desctar and OlympicG). It contains 27 individuals (athletes) described by 13 variables :

Only some of these individuals and variables will be used to perform the principal component analysis (PCA).

The coordinates of the remaining individuals and variables on the factor map will be predicted after the PCA.

In PCA terminology, our data contains :

Active individuals (in blue, rows 1:23) : Individuals that are used during the principal component analysis.
Supplementary individuals (in green, rows 24:27) : The coordinates of these individuals will be predicted using the PCA informations and parameters obtained with active individuals/variables
Active variables (in pink, columns 1:10) : Variables that are used for the principal component analysis.
Supplementary variables : As supplementary individuals, the coordinates of these variables will be predicted also.
Supplementary continuous variables : Columns 11 and 12 corresponding respectively to the rank and the points of athletes.
Supplementary qualitative variables : Column 13 corresponding to the two athlete-tic meetings (2004 Olympic Game or 2004 Decastar). This factor variables will be used to color individuals by groups.

Extract only active individuals and variables for principal component analysis:

decathlon2.active <- decathlon2[1:23, 1:10]
head(decathlon2.active[, 1:6])

           X100m Long.jump Shot.put High.jump X400m X110m.hurdle
SEBRLE     11.04      7.58    14.83      2.07 49.81        14.69
CLAY       10.76      7.40    14.26      1.86 49.37        14.05
BERNARD    11.02      7.23    14.25      1.92 48.93        14.99
YURKOV     11.34      7.09    15.19      2.10 50.42        15.31
ZSIVOCZKY  11.13      7.30    13.48      2.01 48.62        14.17
McMULLEN   10.83      7.31    13.76      2.13 49.91        14.38

Exploratory data analysis

Before principal component analysis, we can perform some exploratory data analysis such as descriptive statistics, correlation matrix and scatter plot matrix.

Descriptive statistics

decathlon2.active_stats <- data.frame(
  Min = apply(decathlon2.active, 2, min), # minimum
  Q1 = apply(decathlon2.active, 2, quantile, 1/4), # First quartile
  Med = apply(decathlon2.active, 2, median), # median
  Mean = apply(decathlon2.active, 2, mean), # mean
  Q3 = apply(decathlon2.active, 2, quantile, 3/4), # Third quartile
  Max = apply(decathlon2.active, 2, max) # Maximum
  )
decathlon2.active_stats <- round(decathlon2.active_stats, 1)
head(decathlon2.active_stats)

              Min   Q1  Med Mean   Q3  Max
X100m        10.4 10.8 11.0 11.0 11.2 11.6
Long.jump     6.8  7.2  7.3  7.3  7.5  8.0
Shot.put     12.7 14.2 14.7 14.6 15.1 16.4
High.jump     1.9  1.9  2.0  2.0  2.1  2.1
X400m        46.8 49.0 49.4 49.4 50.0 51.2
X110m.hurdle 14.0 14.2 14.4 14.5 14.9 15.7

Note that, you can also use the built-in R function summary() for the descriptive statistics but I don’t like the format of the output on data frame.

Correlation matrix

The correlation between variables can be calculated as follow :

cor.mat <- round(cor(decathlon2.active),2)
head(cor.mat[, 1:6])

             X100m Long.jump Shot.put High.jump X400m X110m.hurdle
X100m         1.00     -0.76    -0.45     -0.40  0.59         0.73
Long.jump    -0.76      1.00     0.44      0.34 -0.51        -0.59
Shot.put     -0.45      0.44     1.00      0.53 -0.31        -0.38
High.jump    -0.40      0.34     0.53      1.00 -0.37        -0.25
X400m         0.59     -0.51    -0.31     -0.37  1.00         0.58
X110m.hurdle  0.73     -0.59    -0.38     -0.25  0.58         1.00

Visualize the correlation matrix using a correlogram : the package corrplot is required.

# install.packages("corrplot")
library("corrplot")
corrplot(cor.mat, type="upper", order="hclust", 
         tl.col="black", tl.srt=45)

Read more about visualizing correlation matrix : Correlation matrix visualization

Make a scatter plot matrix showing the correlation coefficients between variables and the significance levels : the package PerformanceAnalytics is required.

# install.packages("PerformanceAnalytics")
library("PerformanceAnalytics")
chart.Correlation(decathlon2.active[, 1:6], histogram=TRUE, pch=19)

You can read more about this plot here : Correlation matrix visualization

Principal component analysis

The function PCA() [in FactoMiner package] can be used. A simplified format is :

PCA(X, scale.unit = TRUE, ncp = 5, graph = TRUE)

X : a data frame. Rows are individuals and columns are numeric variables
scale.unit : a logical value. If TRUE, the data are scaled to unit variance before the analysis. This standardization to the same scale avoids some variables to become dominant just because of their large measurement units.
ncp : number of dimensions kept in the final results.
graph : a logical value. If TRUE a graph is displayed.

In the R code below, the PCA is performed only on the active individuals/variables :

library("FactoMineR")
res.pca <- PCA(decathlon2.active, graph = FALSE)

The output of the function PCA() is a list including :

print(res.pca)

**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 23 individuals, described by 10 variables
*The results are available in the following objects:

   name               description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error of the variables"    
14 "$call$row.w"      "weights for the individuals"        
15 "$call$col.w"      "weights for the variables"

The object that is created using the function PCA() contains many informations found in many different lists and matrices. These values are described in the next section.

Variances of the principal components

The proportion of variances retained by the principal components can be extracted as follow :

eigenvalues <- res.pca$eig
head(eigenvalues[, 1:2])

       eigenvalue percentage of variance
comp 1  4.1242133              41.242133
comp 2  1.8385309              18.385309
comp 3  1.2391403              12.391403
comp 4  0.8194402               8.194402
comp 5  0.7015528               7.015528
comp 6  0.4228828               4.228828

Eigenvalues correspond to the amount of the variation explained by each principal component (PC). Eigenvalues are large for the first PC and small for the subsequent PCs.
A PC with an eigenvalue > 1 indicates that the PC accounts for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point to determine the number of PCs to retain.

Make a scree plot using base graphics : A scree plot is a graph of the eigenvalues/variances associated with components.

barplot(eigenvalues[, 2], names.arg=1:nrow(eigenvalues), 
       main = "Variances",
       xlab = "Principal Components",
       ylab = "Percentage of variances",
       col ="steelblue")
# Add connected line segments to the plot
lines(x = 1:nrow(eigenvalues), eigenvalues[, 2], 
      type="b", pch=19, col = "red")

~60% of the informations (variances) contained in the data are retained by the first two principal components.

Make the scree plot using the package factoextra :

fviz_screeplot(res.pca, ncp=10)

Graph of individus and variables

The function plot.PCA() can be used. A simplified format is :

plot.PCA(x, axes = c(1,2), choix=c("ind", "var"))

x : An object of class PCA
axes : A numeric vector of length 2 specifying the component to plot
choix : The graph to be plotted. Possible values are “ind” for the individuals and “var” for the variables

Variables factor map : The correlation circle

Coordinates of variables on the principal components

head(res.pca$var$coord)

                  Dim.1       Dim.2      Dim.3       Dim.4      Dim.5
X100m        -0.8506257 -0.17939806  0.3015564  0.03357320 -0.1944440
Long.jump     0.7941806  0.28085695 -0.1905465 -0.11538956  0.2331567
Shot.put      0.7339127  0.08540412  0.5175978  0.12846837 -0.2488129
High.jump     0.6100840 -0.46521415  0.3300852  0.14455012  0.4027002
X400m        -0.7016034  0.29017826  0.2835329  0.43082552  0.1039085
X110m.hurdle -0.7641252 -0.02474081  0.4488873 -0.01689589  0.2242200

Cos2 : quality of variables on the factor map

The quality of representation of the variables of the principal components are called the cos2.

head(res.pca$var$cos2)

                 Dim.1        Dim.2      Dim.3        Dim.4      Dim.5
X100m        0.7235641 0.0321836641 0.09093628 0.0011271597 0.03780845
Long.jump    0.6307229 0.0788806285 0.03630798 0.0133147506 0.05436203
Shot.put     0.5386279 0.0072938636 0.26790749 0.0165041211 0.06190783
High.jump    0.3722025 0.2164242070 0.10895622 0.0208947375 0.16216747
X400m        0.4922473 0.0842034209 0.08039091 0.1856106269 0.01079698
X110m.hurdle 0.5838873 0.0006121077 0.20149984 0.0002854712 0.05027463

Contributions of the variables to the principal components

Variable contributions in the determination of a given principal component are (in percentage) : (var.cos2 * 100) / (total cos2 of the component)

head(res.pca$var$contrib)

                 Dim.1      Dim.2     Dim.3       Dim.4     Dim.5
X100m        17.544293  1.7505098  7.338659  0.13755240  5.389252
Long.jump    15.293168  4.2904162  2.930094  1.62485936  7.748815
Shot.put     13.060137  0.3967224 21.620432  2.01407269  8.824401
High.jump     9.024811 11.7715838  8.792888  2.54987951 23.115504
X400m        11.935544  4.5799296  6.487636 22.65090599  1.539012
X110m.hurdle 14.157544  0.0332933 16.261261  0.03483735  7.166193

Graph of variables using FactoMineR base graph

plot(res.pca, choix = "var")

Graph of variables using factoextra

The function fviz_pca_var() is used to visualize variables :

# Default plot
fviz_pca_var(res.pca)

# Change color and theme
fviz_pca_var(res.pca, col.var="steelblue")+
  theme_minimal()

Note that, using factoextra package, the color or the transparency of variables can be automatically controlled by the value of their contributions, their cos2, their coordinates on x or y axis.

# Control variable colors using their contribution
# Possible values for the argument col.var are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_pca_var(res.pca, col.var="contrib")

# Change the gradient color
fviz_pca_var(res.pca, col.var="contrib")+
scale_color_gradient2(low="white", mid="blue", 
                      high="red", midpoint=55)+theme_bw()

This is helpful to highlight the most important variables in the determination of the principal components.

It’s also possible to control automatically the transparency of variables by their contributions :

# Control the transparency of variables using their contribution
# Possible values for the argument alpha.var are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_pca_var(res.pca, alpha.var="contrib")+
  theme_minimal()

Read more about ggplot2 and colors here : ggplot2 colors - How to change colors automatically and manually?

Graph of individuals

Coordinates of individuals on the principal components

head(res.pca$ind$coord)

                Dim.1      Dim.2      Dim.3       Dim.4       Dim.5
SEBRLE      0.1955047  1.5890567  0.6424912  0.08389652  1.16829387
CLAY        0.8078795  2.4748137 -1.3873827  1.29838232 -0.82498206
BERNARD    -1.3591340  1.6480950  0.2005584 -1.96409420  0.08419345
YURKOV     -0.8889532 -0.4426067  2.5295843  0.71290837  0.40782264
ZSIVOCZKY  -0.1081216 -2.0688377 -1.3342591 -0.10152796 -0.20145217
McMULLEN    0.1212195 -1.0139102 -0.8625170  1.34164291  1.62151286

Cos2 : quality of representation of individuals on the principal components

head(res.pca$ind$cos2)

                 Dim.1      Dim.2       Dim.3       Dim.4        Dim.5
SEBRLE     0.007530179 0.49747323 0.081325232 0.001386688 0.2689026575
CLAY       0.048701249 0.45701660 0.143628117 0.125791741 0.0507850580
BERNARD    0.197199804 0.28996555 0.004294015 0.411819183 0.0007567259
YURKOV     0.096109800 0.02382571 0.778230322 0.061812637 0.0202279796
ZSIVOCZKY  0.001574385 0.57641944 0.239754152 0.001388216 0.0054654972
McMULLEN   0.002175437 0.15219499 0.110137872 0.266486530 0.3892621478

Contribition of individuals to the princial components

head(res.pca$ind$contrib)

                Dim.1      Dim.2      Dim.3       Dim.4       Dim.5
SEBRLE     0.04029447  5.9714533  1.4483919  0.03734589  8.45894063
CLAY       0.68805664 14.4839248  6.7537381  8.94458283  4.21794385
BERNARD    1.94740183  6.4234107  0.1411345 20.46819433  0.04393073
YURKOV     0.83308415  0.4632733 22.4517396  2.69663605  1.03075263
ZSIVOCZKY  0.01232413 10.1217143  6.2464325  0.05469230  0.25151025
McMULLEN   0.01549089  2.4310854  2.6102794  9.55055888 16.29493304

Graph of individuals using FactoMineR base graph

plot(res.pca, choix = "ind")

Graph of individuals using factoextra

The function fviz_pca_ind() is used to visualize individuals :

fviz_pca_ind(res.pca)

Remove the points from the graph, use texts only :

fviz_pca_ind(res.pca, geom="text")

Note that, allowed values for the argument geom are :

“point” to show only points (dots)
“text” to show only labels
c(“point”, “text”) to show both types

Control automatically the color of individuals using the cos2 values (the quality of the individuals on the factor map) :

fviz_pca_ind(res.pca, col.ind="cos2") +
scale_color_gradient2(low="white", mid="blue", 
                      high="red", midpoint=0.50)

Read more about ggplot2 and colors here : ggplot2 colors - How to change colors automatically and manually?

Change the theme :

fviz_pca_ind(res.pca,  col.ind="cos2") +
scale_color_gradient2(low="white", mid="blue", 
                      high="red", midpoint=0.50)+
  theme_minimal()

Read more about ggplot2 themes here : ggplot2 themes and background colors

Make a biplot of individuals and variables :

fviz_pca_biplot(res.pca,  geom = "text")

Change the color of individuals by groups

We will use iris data sets in this section :

data(iris)

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

# The variable Species (index = 5) is removed
# before PCA analysis
iris.pca <- PCA(iris[,-5], graph = FALSE)

Individuals factor map :

# Default plot
fviz_pca_ind(iris.pca, label="none")

Change individual colors by groups :

fviz_pca_ind(iris.pca,  label="none", habillage=iris$Species)

Add ellipses of point concentrations : the argument habillage is used to specify the factor variable for coloring the observations by groups.

fviz_pca_ind(iris.pca, label="none", habillage=iris$Species,
             addEllipses=TRUE, ellipse.level=0.95)

Now, let’s :

make a biplot of individuals and variables
change the color of individuals by groups
change the transparency of variable colors by their contribution values
show only the labels for variables

fviz_pca_biplot(iris.pca, 
  habillage = iris$Species, addEllipses = TRUE,
  col.var = "red", alpha.var ="cos2",
  label = "var") +
  scale_color_brewer(palette="Dark2")+
  theme_minimal()

Principal component analysis using supplementary individuals and variables

Supplementary variables and individuals are not used for the determination of the principal components. Their coordinates are predicted using only the informations provided by the performed principal component analysis on active variables/individuals.

To specify supplementary individuals and variables, the function PCA() can be used as follow :

PCA(X, scale.unit = TRUE, ncp = 5, ind.sup = NULL,
    quanti.sup=NULL, quali.sup=NULL, graph=TRUE, axes = c(1,2))

X : a data frame. Rows are individuals and columns are numeric variables.
scale.unit : a logical value. If TRUE, the data are scaled to unit variance before the analysis.
ncp : number of dimensions kept in the final results.
ind.sup : a numeric vector specifying the indexes of the supplementary individuals
quanti.sup, quali.sup : a numeric vector specifying, respectively, the indexes of the quantitative and qualitative variables
graph : a logical value. If TRUE a graph is displayed.
axes : a vector of length 2 specifying the components to be plotted

Example of usage :

res.pca <- PCA(decathlon2, ind.sup=24:27, 
               quanti.sup = 11:12, quali.sup = 13, graph=FALSE)

Visualize supplementary quantitative variables

All the results (coordinates, correlation and cos2) for the supplementary quantitative variables can be extracted as follow :

res.pca$quanti.sup

$coord
            Dim.1       Dim.2      Dim.3       Dim.4       Dim.5
Rank   -0.7014777 -0.24519443 -0.1834294  0.05575186 -0.07382647
Points  0.9637075  0.07768262  0.1580225 -0.16623092 -0.03114711

$cor
            Dim.1       Dim.2      Dim.3       Dim.4       Dim.5
Rank   -0.7014777 -0.24519443 -0.1834294  0.05575186 -0.07382647
Points  0.9637075  0.07768262  0.1580225 -0.16623092 -0.03114711

$cos2
           Dim.1       Dim.2      Dim.3      Dim.4        Dim.5
Rank   0.4920710 0.060120310 0.03364635 0.00310827 0.0054503477
Points 0.9287322 0.006034589 0.02497110 0.02763272 0.0009701427

Variables factor map using FactoMineR base graph :

plot(res.pca, choix = "var")

Supplementary quantitative variables are shown in blue color and dashed lines.

It’s also possible to make the variables factor map using factoextra :

fviz_pca_var(res.pca)

Visualize supplementary individuals

The data sets decathlon2 contain some supplementary individuals from row 24 to 27.

# Data for the supplementary individuals
ind.sup <- decathlon2[24:27, 1:10]
ind.sup[, 1:6]

         X100m Long.jump Shot.put High.jump X400m X110m.hurdle
KARPOV   11.02      7.30    14.77      2.04 48.37        14.09
WARNERS  11.11      7.60    14.31      1.98 48.68        14.23
Nool     10.80      7.53    14.26      1.88 48.81        14.80
Drews    10.87      7.38    13.07      1.88 48.51        14.01

Individuals factor map using FactoMineR base graph :

plot(res.pca, choix="ind")

Supplementary individuals are shown in blue. The levels of the supplementary qualitative variable are shown in magnenta color.

The results for supplementary individuals can be extracted as follow :

res.pca$ind.sup

$coord
              Dim.1       Dim.2      Dim.3      Dim.4       Dim.5
KARPOV    0.7947206  0.77951227 -1.6330203  1.7242283 -0.75070396
WARNERS  -0.3864645 -0.12159237 -1.7387332 -0.7063341 -0.03230011
Nool     -0.5591306  1.97748871 -0.4830358 -2.2784526 -0.25461493
Drews    -1.1092038  0.01741477 -3.0488182 -1.5343468 -0.32642192

$cos2
              Dim.1        Dim.2      Dim.3      Dim.4        Dim.5
KARPOV   0.05104677 4.911173e-02 0.21553730 0.24028620 0.0455487744
WARNERS  0.02422707 2.398250e-03 0.49039677 0.08092862 0.0001692349
Nool     0.02897149 3.623868e-01 0.02162236 0.48108780 0.0060077529
Drews    0.09207094 2.269527e-05 0.69560547 0.17617609 0.0079736753

$dist
 KARPOV  WARNERS     Nool    Drews  
3.517470 2.482899 3.284943 3.655527

Supplementary qualitative variables

The data sets decathlon2 contain a supplementary qualitative variable at columns 13 corresponding to the type of competitions.

Qualitative variable can be helpful for interpreting the data and for coloring individuals by groups.

The argument habillage is used to specify the index of the supplementary qualitative variable :

plot(res.pca, choix = "ind", habillage = 13)

It’s also possible to use factoextra :

fviz_pca_ind(res.pca, habillage = 13,
  addEllipses =TRUE, ellipse.level = 0.68) +
  scale_color_brewer(palette="Dark2") +
  theme_minimal()

Supplementary individuals are shown in blue color

The results concerning the supplementary qualitative variable are :

res.pca$quali

$coord
             Dim.1      Dim.2       Dim.3      Dim.4      Dim.5
Decastar -1.343451  0.1218097 -0.03789524  0.1808357  0.1343364
OlympicG  1.231497 -0.1116589  0.03473730 -0.1657661 -0.1231417

$cos2
             Dim.1       Dim.2        Dim.3      Dim.4       Dim.5
Decastar 0.9051233 0.007440939 0.0007201669 0.01639956 0.009050062
OlympicG 0.9051233 0.007440939 0.0007201669 0.01639956 0.009050062

$v.test
             Dim.1      Dim.2      Dim.3      Dim.4      Dim.5
Decastar -2.970766  0.4034256 -0.1528767  0.8971036  0.7202457
OlympicG  2.970766 -0.4034256  0.1528767 -0.8971036 -0.7202457

$dist
Decastar OlympicG 
1.412108 1.294433 

$eta2
                Dim 1      Dim 2       Dim 3      Dim 4      Dim 5
Competition 0.4011568 0.00739783 0.001062332 0.03658159 0.02357972

Dimension description

The function dimdesc() can be used to identify the most correlated variables with a given principal component.

A simplified format is :

dimdesc(res, axes = 1:3, proba = 0.05)

res : an object of class PCA
axes : a numeric vector specifying the dimensions to be described
prob : the significance level

Example of usage :

res.desc <- dimdesc(res.pca, axes = c(1,2))
# Description of dimension 1
res.desc$Dim.1

$quanti
             correlation      p.value
Points         0.9637075 1.605675e-13
Long.jump      0.7941806 6.059893e-06
Discus         0.7432090 4.842563e-05
Shot.put       0.7339127 6.723102e-05
High.jump      0.6100840 1.993677e-03
Javeline       0.4282266 4.149192e-02
Rank          -0.7014777 1.917657e-04
X400m         -0.7016034 1.910387e-04
X110m.hurdle  -0.7641252 2.195812e-05
X100m         -0.8506257 2.727129e-07

$quali
                   R2     p.value
Competition 0.4011568 0.001177378

$category
          Estimate     p.value
OlympicG  1.287474 0.001177378
Decastar -1.287474 0.001177378

# Description of dimension 2
res.desc$Dim.2

$quanti
           correlation      p.value
Pole.vault   0.8074511 3.205016e-06
X1500m       0.7844802 9.384747e-06
High.jump   -0.4652142 2.529390e-02

Infos

This analysis has been performed using R software (ver. 3.1.2) and ggplot2 (ver. )