HCPC - Hierarchical Clustering on Principal Components: Essentials - Articles

HCPC - Hierarchical Clustering on Principal Components: Essentials

Clustering is one of the important data mining methods for discovering knowledge in multivariate data sets. The goal is to identify groups (i.e. clusters) of similar objects within a data set of interest. To learn more about clustering, you can read our book entitled “Practical Guide to Cluster Analysis in R” (https://goo.gl/DmJ5y5).

Briefly, the two most common clustering strategies are:

Hierarchical clustering, used for identifying groups of similar observations in a data set.
Partitioning clustering such as k-means algorithm, used for splitting a data set into several groups.

The HCPC (Hierarchical Clustering on Principal Components) approach allows us to combine the three standard methods used in multivariate data analyses (Husson, Josse, and J. 2010):

Principal component methods (PCA, CA, MCA, FAMD, MFA),
Hierarchical clustering and
Partitioning clustering, particularly the k-means method.

This chapter describes WHY and HOW to combine principal components and clustering methods. Finally, we demonstrate how to compute and visualize HCPC using R software.

Contents:

Why HCPC
Algorithm of the HCPC method
Computation
Summary
Further reading
References

The Book:

Practical Guide to Principal Component Methods in R

Why HCPC

Combining principal component methods and clustering methods are useful in at least three situations.

Case 1: Continuous variables

In the situation where you have a multidimensional data set containing multiple continuous variables, the principal component analysis (PCA) can be used to reduce the dimension of the data into few continuous variables containing the most important information in the data. Next, you can perform cluster analysis on the PCA results.

The PCA step can be considered as a denoising step which can lead to a more stable clustering. This might be very useful if you have a large data set with multiple variables, such as in gene expression data.

Case 2: Clustering on categorical data

In order to perform clustering analysis on categorical data, the correspondence analysis (CA, for analyzing contingency table) and the multiple correspondence analysis (MCA, for analyzing multidimensional categorical variables) can be used to transform categorical variables into a set of few continuous variables (the principal components). The cluster analysis can be then applied on the (M)CA results.

In this case, the (M)CA method can be considered as pre-processing steps which allow to compute clustering on categorical data.

Case 3: Clustering on mixed data

When you have a mixed data of continuous and categorical variables, you can first perform FAMD (factor analysis of mixed data) or MFA (multiple factor analysis). Next, you can apply cluster analysis on the FAMD/MFA outputs.

Algorithm of the HCPC method

The algorithm of the HCPC method, as implemented in the FactoMineR package, can be summarized as follow:

Compute principal component methods: PCA, (M)CA or MFA depending on the types of variables in the data set and the structure of the data set. At this step, you can choose the number of dimensions to be retained in the output by specifying the argument ncp. The default value is 5.
Compute hierarchical clustering: Hierarchical clustering is performed using the Ward’s criterion on the selected principal components. Ward criterion is used in the hierarchical clustering because it is based on the multidimensional variance like principal component analysis.
Choose the number of clusters based on the hierarchical tree: An initial partitioning is performed by cutting the hierarchical tree.
Perform K-means clustering to improve the initial partition obtained from hierarchical clustering. The final partitioning solution, obtained after consolidation with k-means, can be (slightly) different from the one obtained with the hierarchical clustering.

Computation

R packages

We’ll use two R packages: i) FactoMineR for computing HCPC and ii) factoextra for visualizing the results.

To install the packages, type this:

install.packages(c("FactoMineR", "factoextra"))

After the installation, load the packages as follow:

library(factoextra)
library(FactoMineR)

R function

The function HCPC() [in FactoMineR package] can be used to compute hierarchical clustering on principal components.

A simplified format is:

HCPC(res, nb.clust = 0, min = 3, max = NULL, graph = TRUE)

res: Either the result of a factor analysis or a data frame.
nb.clust: an integer specifying the number of clusters. Possible values are:
- 0: the tree is cut at the level the user clicks on
- -1: the tree is automatically cut at the suggested level
- Any positive integer: the tree is cut with nb.clusters clusters
min, max: the minimum and the maximum number of clusters to be generated, respectively
graph: if TRUE, graphics are displayed

Case of continuous variables

We start by computing again the principal component analysis (PCA). The argument ncp = 3 is used in the function PCA() to keep only the first three principal components. Next, the HCPC is applied on the result of the PCA.

library(FactoMineR)
# Compute PCA with ncp = 3
res.pca <- PCA(USArrests, ncp = 3, graph = FALSE)
# Compute hierarchical clustering on principal components
res.hcpc <- HCPC(res.pca, graph = FALSE)

To visualize the dendrogram generated by the hierarchical clustering, we’ll use the function fviz_dend() [factoextra package]:

fviz_dend(res.hcpc, 
          cex = 0.7,                     # Label size
          palette = "jco",               # Color palette see ?ggpubr::ggpar
          rect = TRUE, rect_fill = TRUE, # Add rectangle around groups
          rect_border = "jco",           # Rectangle color
          labels_track_height = 0.8      # Augment the room for labels
          )

The dendrogram suggests 4 clusters solution.

It’s possible to visualize individuals on the principal component map and to color individuals according to the cluster they belong to. The function fviz_cluster() [in factoextra] can be used to visualize individuals clusters.

fviz_cluster(res.hcpc,
             repel = TRUE,            # Avoid label overlapping
             show.clust.cent = TRUE, # Show cluster centers
             palette = "jco",         # Color palette see ?ggpubr::ggpar
             ggtheme = theme_minimal(),
             main = "Factor map"
             )

You can also draw a three dimensional plot combining the hierarchical clustering and the factorial map using the R base function plot():

# Principal components + tree
plot(res.hcpc, choice = "3D.map")

The function HCPC() returns a list containing:

data.clust: The original data with a supplementary column called class containing the partition.
desc.var: The variables describing clusters
desc.ind: The more typical individuals of each cluster
desc.axes: The axes describing clusters

To display the original data with cluster assignments, type this:

head(res.hcpc$data.clust, 10)

##             Murder Assault UrbanPop Rape clust
## Alabama       13.2     236       58 21.2     3
## Alaska        10.0     263       48 44.5     4
## Arizona        8.1     294       80 31.0     4
## Arkansas       8.8     190       50 19.5     3
## California     9.0     276       91 40.6     4
## Colorado       7.9     204       78 38.7     4
## Connecticut    3.3     110       77 11.1     2
## Delaware       5.9     238       72 15.8     2
## Florida       15.4     335       80 31.9     4
## Georgia       17.4     211       60 25.8     3

In the table above, the last column contains the cluster assignments.

To display quantitative variables that describe the most each cluster, type this:

res.hcpc$desc.var$quanti

Here, we show only some columns of interest: “Mean in category”, “Overall Mean”, “p.value”

## $`1`
##          Mean in category Overall mean  p.value
## UrbanPop             52.1        65.54 9.68e-05
## Murder                3.6         7.79 5.57e-05
## Rape                 12.2        21.23 5.08e-05
## Assault              78.5       170.76 3.52e-06
## 
## $`2`
##          Mean in category Overall mean p.value
## UrbanPop            73.88        65.54 0.00522
## Murder               5.66         7.79 0.01759
## 
## $`3`
##          Mean in category Overall mean  p.value
## Murder               13.9         7.79 1.32e-05
## Assault             243.6       170.76 6.97e-03
## UrbanPop             53.8        65.54 1.19e-02
## 
## $`4`
##          Mean in category Overall mean  p.value
## Rape                 33.2        21.23 8.69e-08
## Assault             257.4       170.76 1.32e-05
## UrbanPop             76.0        65.54 2.45e-03
## Murder               10.8         7.79 3.58e-03

From the output above, it can be seen that:

the variables UrbanPop, Murder, Rape and Assault are most significantly associated with the cluster 1. For example, the mean value of the Assault variable in cluster 1 is 78.53 which is less than it’s overall mean (170.76) across all clusters. Therefore, It can be conclude that the cluster 1 is characterized by a low rate of Assault compared to all clusters.
the variables UrbanPop and Murder are most significantly associated with the cluster 2.

…and so on …

Similarly, to show principal dimensions that are the most associated with clusters, type this:

res.hcpc$desc.axes$quanti

## $`1`
##       Mean in category Overall mean  p.value
## Dim.1            -1.96    -5.64e-16 2.27e-07
## 
## $`2`
##       Mean in category Overall mean  p.value
## Dim.2            0.743    -5.37e-16 0.000336
## 
## $`3`
##       Mean in category Overall mean  p.value
## Dim.1            1.061    -5.64e-16 3.96e-02
## Dim.3            0.397     3.54e-17 4.25e-02
## Dim.2           -1.477    -5.37e-16 5.72e-06
## 
## $`4`
##       Mean in category Overall mean  p.value
## Dim.1             1.89    -5.64e-16 6.15e-07

The results above indicate that, individuals in clusters 1 and 4 have high coordinates on axes 1. Individuals in cluster 2 have high coordinates on the second axis. Individuals who belong to the third cluster have high coordinates on axes 1, 2 and 3.

Finally, representative individuals of each cluster can be extracted as follow:

res.hcpc$desc.ind$para

## Cluster: 1
##         Idaho  South Dakota         Maine          Iowa New Hampshire 
##         0.367         0.499         0.501         0.553         0.589 
## -------------------------------------------------------- 
## Cluster: 2
##         Ohio     Oklahoma Pennsylvania       Kansas      Indiana 
##        0.280        0.505        0.509        0.604        0.710 
## -------------------------------------------------------- 
## Cluster: 3
##        Alabama South Carolina        Georgia      Tennessee      Louisiana 
##          0.355          0.534          0.614          0.852          0.878 
## -------------------------------------------------------- 
## Cluster: 4
##   Michigan    Arizona New Mexico   Maryland      Texas 
##      0.325      0.453      0.518      0.901      0.924

For each cluster, the top 5 closest individuals to the cluster center is shown. The distance between each individual and the cluster center is provided. For example, representative individuals for cluster 1 include: Idaho, South Dakota, Maine, Iowa and New Hampshire.

Case of categorical variables

For categorical variables, compute CA or MCA and then apply the function HCPC() on the results as described above.

Here, we’ll use the tea data [in FactoMineR] as demo data set: Rows represent the individuals and columns represent categorical variables.

We start, by performing an MCA on the individuals. We keep the first 20 axes of the MCA which retain 87% of the information.

# Loading data
library(FactoMineR)
data(tea)
# Performing MCA
res.mca <- MCA(tea, 
               ncp = 20,            # Number of components kept
               quanti.sup = 19,     # Quantitative supplementary variables
               quali.sup = c(20:36), # Qualitative supplementary variables
               graph=FALSE)

Next, we apply hierarchical clustering on the results of the MCA:

res.hcpc <- HCPC (res.mca, graph = FALSE, max = 3)

The results can be visualized as follow:

# Dendrogram
fviz_dend(res.hcpc, show_labels = FALSE)
# Individuals facor map
fviz_cluster(res.hcpc, geom = "point", main = "Factor map")

As mentioned above, clusters can be described by i) variables and/or categories, ii) principal axes and iii) individuals. In the example below, we display only a subset of the results.

Description by variables and categories

# Description by variables
res.hcpc$desc.var$test.chi2

##          p.value df
## where   8.47e-79  4
## how     3.14e-47  4
## price   1.86e-28 10
## tearoom 9.62e-19  2

# Description by variable categories
res.hcpc$desc.var$category

## $`1`
##                     Cla/Mod Mod/Cla Global  p.value
## where=chain store      85.9    93.8   64.0 2.09e-40
## how=tea bag            84.1    81.2   56.7 1.48e-25
## tearoom=Not.tearoom    70.7    97.2   80.7 1.08e-18
## price=p_branded        83.2    44.9   31.7 1.63e-09
## 
## $`2`
##                 Cla/Mod Mod/Cla Global  p.value
## where=tea shop     90.0    84.4   10.0 3.70e-30
## how=unpackaged     66.7    75.0   12.0 5.35e-20
## price=p_upscale    49.1    81.2   17.7 2.39e-17
## Tea=green          27.3    28.1   11.0 4.44e-03
## 
## $`3`
##                            Cla/Mod Mod/Cla Global  p.value
## where=chain store+tea shop    85.9    72.8   26.0 5.73e-34
## how=tea bag+unpackaged        67.0    68.5   31.3 1.38e-19
## tearoom=tearoom               77.6    48.9   19.3 1.25e-16
## pub=pub                       63.5    43.5   21.0 1.13e-09

The variables that characterize the most the clusters are the variables “where” and “how”. Each cluster is characterized by a category of the variables “where” and “how”. For example, individuals who belong to the first cluster buy tea as tea bag in chain stores.

Description by principal components

res.hcpc$desc.axes

Description by individuals

res.hcpc$desc.ind$para

Summary

We described how to compute hierarchical clustering on principal components (HCPC). This approach is useful in situations, including:

When you have a large data set containing continuous variables, a principal component analysis can be used to reduce the dimension of the data before the hierarchical clustering analysis.
When you have a data set containing categorical variables, a (Multiple)Correspondence analysis can be used to transform the categorical variables into few continuous principal components, which can be used as the input of the cluster analysis.

We used the FactoMineR package to compute the HCPC and the factoextra R package for ggplot2-based elegant data visualization.

References

Husson, François, J. Josse, and Pagès J. 2010. “Principal Component Methods - Hierarchical Clustering - Partitional Clustering: Why Would We Need to Choose for Visualizing Data?” Unpublished Data. https://www.sthda.com/english/upload/hcpc_husson_josse.pdf.

1 Note

Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Donnez nous 5 étoiles (juste au dessus de ce block)! Vous devez être membre pour voter. Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!

Recommended for You!

Machine Learning Essentials: Practical Guide in R

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

R Graphics Essentials for Great Data Visualization

Network Analysis and Visualization in R

More books on R and data science

Recommended for you

This section contains the best data science and self-development resources to help you on your path.

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Comments

You are not authorized to post a comment

Comment

Mukesh

Member

#924 12/03/2021 at 05h08

Hi Kassambara
I found factoextra and FactoMineR very useful for cluster dendrogram and factor map.
But I want to update about one issue that went may be unnoticed by users.
If you look at the cities in different groups, Missouri is in group 4 in cluster dendrogram, but in factor map it comes along with group 2 cities. I faced the same problem with my data.
Cluster dendrogram and factor map are not 100% match for grouping of the factor (cities in this case).
The grouping of cities in the res.gcpc$data.clust also does not agree with the cluster dendorgam but it agrees with the factor map.
Can you please help to resolve the issue?

Comment

Visitor

#672 01/11/2019 at 19h48

Hello,

First off, thank you and congratulations on a fantastic website. I've learned so much here.

I am working with financial time series data (monthly percentage returns from several indices and portfolios over a 10 year time horizon) and attempting to cluster the strategies using the HCPC method. Seems like I can only get the mapping to work if I transpose my data so that the time index is on the x and the strategy names are on the y of the data matrix. In PCA speak, I believe this means that the dates are the variables and strategies are the individuals. We are used to working with return data using the opposite setup: the time index on the y and the names on the x (this would make the strategies the variables and the dates the individuals). Indeed, I find the results of the PCA more intuitive using this format. But the HCPC mapping doesn't make sense. Instead of clustering by strategy (which is intuitive) it clusters by date. Is there a way to cluster by strategy without having to transpose my data i.e. keep the strategy as the variable and the date as the individual?

Thank you!

Comment

Visitor

#631 10/17/2018 at 16h46

Hi,

After installing the last version, I have always the same issue as Luigi :

The clusters shown with fviz_dend() are not the same as the ones of all other commands (fviz_clust(), plot(res.hcpc, "3D.map"), or res.hcpc$data.clust. I did not fix the nb.clust.
I think something is going wrong with fviz_dend().

Thank you for your great tool !

Oriane

Comment

Sara

Visitor

#507 05/29/2018 at 22h34

Hi,
I'm trying to do MCA + HCPC to find patterns in my data. The result from MCA gives 100 dimensions (I have 20 variables, including spanish province that has 52 categories).
When I try to do clustering usin HCPC, I receive an error because it tries to allocate a vector of 1.8 Gb.
When I try with kk=10 I receive a tree with 4 clusters, but when I click in the graph it returns the error.
In the help documentation I've read "If the ascending clustering is constructed from a data-frame with a lot of rows (individuals), it is possible to first perform a partition with kk clusters and then construct the tree from the (weighted) kk clusters." I have 80000 observations with 20 variables, so I think I should do this, but can't find any help about how to do it.
Could you please give me an example or show me haow I can do this?
Thanks,
Sara

Comment

Edgar Espinosa-Trujillo

Member

#428 04/05/2018 at 05h28

CONGRATULATIONS¡¡

Comment

kassambara

Administrator

#324 12/15/2017 at 21h30

fixed now, thanks. Install the latest developmental version as follow:

Code R :

 
devtools::install_github("kassambara/factoextra")

Comment

Luigi

Visitor

#323 12/15/2017 at 13h26

Hi,

One issue: for some reason, fviz_dend is not "cutting" the tree that is specified with
nb.clust (I think it chooses the "min" level).
e.g.
fviz_dend(res.hcpc, show_labels = FALSE)

plot(res.hcpc, "3D.map") does give me the "correct" result based on nbclust.

Am I doing something wrong when using fviz_dend?

Really great tools and website!

Thanks
-Luigi

STAY UPDATED

Articles - Principal Component Methods in R: Practical Guide

HCPC - Hierarchical Clustering on Principal Components: Essentials

Why HCPC

Case 1: Continuous variables

Case 2: Clustering on categorical data

Case 3: Clustering on mixed data

Algorithm of the HCPC method

Computation

R packages

R function

Case of continuous variables

Case of categorical variables

Summary

Further reading

References