Advanced Clustering

Hierarchical K-Means Clustering: Optimize Clusters

K-means represents one of the most popular clustering algorithm. However, it has some limitations: it requires the user to specify the number of clusters in advance and selects initial centroids randomly. The final k-means clustering solution is very sensitive to this initial random selection of cluster centers. The result might be (slightly) different each time you compute k-means.

In this chapter, we described an hybrid method, named hierarchical k-means clustering (hkmeans), for improving k-means results.



Contents:

Related Book

Practical Guide to Cluster Analysis in R

Algorithm

The algorithm is summarized as follow:

  1. Compute hierarchical clustering and cut the tree into k-clusters
  2. Compute the center (i.e the mean) of each cluster
  3. Compute k-means by using the set of cluster centers (defined in step 2) as the initial cluster centers

Note that, k-means algorithm will improve the initial partitioning generated at the step 2 of the algorithm. Hence, the initial partitioning can be slightly different from the final partitioning obtained in the step 4.

R code

The R function hkmeans() [in factoextra], provides an easy solution to compute the hierarchical k-means clustering. The format of the result is similar to the one provided by the standard kmeans() function (see Chapter @ref(kmeans-clustering)).

To install factoextra, type this: install.packages(“factoextra”).

We’ll use the USArrest data set and we start by standardizing the data:

df <- scale(USArrests)
# Compute hierarchical k-means clustering
library(factoextra)
res.hk <-hkmeans(df, 4)
# Elements returned by hkmeans()
names(res.hk)
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"

To print all the results, type this:

# Print the results
res.hk
# Visualize the tree
fviz_dend(res.hk, cex = 0.6, palette = "jco", 
          rect = TRUE, rect_border = "jco", rect_fill = TRUE)

# Visualize the hkmeans final clusters
fviz_cluster(res.hk, palette = "jco", repel = TRUE,
             ggtheme = theme_classic())

Summary

We described hybrid hierarchical k-means clustering for improving k-means results.



(Next Lesson) Fuzzy Clustering Essentials
Back to Advanced Clustering

Comments ( 4 )

  • Bryan Hutchinson

    Dear Alboukadel

    If I were to use hierarchical k-means clustering, what function method should I use in the eclust function when doing cluster validation to initialise the function?

  • YURI COSTA

    Hello, I have a practical question. I need to classify salinity zones for the estuarine environment. There is a theoretical classification (Venice System) that defines four zones,
    Euhalina (40 ~ 30),
    Polyhalina (30 ~ 18),
    Mesohalina (18 ~ 5),
    Oligohalina (5 ~ 0.5).
    I have data collected over the salinity gradient at different periods and would like to apply this classification to my ecosystems. In the real world, there is some overlap in tidal zone edges, but I still find it interesting to try to apply some clustering technique to these systems. What I imagine I need is an algorithm that classifies the zones of each system and that I provide the rating ranges. That is, I have points in space that have different salinity values ​​and I would like to group them according to a rule (based on intervals) defined by me. Would you have something to suggest me?

  • S3-T1 D'Amato

    how can I do a cluster validation analysis of the clusters obtained by hkmeans ?
    I tried using fviz_silhouette, but I get an error message saying that hkmeans objects are not supported by fviz_silhouette.

Give a comment

Want to post an issue with R? If yes, please make sure you have read this: How to Include Reproducible R Script Examples in Datanovia Comments

Teacher
Alboukadel Kassambara
Role : Founder of Datanovia
Read More