The Guide for Clustering Analysis on a Real Data: 4 steps you should know - Unsupervised Machine Learning


Human’s abilities are exceeded by the large amounts of data collected every day from different fields, bio-medical, security, marketing, web search, geo-spatial or other automatic equipment. Consequently, unsupervised machine learning technics, such as clustering, are used for discovering knowledge from big data.

Clustering approaches classify samples into groups (i.e clusters) containing objects of similar profiles. In our previous post, we clarified distance measures for assessing similarity between observations.

In this chapter we’ll describe the different steps to follow for computing clustering on a real data using k-means clustering:



1 Required packages

The following packages will be used:

  • cluster for clustering analyses
  • factoextra for visualizing clusters using ggplot2 plotting system

Install factoextra package as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

The cluster package can be installed using the code below:

install.packages("cluster")

Load packages:

library(cluster)
library(factoextra)

2 Data preparation

We’ll use the built-in R data set USArrests, which can be loaded and prepared as follow:

# Load the data set
data(USArrests)
# Remove any missing value (i.e, NA values for not available)
# That might be present in the data
USArrests <- na.omit(USArrests)
# View the firt 6 rows of the data
head(USArrests, n = 6)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

In this data set, columns are variables and rows are observations (i.e., samples).

To inspect the data before the K-means clustering we’ll compute some descriptive statistics such as the mean and the standard deviation of the variables.

The apply() function is used to apply a given function (e.g : min(), max(), mean(), …) on the data set. The second argument can take the value of:

  • 1: for applying the function on the rows
  • 2: for applying the function on the columns
desc_stats <- data.frame(
  Min = apply(USArrests, 2, min), # minimum
  Med = apply(USArrests, 2, median), # median
  Mean = apply(USArrests, 2, mean), # mean
  SD = apply(USArrests, 2, sd), # Standard deviation
  Max = apply(USArrests, 2, max) # Maximum
  )
desc_stats <- round(desc_stats, 1)
head(desc_stats)
##           Min   Med  Mean   SD   Max
## Murder    0.8   7.2   7.8  4.4  17.4
## Assault  45.0 159.0 170.8 83.3 337.0
## UrbanPop 32.0  66.0  65.5 14.5  91.0
## Rape      7.3  20.1  21.2  9.4  46.0

Note that the variables have a large different means and variances. They must be standardized to make them comparable.

Standardization consists of transforming the variables such that they have mean zero and standard deviation one. The scale() function can be used as follow:

df<- scale(USArrests)

3 Assessing the clusterability

The function get_clust_tendency() [in factoextra] can be used. It computes Hopkins statistic and provides a visual approach.

library("factoextra")
res <- get_clust_tendency(df, 40, graph = FALSE)
# Hopskin statistic
res$hopkins_stat
## [1] 0.3440875
# Visualize the dissimilarity matrix
res$plot
## NULL

The value of Hopkins statistic is significantly < 0.5, indicating that the data is highly clusterable. Additionally, It can be seen that the ordered dissimilarity image contains patterns (i.e., clusters).

4 Estimate the number of clusters in the data

As k-means clustering requires to specify the number of clusters to generate, we’ll use the function clusGap() [in cluster] to compute gap statistics for estimating the optimal number of clusters . The function fviz_gap_stat() [in factoextra] is used to visualize the gap statistic plot.

library("cluster")
set.seed(123)
# Compute the gap statistic
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25, 
                    K.max = 10, B = 500) 
# Plot the result
library(factoextra)
fviz_gap_stat(gap_stat)

Step by step guide for partitioning clustering - Unsupervised Machine Learning

The gap statistic suggests a 4 cluster solutions.

It’s also possible to use the function NbClust() [in NbClust] package.

5 Compute k-means clustering

K-means clustering with k = 4:

# Compute k-means
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)
head(km.res$cluster, 20)
##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           4           3           3           4           3           3 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           2           2           3           4           2           1 
##    Illinois     Indiana        Iowa      Kansas    Kentucky   Louisiana 
##           3           2           1           2           1           4 
##       Maine    Maryland 
##           1           3
# Visualize clusters using factoextra
fviz_cluster(km.res, USArrests)

Step by step guide for partitioning clustering - Unsupervised Machine Learning

6 Cluster validation statistics: Inspect cluster silhouette plot

Recall that the silhouette measures (\(S_i\)) how similar an object \(i\) is to the the other objects in its own cluster versus those in the neighbor cluster. \(S_i\) values range from 1 to - 1:

  • A value of \(S_i\) close to 1 indicates that the object is well clustered. In the other words, the object \(i\) is similar to the other objects in its group.
  • A value of \(S_i\) close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.
sil <- silhouette(km.res$cluster, dist(df))
rownames(sil) <- rownames(USArrests)
head(sil[, 1:3])
##            cluster neighbor  sil_width
## Alabama          4        3 0.48577530
## Alaska           3        4 0.05825209
## Arizona          3        2 0.41548326
## Arkansas         4        2 0.11870947
## California       3        2 0.43555885
## Colorado         3        2 0.32654235
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   13          0.37
## 2       2   16          0.34
## 3       3   13          0.27
## 4       4    8          0.39

Step by step guide for partitioning clustering - Unsupervised Machine Learning

It can be seen that there are some samples which have negative silhouette values. Some natural questions are :

Which samples are these? To what cluster are they closer?

This can be determined from the output of the function silhouette() as follow:

neg_sil_index <- which(sil[, "sil_width"] < 0)
sil[neg_sil_index, , drop = FALSE]
##          cluster neighbor   sil_width
## Missouri       3        2 -0.07318144

7 eclust(): Enhanced clustering analysis

The function eclust() [in factoextra] provides several advantages compared to the standard packages used for clustering analysis:

  • It simplifies the workflow of clustering analysis
  • It can be used to compute hierarchical clustering and partitioning clustering in a single line function call
  • The function eclust() computes automatically the gap statistic for estimating the right number of clusters.
  • It automatically provides silhouette information
  • It draws beautiful graphs using ggplot2

7.1 K-means clustering using eclust()

# Compute k-means
res.km <- eclust(df, "kmeans")

Step by step guide for partitioning clustering - Unsupervised Machine Learning

# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)

Step by step guide for partitioning clustering - Unsupervised Machine Learning

# Silhouette plot
fviz_silhouette(res.km)
##   cluster size ave.sil.width
## 1       1   13          0.27
## 2       2   13          0.37
## 3       3    8          0.39
## 4       4   16          0.34

Step by step guide for partitioning clustering - Unsupervised Machine Learning

7.2 Hierachical clustering using eclust()

 # Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust
fviz_dend(res.hc, rect = TRUE) # dendrogam

Step by step guide for partitioning clustering - Unsupervised Machine Learning

The R code below generates the silhouette plot and the scatter plot for hierarchical clustering.

fviz_silhouette(res.hc) # silhouette plot
fviz_cluster(res.hc) # scatter plot

8 Infos

This analysis has been performed using R software (ver. 3.2.3)









Want to Learn More on R Programming and Data Science?

Follow us by Email

by FeedBurner

On Social Networks:


 Get involved :
  Click to follow us on and Google+ :   
  Comment this article by clicking on "Discussion" button (top-right position of this page)
  Sign up as a member and post news and articles on STHDA web site.
This page has been seen 11316 times