Last articles - STHDA : Cluster Validation Essentials

Computing P-value for Hierarchical Clustering

Thu, 07 Sep 2017 11:52:00 +0200

Clusters can be found in a data set by chance due to clustering noise or sampling error. This article describes the R package pvclust (Suzuki and Shimodaira 2015) which uses bootstrap resampling techniques to compute p-value for each hierarchical clusters.

Contents:

Algorithm
Required packages
Data preparation
Compute p-value for hierarchical clustering
- Description of pvclust() function
- Usage of pvclust() function
References

Algorithm

Generated thousands of bootstrap samples by randomly sampling elements of the data
Compute hierarchical clustering on each bootstrap copy
For each cluster:
- compute the bootstrap probability (BP) value which corresponds to the frequency that the cluster is identified in bootstrap copies.
- Compute the approximately unbiased (AU) probability values (p-values) by multiscale bootstrap resampling

Clusters with AU >= 95% are considered to be strongly supported by data.

Required packages

Install pvclust:

install.packages("pvclust")

Load pvclust:

library(pvclust)

Data preparation

We’ll use lung data set [in pvclust package]. It contains the gene expression profile of 916 genes of 73 lung tissues including 67 tumors. Columns are samples and rows are genes.

library(pvclust)
# Load the data
data("lung")
head(lung[, 1:4])

##               fetal_lung 232-97_SCC 232-97_node 68-96_Adeno
## IMAGE:196992       -0.40       4.28        3.68       -1.35
## IMAGE:587847       -2.22       5.21        4.75       -0.91
## IMAGE:1049185      -1.35      -0.84       -2.88        3.35
## IMAGE:135221        0.68       0.56       -0.45       -0.20
## IMAGE:298560          NA       4.14        3.58       -0.40
## IMAGE:119882       -3.23      -2.84       -2.72       -0.83

# Dimension of the data
dim(lung)

## [1] 916  73

We’ll use only a subset of the data set for the clustering analysis. The R function sample() can be used to extract a random subset of 30 samples:

set.seed(123)
ss <- sample(1:73, 30) # extract 20 samples out of
df <- lung[, ss]

Compute p-value for hierarchical clustering

Description of pvclust() function

The function pvclust() can be used as follow:

pvclust(data, method.hclust = "average",
        method.dist = "correlation", nboot = 1000)

Note that, the computation time can be strongly decreased using parallel computation version called parPvclust(). (Read ?parPvclust() for more information.)

parPvclust(cl=NULL, data, method.hclust = "average",
           method.dist = "correlation", nboot = 1000,
           iseed = NULL)

data: numeric data matrix or data frame.
method.hclust: the agglomerative method used in hierarchical clustering. Possible values are one of “average”, “ward”, “single”, “complete”, “mcquitty”, “median” or “centroid”. The default is “average”. See method argument in ?hclust.
method.dist: the distance measure to be used. Possible values are one of “correlation”, “uncentered”, “abscor” or those which are allowed for method argument in dist() function, such “euclidean” and “manhattan”.
nboot: the number of bootstrap replications. The default is 1000.
iseed: an integrer for random seeds. Use iseed argument to achieve reproducible results.

The function pvclust() returns an object of class pvclust containing many elements including hclust which contains hierarchical clustering result for the original data generated by the function hclust().

Usage of pvclust() function

pvclust() performs clustering on the columns of the data set, which correspond to samples in our case. If you want to perform the clustering on the variables (here, genes) you have to transpose the data set using the function t().

The R code below computes pvclust() using 10 as the number of bootstrap replications (for speed):

library(pvclust)
set.seed(123)
res.pv <- pvclust(df, method.dist="cor", 
                  method.hclust="average", nboot = 10)

# Default plot
plot(res.pv, hang = -1, cex = 0.5)
pvrect(res.pv)

Values on the dendrogram are AU p-values (Red, left), BP values (green, right), and clusterlabels (grey, bottom). Clusters with AU > = 95% are indicated by the rectangles and are considered to be strongly supported by data.

To extract the objects from the significant clusters, use the function pvpick():

clusters <- pvpick(res.pv)
clusters

Parallel computation can be applied as follow:

# Create a parallel socket cluster
library(parallel)
cl <- makeCluster(2, type = "PSOCK")
# parallel version of pvclust
res.pv <- parPvclust(cl, df, nboot=1000)
stopCluster(cl)

References

Suzuki, Ryota, and Hidetoshi Shimodaira. 2015. Pvclust: Hierarchical Clustering with P-Values via Multiscale Bootstrap Resampling. https://CRAN.R-project.org/package=pvclust.

Choosing the Best Clustering Algorithms

Thu, 07 Sep 2017 11:33:00 +0200

Choosing the best clustering method for a given data can be a hard task for the analyst. This article describes the R package clValid (Brock et al. 2008), which can be used to compare simultaneously multiple clustering algorithms in a single function call for identifying the best clustering approach and the optimal number of clusters.

We’ll start by describing the different measures in the clValid package for comparing clustering algorithms. Next, we’ll present the function clValid(). Finally, we’ll provide R scripts for validating clustering results and comparing clustering algorithms.

Contents:

Measures for comparing clustering algorithms
Compare clustering algorithms in R
Summary
References

Measures for comparing clustering algorithms

The clValid package compares clustering algorithms using two cluster validation measures:

Internal measures, which uses intrinsic information in the data to assess the quality of the clustering. Internal measures include the connectivity, the silhouette coefficient and the Dunn index as described in Chapter @ref(cluster-validation-statistics) (Cluster Validation Statistics).
Stability measures, a special version of internal measures, which evaluates the consistency of a clustering result by comparing it with the clusters obtained after each column is removed, one at a time.

Cluster stability measures include:

The average proportion of non-overlap (APN)
The average distance (AD)
The average distance between means (ADM)
The figure of merit (FOM)

The APN, AD, and ADM are all based on the cross-classification table of the original clustering on the full data with the clustering based on the removal of one column.

The APN measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.
The AD measures the average distance between observations placed in the same cluster under both cases (full data set and removal of one column).
The ADM measures the average distance between cluster centers for observations placed in the same cluster under both cases.
The FOM measures the average intra-cluster variance of the deleted column, where the clustering is based on the remaining (undeleted) columns.

The values of APN, ADM and FOM ranges from 0 to 1, with smaller value corresponding with highly consistent clustering results. AD has a value between 0 and infinity, and smaller values are also preferred.

Note that, the clValid package provides also biological validation measures, which evaluates the ability of a clustering algorithm to produce biologically meaningful clusters. An application is microarray or RNAseq data where observations corresponds to genes.

Compare clustering algorithms in R

We’ll use the function clValid() [in the clValid package], which simplified format is as follow:

clValid(obj, nClust, clMethods = "hierarchical", 
        validation = "stability", maxitems = 600,
        metric = "euclidean", method = "average")

obj: A numeric matrix or data frame. Rows are the items to be clustered and columns are samples.
nClust: A numeric vector specifying the numbers of clusters to be evaluated. e.g., 2:10
clMethods: The clustering method to be used. Available options are “hierarchical”, “kmeans”, “diana”, “fanny”, “som”, “model”, “sota”, “pam”, “clara”, and “agnes”, with multiple choices allowed.
validation: The type of validation measures to be used. Allowed values are “internal”, “stability”, and “biological”, with multiple choices allowed.
maxitems: The maximum number of items (rows in matrix) which can be clustered.
metric: The metric used to determine the distance matrix. Possible choices are “euclidean”, “correlation”, and “manhattan”.
method: For hierarchical clustering (hclust and agnes), the agglomeration method to be used. Available choices are “ward”, “single”, “complete” and “average”.

For example, consider the iris data set, the clValid() function can be used as follow.

We start by cluster internal measures, which include the connectivity, silhouette width and Dunn index. It’s possible to compute simultaneously these internal measures for multiple clustering algorithms in combination with a range of cluster numbers.

library(clValid)
# Iris data set:
# - Remove Species column and scale
df <- scale(iris[, -5])
# Compute clValid
clmethods <- c("hierarchical","kmeans","pam")
intern <- clValid(df, nClust = 2:6, 
              clMethods = clmethods, validation = "internal")
# Summary
summary(intern)

##  Length   Class    Mode 
##       1 clValid      S4

It can be seen that hierarchical clustering with two clusters performs the best in each case (i.e., for connectivity, Dunn and Silhouette measures). Regardless of the clustering algorithm, the optimal number of clusters seems to be two using the three measures.

The stability measures can be computed as follow:

# Stability measures
clmethods <- c("hierarchical","kmeans","pam")
stab <- clValid(df, nClust = 2:6, clMethods = clmethods, 
                validation = "stability")
# Display only optimal Scores
optimalScores(stab)

##       Score       Method Clusters
## APN 0.00327 hierarchical        2
## AD  1.00429          pam        6
## ADM 0.01609 hierarchical        2
## FOM 0.45575          pam        6

For the APN and ADM measures, hierarchical clustering with two clusters again gives the best score. For the other measures, PAM with six clusters has the best score.

Summary

Here, we described how to compare clustering algorithms using the clValid R package.

References

Brock, Guy, Vasyl Pihur, Susmita Datta, and Somnath Datta. 2008. “ClValid: An R Package for Cluster Validation.” Journal of Statistical Software 25 (4): 1–22. https://www.jstatsoft.org/v025/i04.

Cluster Validation Statistics: Must Know Methods

Thu, 07 Sep 2017 11:07:00 +0200

The term cluster validation is used to design the procedure of evaluating the goodness of clustering algorithm results. This is important to avoid finding patterns in a random data, as well as, in the situation where you want to compare two clustering algorithms.

Generally, clustering validation statistics can be categorized into 3 classes (Charrad et al. 2014,Brock et al. (2008), Theodoridis and Koutroumbas (2008)):

Internal cluster validation, which uses the internal information of the clustering process to evaluate the goodness of a clustering structure without reference to external information. It can be also used for estimating the number of clusters and the appropriate clustering algorithm without any external data.
External cluster validation, which consists in comparing the results of a cluster analysis to an externally known result, such as externally provided class labels. It measures the extent to which cluster labels match externally supplied class labels. Since we know the “true” cluster number in advance, this approach is mainly used for selecting the right clustering algorithm for a specific data set.
Relative cluster validation, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g.,: varying the number of clusters k). It’s generally used for determining the optimal number of clusters.

In this chapter, we start by describing the different methods for clustering validation. Next, we’ll demonstrate how to compare the quality of clustering results obtained with different clustering algorithms. Finally, we’ll provide R scripts for validating clustering results.

In all the examples presented here, we’ll apply k-means, PAM and hierarchical clustering. Note that, the functions used in this article can be applied to evaluate the validity of any other clustering methods.

Contents:

Internal measures for cluster validation
- Silhouette coefficient
- Dunn index
External measures for clustering validation
Computing cluster validation statistics in R
Summary
References

Internal measures for cluster validation

In this section, we describe the most widely used clustering validation indices. Recall that the goal of partitioning clustering algorithms (Part @ref(partitioning-clustering)) is to split the data set into clusters of objects, such that:

the objects in the same cluster are similar as much as possible,
and the objects in different clusters are highly distinct

That is, we want the average distance within cluster to be as small as possible; and the average distance between clusters to be as large as possible.

Internal validation measures reflect often the compactness, the connectedness and the separation of the cluster partitions.

Compactness or cluster cohesion: Measures how close are the objects within the same cluster. A lower within-cluster variation is an indicator of a good compactness (i.e., a good clustering). The different indices for evaluating the compactness of clusters are base on distance measures such as the cluster-wise within average/median distances between observations.
Separation: Measures how well-separated a cluster is from other clusters. The indices used as separation measures include:
- distances between cluster centers
- the pairwise minimum distances between objects in different clusters
Connectivity: corresponds to what extent items are placed in the same cluster as their nearest neighbors in the data space. The connectivity has a value between 0 and infinity and should be minimized.

Generally most of the indices used for internal clustering validation combine compactness and separation measures as follow:

\[ Index = \frac{(\alpha \times Separation)}{(\beta \times Compactness)} \]

Where \(\alpha\) and \(\beta\) are weights.

In this section, we’ll describe the two commonly used indices for assessing the goodness of clustering: the silhouette width and the Dunn index. These internal measure can be used also to determine the optimal number of clusters in the data.

Silhouette coefficient

The silhouette analysis measures how well an observation is clustered and it estimates the average distance between clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.

For each observation \(i\), the silhouette width \(s_i\) is calculated as follows:

For each observation \(i\), calculate the average dissimilarity \(a_i\) between \(i\) and all other points of the cluster to which i belongs.
For all other clusters \(C\), to which i does not belong, calculate the average dissimilarity \(d(i, C)\) of \(i\) to all observations of C. The smallest of these \(d(i,C)\) is defined as \(b_i= \min_C d(i,C)\). The value of \(b_i\) can be seen as the dissimilarity between \(i\) and its “neighbor” cluster, i.e., the nearest one to which it does not belong.
Finally the silhouette width of the observation \(i\) is defined by the formula: \(S_i = (b_i - a_i)/max(a_i, b_i)\).

Silhouette width can be interpreted as follow:

Observations with a large S_i (almost 1) are very well clustered.
A small S_i (around 0) means that the observation lies between two clusters.
Observations with a negative S_i are probably placed in the wrong cluster.

Dunn index

The Dunn index is another internal clustering validation measure which can be computed as follow:

For each cluster, compute the distance between each of the objects in the cluster and the objects in the other clusters
Use the minimum of this pairwise distance as the inter-cluster separation (min.separation)
For each cluster, compute the distance between the objects in the same cluster.
Use the maximal intra-cluster distance (i.e maximum diameter) as the intra-cluster compactness
Calculate the Dunn index (D) as follow:

\[ D = \frac{min.separation}{max.diameter} \]

If the data set contains compact and well-separated clusters, the diameter of the clusters is expected to be small and the distance between the clusters is expected to be large. Thus, Dunn index should be maximized.

External measures for clustering validation

The aim is to compare the identified clusters (by k-means, pam or hierarchical clustering) to an external reference.

It’s possible to quantify the agreement between partitioning clusters and external reference using either the corrected Rand index and Meila’s variation index VI, which are implemented in the R function cluster.stats()[fpc package].

The corrected Rand index varies from -1 (no agreement) to 1 (perfect agreement).

External clustering validation, can be used to select suitable clustering algorithm for a given data set.

Computing cluster validation statistics in R

Required R packages

The following R packages are required in this chapter:

factoextra for data visualization
fpc for computing clustering validation statistics
NbClust for determining the optimal number of clusters in the data set.
Install the packages:

install.packages(c("factoextra", "fpc", "NbClust"))

Load the packages:

library(factoextra)
library(fpc)
library(NbClust)

Data preparation

We’ll use the built-in R data set iris:

# Excluding the column "Species" at position 5
df <- iris[, -5]
# Standardize
df <- scale(df)

Clustering analysis

We’ll use the function eclust() [enhanced clustering, in factoextra] which provides several advantages:

It simplifies the workflow of clustering analysis
It can be used to compute hierarchical clustering and partitioning clustering in a single line function call
Compared to the standard partitioning functions (kmeans, pam, clara and fanny) which requires the user to specify the optimal number of clusters, the function eclust() computes automatically the gap statistic for estimating the right number of clusters.
It provides silhouette information for all partitioning methods and hierarchical clustering
It draws beautiful graphs using ggplot2

The simplified format the eclust() function is as follow:

eclust(x, FUNcluster = "kmeans", hc_metric = "euclidean", ...)

x: numeric vector, data matrix or data frame
FUNcluster: a clustering function including “kmeans”, “pam”, “clara”, “fanny”, “hclust”, “agnes” and “diana”. Abbreviation is allowed.
hc_metric: character string specifying the metric to be used for calculating dissimilarities between observations. Allowed values are those accepted by the function dist() [including “euclidean”, “manhattan”, “maximum”, “canberra”, “binary”, “minkowski”] and correlation based distance measures [“pearson”, “spearman” or “kendall”]. Used only when FUNcluster is a hierarchical clustering function such as one of “hclust”, “agnes” or “diana”.
…: other arguments to be passed to FUNcluster.

The function eclust() returns an object of class eclust containing the result of the standard function used (e.g., kmeans, pam, hclust, agnes, diana, etc.).

It includes also:

cluster: the cluster assignment of observations after cutting the tree
nbclust: the number of clusters
silinfo: the silhouette information of observations
size: the size of clusters
data: a matrix containing the original or the standardized data (if stand = TRUE)
gap_stat: containing gap statistics

To compute a partitioning clustering, such as k-means clustering with k = 3, type this:

# K-means clustering
km.res <- eclust(df, "kmeans", k = 3, nstart = 25, graph = FALSE)
# Visualize k-means clusters
fviz_cluster(km.res, geom = "point", ellipse.type = "norm",
             palette = "jco", ggtheme = theme_minimal())

To compute a hierarchical clustering, use this:

# Hierarchical clustering
hc.res <- eclust(df, "hclust", k = 3, hc_metric = "euclidean", 
                 hc_method = "ward.D2", graph = FALSE)
# Visualize dendrograms
fviz_dend(hc.res, show_labels = FALSE,
         palette = "jco", as.ggplot = TRUE)

Cluster validation

Silhouette plot

Recall that the silhouette coefficient (\(S_i\)) measures how similar an object \(i\) is to the the other objects in its own cluster versus those in the neighbor cluster. \(S_i\) values range from 1 to - 1:

A value of \(S_i\) close to 1 indicates that the object is well clustered. In the other words, the object \(i\) is similar to the other objects in its group.
A value of \(S_i\) close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.

It’s possible to draw silhouette coefficients of observations using the function fviz_silhouette() [factoextra package], which will also print a summary of the silhouette analysis output. To avoid this, you can use the option print.summary = FALSE.

fviz_silhouette(km.res, palette = "jco", 
                ggtheme = theme_classic())

##   cluster size ave.sil.width
## 1       1   50          0.64
## 2       2   47          0.35
## 3       3   53          0.39

Silhouette information can be extracted as follow:

# Silhouette information
silinfo <- km.res$silinfo
names(silinfo)
# Silhouette widths of each observation
head(silinfo$widths[, 1:3], 10)
# Average silhouette width of each cluster
silinfo$clus.avg.widths
# The total average (mean of all individual silhouette widths)
silinfo$avg.width
# The size of each clusters
km.res$size

It can be seen that several samples, in cluster 2, have a negative silhouette coefficient. This means that they are not in the right cluster. We can find the name of these samples and determine the clusters they are closer (neighbor cluster), as follow:

# Silhouette width of observation
sil <- km.res$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]

##     cluster neighbor sil_width
## 112       2        3   -0.0106
## 128       2        3   -0.0249

Computing Dunn index and other cluster validation statistics

The function cluster.stats() [fpc package] and the function NbClust() [in NbClust package] can be used to compute Dunn index and many other indices.

The simplified format is:

cluster.stats(d = NULL, clustering, al.clustering = NULL)

d: a distance object between cases as generated by the dist() function
clustering: vector containing the cluster number of each observation
alt.clustering: vector such as for clustering, indicating an alternative clustering

The function cluster.stats() returns a list containing many components useful for analyzing the intrinsic characteristics of a clustering:

cluster.number: number of clusters
cluster.size: vector containing the number of points in each cluster
average.distance, median.distance: vector containing the cluster-wise within average/median distances
average.between: average distance between clusters. We want it to be as large as possible
average.within: average distance within clusters. We want it to be as small as possible
clus.avg.silwidths: vector of cluster average silhouette widths. Recall that, the silhouette width is also an estimate of the average distance between clusters. Its value is comprised between 1 and -1 with a value of 1 indicating a very good cluster.
within.cluster.ss: a generalization of the within clusters sum of squares (k-means objective function), which is obtained if d is a Euclidean distance matrix.
dunn, dunn2: Dunn index
corrected.rand, vi: Two indexes to assess the similarity of two clustering: the corrected Rand index and Meila’s VI

All the above elements can be used to evaluate the internal quality of clustering.

In the following sections, we’ll compute the clustering quality statistics for k-means. Look at the within.cluster.ss (within clusters sum of squares), the average.within (average distance within clusters) and clus.avg.silwidths (vector of cluster average silhouette widths).

library(fpc)
# Statistics for k-means clustering
km_stats <- cluster.stats(dist(df),  km.res$cluster)
# Dun index
km_stats$dunn

## [1] 0.0265

To display all statistics, type this:

km_stats

Read the documentation of cluster.stats() for details about all the available indices.

External clustering validation

Among the values returned by the function cluster.stats(), there are two indexes to assess the similarity of two clustering, namely the corrected Rand index and Meila’s VI.

We know that the iris data contains exactly 3 groups of species.

Does the K-means clustering matches with the true structure of the data?

We can use the function cluster.stats() to answer to this question.

Let start by computing a cross-tabulation between k-means clusters and the reference Species labels:

table(iris$Species, km.res$cluster)

##             
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 11 39
##   virginica   0 36 14

It can be seen that:

All setosa species (n = 50) has been classified in cluster 1
A large number of versicor species (n = 39 ) has been classified in cluster 3. Some of them ( n = 11) have been classified in cluster 2.
A large number of virginica species (n = 36 ) has been classified in cluster 2. Some of them (n = 14) have been classified in cluster 3.

It’s possible to quantify the agreement between Species and k-means clusters using either the corrected Rand index and Meila’s VI provided as follow:

library("fpc")
# Compute cluster stats
species <- as.numeric(iris$Species)
clust_stats <- cluster.stats(d = dist(df), 
                             species, km.res$cluster)
# Corrected Rand index
clust_stats$corrected.rand

## [1] 0.62

# VI
clust_stats$vi

## [1] 0.748

The corrected Rand index provides a measure for assessing the similarity between two partitions, adjusted for chance. Its range is -1 (no agreement) to 1 (perfect agreement). Agreement between the specie types and the cluster solution is 0.62 using Rand index and 0.748 using Meila’s VI.

The same analysis can be computed for both PAM and hierarchical clustering:

# Agreement between species and pam clusters
pam.res <- eclust(df, "pam", k = 3, graph = FALSE)
table(iris$Species, pam.res$cluster)
cluster.stats(d = dist(iris.scaled), 
              species, pam.res$cluster)$vi
# Agreement between species and HC clusters
res.hc <- eclust(df, "hclust", k = 3, graph = FALSE)
table(iris$Species, res.hc$cluster)
cluster.stats(d = dist(iris.scaled), 
              species, res.hc$cluster)$vi

External clustering validation, can be used to select suitable clustering algorithm for a given data set.

Summary

We described how to validate clustering results using the silhouette method and the Dunn index. This task is facilitated using the combination of two R functions: eclust() and fviz_silhouette in the factoextra package We also demonstrated how to assess the agreement between a clustering result and an external reference. In the next chapters, we’ll show how to i) choose the appropriate clustering algorithm for your data; and ii) computing p-values for hierarchical clustering.

References

Charrad, Malika, Nadia Ghazzali, Véronique Boiteau, and Azam Niknafs. 2014. “NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set.” Journal of Statistical Software 61: 1–36. http://www.jstatsoft.org/v61/i06/paper.

Theodoridis, Sergios, and Konstantinos Koutroumbas. 2008. Pattern Recognition. 2nd ed. Academic Press.

Determining The Optimal Number Of Clusters: 3 Must Know Methods

Thu, 07 Sep 2017 09:14:00 +0200

Determining the optimal number of clusters in a data set is a fundamental issue in partitioning clustering, such as k-means clustering (Chapter @ref(kmeans-clustering)), which requires the user to specify the number of clusters k to be generated.

Unfortunately, there is no definitive answer to this question. The optimal number of clusters is somehow subjective and depends on the method used for measuring similarities and the parameters used for partitioning. A simple and popular solution consists of inspecting the dendrogram produced using hierarchical clustering (Chapter @ref(agglomerative-clustering)) to see if it suggests a particular number of clusters. Unfortunately, this approach is also subjective.

In this chapter, we’ll describe different methods for determining the optimal number of clusters for k-means, k-medoids (PAM) and hierarchical clustering.

These methods include direct methods and statistical testing methods:

Direct methods: consists of optimizing a criterion, such as the within cluster sums of squares or the average silhouette. The corresponding methods are named elbow and silhouette methods, respectively.
Statistical testing methods: consists of comparing evidence against null hypothesis. An example is the gap statistic.

In addition to elbow, silhouette and gap statistic methods, there are more than thirty other indices and methods that have been published for identifying the optimal number of clusters. We’ll provide R codes for computing all these 30 indices in order to decide the best number of clusters using the “majority rule”.

For each of these methods:

We’ll describe the basic idea and the algorithm
We’ll provide easy-o-use R codes with many examples for determining the optimal number of clusters and visualizing the output.

Contents:

Elbow method
Average silhouette method
Gap statistic method
Computing the number of clusters using R
Summary
References

Elbow method

Recall that, the basic idea behind partitioning methods, such as k-means clustering (Chapter @ref(kmeans-clustering)), is to define clusters such that the total intra-cluster variation [or total within-cluster sum of square (WSS)] is minimized. The total WSS measures the compactness of the clustering and we want it to be as small as possible.

The Elbow method looks at the total WSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t improve much better the total WSS.

The optimal number of clusters can be defined as follow:

Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
For each k, calculate the total within-cluster sum of square (wss).
Plot the curve of wss according to the number of clusters k.
The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

Note that, the elbow method is sometimes ambiguous. An alternative is the average silhouette method (Kaufman and Rousseeuw [1990]) which can be also used with any clustering approach.

Average silhouette method

The average silhouette approach we’ll be described comprehensively in the chapter cluster validation statistics (Chapter @ref(cluster-validation-statistics)). Briefly, it measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.

Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k (Kaufman and Rousseeuw 1990).

The algorithm is similar to the elbow method and can be computed as follow:

Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
For each k, calculate the average silhouette of observations (avg.sil).
Plot the curve of avg.sil according to the number of clusters k.
The location of the maximum is considered as the appropriate number of clusters.

Gap statistic method

The gap statistic has been published by R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001). The approach can be applied to any clustering method.

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e, that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.

The algorithm works as follow:

Cluster the observed data, varying the number of clusters from k = 1, …, k_max, and compute the corresponding total within intra-cluster variation W_k.
Generate B reference data sets with a random uniform distribution. Cluster each of these reference data sets with varying number of clusters k = 1, …, k_max, and compute the corresponding total within intra-cluster variation W_kb.
Compute the estimated gap statistic as the deviation of the observed W_k value from its expected value W_kb under the null hypothesis: \(Gap(k) = \frac{1}{B} \sum\limits_{b=1}^B log(W_{kb}^*) - log(W_k)\). Compute also the standard deviation of the statistics.
Choose the number of clusters as the smallest value of k such that the gap statistic is within one standard deviation of the gap at k+1: Gap(k)≥Gap(k + 1)−s_k + 1.

Note that, using B = 500 gives quite precise results so that the gap plot is basically unchanged after an another run.

Computing the number of clusters using R

In this section, we’ll describe two functions for determining the optimal number of clusters:

fviz_nbclust() function [in factoextra R package]: It can be used to compute the three different methods [elbow, silhouette and gap statistic] for any partitioning clustering methods [K-means, K-medoids (PAM), CLARA, HCUT]. Note that the hcut() function is available only in factoextra package.It computes hierarchical clustering and cut the tree in k pre-specified clusters.
NbClust() function [ in NbClust R package] (Charrad et al. 2014): It provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods. It can simultaneously computes all the indices and determine the number of clusters in a single function call.

Required R packages

We’ll use the following R packages:

factoextra to determine the optimal number clusters for a given clustering methods and for data visualization.
NbClust for computing about 30 methods at once, in order to find the optimal number of clusters.

To install the packages, type this:

pkgs <- c("factoextra",  "NbClust")
install.packages(pkgs)

Load the packages as follow:

library(factoextra)
library(NbClust)

Data preparation

We’ll use the USArrests data as a demo data set. We start by standardizing the data to make variables comparable.

# Standardize the data
df <- scale(USArrests)
head(df)

##            Murder Assault UrbanPop     Rape
## Alabama    1.2426   0.783   -0.521 -0.00342
## Alaska     0.5079   1.107   -1.212  2.48420
## Arizona    0.0716   1.479    0.999  1.04288
## Arkansas   0.2323   0.231   -1.074 -0.18492
## California 0.2783   1.263    1.759  2.06782
## Colorado   0.0257   0.399    0.861  1.86497

fviz_nbclust() function: Elbow, Silhouhette and Gap statistic methods

The simplified format is as follow:

fviz_nbclust(x, FUNcluster, method = c("silhouette", "wss", "gap_stat"))

x: numeric matrix or data frame
FUNcluster: a partitioning function. Allowed values include kmeans, pam, clara and hcut (for hierarchical clustering).
method: the method to be used for determining the optimal number of clusters.

The R code below determine the optimal number of clusters for k-means clustering:

# Elbow method
fviz_nbclust(df, kmeans, method = "wss") +
    geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")
# Silhouette method
fviz_nbclust(df, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")
# Gap statistic
# nboot = 50 to keep the function speedy. 
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
fviz_nbclust(df, kmeans, nstart = 25,  method = "gap_stat", nboot = 50)+
  labs(subtitle = "Gap statistic method")

## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 50)  [one "." per sample]:
## .................................................. 50

Elbow method: 4 clusters solution suggested
Silhouette method: 2 clusters solution suggested
Gap statistic method: 4 clusters solution suggested

According to these observations, it’s possible to define k = 4 as the optimal number of clusters in the data.

The disadvantage of elbow and average silhouette methods is that, they measure a global clustering characteristic only. A more sophisticated method is to use the gap statistic which provides a statistical procedure to formalize the elbow/silhouette heuristic in order to estimate the optimal number of clusters.

NbClust() function: 30 indices for choosing the best number of clusters

The simplified format of the function NbClust() is:

NbClust(data = NULL, diss = NULL, distance = "euclidean",
        min.nc = 2, max.nc = 15, method = NULL)

data: matrix
diss: dissimilarity matrix to be used. By default, diss=NULL, but if it is replaced by a dissimilarity matrix, distance should be “NULL”
distance: the distance measure to be used to compute the dissimilarity matrix. Possible values include “euclidean”, “manhattan” or “NULL”.
min.nc, max.nc: minimal and maximal number of clusters, respectively
method: The cluster analysis method to be used including “ward.D”, “ward.D2”, “single”, “complete”, “average”, “kmeans” and more.

To compute NbClust() for kmeans, use method = “kmeans”.
To compute NbClust() for hierarchical clustering, method should be one of c(“ward.D”, “ward.D2”, “single”, “complete”, “average”).

The R code below computes NbClust() for k-means:

library("NbClust")
nb <- NbClust(df, distance = "euclidean", min.nc = 2,
        max.nc = 10, method = "kmeans")

The result of NbClust using the function fviz_nbclust() [in factoextra], as follow:

library("factoextra")
fviz_nbclust(nb)

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 10 proposed  2 as the best number of clusters
## * 2 proposed  3 as the best number of clusters
## * 8 proposed  4 as the best number of clusters
## * 1 proposed  5 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 2 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

….
2 proposed 0 as the best number of clusters
10 indices proposed 2 as the best number of clusters.
2 proposed 3 as the best number of clusters.
8 proposed 4 as the best number of clusters.

According to the majority rule, the best number of clusters is 2.

Summary

In this article, we described different methods for choosing the optimal number of clusters in a data set. These methods include the elbow, the silhouette and the gap statistic methods.

We demonstrated how to compute these methods using the R function fviz_nbclust() [in factoextra R package]. Additionally, we described the package NbClust(), which can be used to compute simultaneously many other indices and methods for determining the number of clusters.

After choosing the number of clusters k, the next step is to perform partitioning clustering as described at: k-means clustering (Chapter @ref(kmeans-clustering)).

References

Kaufman, Leonard, and Peter Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis.

Assessing Clustering Tendency: Essentials

Thu, 07 Sep 2017 08:45:00 +0200

Before applying any clustering method on your data, it’s important to evaluate whether the data sets contains meaningful clusters (i.e.: non-random structures) or not. If yes, then how many clusters are there. This process is defined as the assessing of clustering tendency or the feasibility of the clustering analysis.

A big issue, in cluster analysis, is that clustering methods will return clusters even if the data does not contain any clusters. In other words, if you blindly apply a clustering method on a data set, it will divide the data into clusters because that is what it supposed to do.

In this chapter, we start by describing why we should evaluate the clustering tendency before applying any clustering method on a data. Next, we provide statistical and visual methods for assessing the clustering tendency.

Contents:

Required R packages
Data preparation
Visual inspection of the data
Why assessing clustering tendency?
Methods for assessing clustering tendency
- Statistical methods
- Visual methods
Summary
References

Required R packages

factoextra for data visualization
clustertend for statistical assessment clustering tendency

To install the two packages, type this:

install.packages(c("factoextra", "clustertend"))

Data preparation

We’ll use two data sets:

the built-in R data set iris.
and a random data set generated from the iris data set.

The iris data sets look like this:

head(iris, 3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

We start by excluding the column “Species” at position 5

# Iris data set
df <- iris[, -5]
# Random data generated from the iris data set
random_df <- apply(df, 2, 
                function(x){runif(length(x), min(x), (max(x)))})
random_df <- as.data.frame(random_df)
# Standardize the data sets
df <- iris.scaled <- scale(df)
random_df <- scale(random_df)

Visual inspection of the data

We start by visualizing the data to assess whether they contains any meaningful clusters.

As the data contain more than two variables, we need to reduce the dimensionality in order to plot a scatter plot. This can be done using principal component analysis (PCA) algorithm (R function: prcomp()). After performing PCA, we use the function fviz_pca_ind() [factoextra R package] to visualize the output.

The iris and the random data sets can be illustrated as follow:

library("factoextra")
# Plot faithful data set
fviz_pca_ind(prcomp(df), title = "PCA - Iris data", 
             habillage = iris$Species,  palette = "jco",
             geom = "point", ggtheme = theme_classic(),
             legend = "bottom")

# Plot the random df
fviz_pca_ind(prcomp(random_df), title = "PCA - Random data", 
             geom = "point", ggtheme = theme_classic())

It can be seen that the iris data set contains 3 real clusters. However the randomly generated data set doesn’t contain any meaningful clusters.

Why assessing clustering tendency?

In order to illustrate why it’s important to assess cluster tendency, we start by computing k-means clustering (Chapter @ref(kmeans-clustering)) and hierarchical clustering (Chapter @ref(agglomerative-clustering)) on the two data sets (the real and the random data). The function fviz_cluster() and fviz_dend() [in factoextra R package] will be used to visualize the results.

library(factoextra)
set.seed(123)
# K-means on iris dataset
km.res1 <- kmeans(df, 3)
fviz_cluster(list(data = df, cluster = km.res1$cluster),
             ellipse.type = "norm", geom = "point", stand = FALSE,
             palette = "jco", ggtheme = theme_classic())

# K-means on the random dataset
km.res2 <- kmeans(random_df, 3)
fviz_cluster(list(data = random_df, cluster = km.res2$cluster),
             ellipse.type = "norm", geom = "point", stand = FALSE,
             palette = "jco", ggtheme = theme_classic())

# Hierarchical clustering on the random dataset
fviz_dend(hclust(dist(random_df)), k = 3, k_colors = "jco",  
          as.ggplot = TRUE, show_labels = FALSE)

It can be seen that the k-means algorithm and the hierarchical clustering impose a classification on the random uniformly distributed data set even if there are no meaningful clusters present in it. This is why, clustering tendency assessment methods should be used to evaluate the validity of clustering analysis. That is, whether a given data set contains meaningful clusters.

Methods for assessing clustering tendency

In this section, we’ll describe two methods for evaluating the clustering tendency: i) a statistical (Hopkins statistic) and ii) a visual methods (Visual Assessment of cluster Tendency (VAT) algorithm).

Statistical methods

The Hopkins statistic (Lawson and Jurs 1990) is used to assess the clustering tendency of a data set by measuring the probability that a given data set is generated by a uniform data distribution. In other words, it tests the spatial randomness of the data.

For example, let D be a real data set. The Hopkins statistic can be calculated as follow:

Sample uniformly \(n\) points (\(p_1\),…, \(p_n\)) from D.
Compute the distance, \(x_i\), from each real point to each nearest neighbor: For each point \(p_i \in D\), find it’s nearest neighbor \(p_j\); then compute the distance between \(p_i\) and \(p_j\) and denote it as \(x_i = dist(p_i, p_j)\)
Generate a simulated data set (\(random_D\)) drawn from a random uniform distribution with \(n\) points (\(q_1\),…, \(q_n\)) and the same variation as the original real data set D.
Compute the distance, \(y_i\) from each artificial point to the nearest real data point: For each point \(q_i \in random_D\), find it’s nearest neighbor \(q_j\) in D; then compute the distance between \(q_i\) and \(q_j\) and denote it \(y_i = dist(q_i, q_j)\)
Calculate the Hopkins statistic (H) as the mean nearest neighbor distance in the random data set divided by the sum of the mean nearest neighbor distances in the real and across the simulated data set.

The formula is defined as follow:

\[H = \frac{\sum\limits_{i=1}^ny_i}{\sum\limits_{i=1}^nx_i + \sum\limits_{i=1}^ny_i}\]

How to interpret the Hopkins statistics? If \(D\) were uniformly distributed, then \(\sum\limits_{i=1}^ny_i\) and \(\sum\limits_{i=1}^nx_i\) would be close to each other, and thus \(H\) would be about 0.5. However, if clusters are present in D, then the distances for artificial points (\(\sum\limits_{i=1}^ny_i\)) would be substantially larger than for the real ones (\(\sum\limits_{i=1}^nx_i\)) in expectation, and thus the value of \(H\) will increase (Han, Kamber, and Pei 2012).

A value for \(H\) higher than 0.75 indicates a clustering tendency at the 90% confidence level.

The null and the alternative hypotheses are defined as follow:

Null hypothesis: the data set D is uniformly distributed (i.e., no meaningful clusters)
Alternative hypothesis: the data set D is not uniformly distributed (i.e., contains meaningful clusters)

We can conduct the Hopkins Statistic test iteratively, using 0.5 as the threshold to reject the alternative hypothesis. That is, if H < 0.5, then it is unlikely that D has statistically significant clusters.

Put in other words, If the value of Hopkins statistic is close to 1, then we can reject the null hypothesis and conclude that the dataset D is significantly a clusterable data.

Here, we present two R functions / packages to statistically evaluate clustering tendency by computing the Hopkins statistics:

get_clust_tendency() function [in factoextra package]. It returns the Hopkins statistics as defined in the formula above. The result is a list containing two elements:
- hopkins_stat
- and plot
hopkins() function [in clustertend package]. It implements 1- the definition of H provided here.

In the R code below, we’ll use the factoextra R package. Make sure that you have the latest version (or install: devtools::install_github("kassambara/factoextra")).

library(factoextra)
# Compute Hopkins statistic for iris dataset
res <- get_clust_tendency(df, n = nrow(df)-1, graph = FALSE)
res$hopkins_stat

## [1] 0.818

# Compute Hopkins statistic for a random dataset
res <- get_clust_tendency(random_df, n = nrow(random_df)-1,
                          graph = FALSE)
res$hopkins_stat

## [1] 0.466

It can be seen that the iris data set is highly clusterable (the H value = 0.82 which is far above the threshold 0.5). However the random_df data set is not clusterable (H = 0.47)

Visual methods

The algorithm of the visual assessment of cluster tendency (VAT) approach (Bezdek and Hathaway, 2002) is as follow:

The algorithm of VAT is as follow:

Compute the dissimilarity (DM) matrix between the objects in the data set using the Euclidean distance measure
Reorder the DM so that similar objects are close to one another. This process create an ordered dissimilarity matrix (ODM)
The ODM is displayed as an ordered dissimilarity image (ODI), which is the visual output of VAT

For the visual assessment of clustering tendency, we start by computing the dissimilarity matrix between observations using the function dist(). Next the function fviz_dist() [factoextra package] is used to display the dissimilarity matrix.

fviz_dist(dist(df), show_labels = FALSE)+
  labs(title = "Iris data")

fviz_dist(dist(random_df), show_labels = FALSE)+
  labs(title = "Random data")

Red: high similarity (ie: low dissimilarity) | Blue: low similarity

The color level is proportional to the value of the dissimilarity between observations: pure red if \(dist(x_i, x_j) = 0\) and pure blue if \(dist(x_i, x_j) = 1\). Objects belonging to the same cluster are displayed in consecutive order.

The dissimilarity matrix image confirms that there is a cluster structure in the iris data set but not in the random one.

The VAT detects the clustering tendency in a visual form by counting the number of square shaped dark blocks along the diagonal in a VAT image.

Summary

In this article, we described how to assess clustering tendency using the Hopkins statistics and a visual method. After showing that a data is clusterable, the next step is to determine the number of optimal clusters in the data. This will be described in the next chapter.

References

Han, Jiawei, Micheline Kamber, and Jian Pei. 2012. Data Mining: Concepts and Techniques. 3rd ed. Boston: Morgan Kaufmann. https://doi.org/10.1016/B978-0-12-381479-1.00016-2.

Lawson, Richard G., and Peter C. Jurs. 1990. “New Index for Clustering Tendency and Its Application to Chemical Problems.” Journal of Chemical Information and Computer Sciences 30 (1): 36–41. http://pubs.acs.org/doi/abs/10.1021/ci00065a010.