Easy Guides

Practical Guide to Cluster Analysis in R - Book

Wed, 08 Feb 2017 06:30:53 +0100

Introduction

Large amounts of data are collected every day from satellite images, bio-medical, security, marketing, web search, geo-spatial or other automatic equipment. Mining knowledge from these big data far exceeds human’s abilities.

Clustering is one of the important data mining methods for discovering knowledge in multidimensional data. The goal of clustering is to identify pattern or groups of similar objects within a data set of interest.

In the litterature, it is referred as “pattern recognition” or “unsupervised machine learning” - “unsupervised” because we are not guided by a priori ideas of which variables or samples belong in which clusters. “Learning” because the machine algorithm “learns” how to cluster.

Cluster analysis is popular in many fields, including:

In cancer research for classifying patients into subgroups according their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.
In marketing for market segmentation by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.
In City-planning for identifying groups of houses according to their type, value and location.

This book provides a practical guide to unsupervised machine learning or cluster analysis using R software. Additionally, we developped an R package named factoextra to create, easily, a ggplot2-based elegant plots of cluster analysis results. Factoextra official online documentation: https://www.sthda.com/english/rpkgs/factoextra

Preview of the first 38 pages of the book: Practical Guide to Cluster Analysis in R (preview).

Download the ebook through payhip:

Order a physical copy from amazon:

Key features of this book

Although there are several good books on unsupervised machine learning/clustering and related topics, we felt that many of them are either too high-level, theoretical or too advanced. Our goal was to write a practical guide to cluster analysis, elegant visualization and interpretation.

The main parts of the book include:

distance measures,
partitioning clustering,
hierarchical clustering,
cluster validation methods, as well as,
advanced clustering methods such as fuzzy clustering, density-based clustering and model-based clustering.

The book presents the basic principles of these tasks and provide many examples in R. This book offers solid guidance in data mining for students and researchers.

Key features:

Covers clustering algorithm and implementation
Key mathematical concepts are presented
Short, self-contained chapters with practical examples. This means that, you don’t need to read the different chapters in sequence.

At the end of each chapter, we present R lab sections in which we systematically work through applications of the various methods discussed in that chapter.

How this book is organized?

This book contains 5 parts. Part I (Chapter 1 - 3) provides a quick introduction to R (chapter 1) and presents required R packages and data format (Chapter 2) for clustering analysis and visualization.

The classification of objects, into clusters, requires some methods for measuring the distance or the (dis)similarity between the objects. Chapter 3 covers the common distance measures used for assessing similarity between observations.

Part II starts with partitioning clustering methods, which include:

K-means clustering (Chapter 4),
K-Medoids or PAM (partitioning around medoids) algorithm (Chapter 5) and
CLARA algorithms (Chapter 6).

Partitioning clustering approaches subdivide the data sets into a set of k groups, where k is the number of groups pre-specified by the analyst.

cluster analysis in R

In Part III, we consider agglomerative hierarchical clustering method, which is an alternative approach to partitionning clustering for identifying groups in a data set. It does not require to pre-specify the number of clusters to be generated. The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram (see the figure below).

In this part, we describe how to compute, visualize, interpret and compare dendrograms:

Agglomerative clustering (Chapter 7)
- Algorithm and steps
- Verify the cluster tree
- Cut the dendrogram into different groups
Compare dendrograms (Chapter 8)
- Visual comparison of two dendrograms
- Correlation matrix between a list of dendrograms
Visualize dendrograms (Chapter 9)
- Case of small data sets
- Case of dendrogram with large data sets: zoom, sub-tree, PDF
- Customize dendrograms using dendextend
Heatmap: static and interactive (Chapter 10)
- R base heat maps
- Pretty heat maps
- Interactive heat maps
- Complex heatmap
- Real application: gene expression data

In this section, you will learn how to generate and interpret the following plots.

Standard dendrogram with filled rectangle around clusters:

cluster analysis in R

Compare two dendrograms:

cluster analysis in R

Heatmap:

cluster analysis in R

Part IV describes clustering validation and evaluation strategies, which consists of measuring the goodness of clustering results. Before applying any clustering algorithm to a data set, the first thing to do is to assess the clustering tendency. That is, whether applying clustering is suitable for the data. If yes, then how many clusters are there. Next, you can perform hierarchical clustering or partitioning clustering (with a pre-specified number of clusters). Finally, you can use a number of measures, described in this chapter, to evaluate the goodness of the clustering results.

The different chapters included in part IV are organized as follow:

Assessing clustering tendency (Chapter 11)
Determining the optimal number of clusters (Chapter 12)
Cluster validation statistics (Chapter 13)
Choosing the best clustering algorithms (Chapter 14)
Computing p-value for hierarchical clustering (Chapter 15)

In this section, you’ll learn how to create and interpret the plots hereafter.

Visual assessment of clustering tendency (left panel): Clustering tendency is detected in a visual form by counting the number of square shaped dark blocks along the diagonal in the image.
Determine the optimal number of clusters (right panel) in a data set using the gap statistics.

cluster analysis in R

Cluster validation using the silhouette coefficient (Si): A value of Si close to 1 indicates that the object is well clustered. A value of Si close to -1 indicates that the object is poorly clustered. The figure below shows the silhouette plot of a k-means clustering.

cluster analysis in R

Part V presents advanced clustering methods, including:

Hierarchical k-means clustering (Chapter 16)
Fuzzy clustering (Chapter 17)
Model-based clustering (Chapter 18)
DBSCAN: Density-Based Clustering (Chapter 19)

The hierarchical k-means clustering is an hybrid approach for improving k-means results.

In Fuzzy clustering, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster.

In model-based clustering, the data are viewed as coming from a distribution that is mixture of two ore more clusters. It finds best fit of models to data and estimates the number of clusters.

The density-based clustering (DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers.

cluster analysis in R

Hybrid hierarchical k-means clustering for optimizing clustering outputs - Unsupervised Machine Learning

Mon, 14 Nov 2016 10:43:38 +0100

1 How this article is organized
2 Required R packages
3 Data preparation
4 R function for clustering analyses
- 4.1 Example of k-means clustering
- 4.2 Example of hierarchical clustering
5 Combining hierarchical clustering and k-means
6 Infos

Clustering algorithms are used to split a dataset into several groups (i.e clusters), so that the objects in the same group are as similar as possible and the objects in different groups are as dissimilar as possible.

The most popular clustering algorithms are:

k-means clustering, a partitioning method used for splitting a dataset into a set of k clusters.
hierarchical clustering, an alternative approach to k-means clustering for identifying clustering in the dataset by using pairwise distance matrix between observations as clustering criteria.

However, each of these two standard clustering methods has its limitations. K-means clustering requires the user to specify the number of clusters in advance and selects initial centroids randomly. Agglomerative hierarchical clustering is good at identifying small clusters but not large ones.

In this article, we document hybrid approaches for easily mixing the best of k-means clustering and hierarchical clustering.

1 How this article is organized

We’ll start by demonstrating why we should combine k-means and hierarcical clustering. An application is provided using R software.

Finally, we’ll provide an easy to use R function (in factoextra package) for computing hybrid hierachical k-means clustering.

2 Required R packages

We’ll use the R package factoextra which is very helpful for simplifying clustering workflows and for visualizing clusters using ggplot2 plotting system

Install factoextra package as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Load the package:

library(factoextra)

3 Data preparation

We’ll use USArrest dataset and we start by scaling the data:

# Load the data
data(USArrests)
# Scale the data
df <- scale(USArrests)
head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

If you want to understand why the data are scaled before the analysis, then you should read this section: Distances and scaling.

4 R function for clustering analyses

We’ll use the function eclust() [in factoextra] which provides several advantages as described in the previous chapter: Visual Enhancement of Clustering Analysis.

eclust() stands for enhanced clustering. It simplifies the workflow of clustering analysis and, it can be used for computing hierarchical clustering and partititioning clustering in a single line function call.

4.1 Example of k-means clustering

We’ll split the data into 4 clusters using k-means clustering as follow:

library("factoextra")
# K-means clustering
km.res <- eclust(df, "kmeans", k = 4,
                 nstart = 25, graph = FALSE)
# k-means group number of each observation
head(km.res$cluster, 15)

##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           4           3           3           4           3           3 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           2           2           3           4           2           1 
##    Illinois     Indiana        Iowa 
##           3           2           1

# Visualize k-means clusters
fviz_cluster(km.res,  frame.type = "norm", frame.level = 0.68)

# Visualize the silhouette of clusters
fviz_silhouette(km.res)

##   cluster size ave.sil.width
## 1       1   13          0.37
## 2       2   16          0.34
## 3       3   13          0.27
## 4       4    8          0.39

Note that, silhouette coefficient measures how well an observation is clustered and it estimates the average distance between clusters (i.e, the average silhouette width). Observations with negative silhouette are probably placed in the wrong cluster. Read more here: cluster validation statistics

Samples with negative silhouette coefficient:

# Silhouette width of observation
sil <- km.res$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]

##          cluster neighbor   sil_width
## Missouri       3        2 -0.07318144

Read more about k-means clustering: K-means clustering

4.2 Example of hierarchical clustering

# Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust", k = 4,
                method = "ward.D2", graph = FALSE) 
head(res.hc$cluster, 15)

##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           1           2           2           3           2           2 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           3           3           2           1           3           4 
##    Illinois     Indiana        Iowa 
##           2           3           4

# Dendrogram
fviz_dend(res.hc, rect = TRUE, show_labels = TRUE, cex = 0.5)

# Visualize the silhouette of clusters
fviz_silhouette(res.hc)

##   cluster size ave.sil.width
## 1       1    7          0.46
## 2       2   12          0.29
## 3       3   19          0.26
## 4       4   12          0.43

It can be seen that three samples have negative silhouette coefficient indicating that they are not in the right cluster. These samples are:

# Silhouette width of observation
sil <- res.hc$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]

##          cluster neighbor   sil_width
## Kentucky       3        4 -0.06459230
## Arkansas       3        1 -0.08467352

Read more about hierarchical clustering: Hierarchical clustering

5 Combining hierarchical clustering and k-means

5.1 Why?

Recall that, in k-means algorithm, a random set of observations are chosen as the initial centers.

The final k-means clustering solution is very sensitive to this initial random selection of cluster centers. The result might be (slightly) different each time you compute k-means.

To avoid this, a solution is to use an hybrid approach by combining the hierarchical clustering and the k-means methods. This process is named hybrid hierarchical k-means clustering (hkmeans).

5.2 How ?

The procedure is as follow:

Compute hierarchical clustering and cut the tree into k-clusters
compute the center (i.e the mean) of each cluster
Compute k-means by using the set of cluster centers (defined in step 3) as the initial cluster centers

Note that, k-means algorithm will improve the initial partitioning generated at the step 2 of the algorithm. Hence, the initial partitioning can be slightly different from the final partitioning obtained in the step 4.

5.3 R codes

5.3.1 Compute hierarchical clustering and cut the tree into k-clusters:

res.hc <- eclust(df, "hclust", k = 4,
                method = "ward.D2", graph = FALSE) 
grp <- res.hc$cluster

5.3.2 Compute the centers of clusters defined by hierarchical clustering:

Cluster centers are defined as the means of variables in clusters. The function aggregate() can be used to compute the mean per group in a data frame.

# Compute cluster centers
clus.centers <- aggregate(df, list(grp), mean)
clus.centers

##   Group.1     Murder    Assault   UrbanPop        Rape
## 1       1  1.5803956  0.9662584 -0.7775109  0.04844071
## 2       2  0.7298036  1.1188219  0.7571799  1.32135653
## 3       3 -0.3621789 -0.3444705  0.3953887 -0.21863180
## 4       4 -1.0782511 -1.1370610 -0.9296640 -1.00344660

# Remove the first column
clus.centers <- clus.centers[, -1]
clus.centers

##       Murder    Assault   UrbanPop        Rape
## 1  1.5803956  0.9662584 -0.7775109  0.04844071
## 2  0.7298036  1.1188219  0.7571799  1.32135653
## 3 -0.3621789 -0.3444705  0.3953887 -0.21863180
## 4 -1.0782511 -1.1370610 -0.9296640 -1.00344660

5.3.3 K-means clustering using hierarchical clustering defined cluster-centers

km.res2 <- eclust(df, "kmeans", k = clus.centers, graph = FALSE)
fviz_silhouette(km.res2)

##   cluster size ave.sil.width
## 1       1    8          0.39
## 2       2   13          0.27
## 3       3   16          0.34
## 4       4   13          0.37

5.3.4 Compare the results of hierarchical clustering and hybrid approach

The R code below compares the initial clusters defined using only hierarchical clustering and the final ones defined using hierarchical clustering + k-means:

# res.hc$cluster: Initial clusters defined using hierarchical clustering
# km.res2$cluster: Final clusters defined using k-means
table(km.res2$cluster, res.hc$cluster)

##    
##      1  2  3  4
##   1  7  0  1  0
##   2  0 12  1  0
##   3  0  0 16  0
##   4  0  0  1 12

It can be seen that, 3 of the observations defined as belonging to cluster 3 by hierarchical clustering has been reclassified to cluster 1, 2, and 4 in the final solution defined by k-means clustering.

The difference can be easily visualized using the function fviz_dend() [in factoextra]. The labels are colored using k-means clusters:

fviz_dend(res.hc, k = 4, 
          k_colors = c("black", "red",  "blue", "green3"),
          label_cols =  km.res2$cluster[res.hc$order], cex = 0.6)

It can be seen that the hierarchical clustering result has been improved by the k-means algorithm.

5.3.5 Compare the results of standard k-means clustering and hybrid approach

# Final clusters defined using hierarchical k-means clustering
km.clust <- km.res$cluster

# Standard k-means clustering
set.seed(123)
res.km <- kmeans(df, centers = 4, iter.max = 100)


# comparison
table(km.clust, res.km$cluster)

##         
## km.clust  1  2  3  4
##        1 13  0  0  0
##        2  0 16  0  0
##        3  0  0 13  0
##        4  0  0  0  8

In our current example, there was no further improvement of the k-means clustering result by the hybrid approach. An improvement might be observed using another dataset.

5.4 hkmeans(): Easy-to-use function for hybrid hierarchical k-means clustering

The function hkmeans() [in factoextra] can be used to compute easily the hybrid approach of k-means on hierarchical clustering. The format of the result is similar to the one provided by the standard kmeans() function.

# Compute hierarchical k-means clustering
res.hk <-hkmeans(df, 4)
# Elements returned by hkmeans()
names(res.hk)

##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"

# Print the results
res.hk

## Hierarchical K-means clustering with 4 clusters of sizes 8, 13, 16, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2  0.6950701  1.0394414  0.7226370  1.27693964
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              1              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              4              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3 
## 
## Within cluster sum of squares by cluster:
## [1]  8.316061 19.922437 16.212213 11.952463
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"

# Visualize the tree
fviz_dend(res.hk, cex = 0.6, rect = TRUE)

# Visualize the hkmeans final clusters
fviz_cluster(res.hk, frame.type = "norm", frame.level = 0.68)

6 Infos

This analysis has been performed using R software (ver. 3.2.4)

Assessing clustering tendency: A vital issue - Unsupervised Machine Learning

Fri, 28 Oct 2016 06:29:34 +0200

1 Required packages
2 Data preparation
- 2.1 faithful dataset
- 2.2 Random uniformly distributed dataset
3 Why assessing clustering tendency?
4 Methods for assessing clustering tendency
- 4.1 Hopkins statistic
  - 4.1.1 Algorithm
  - 4.1.2 R function for computing Hopkins statistic
- 4.2 VAT: Visual Assessment of cluster Tendency
  - 4.2.1 VAT Algorithm
  - 4.2.2 R functions for VAT
5 A single function for Hopkins statistic and VAT
6 Infos

Clustering algorithms, including partitioning methods (K-means, PAM, CLARA and FANNY) and hierarchical clustering, are used to split the dataset into groups or clusters of similar objects.

Before applying any clustering method on the dataset, a natural question is:

Does the dataset contains any inherent clusters?

A big issue, in unsupervised machine learning, is that clustering methods will return clusters even if the data does not contain any clusters. In other words, if you blindly apply a clustering analysis on a dataset, it will divide the data into clusters because that is what it supposed to do.

Therefore before choosing a clustering approach, the analyst has to decide whether the dataset contains meaningful clusters (i.e nonrandom structures) or not. If yes, then how many clusters are there. This process is defined as the assessing of clustering tendency or the feasibility of the clustering analysis.

In this chapter:

We describe why we should evaluate the clustering tendency (i.e., clusterability) before applying any cluster analysis on a dataset.
We describe statistical and visual methods for assessing the clustering tendency
R lab sections containing many examples are also provided for computing clustering tendency and visualizing clusters

1 Required packages

The following R packages are required in this chapter:

factoextra for data visualization
clustertend for assessing clustering tendency
seriation for visually assessment of cluster tendency

factoextra can be installed as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Install clustertend and seriation:

install.packages("clustertend")
install.packages("seriation")

Load required packages:

library(factoextra)
library(clustertend)
library(seriation)

2 Data preparation

We’ll use two datasets: the built-in R dataset faithful and a simulated dataset.

2.1 faithful dataset

faithful dataset contains the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park (Wyoming, USA).

# Load the data
data("faithful")
df <- faithful
head(df)

##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

An illustration of the data can be drawn using ggplot2 package as follow:

library("ggplot2")
ggplot(df, aes(x=eruptions, y=waiting)) +
  geom_point() +  # Scatter plot
  geom_density_2d() # Add 2d density estimation

2.2 Random uniformly distributed dataset

The R code below generates a random uniform data with the same dimension as the faithful dataset. The function runif(n, min, max) is used for generating uniform distribution on the interval from min to max.

# Generate random dataset
set.seed(123)
n <- nrow(df)

random_df <- data.frame(
  x = runif(nrow(df), min(df$eruptions), max(df$eruptions)),
  y = runif(nrow(df), min(df$waiting), max(df$waiting)))

# Plot the data
ggplot(random_df, aes(x, y)) + geom_point()

Note that for a given real dataset, random uniform data can be generated in a single line function call as follow:

random_df <- apply(df, 2, 
                function(x, n){runif(n, min(x), (max(x)))}, n)

3 Why assessing clustering tendency?

As shown above, we know that faithful dataset contains 2 real clusters. However the randomly generated dataset doesn’t contain any meaningful clusters.

The R code below computes k-means clustering and/or hierarchical clustering on the two datasets. The function fviz_cluster() and fviz_dend() [in factoextra] will be used to visualize the results.

library(factoextra)
set.seed(123)
# K-means on faithful dataset
km.res1 <- kmeans(df, 2)
fviz_cluster(list(data = df, cluster = km.res1$cluster),
             frame.type = "norm", geom = "point", stand = FALSE)

# K-means on the random dataset
km.res2 <- kmeans(random_df, 2)
fviz_cluster(list(data = random_df, cluster = km.res2$cluster),
             frame.type = "norm", geom = "point", stand = FALSE)

# Hierarchical clustering on the random dataset
fviz_dend(hclust(dist(random_df)), k = 2,  cex = 0.5)

It can be seen that, k-means algorithm and hierarchical clustering impose a classification on the random uniformly distributed dataset even if there are no meaningful clusters present in it.

Clustering tendency assessment methods are used to avoid this issue.

4 Methods for assessing clustering tendency

Clustering tendency assessment determines whether a given dataset contains meaningful clusters (i.e., non-random structure).

In this section, we’ll describe two methods for determining the clustering tendency: i) a statistical (Hopkins statistic) and ii) a visual methods (Visual Assessment of cluster Tendency (VAT) algorithm).

4.1 Hopkins statistic

Hopkins statistic is used to assess the clustering tendency of a dataset by measuring the probability that a given dataset is generated by a uniform data distribution. In other words it tests the spatial randomness of the data.

4.1.1 Algorithm

Let D be a real dataset. The Hopkins statistic can be calculated as follow:

Sample uniformly $n$ points ($p_1$,…, $p_n$) from D.
For each point $p_i \in D$, find it’s nearest neighbor $p_j$; then compute the distance between $p_i$ and $p_j$ and denote it as $x_i = dist(p_i, p_j)$
Generate a simulated dataset ($random_D$) drawn from a random uniform distribution with $n$ points ($q_1$,…, $q_n$) and the same variation as the original real dataset D.
For each point $q_i \in random_D$, find it’s nearest neighbor $q_j$ in D; then compute the distance between $q_i$ and $q_j$ and denote it $y_i = dist(q_i, q_j)$
Calculate the Hopkins statistic (H) as the mean nearest neighbor distance in the random dataset divided by the sum of the mean nearest neighbor distances in the real and across the simulated dataset.

The formula is defined as follow:

\[H = \frac{\sum\limits_{i=1}^ny_i}{\sum\limits_{i=1}^nx_i + \sum\limits_{i=1}^ny_i}\]

A value of H about 0.5 means that $\sum\limits_{i=1}^ny_i$ and $\sum\limits_{i=1}^nx_i$ are close to each other, and thus the data D is uniformly distributed.

The null and the alternative hypotheses are defined as follow:

Null hypothesis: the dataset D is uniformly distributed (i.e., no meaningful clusters)
Alternative hypothesis: the dataset D is not uniformly distributed (i.e., contains meaningful clusters)

If the value of Hopkins statistic is close to zero, then we can reject the null hypothesis and conclude that the dataset D is significantly a clusterable data.

4.1.2 R function for computing Hopkins statistic

The function hopkins() [in clustertend package] can be used to statistically evaluate clustering tendency in R. The simplified format is:

hopkins(data, n, byrow = F, header = F)

data: a data frame or matrix
n: the number of points to be selected from the data
byrow: logical value. If FALSE (default), the variables is taken by columns, otherwise the variables is taken by rows
header: logical. If FALSE (the default) the first column (or row) will be deleted in the calculation

library(clustertend)
# Compute Hopkins statistic for faithful dataset
set.seed(123)
hopkins(faithful, n = nrow(faithful)-1)

## $H
## [1] 0.1588201

# Compute Hopkins statistic for a random dataset
set.seed(123)
hopkins(random_df, n = nrow(random_df)-1)

## $H
## [1] 0.5388899

It can be seen that faithful dataset is highly clusterable (the H value = 0.15 which is far below the threshold 0.5). However the random_df dataset is not clusterable ($H = 0.53$)

4.2 VAT: Visual Assessment of cluster Tendency

The visual assessment of cluster tendency (VAT) has been originally described by Bezdek and Hathaway (2002). This approach can be used to visually inspect the clustering tendency of the dataset.

4.2.1 VAT Algorithm

The algorithm of VAT is as follow:

Compute the dissimilarity (DM) matrix between the objects in the dataset using Euclidean distance measure
Reorder the DM so that similar objects are close to one another. This process create an ordered dissimilarity matrix (ODM)
The ODM is displayed as an ordered dissimilarity image (ODI), which is the visual output of VAT

4.2.2 R functions for VAT

We start by scaling the data using the function scale(). Next we compute the dissimilarity matrix between observations using the function dist(). finally the function dissplot() [in the package seriation] is used to display an ordered dissimilarity image.

The R code below computes VAT algorithm for the faithful dataset

library("seriation")
# faithful data: ordered dissimilarity image
df_scaled <- scale(faithful)
df_dist <- dist(df_scaled) 
dissplot(df_dist)

The gray level is proportional to the value of the dissimilarity between observations: pure black if $dist(x_i, x_j) = 0$ and pure white if $dist(x_i, x_j) = 1$. Objects belonging to the same cluster are displayed in consecutive order.

The VAT detects the clustering tendency in a visual form by counting the number of square shaped dark blocks along the diagonal in a VAT image.

The figure above suggests two clusters represented by two well-formed black blocks.

The same analysis can be done with the random dataset:

# faithful data: ordered dissimilarity image
random_df_scaled <- scale(random_df)
random_df_dist <- dist(random_df_scaled) 
dissplot(random_df_dist)

It can be seen that the random_df dataset doesn’t contain any evident clusters.

Now, we can perform k-means on faithful dataset and add cluster labels on the dissimilarity plot:

set.seed(123)
km.res <- kmeans(scale(faithful), 2)
dissplot(df_dist, labels = km.res$cluster)

After showing that the data is clusterable, the next step is to determine the number of optimal clusters in the data. This will be described in the next chapter.

5 A single function for Hopkins statistic and VAT

The function get_clust_tendency() [in factoextra package] can be used to compute Hopkins statistic and provides also an ordered dissimilarity image using ggplot2, in a single function call. The ordering of dissimilarity matrix is done using hierarchical clustering.

# Cluster tendency
clustend <- get_clust_tendency(scale(faithful), 100)
# Hopkins statistic
clustend$hopkins_stat

## [1] 0.1482683

# Customize the plot
clustend$plot + 
  scale_fill_gradient(low = "steelblue", high = "white")

6 Infos

This analysis has been performed using R software (ver. 3.2.4)

Cluster Analysis in R - Unsupervised machine learning

Fri, 29 Apr 2016 12:01:52 +0200

1 Introduction
- 1.1 Quick overview of machine learning
- 1.2 Applications of unsupervised machine learning
2 How this document is organized?
3 Data preparation
4 Installing and loading required R packages
5 Clarifying distance measures
6 Basic clustering methods
- 6.1 Partitioning clustering
- 6.2 Hierarchical clustering
7 Clustering validation
8 The guide for clustering analysis on a real data: 4 steps you should know
9 Visualization of clustering results
10 Advanced clustering methods
11 Infos

1 Introduction

1.1 Quick overview of machine learning

A huge amounts of multidimensional data have been collected in various fields such as marketing, bio-medical and geo-spatial fields. Mining knowledge from these big data becomes a highly demanding field. However, it far exceeded human’s ability to analyze these huge data. Unsupervised Machine Learning or clustering is one of the important data mining methods for discovering knowledge in multidimensional data.

Machine learning (ML) is divided into two different fields:

Supervised ML defined as a set of tools used for prediction (linear model, logistic regression, linear discriminant analysis, classification trees, support vector machines and more)
Unsupervised ML, also known as clustering, is an exploratory data analysis technique used for identifying groups (i.e clusters) in the data set of interest. Each group contains observations with similar profile according to a specific criteria. Similarity between observations is defined using some inter-observation distance measures including Euclidean and correlation-based distance measures.

This document describes the use of unsupervised machine learning approaches, including Principal Component Analysis (PCA) and clustering methods.

Principal Component Analysis (PCA) is a dimension reduction techniques applied for simplifying the data and for visualizing the most important information in the data set
Clustering is applied for identifying groups (i.e clusters) among the observations. Clustering can be subdivided into five general strategies:
- Partitioning methods
- Hierarchical clustering
- Fuzzy clustering
- Density-based clustering
- Model-based clustering

Note that, it’ possible to cluster both observations (i.e, samples or individuals) and features (i.e, variables). Observations can be clustered on the basis of variables and variables can be clustered on the basis of observations.

1.2 Applications of unsupervised machine learning

Unsupervised ML is popular in many fields, including:

In cancer research field in order to classify patients in subgroups according their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.
In marketing for market segmentation by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.
City-planning: for identifying groups of houses according to their type, value and location.

2 How this document is organized?

Here,

we start by describing the two standard clustering strategies [partitioning methods (k-MEANS, PAM, CLARA) and hierarchical clustering] as well as how to assess the quality of clustering analysis.
next, we provide a step-by-step guide for clustering analysis and an R package, named factoextra, for ggplot2-based elegant clustering visualization.
finally, we describe advanced clustering approaches to find pattern of any shape in large data sets with noise and outliers.

Data preparation
Installing and loading required R packages
Clarifying distance measures
Basic clustering methods

The guide for clustering analysis on a real data: 4 steps you should know?
Elegant Clustering Visualization

Advanced Clustering Methods

Clustering on categorical variables: CA, MCA –> HCPC (coming soon)

To be published late in 2016. Subscribe to our mailing list at: STHDA mailing list. You will be notified about this book.

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.

3 Data preparation

The built-in R dataset USArrest is used as demo data.

Remove missing data
Scale variables to make them comparable

# Load data
data("USArrests")
my_data <- USArrests

# Remove any missing value (i.e, NA values for not available)
my_data <- na.omit(my_data)

# Scale variables
my_data <- scale(my_data)

# View the firt 3 rows
head(my_data, n = 3)

##             Murder   Assault   UrbanPop         Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska  0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona 0.07163341 1.4788032  0.9989801  1.042878388

4 Installing and loading required R packages

Install required packages

cluster: for computing clustering
factoextra: for elegant ggplot2-based data visualization. See the online documentation at: https://www.sthda.com/english/rpkgs/factoextra/

# Install factoextra
install.packages("factoextra")

# Install cluster package
install.packages("cluster")

Loading required packages

library("cluster")
library("factoextra")

5 Clarifying distance measures

The classification of observations into groups, requires some methods for measuring the distance or the (dis)similarity between the observations.

In this chapter, we covered the common distance measures used for assessing similarity between observations. Some R codes, for computing and visualizing pairwise-distances between observations, are also provided.

How this chapter is organized?

Methods for measuring distances
Distances and scaling
Data preparation
R functions for computing distances
- The standard dist() function
- Correlation based distance measures
- The function daisy() in cluster package
Visualizing distance matrices

It’s simple to compute and visualize distance matrix using the functions get_dist() and fviz_dist() in factoextra R package:

get_dist(): for computing a distance matrix between the rows of a data matrix. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
fviz_dist(): for visualizing a distance matrix

res.dist <- get_dist(USArrests, stand = TRUE, method = "pearson")

fviz_dist(res.dist, 
   gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

Clustering - Unsupervised Machine Learning

6 Basic clustering methods

6.1 Partitioning clustering

Partitioning algorithms are clustering approaches that split the data sets, containing n observations, into a set of k groups (i.e. clusters). The algorithms require the analyst to specify the number of clusters to be generated.

This chapter describes the most commonly used partitioning algorithms including:

K-means clustering (MacQueen, 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster.
K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990), in which, each cluster is represented by one of the objects in the cluster. It’s a “non-parametric” alternative of k-means clustering. We’ll describe also a variant of PAM named CLARA (Clustering Large Applications) which is used for analyzing large data sets.

For each of these methods, we provide:

the basic idea and the key mathematical concepts
the clustering algorithm and implementation in R software
R lab sections with many examples for computing clustering methods and visualizing the outputs

Clustering - Unsupervised Machine Learning

How this chapter is organized?

Required packages: cluster (for computing clustering algorithm) and factoextra (for elegant visualization)
K-means clustering
- Concept
- Algorithm
- R function for k-means clustering: stats::kmeans()
- Data format
- Compute k-means clustering
- Application of K-means clustering on real data
  - Data preparation and descriptive statistics
  - Determine the number of optimal clusters in the data: factoextra::fviz_nbclust()
  - Compute k-means clustering
  - Plot the result: factoextra::fviz_cluster()
PAM: Partitioning Around Medoids
- Concept
- Algorithm
- R function for computing PAM: cluster::pam() or fpc::pamk()
- Compute PAM
CLARA: Clustering Large Applications
- Concept
- Algorithm
- R function for computing CLARA: cluster::clara()
R packages and functions for visualizing partitioning clusters
- cluster::clusplot() function
- factoextra::fviz_cluster() function

Read more: Partitioning cluster analysis. If you are in hurry, read the following quick-start guide.

K-means clustering: split the data into a set of k groups (i.e., cluster), where k must be specified by the analyst. Each cluster is represented by means of points belonging to the cluster.

Determine the optimal number of clusters: use factoextra::fviz_nbclust()

library("factoextra")
fviz_nbclust(my_data, kmeans, method = "gap_stat")

Clustering - Unsupervised Machine Learning

Compute and visualize k-means clustering

km.res <- kmeans(my_data, 4, nstart = 25)

# Visualize
library("factoextra")
fviz_cluster(km.res, data = my_data, frame.type = "convex")+
  theme_minimal()

Clustering - Unsupervised Machine Learning

PAM clustering: Partitioning Around Medoids. Robust alternative to k-means clustering, less sensitive to outliers.

# Compute PAM
library("cluster")
pam.res <- pam(my_data, 4)

# Visualize
fviz_cluster(pam.res)

6.2 Hierarchical clustering

Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in the dataset. It does not require to pre-specify the number of clusters to be generated.

Hierarchical clustering can be subdivided into two types:

Agglomerative hierarchical clustering (AHC) in which, each observation is initially considered as a cluster of its own (leaf. Then, the most similar clusters are iteratively merged until there is just one single big cluster (root).
Divise hierarchical clustering which is an inverse of AHC. It begins with the root, in witch all objects are included in one cluster. Then the most heterogeneous clusters are iteratively divided until all observation are in their own cluster.

The result of hierarchical clustering is a tree-based representation of the observations which is called a dendrogram. Observations can be subdivided into groups by cutting the dendogram at a desired similarity level.

This chapter provides:

The description of the different types of hierarchical clustering algorithms
R lab sections with many examples for computing hierarchical clustering, visualizing and comparing dendrogram
The interpretation of dendrogram
R codes for cutting the dendrograms into groups

How this chapter is organized?

Required R packages
Algorithm
Data preparation and descriptive statistics
R functions for hierarchical clustering
- hclust() function
- agnes() and diana() functions
Interpretation of the dendrogram
Cut the dendrogram into different groups
Hierarchical clustering and correlation based distance
What type of distance measures should we choose?
Comparing two dendrograms
- Tanglegram
- Correlation matrix between a list of dendrogram

Read more: Hierarchical clustering essentials. If you are in hurry, read the following quick-start guide.

Install and load required packages (cluster, factoextra) as previously described
Compute and visualize hierarchical clustering using R base functions

# 1. Loading and preparing data
data("USArrests")
my_data <- scale(USArrests)

# 2. Compute dissimilarity matrix
d <- dist(my_data, method = "euclidean")

# Hierarchical clustering using Ward's method
res.hc <- hclust(d, method = "ward.D2" )

# Cut tree into 4 groups
grp <- cutree(res.hc, k = 4)

# Visualize
plot(res.hc, cex = 0.6) # plot tree
rect.hclust(res.hc, k = 4, border = 2:5) # add rectangle

Clustering - Unsupervised Machine Learning

Elegant visualization using factoextra functions: factoextra::hcut(), factoextra::fviz_dend()

library("factoextra")
# Compute hierarchical clustering and cut into 4 clusters
res <- hcut(USArrests, k = 4, stand = TRUE)

# Visualize
fviz_dend(res, rect = TRUE, cex = 0.5,
          k_colors = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"))

Clustering - Unsupervised Machine Learning

We’ll see also, how to customize the dendrogram:

Clustering - Unsupervised Machine Learning

7 Clustering validation

Clustering validation includes three main tasks:

clustering tendency assesses whether applying clustering is suitable to your data.
clustering evaluation assesses the goodness or quality of the clustering.
clustering stability seeks to understand the sensitivity of the clustering result to various algorithmic parameters, for example, the number of clusters.

The aim of this part is to:

describe the different methods for clustering validation
compare the quality of clustering results obtained with different clustering algorithms
provide R lab section for validating clustering results

7.1 Assessing clustering tendency

Assessing clustering tendency consists of examining whether the data is clusterable, that is, whether the data contains any inherent grouping structure. This should be checked before applying clustering analysis.

In this chapter:

We describe why we should evaluate the clustering tendency before applying any cluster analysis on a dataset.
We describe statistical and visual methods for assessing the clustering tendency
R lab sections containing many examples are also provided for computing clustering tendency and visualizing clusters

How this chapter is organized?

Required packages
Data preparation
Why assessing clustering tendency?
Methods for assessing clustering tendency
- Hopkins statistic
  - Algorithm
  - R function for computing Hopkins statistic: clustertend::hopkins()
- VAT: Visual Assessment of cluster Tendency: seriation::dissplot()
  - VAT Algorithm
  - R functions for VAT
A single function for Hopkins statistic and VAT: factoextra::get_clust_tendency()

Read more: Assessing clustering tendency. If you are in hurry, read the following quick-start guide.

Install and load factoextra as previously described
Assessing clustering tendency: use factoextra::get_clust_tendency(). Assess clustering tendency using Hopkins’ statistic and a visual approach. An ordered dissimilarity image (ODI) is shown.

Hopkins statistic: If the value of Hopkins statistic is close to zero (far below 0.5), then we can conclude that the dataset is significantly clusterable.
VAT (Visual Assessment of cluster Tendency): The VAT detects the clustering tendency in a visual form by counting the number of square shaped dark (or colored) blocks along the diagonal in a VAT image.

library("factoextra")
my_data <- scale(iris[, -5])
get_clust_tendency(my_data, n = 50,
                   gradient = list(low = "steelblue",  high = "white"))

## $hopkins_stat
## [1] 0.2002686
## 
## $plot

Clustering - Unsupervised Machine Learning

7.2 Determining the optimal number of clusters

As described above, Partitioning methods, such as k-means clustering require the users to specify the number of clusters to be generated.

In this chapter, we’ll describe different methods to determine the optimal number of clusters for k-means, PAM and hierarchical clustering.

How this chapter is organized?

Required packages
Data preparation
Example of partitioning method results
Example of hierarchical clustering results
Three popular methods for determining the optimal number of clusters
- Elbow method
  - Concept
  - Algorithm
  - R codes
- Average silhouette method
  - Concept
  - Algorithm
  - R codes
- Conclusions about elbow and silhouette methods
- Gap statistic method
  - Concept
  - Algorithm
  - R codes
NbClust: A Package providing 30 indices for determining the best number of clusters
- Overview of NbClust package
- NbClust R function
- Examples of usage
  - Compute only an index of interest
  - Compute all the 30 indices

Read more: Determining the optimal number of clusters. If you are in hurry, read the following quick-start guide.

Estimate the number of clusters in the data using gap statistics : factoextra::fviz_nbclust()

my_data <- scale(USArrests)
library("factoextra")
fviz_nbclust(my_data, kmeans, method = "gap_stat")

Clustering - Unsupervised Machine Learning

NbClust: A Package providing 30 indices for determining the best number of clusters

library("NbClust")
set.seed(123)
res.nbclust <- NbClust(my_data, distance = "euclidean",
                  min.nc = 2, max.nc = 10, 
                  method = "complete", index ="all")

Visualize using factoextra:

factoextra::fviz_nbclust(res.nbclust) + theme_minimal()

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 9 proposed  2 as the best number of clusters
## * 4 proposed  3 as the best number of clusters
## * 6 proposed  4 as the best number of clusters
## * 2 proposed  5 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

Clustering - Unsupervised Machine Learning

7.3 Clustering validation statistics

A variety of measures has been proposed in the literature for evaluating clustering results. The term clustering validation is used to design the procedure of evaluating the results of a clustering algorithm.

The aim of this chapter is to:

describe the different methods for clustering validation
compare the quality of clustering results obtained with different clustering algorithms
provide R lab section for validating clustering results

How this chapter is organized?

Required packages: cluster, factoextra, NbClust, fpc
Data preparation
Relative measures - Determine the optimal number of clusters: NbClust::NbClust()
Clustering analysis
- Example of partitioning method results
- Example of hierarchical clustering results
Internal clustering validation measures
- Silhouette analysis
  - Concept and algorithm
  - Interpretation of silhouette width
  - R functions for silhouette analysis: cluster::silhouette(), factoextra::fviz_silhouette()
- Dunn index
  - Concept and algorithm
  - R function for computing Dunn index: fpc::cluster.stats(), NbClust::NbClust()
- Clustering validation statistics: fpc::cluster.stats()
External clustering validation

Read more: Clustering Validation Statistics. If you are in hurry, read the following quick-start guide.

Compute and visualize hierarchical clustering

Compute: factoextra::eclust()
Elegant visualization: factoextra::fviz_dend()

my_data <- scale(iris[, -5])

# Enhanced hierarchical clustering, cut in 3 groups
library("factoextra")
res.hc <- eclust(my_data, "hclust", k = 3, graph = FALSE) 

# Visualize
fviz_dend(res.hc, rect = TRUE, show_labels = FALSE)

Clustering - Unsupervised Machine Learning

Validate clustering results by inspection the cluster silhouette plot

Recall that the silhouette ($S_i$) measures how similar an object $i$ is to the the other objects in its own cluster versus those in the neighbor cluster. $S_i$ values range from 1 to - 1:

A value of $S_i$ close to 1 indicates that the object is well clustered. In the other words, the object $i$ is similar to the other objects in its group.
A value of $S_i$ close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.

# Visualize the silhouette plot
fviz_silhouette(res.hc)

##   cluster size ave.sil.width
## 1       1   49          0.63
## 2       2   30          0.44
## 3       3   71          0.32

Clustering - Unsupervised Machine Learning

Which samples have negative silhouette? To what cluster are they closer?

# Silhouette width of observations
sil <- res.hc$silinfo$widths[, 1:3]

# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]

##     cluster neighbor   sil_width
## 84        3        2 -0.01269799
## 122       3        2 -0.01789603
## 62        3        2 -0.04756835
## 135       3        2 -0.05302402
## 73        3        2 -0.10091884
## 74        3        2 -0.14761137
## 114       3        2 -0.16107155
## 72        3        2 -0.23036371

7.4 How to choose the appropriate clustering algorithms for your data?

This chapter describes the R package clValid (G. Brock et al., 2008) which can be used for simultaneously comparing multiple clustering algorithms in a single function call for identifying the best clustering approach and the optimal number of clusters.

We’ll start by describing the different clustering validation measures in the package. Next, we’ll present the function clValid() and finally we’ll provide an R lab section for validating clustering results and comparing clustering algorithms.

How this chapter is organized?

Clustering validation measures in clValid package
- Internal validation measures
- Stability validation measures
- Biological validation measures
R function clValid()
- Format
- Examples of usage
  - Data
  - Compute clValid()

Read more: How to choose the appropriate clustering algorithms for your data?. If you are in hurry, read the following quick-start guide.

my_data <- scale(USArrests)

# Compute clValid
library("clValid")
intern <- clValid(my_data, nClust = 2:6, 
              clMethods = c("hierarchical","kmeans","pam"),
              validation = "internal")
# Summary
summary(intern)

## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 
## 
## Validation Measures:
##                                  2       3       4       5       6
##                                                                   
## hierarchical Connectivity   6.6437  9.5615 13.9563 22.5782 31.2873
##              Dunn           0.2214  0.2214  0.2224  0.2046  0.2126
##              Silhouette     0.4085  0.3486  0.3637  0.3213  0.2720
## kmeans       Connectivity   6.6437 13.6484 16.2413 24.6639 33.7194
##              Dunn           0.2214  0.2224  0.2224  0.1983  0.2231
##              Silhouette     0.4085  0.3668  0.3573  0.3377  0.3079
## pam          Connectivity   6.6437 13.8302 20.4421 29.5726 38.2643
##              Dunn           0.2214  0.1376  0.1849  0.1849  0.2019
##              Silhouette     0.4085  0.3144  0.3390  0.3105  0.2630
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 6.6437 hierarchical 2       
## Dunn         0.2231 kmeans       6       
## Silhouette   0.4085 hierarchical 2

It can be seen that hierarchical clustering with two clusters performs the best in each case (i.e., for connectivity, Dunn and Silhouette measures).

7.5 How to compute p-value for hierarchical clustering in R?

This chapter describes the R package pvclust (Suzuki et al., 2004) which uses bootstrap resampling techniques to compute p-value for each clusters.

How this chapter is organized?

Concept
Algorithm
Required R packages
Data preparation
Compute p-value for hierarchical clustering
- Description of pvclust() function
- Usage of pvclust() function

Read more: How to compute p-value for hierarchical clustering in R?. If you are in hurry, read the following quick-start guide.

Note that, pvclust() performs clustering on the columns of the dataset, which correspond to samples in our case.

library(pvclust)
# Data preparation
set.seed(123)
data("lung")
ss <- sample(1:73, 30) # extract 20 samples out of
my_data <- lung[, ss]

# Compute pvclust
res.pv <- pvclust(my_data, method.dist="cor", 
                  method.hclust="average", nboot = 10)

## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.

# Default plot
plot(res.pv, hang = -1, cex = 0.5)
pvrect(res.pv)

Clustering - Unsupervised Machine Learning

Clusters with AU > = 95% are indicated by the rectangles and are considered to be strongly supported by data.

8 The guide for clustering analysis on a real data: 4 steps you should know

In this chapter we’ll describe the different steps to follow for computing clustering on a real data using k-means clustering:

9 Visualization of clustering results

In this chapter, we’ll describe how to visualize the result of clustering using dendrograms as well as static and interactive heatmap.

Heat map is a false color image with a dendrogram added to the left side and to the top. It’s used to visualize a hidden pattern in a data matrix in order to reveal some associations between rows or columns.

9.1 Visual enhancement of clustering analysis

In this chapter, we provide some easy-to-use functions for enhancing the workflow of clustering analyses and we implemented ggplot2 method for visualizing the results: factoextra::eclust().

9.2 Beautiful dendrogram visualizations

Clustering - Unsupervised Machine Learning

9.3 Static and Interactive Heatmap

Clustering - Unsupervised Machine Learning

10 Advanced clustering methods

10.1 Fuzzy clustering analysis

Fuzzy clustering is also known as soft method. Standard clustering approaches produce partitions (K-means, PAM), in which each observation belongs to only one cluster. This is known as hard clustering.

In Fuzzy clustering, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster. The Fuzzy c-means method is the most popular fuzzy clustering algorithm. Read more: Fuzzy clustering analysis.

10.2 Model-based clustering

In model-based clustering, the data are viewed as coming from a distribution that is mixture of two ore more clusters. It finds best fit of models to data and estimates the number of clusters. Read more: Model-based clustering.

Clustering - Unsupervised Machine Learning

10.3 DBSCAN: Density-based clustering

DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers.The basic idea behind density-based clustering approach is derived from a human intuitive clustering method.

The description and implementation of DBSCAN in R are provided in this chapter : DBSCAN.

Clustering - Unsupervised Machine Learning

10.4 Hybrid clustering methods

Clustering - Unsupervised Machine Learning

11 Infos

This analysis has been performed using R software (ver. 3.2.4)

Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96). pdf

Fuzzy clustering analysis - Unsupervised Machine Learning

Wed, 27 Apr 2016 22:34:53 +0200

1 Required packages
2 Concept of fuzzy clustering
3 Algorithm of fuzzy clustering
- 3.1 R functions for fuzzy clustering
  - 3.1.1 fanny(): Fuzzy analysis clustering
  - 3.1.2 cmeans()
4 Infos

1 Required packages

Three R packages are required for this chapter:

cluster and e1071 for computing fuzzy clustering
factoextra for visualizing clusters

install.packages("cluster")
install.packages("e1071")
install.packages("factoextra")

2 Concept of fuzzy clustering

In K-means or PAM clustering, the data is divided into distinct clusters, where each element is affected exactly to one cluster. This type of clustering is also known as hard clustering or non-fuzzy clustering. Unlike K-means, Fuzzy clustering is considered as a soft clustering, in which each element has a probability of belonging to each cluster. In other words, each element has a set of membership coefficients corresponding to the degree of being in a given cluster.

Points close to the center of a cluster, may be in the cluster to a higher degree than points in the edge of a cluster. The degree, to which an element belongs to a given cluster, is a numerical value in [0, 1].

Fuzzy c-means (FCM) algorithm is one of the most widely used fuzzy clustering algorithms. It was developed by Dunn in 1973 and improved by Bezdek in 1981. It’s frequently used in pattern recognition.

3 Algorithm of fuzzy clustering

FCM algorithm is very similar to the k-means algorithm and the aim is to minimize the objective function defined as follow:

\[ \sum\limits_{j=1}^k \sum\limits_{x_i \in C_j} u_{ij}^m (x_i - \mu_j)^2 \]

Where,

$u_{ij}$ is the degree to which an observation $x_i$ belongs to a cluster $c_j$
$\mu_j$ is the center of the cluster j
$u_{ij}$ is the degree to which an observation $x_i$ belongs to a cluster $c_j$
$m$ is the fuzzifier.

It can be seen that, FCM differs from k-means by using the membership values $u_{ij}$ and the fuzzifier $m$.

The variable $u_{ij}^m$ is defined as follow:

\[ u_{ij}^m = \frac{1}{\sum\limits_{l=1}^k \left( \frac{| x_i - c_j |}{| x_i - c_k |}\right)^{\frac{2}{m-1}}} \]

The degree of belonging, $u_{ij}$, is linked inversely to the distance from x to the cluster center.

The parameter $m$ is a real number greater than 1 ($1.0 < m < \infty$) and it defines the level of cluster fuzziness. Note that, a value of $m$ close to 1 gives a cluster solution which becomes increasingly similar to the solution of hard clustering such as k-means; whereas a value of $m$ close to infinite leads to complete fuzzyness.

Note that, a good choice is to use m = 2.0 (Hathaway and Bezdek 2001).

In fuzzy clustering the centroid of a cluster is he mean of all points, weighted by their degree of belonging to the cluster:

\[ C_j = \frac{\sum\limits_{x \in C_j} u_{ij}^m x}{\sum\limits_{x \in C_j} u_{ij}^m} \]

Where,

$C_j$ is the centroid of the cluster j
$u_{ij}$ is the degree to which an observation $x_i$ belongs to a cluster $c_j$

The algorithm of fuzzy clustering can be summarize as follow:

Specify a number of clusters k (by the analyst)
Assign randomly to each point coefficients for being in the clusters.
Repeat until the maximum number of iterations (given by “maxit”) is reached, or when the algorithm has converged (that is, the coefficients’ change between two iterations is no more than $\epsilon$, the given sensitivity threshold):
- Compute the centroid for each cluster, using the formula above.
- For each point, compute its coefficients of being in the clusters, using the formula above.

The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means; the minimum is a local minimum, and the results depend on the initial choice of weights. Hence, different initializations may lead to different results.

Using a mixture of Gaussians along with the expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes.

3.1 R functions for fuzzy clustering

3.1.1 fanny(): Fuzzy analysis clustering

The function fanny() [in cluster package] can be used to compute fuzzy clustering. FANNY stands for fuzzy analysis clustering. A simplified format is:

fanny(x, k, memb.exp = 2, metric = "euclidean", 
      stand = FALSE, maxit = 500)

x: A data matrix or data frame or dissimilarity matrix
k: The desired number of clusters to be generated
memb.exp: The membership exponent (strictly larger than 1) used in the fit criteria. It’s also known as the fuzzifier
metric: The metric to be used for calculating dissimilarities between observations
stand: Logical; if true, the measurements in x are standardized before calculating the dissimilarities
maxit: maximal number of iterations

The function fanny() returns an object including the following components:

membership: matrix containing the degree to which each observation belongs to a given cluster. Column names are the clusters and rows are observations
coeff: Dunn’s partition coefficient F(k) of the clustering, where k is the number of clusters. F(k) is the sum of all squared membership coefficients, divided by the number of observations. Its value is between 1/k and 1. The normalized form of the coefficient is also given. It is defined as $(F(k) - 1/k) / (1 - 1/k)$, and ranges between 0 and 1. A low value of Dunn’s coefficient indicates a very fuzzy clustering, whereas a value close to 1 indicates a near-crisp clustering.
clustering: the clustering vector containing the nearest crisp grouping of observations

A subset of USArrests data is used in the following example:

library(cluster)
set.seed(123)
# Load the data
data("USArrests")

# Subset of USArrests
ss <- sample(1:50, 20)
df <- scale(USArrests[ss,])

# Compute fuzzy clustering
res.fanny <- fanny(df, 4)

# Cluster plot using fviz_cluster()
# You can use also : clusplot(res.fanny)
library(factoextra)
fviz_cluster(res.fanny, frame.type = "norm",
             frame.level = 0.68)

# Silhouette plot
fviz_silhouette(res.fanny, label = TRUE)

##   cluster size ave.sil.width
## 1       1    4          0.52
## 2       2    6          0.10
## 3       3    6          0.41
## 4       4    4          0.04

The result of fanny() function can be printed as follow:

print(res.fanny)

## Fuzzy Clustering object of class 'fanny' :                      
## m.ship.expon.        2
## objective     6.052789
## tolerance        1e-15
## iterations         215
## converged            1
## maxit              500
## n                   20
## Membership coefficients (in %, rounded):
##              [,1] [,2] [,3] [,4]
## Iowa           75   11    7    7
## Rhode Island   26   32   21   21
## Maryland        8   19   37   37
## Tennessee      10   24   33   33
## Utah           23   36   20   20
## Arizona        10   23   34   34
## Mississippi    16   25   29   29
## Wisconsin      65   15   10   10
## Virginia       17   37   23   23
## Maine          63   15   11   11
## Texas           8   25   33   33
## Louisiana       9   22   35   35
## Montana        41   26   17   17
## Michigan        8   20   36   36
## Arkansas       19   30   25   25
## New York        9   24   34   34
## Florida        10   21   35   35
## Alaska         15   24   31   31
## Hawaii         27   34   20   20
## New Jersey     16   37   23   23
## Fuzzyness coefficients:
## dunn_coeff normalized 
## 0.31337355 0.08449807 
## Closest hard clustering:
##         Iowa Rhode Island     Maryland    Tennessee         Utah 
##            1            2            3            4            2 
##      Arizona  Mississippi    Wisconsin     Virginia        Maine 
##            3            4            1            2            1 
##        Texas    Louisiana      Montana     Michigan     Arkansas 
##            3            4            1            3            2 
##     New York      Florida       Alaska       Hawaii   New Jersey 
##            3            3            4            2            2 
## 
## Available components:
##  [1] "membership"  "coeff"       "memb.exp"    "clustering"  "k.crisp"    
##  [6] "objective"   "convergence" "diss"        "call"        "silinfo"    
## [11] "data"

The different components can be extracted using the code below:

# Membership coefficient
res.fanny$membership

##                    [,1]      [,2]       [,3]       [,4]
## Iowa         0.75234997 0.1056742 0.07098791 0.07098791
## Rhode Island 0.26129280 0.3198982 0.20940449 0.20940449
## Maryland     0.07559096 0.1906031 0.36690296 0.36690296
## Tennessee    0.10351700 0.2444743 0.32600436 0.32600436
## Utah         0.23177048 0.3631831 0.20252321 0.20252321
## Arizona      0.09505979 0.2329621 0.33598906 0.33598906
## Mississippi  0.15957721 0.2511123 0.29465525 0.29465525
## Wisconsin    0.65274007 0.1530047 0.09712764 0.09712764
## Virginia     0.16856415 0.3654879 0.23297397 0.23297397
## Maine        0.62818484 0.1532966 0.10925930 0.10925930
## Texas        0.08407125 0.2465250 0.33470188 0.33470188
## Louisiana    0.09152177 0.2159634 0.34625741 0.34625741
## Montana      0.40788012 0.2556886 0.16821562 0.16821562
## Michigan     0.07811792 0.1957270 0.36307753 0.36307753
## Arkansas     0.19473888 0.2992279 0.25301662 0.25301662
## New York     0.08723572 0.2392572 0.33675356 0.33675356
## Florida      0.09725070 0.2073927 0.34767830 0.34767830
## Alaska       0.14688036 0.2428630 0.30512830 0.30512830
## Hawaii       0.26945561 0.3356724 0.19743602 0.19743602
## New Jersey   0.16160093 0.3720897 0.23315470 0.23315470

# Visualize using corrplot
library(corrplot)
corrplot(res.fanny$membership, is.corr = FALSE)

# Dunn's partition coefficient
res.fanny$coeff

## dunn_coeff normalized 
## 0.31337355 0.08449807

# Observation groups
res.fanny$clustering

##         Iowa Rhode Island     Maryland    Tennessee         Utah 
##            1            2            3            4            2 
##      Arizona  Mississippi    Wisconsin     Virginia        Maine 
##            3            4            1            2            1 
##        Texas    Louisiana      Montana     Michigan     Arkansas 
##            3            4            1            3            2 
##     New York      Florida       Alaska       Hawaii   New Jersey 
##            3            3            4            2            2

3.1.2 cmeans()

It’s also possible to use the function cmeans() [in e1071 package] for computing fuzzy clustering.

cmeans(x, centers, iter.max = 100, dist = "euclidean", m = 2)

x: a data matrix where columns are variables and rows are observations
centers: Number of clusters or initial values for cluster centers
iter.max: Maximum number of iterations
dist: Possible values are “euclidean” or “manhattan”
m: A number greater than 1 giving the degree of fuzzification.

The function cmeans() returns an object of class fclust which is a list containing the following components:

centers: the final cluster centers
size: the number of data points in each cluster of the closest hard clustering
cluster: a vector of integers containing the indices of the clusters where the data points are assigned to for the closest hard clustering, as obtained by assigning points to the (first) class with maximal membership.
iter: the number of iterations performed
membership: a matrix with the membership values of the data points to the clusters
withinerror: the value of the objective function

set.seed(123)
library(e1071)
cm <- cmeans(df, 4)
cm

## Fuzzy c-means clustering with 4 clusters
## 
## Cluster centers:
##       Murder    Assault   UrbanPop       Rape
## 1  0.6290005  0.9705484  0.5006389  0.8647698
## 2  0.8560350  0.3375298 -0.7294688  0.2002994
## 3 -1.2101485 -1.2476750 -0.7277747 -1.1534135
## 4 -0.7314218 -0.6647441  1.0032068 -0.3335272
## 
## Memberships:
##                        1           2          3          4
## Iowa         0.005939255 0.009155372 0.96585947 0.01904590
## Rhode Island 0.104616576 0.098854401 0.20500209 0.59152694
## Maryland     0.697459281 0.227720539 0.02731256 0.04750762
## Tennessee    0.078024194 0.872296030 0.02111342 0.02856636
## Utah         0.049301432 0.044484100 0.08442894 0.82178552
## Arizona      0.740498081 0.118781050 0.03988867 0.10083220
## Mississippi  0.179555100 0.624367937 0.10296383 0.09311313
## Wisconsin    0.024017906 0.033630983 0.83136508 0.11098604
## Virginia     0.155690387 0.395730684 0.19167059 0.25690834
## Maine        0.021165990 0.034336946 0.89152511 0.05297195
## Texas        0.545608753 0.240753676 0.05410235 0.15953522
## Louisiana    0.275003950 0.617629141 0.04197257 0.06539434
## Montana      0.062161310 0.135620851 0.66557661 0.13664123
## Michigan     0.848927329 0.096168273 0.01784963 0.03705477
## Arkansas     0.131803310 0.565593614 0.18039386 0.12220922
## New York     0.694179984 0.131927283 0.04157413 0.13231860
## Florida      0.711655719 0.173670792 0.03979837 0.07487512
## Alaska       0.369474028 0.381553979 0.11356564 0.13540635
## Hawaii       0.064103932 0.066647766 0.14874490 0.72050340
## New Jersey   0.082015921 0.059546923 0.05743425 0.80100291
## 
## Closest hard clustering:
##         Iowa Rhode Island     Maryland    Tennessee         Utah 
##            3            4            1            2            4 
##      Arizona  Mississippi    Wisconsin     Virginia        Maine 
##            1            2            3            2            3 
##        Texas    Louisiana      Montana     Michigan     Arkansas 
##            1            2            3            1            2 
##     New York      Florida       Alaska       Hawaii   New Jersey 
##            1            1            2            4            4 
## 
## Available components:
## [1] "centers"     "size"        "cluster"     "membership"  "iter"       
## [6] "withinerror" "call"

fviz_cluster(list(data = df, cluster=cm$cluster), frame.type = "norm",
             frame.level = 0.68)

4 Infos

This analysis has been performed using R software (ver. 3.2.4)

J. C. Dunn (1973): A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics 3: 32-57
J. C. Bezdek (1981): Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York Tariq Rashid: “Clustering”

Static and Interactive Heatmap in R - Unsupervised Machine Learning

Wed, 06 Apr 2016 22:55:51 +0200

In this article, we’ll describe how to draw static and interactive heatmap in R. The following R packages and functions will be used:

heatmap(): an R base function for drawing a simple heatmap
heatmap.2() [in gplots package]: a function for drawing an enhanced heatmap
d3heatmap: an R package for drawing interactive heatmap
ComplexHeatmap: an R/bioconductor package for drawing, annotating and arranging complex heatmaps (very useful for genomic data analysis)

1 Data

The built-in mtcars R data is used:

df <- as.matrix(scale(mtcars))

2 Draw a heat map using R base function

The built-in R heatmap function [in stats package] can be used.

A simplified format is:

heatmap(x, scale = "row")

x: a numeric matrix
scale: a character indicating if the values should be centered and scaled in either the row direction or the column direction, or none. Allowed values are in c(“row”, “column”, “none”). Default is “row”.

# Default plot
heatmap(df, scale = "none")

Heatmap - R data visualization

# Use custom colors
col<- colorRampPalette(c("red", "white", "blue"))(256)
heatmap(scale(as.matrix(mtcars)), scale = "none",
        col =  col)

Heatmap - R data visualization

The R code below will customize the heatmap as follow:

An RColorBrewer color palette name is used to change the appearance
The argument RowSideColors and ColSideColors are used to annotate rows and columns respectively. The expected values for these options are a vector containing color names specifying the classes for rows/columns.

# Use RColorBrewer color palette names
library("RColorBrewer")
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)
heatmap(df, scale = "none", col =  col, 
        RowSideColors = rep(c("blue", "pink"), each = 16),
        ColSideColors = c(rep("purple", 5), rep("orange", 6)))

Heatmap - R data visualization

3 Enhanced heat map

The function heatmap.2() [in gplots package] provides many extensions to the standard R heatmap() function presented in the previous section.

# install.packages("gplots")
library("gplots")
heatmap.2(df, scale = "none", col = bluered(100), 
          trace = "none", density.info = "none")

Heatmap - R data visualization

Other arguments can be used including:

labRow, labCol
margins
hclustfun: hclustfun=function(x) hclust(x, method=“ward”)
keysize

In the R code above, bluered() function [in gplots package] is used to generate a smoothly varying set of colors. You can also use the following color generator functions:

colorpanel(n, low, mid, high)
- n: Desired number of color elements to be generated
- low, mid, high: Colors to use for the Lowest, middle, and highest values. mid may be omitted.
redgreen(n)
greenred(n)
bluered(n)
redblue(n)

4 Interactive heatmap

The package d3heatmap can be used to produce an interactive heatmap:

It can be installed as follow:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("rstudio/d3heatmap")

The function d3heatmap() is used to create the interactive heatmap:

The possibilities below are provided:

Put the mouse on a heatmap cell of interest to view the row and the column names as well as the corresponding value.
select an area for zooming. After zooming, click on the heatmap again to go back to the previous display

library("d3heatmap")
d3heatmap(scale(mtcars), colors = "RdBu",
          k_row = 4, k_col = 2)

colors: Either an RColorBrewer color palette name (e.g. “YlOrRd” or “Blues”), or a vector of colors to interpolate in hexadecimal “#RRGGBB” format, or a color interpolation function like colorRamp. Read this: available colors in R
k_row, k_col: an integer specifying the desired number of groups by which to color the dendrogram’s branches in row and column, respectively.

For further customizing the heatmap read ?d3heatmap. Possible options include:

5 Enhancing heatmaps using dendextend

The package dendextend can be used to enhance functions from other packages. The mtcars data is used in the following sections. We’ll start by defining the order and the appearance for rows and columns using dendextend. These results are used in others functions from others packages.

The order and the appearance for rows and columns can be defined as follow:

library(dendextend)
# order for rows
Rowv  <- mtcars %>% scale %>% dist %>% hclust %>% as.dendrogram %>%
   set("branches_k_color", k = 3) %>% set("branches_lwd", 1.2) %>%
   ladderize

# Order for columns
# We must transpose the data
Colv  <- mtcars %>% scale %>% t %>% dist %>% hclust %>% as.dendrogram %>%
   set("branches_k_color", k = 2, value = c("orange", "blue")) %>%
   set("branches_lwd", 1.2) %>%
   ladderize

The arguments above can be used in the functions below:

The standard heatmap() function [in **stats* package]:

heatmap(scale(mtcars), Rowv = Rowv, Colv = Colv,
        scale = "none")

The enhanced heatmap.2() function [in gplots package]:

library(gplots)
heatmap.2(scale(mtcars), scale = "none", col = bluered(100), 
          Rowv = Rowv, Colv = Colv,
          trace = "none", density.info = "none")

Heatmap - R data visualization

The interactive heatmap generator d3heatmap() function [in d3heatmap package]:

library("d3heatmap")
d3heatmap(scale(mtcars), colors = "RdBu",
          Rowv = Rowv, Colv = Colv)

6 Complex heatmap

ComplexHeatmap is an R/bioconductor package, developed by Zuguang Gu, which provides a flexible solution to arrange and annotate multiple heatmaps. It allows also to visualize the association between different data from different sources.

6.1 Install and load ComplexHeatmap package

The latest version can be installed as follow:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("jokergoo/ComplexHeatmap")

Loading:

library("ComplexHeatmap")

6.2 Main function: Heatmap()

The main function from ComplexHeatmap package is Heatmap(). A simplified format is:

Heatmap(matrix, col, name)

matrix: a numeric or character matrix
col: a vector of colors (discrete color mapping) or a color mapping function (if the matrix is continuous numbers)
name: the name of the heatmap

6.3 Single heatmap

A single heatmap can be used to visualize a data set containing continuous or discrete values.

In the example below we’ll visualize the built-in mtcars data set.

Recall that, the mtcars data comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

data(mtcars)
head(mtcars[, 1:6])

##                    mpg cyl disp  hp drat    wt
## Mazda RX4         21.0   6  160 110 3.90 2.620
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875
## Datsun 710        22.8   4  108  93 3.85 2.320
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215
## Hornet Sportabout 18.7   8  360 175 3.15 3.440
## Valiant           18.1   6  225 105 2.76 3.460

Before drawing the heatmap, the data is firstly scaled, using the R base scale() function.

df <- scale(mtcars)
Heatmap(df, name = "mtcars")

Heatmap - R data visualization

6.3.1 Colors

The argument col is used to specify colors. As our data matrix contains continuous values, the option col should be a color mapping function. In this case, the colorRamp2() function [in circlize] can be used as follow:

library("circlize")
Heatmap(df, name = "mtcars",
        col = colorRamp2(c(-2, 0, 2), c("green", "white", "red")))

Heatmap - R data visualization

The two arguments for colorRamp2() is a vector of breaks values and corresponding colors.

It’s also possible to use RColorBrewer color palettes:

library("RColorBrewer")
Heatmap(df, name = "mtcars",
        col = colorRamp2(c(-2, 0, 2), brewer.pal(n=3, name="RdBu")))

Heatmap - R data visualization

In the next sections, we’ll use the following custom color palette:

mycol <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))

6.3.2 Titles

The heatmap name, column title and row title can be changed as follow:

Heatmap(df, name = "mtcars", col = mycol,
        column_title = "Column title",
        row_title = "Row title")

Heatmap - R data visualization

Note that, the default side for the row title is “left” and the default side for the column title is “top”. This can be changed using the options below:

row_title_side: Allowed values are “left” or “right” (e.g.: row_title_side = “right” )
column_title_side: Allowed values are “top” or “bottom” (e.g.: column_title_side = “bottom” )

It’s also possible to modify the font size and face of titles using the options:

row_title_gp: graphic parameters for drawing row text
column_title_gp: graphic parameters for drawing column text

For instance,

Heatmap(df, name = "mtcars", col = mycol,
        column_title = "Column title",
        column_title_gp = gpar(fontsize = 14, fontface = "bold"),
        row_title = "Row title",
        row_title_gp = gpar(fontsize = 14, fontface = "bold"))

Heatmap - R data visualization

In the R code above, the possible values for fontface can be an integer or string: 1 = plain, 2 = bold, 3 = italic and 4 = bold italic. If a string, then valid values are: “plain”, “bold”, “italic”, “oblique”, and “bold.italic”.

6.3.3 Row and column names

Show row/column names:
- show_row_names: whether to show row names. Default value is TRUE
- show_column_names: whether to show column names. Default value is TRUE

Heatmap(df, name = "mtcars", show_row_names = FALSE)

Change font size and face:
- row_names_gp: graphical parameters for drawing row names
- column_names_gp: graphical parameters for drawing column names

Heatmap(df, name = "mtcars", 
        row_names_gp = gpar(fontsize = 14, fontface = "bold",
                            col = c("blue", "red")))

6.3.4 Clustering

6.3.4.1 Change the appearance of clustering

By default, rows and columns are clustered. This can be inactivated using the argument:

cluster_rows = FALSE. If TRUE, makes cluster on rows
cluster_columns = FALSE. If TRUE, makes cluster on columns

Cluster on rows are inactivated using the R code below:

# Inactivate cluster on rows
Heatmap(df, name = "mtcars", col = mycol, cluster_rows = FALSE)

Heatmap - R data visualization

In some cases, we want to make the cluster on rows/columns, but we don’t want to show the dendogram on the final image. In this case, use the options:

show_row_hclust: logical value; whether to show row clusters
show_column_hclust: logical value; whether to show column clusters

It’s also possible to change the side of row and column clusters using the arguments:

row_hclust_side: The allowed values are “left” or “right”
column_hclust_side: The allowed values are “top” or “bottom”

If you want to change the height of column clusters or the width of row clusters, you can use the option column_dend_height and row_dend_width as follow:

Heatmap(df, name = "mtcars", col = mycol,
        column_dend_height = unit(2, "cm"),
        row_dend_width = unit(2, "cm") )

Heatmap - R data visualization

We can also customize the appearance of dendogram using the function color_branches() [in dendextend package]:

# install.packages("dendextend")
library(dendextend)
row_dend = hclust(dist(df)) # row clustering
col_dend = hclust(dist(t(df))) # column clustering
Heatmap(df, name = "mtcars", col = mycol,
        cluster_rows = color_branches(row_dend, k = 4),
        cluster_columns = color_branches(col_dend, k = 2))

Heatmap - R data visualization

6.3.4.2 Metric for clustering

The arguments clustering_distance_rows and clustering_distance_columns are used to specify the metric for row and column clustering, respectively. Default values are “euclidean”.

Allowed values are:

A Pre-defined character which is in (“euclidean”, “maximum”, “manhattan”, “canberra”, “binary”, “minkowski”, “pearson”, “spearman”, “kendall”):

Heatmap(df, name = "mtcars", clustering_distance_rows = "pearson",
        clustering_distance_columns = "pearson")

A Pre-defined function, such as dist(), to calculate distance from matrix (m):

Heatmap(df, name = "mtcars", 
        clustering_distance_rows = function(m) dist(m))

A Self-defined function which calculates distance from two vectors:

Heatmap(df, name = "mtcars", 
        clustering_distance_rows = function(x, y) 1 - cor(x, y))

Note that, in the R code above, the example is generally shown for the argument clustering_distance_rows which specify the metric for row clustering. I recommend to use the same metric for the argument clustering_distance_columns (metric for column clustering).

As an illustration, the R code below applies a self defined function for clustering which is robust to outliers based on the pair-wise distance:

# Clustering metric function
robust_dist = function(x, y) {
    qx = quantile(x, c(0.1, 0.9))
    qy = quantile(y, c(0.1, 0.9))
    l = x > qx[1] & x < qx[2] & y > qy[1] & y < qy[2]
    x = x[l]
    y = y[l]
    sqrt(sum((x - y)^2))
}
# Heatmap
Heatmap(df, name = "mtcars", 
    clustering_distance_rows = robust_dist,
    clustering_distance_columns = robust_dist,
    col = colorRamp2(c(-2, 0, 2), c("purple", "white", "orange")))

Heatmap - R data visualization

6.3.4.3 Clustering methods

The arguments clustering_method_rows and clustering_method_columns can be used to specify the method for making hierarchical clustering. Allowed values are those supported by hclust() function including “ward.D”, “ward.D2”, “single”, “complete”, “average”, … (see ?hclust).

As an example:

Heatmap(df, name = "mtcars", clustering_method_rows = "ward.D",
        clustering_method_columns = "ward.D")

6.3.5 Split heatmap by rows

There are many ways to split the heatmap. One solution is to apply k-means using the argument km.

It’s important to use the set.seed() function when performing k-means so that the results obtained can be reproduced precisely at a later time.

set.seed(2)
# split into 2 groups
Heatmap(df, name = "mtcars", col = mycol, k = 2)

Heatmap - R data visualization

It’s also possible to use split argument to specify row classes as a vector. In the following example we’ll use the levels of the factor variable cyl [in mtcars] to split the heatmap by rows. Recall that cyl corresponds to the number of cylinders.

# split by a vector specifying row classes
Heatmap(df, name = "mtcars", col = mycol, 
        split = mtcars$cyl )

Heatmap - R data visualization

Note that, split can be also a data frame in which different combinations of levels split the rows of the heatmap.

# Split by combining multiple variables
Heatmap(df, name ="mtcars", col = mycol,
        split = data.frame(cyl = mtcars$cyl, am = mtcars$am))

Heatmap - R data visualization

# Combine km and split
Heatmap(df, name ="mtcars", col = mycol,
        km = 2, split =  mtcars$cyl)

Heatmap - R data visualization

If you want to use other partitioning method, rather than k-means, you can easily do it by just assigning the partitioning vector to split. In the R code below, we’ll use pam() function [in cluster package]. pam() stands for Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.

# install.packages("cluster")
library("cluster")
set.seed(2)
pa = pam(df, k = 3)
Heatmap(df, name = "mtcars", col = mycol,
        split = paste0("pam", pa$clustering))

Heatmap - R data visualization

It’s also possible to combine user defined dendrograms and split. In this case, split can be specified as a single number:

library(dendextend)
row_dend = hclust(dist(df)) # row clustering
row_dend = color_branches(row_dend, k = 4)
Heatmap(df, name = "mtcars", col = mycol,
        cluster_rows = row_dend, split = 2)

Heatmap - R data visualization

6.4 Heatmap annotation

The HeatmapAnnotation class is used to define annotation on row or column. A simplified format is:

HeatmapAnnotation(df, name, col, show_legend)

df: a data.frame with column names
name: the name of the heatmap annotation
col: a list of colors which contains color mapping to columns in df

For the example below, we’ll transpose our data to have the observations in columns and the variables in rows.

6.4.1 Prepare the data

# Transpose
df <- t(df)
# Heatmap of the transposed data
Heatmap(df, name ="mtcars", col = mycol)

Heatmap - R data visualization

6.4.2 Simple annotation

In simple annotation a vector, containing discrete or continuous values, is used to annotate rows or columns.

We’ll use the qualitative variables cyl (levels = “4”, “5” and “8”) and am (levels = “0” and “1”), and the continuous variable mpg to annotate columns.

For each of these 3 variables, custom colors are defined as follow:

# Annotation data frame
annot_df <- data.frame(cyl = mtcars$cyl, am = mtcars$am,  mpg = mtcars$mpg)

# Define colors for each levels of qualitative variables
# Define gradient color for continuous variable (mpg)
col = list(cyl = c("4" = "green", "6" = "gray", "8" = "darkred"),
            am = c("0" = "yellow", "1" = "orange"),
            mpg = colorRamp2(c(17, 25), c("lightblue", "purple"))
            )

# Create the heatmap annotation
ha <- HeatmapAnnotation(annot_df, col = col)

# Combine the heatmap and the annotation
Heatmap(df, name = "mtcars", col = mycol,
        top_annotation = ha)

Heatmap - R data visualization

It’s possible to hide the annotation legend using the argument show_legend = FALSE as follow:

ha <- HeatmapAnnotation(annot_df, col = col, show_legend = FALSE)
Heatmap(df, name = "mtcars", col = mycol, top_annotation = ha)

Annotation names can be added using the R code hereafter. The function qq() [in GetoptLong package], for simple variable interpolation in texts, is required.

library("GetoptLong")
# Combine Heatmap and annotation
ha <- HeatmapAnnotation(annot_df, col = col, show_legend = FALSE)
Heatmap(df, name = "mtcars", col = mycol, top_annotation = ha)
# Add annotation names on the right
for(an in colnames(annot_df)) {
    seekViewport(qq("annotation_@{an}"))
    grid.text(an, unit(1, "npc") + unit(2, "mm"), 0.5,
              default.units = "npc", just = "left")
}

Heatmap - R data visualization

To add annotation names on the left, use the code below:

# Annotation names on the left
for(an in colnames(annot_df)) {
    seekViewport(qq("annotation_@{an}"))
    grid.text(an, unit(1, "npc") - unit(2, "mm"), 0.5,
              default.units = "npc", just = "left")
}

6.4.3 Complex annotation

In this section we’ll see how to combine heatmap and some basic graphs to show the data distribution. For simple annotation graphics, the following functions can be used: anno_points(), anno_barplot(), anno_boxplot(), anno_density() and anno_histogram().

An example is shown below:

# Define some graphics to display the distribution of columns
.hist = anno_histogram(df, gp = gpar(fill = "lightblue"))
.density = anno_density(df, type = "line", gp = gpar(col = "blue"))
ha_mix_top = HeatmapAnnotation(hist = .hist, density = .density)

# Define some graphics to display the distribution of rows
.violin = anno_density(df, type = "violin", 
                       gp = gpar(fill = "lightblue"), which = "row")
.boxplot = anno_boxplot(df, which = "row")
ha_mix_right = HeatmapAnnotation(violin = .violin, bxplt = .boxplot,
                              which = "row", width = unit(4, "cm"))

# Combine annotation with heatmap
Heatmap(df, name = "mtcars", col = mycol,
        column_names_gp = gpar(fontsize = 8),
        top_annotation = ha_mix_top, 
        top_annotation_height = unit(4, "cm")) + ha_mix_right

Heatmap - R data visualization

Note that, it’s also possible to use the argument bottom_annotation.

6.5 Combine multiple heatmaps

Multiple heatmaps can be arranged as follow:

# Heatmap 1
ht1 = Heatmap(df, name = "ht1", col = mycol, km = 2,
              column_names_gp = gpar(fontsize = 9))
# Heatmap 2
ht2 = Heatmap(df, name = "ht2", 
        col = colorRamp2(c(-2, 0, 2), c("green", "white", "red")),
        column_names_gp = gpar(fontsize = 9))
# Combine the two heatmaps
ht1 + ht2

Heatmap - R data visualization

You can use the option width = unit(3, “cm”)) to control the size of the heatmaps.

Note that when combining multiple heatmaps, the first heatmap is considered as the main heatmap. Some settings of the remaining heatmaps are auto-adjusted according to the setting of the main heatmap. These include: removing row clusters and titles, and adding splitting

The draw() function can be used to customize the appearance of the final image:

draw(ht1 + ht2, 
    # Titles
    row_title = "Two heatmaps, row title", 
    row_title_gp = gpar(col = "red"),
    column_title = "Two heatmaps, column title", 
    column_title_side = "bottom",
    # Gap between heatmaps
    gap = unit(0.5, "cm"))

Legends can be removed using the arguments show_heatmap_legend = FALSE, show_annotation_legend = FALSE.

6.6 Real application

6.7 Gene expression matrix

In gene expression data, rows are genes and columns are samples. More information about genes can be attached after the expression heatmap such as gene length and type of genes.

expr = readRDS(paste0(system.file(package = "ComplexHeatmap"),
                      "/extdata/gene_expression.rds"))
mat = as.matrix(expr[, grep("cell", colnames(expr))])

type = gsub("s\\d+_", "", colnames(mat))
ha = HeatmapAnnotation(df = data.frame(type = type))

Heatmap(mat, name = "expression", km = 5, top_annotation = ha, 
    top_annotation_height = unit(4, "mm"), 
    show_row_names = FALSE, show_column_names = FALSE) +
Heatmap(expr$length, name = "length", width = unit(5, "mm"),
        col = colorRamp2(c(0, 100000), c("white", "orange"))) +
Heatmap(expr$type, name = "type", width = unit(5, "mm")) +
Heatmap(expr$chr, name = "chr", width = unit(5, "mm"),
        col = rand_color(length(unique(expr$chr))))

Heatmap - R data visualization

It’s also possible to visualize genomic alterations and to integrate different molecular levels (gene expression, DNA methylation, …). Read the vignette for further examples.

6.8 Visualize distribution of column in matrix

The function densityHeatmap() is used.

densityHeatmap(scale(mtcars))

Heatmap - R data visualization

The dashed lines on the heatmap correspond to the five quantile numbers. The text for the five quantile levels are added in the right of the heatmap.

7 Infos

This analysis has been performed using R software (ver. 3.2.3) and ComplexHeatmap (ver. )

The Guide for Clustering Analysis on a Real Data: 4 steps you should know - Unsupervised Machine Learning

Mon, 21 Dec 2015 11:45:58 +0100

1 Required packages
2 Data preparation
3 Assessing the clusterability
4 Estimate the number of clusters in the data
5 Compute k-means clustering
6 Cluster validation statistics: Inspect cluster silhouette plot
7 eclust(): Enhanced clustering analysis
- 7.1 K-means clustering using eclust()
- 7.2 Hierachical clustering using eclust()
8 Infos

Human’s abilities are exceeded by the large amounts of data collected every day from different fields, bio-medical, security, marketing, web search, geo-spatial or other automatic equipment. Consequently, unsupervised machine learning technics, such as clustering, are used for discovering knowledge from big data.

Clustering approaches classify samples into groups (i.e clusters) containing objects of similar profiles. In our previous post, we clarified distance measures for assessing similarity between observations.

In this chapter we’ll describe the different steps to follow for computing clustering on a real data using k-means clustering:

1 Required packages

The following packages will be used:

cluster for clustering analyses
factoextra for visualizing clusters using ggplot2 plotting system

Install factoextra package as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

The cluster package can be installed using the code below:

install.packages("cluster")

Load packages:

library(cluster)
library(factoextra)

2 Data preparation

We’ll use the built-in R data set USArrests, which can be loaded and prepared as follow:

# Load the data set
data(USArrests)

# Remove any missing value (i.e, NA values for not available)
# That might be present in the data
USArrests <- na.omit(USArrests)

# View the firt 6 rows of the data
head(USArrests, n = 6)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

In this data set, columns are variables and rows are observations (i.e., samples).

To inspect the data before the K-means clustering we’ll compute some descriptive statistics such as the mean and the standard deviation of the variables.

The apply() function is used to apply a given function (e.g : min(), max(), mean(), …) on the data set. The second argument can take the value of:

1: for applying the function on the rows
2: for applying the function on the columns

desc_stats <- data.frame(
  Min = apply(USArrests, 2, min), # minimum
  Med = apply(USArrests, 2, median), # median
  Mean = apply(USArrests, 2, mean), # mean
  SD = apply(USArrests, 2, sd), # Standard deviation
  Max = apply(USArrests, 2, max) # Maximum
  )
desc_stats <- round(desc_stats, 1)
head(desc_stats)

##           Min   Med  Mean   SD   Max
## Murder    0.8   7.2   7.8  4.4  17.4
## Assault  45.0 159.0 170.8 83.3 337.0
## UrbanPop 32.0  66.0  65.5 14.5  91.0
## Rape      7.3  20.1  21.2  9.4  46.0

Note that the variables have a large different means and variances. They must be standardized to make them comparable.

Standardization consists of transforming the variables such that they have mean zero and standard deviation one. The scale() function can be used as follow:

df<- scale(USArrests)

3 Assessing the clusterability

The function get_clust_tendency() [in factoextra] can be used. It computes Hopkins statistic and provides a visual approach.

library("factoextra")
res <- get_clust_tendency(df, 40, graph = FALSE)
# Hopskin statistic
res$hopkins_stat

## [1] 0.3440875

# Visualize the dissimilarity matrix
res$plot

## NULL

The value of Hopkins statistic is significantly < 0.5, indicating that the data is highly clusterable. Additionally, It can be seen that the ordered dissimilarity image contains patterns (i.e., clusters).

4 Estimate the number of clusters in the data

As k-means clustering requires to specify the number of clusters to generate, we’ll use the function clusGap() [in cluster] to compute gap statistics for estimating the optimal number of clusters . The function fviz_gap_stat() [in factoextra] is used to visualize the gap statistic plot.

library("cluster")
set.seed(123)
# Compute the gap statistic
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25, 
                    K.max = 10, B = 500) 
# Plot the result
library(factoextra)
fviz_gap_stat(gap_stat)

The gap statistic suggests a 4 cluster solutions.

It’s also possible to use the function NbClust() [in NbClust] package.

5 Compute k-means clustering

K-means clustering with k = 4:

# Compute k-means
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)
head(km.res$cluster, 20)

##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           4           3           3           4           3           3 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           2           2           3           4           2           1 
##    Illinois     Indiana        Iowa      Kansas    Kentucky   Louisiana 
##           3           2           1           2           1           4 
##       Maine    Maryland 
##           1           3

# Visualize clusters using factoextra
fviz_cluster(km.res, USArrests)

6 Cluster validation statistics: Inspect cluster silhouette plot

Recall that the silhouette measures ($S_i$) how similar an object $i$ is to the the other objects in its own cluster versus those in the neighbor cluster. $S_i$ values range from 1 to - 1:

A value of $S_i$ close to 1 indicates that the object is well clustered. In the other words, the object $i$ is similar to the other objects in its group.
A value of $S_i$ close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.

sil <- silhouette(km.res$cluster, dist(df))
rownames(sil) <- rownames(USArrests)
head(sil[, 1:3])

##            cluster neighbor  sil_width
## Alabama          4        3 0.48577530
## Alaska           3        4 0.05825209
## Arizona          3        2 0.41548326
## Arkansas         4        2 0.11870947
## California       3        2 0.43555885
## Colorado         3        2 0.32654235

fviz_silhouette(sil)

##   cluster size ave.sil.width
## 1       1   13          0.37
## 2       2   16          0.34
## 3       3   13          0.27
## 4       4    8          0.39

It can be seen that there are some samples which have negative silhouette values. Some natural questions are :

Which samples are these? To what cluster are they closer?

This can be determined from the output of the function silhouette() as follow:

neg_sil_index <- which(sil[, "sil_width"] < 0)
sil[neg_sil_index, , drop = FALSE]

##          cluster neighbor   sil_width
## Missouri       3        2 -0.07318144

7 eclust(): Enhanced clustering analysis

The function eclust() [in factoextra] provides several advantages compared to the standard packages used for clustering analysis:

It simplifies the workflow of clustering analysis
It can be used to compute hierarchical clustering and partitioning clustering in a single line function call
The function eclust() computes automatically the gap statistic for estimating the right number of clusters.
It automatically provides silhouette information
It draws beautiful graphs using ggplot2

7.1 K-means clustering using eclust()

# Compute k-means
res.km <- eclust(df, "kmeans")

# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)

# Silhouette plot
fviz_silhouette(res.km)

##   cluster size ave.sil.width
## 1       1   13          0.27
## 2       2   13          0.37
## 3       3    8          0.39
## 4       4   16          0.34

7.2 Hierachical clustering using eclust()

 # Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust
fviz_dend(res.hc, rect = TRUE) # dendrogam

The R code below generates the silhouette plot and the scatter plot for hierarchical clustering.

fviz_silhouette(res.hc) # silhouette plot
fviz_cluster(res.hc) # scatter plot

8 Infos

This analysis has been performed using R software (ver. 3.2.3)

Model-Based Clustering - Unsupervised Machine Learning

Sun, 20 Dec 2015 17:12:40 +0100

1 Concept
2 Model parameters
3 Advantage of model-based clustering
4 Example of data
5 Mclust(): R function for computing model-based clustering
6 Example of cluster analysis using Mclust()
7 Infos

1 Concept

The traditional clustering methods such as hierarchical clustering and partitioning algorithms (k-means and others) are heuristic and are not based on formal models.

An alternative is to use model-based clustering, in which, the data are considered as coming from a distribution that is mixture of two or more components (i.e. clusters) (Chris Fraley and Adrian E. Raftery, 2002 and 2012).

Each component k (i.e. group or cluster) is modeled by the normal or Gaussian distribution which is characterized by the parameters:

$\mu_k$: mean vector,
$\sum_k$: covariance matrix,
An associated probability in the mixture. Each point has a probability of belonging to each cluster.

2 Model parameters

The model parameters can be estimated using the EM (Expectation-Maximization) algorithm initialized by hierarchical model-based clustering. Each cluster k is centered at the means $\mu_k$, with increased density for points near the mean.

Geometric features (shape, volume, orientation) of each cluster are determined by the covariance matrix $\sum_k$.

Different possible parameterizations of $\sum_k$ are available in the R package mclust (see ?mclustModelNames).

The available model options, in mclust package, are represented by identifiers including: EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV and VVV.

The first identifier refers to volume, the second to shape and the third to orientation. E stands for “equal”, V for “variable” and I for “coordinate axes”.

For example:

EVI denotes a model in which the volumes of all clusters are equal (E), the shapes of the clusters may vary (V), and the orientation is the identity (I) or “coordinate axes.
EEE means that the clusters have the same volume, shape and orientation in p-dimensional space.
VEI means that the clusters have variable volume, the same shape and orientation equal to coordinate axes.

The mclust package uses maximum likelihood to fit all these models, with different covariance matrix parameterizations, for a range of k components. The “best model” is selected using the Bayesian Information Criterion or BIC. A large BIC score indicates strong evidence for the corresponding model.

3 Advantage of model-based clustering

The key advantage of model-based approach, compared to the standard clustering methods (k-means, hierarchical clustering, …), is the suggestion of the number of clusters and an appropriate model.

4 Example of data

We’ll use the bivariate faithful data set which contains the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park (Wyoming, USA).

# Load the data
data("faithful")
head(faithful)

##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

An illustration of the data can be drawn using ggplot2 package as follow:

library("ggplot2")
ggplot(faithful, aes(x=eruptions, y=waiting)) +
  geom_point() +  # Scatter plot
  geom_density2d() # Add 2d density estimation

5 Mclust(): R function for computing model-based clustering

The function Mclust() [in mclust package] can be used to compute model-based clustering.

Install and load the package as follow:

# Install
install.packages("mclust")

# Load
library("mclust")

The function Mclust() provides the optimal mixture model estimation according to BIC. A simplified format is:

Mclust(data, G = NULL)

data: A numeric vector, matrix or data frame. Categorical variables are not allowed. If a matrix or data frame, rows correspond to observations and columns correspond to variables.
G: An integer vector specifying the numbers of mixture components (clusters) for which the BIC is to be calculated. The default is G=1:9.

The function Mclust() returns an object of class ‘Mclust’ containing the following elements:

modelName: A character string denoting the model at which the optimal BIC occurs.
G: The optimal number of mixture components (i.e: number of clusters)
BIC: All BIV values
bic Optimal BIC value
loglik: The loglikelihood corresponding to the optimal BIC
df: The number of estimated parameters
Z: A matrix whose $[i,k]^{th}$ entry is the probability that observation $i$ in the test data belongs to the $k^{th}$ class. Column names are cluster numbers, and rows are observations
classification: The cluster number of each observation, i.e. map(z)
uncertainty: The uncertainty associated with the classification

6 Example of cluster analysis using Mclust()

library(mclust)
# Model-based-clustering
mc <- Mclust(faithful)
# Print a summary
summary(mc)

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 3 components:
## 
##  log.likelihood   n df       BIC       ICL
##       -1126.361 272 11 -2314.386 -2360.865
## 
## Clustering table:
##   1   2   3 
## 130  97  45

# Values returned by Mclust()
names(mc)

##  [1] "call"           "data"           "modelName"      "n"             
##  [5] "d"              "G"              "BIC"            "bic"           
##  [9] "loglik"         "df"             "hypvol"         "parameters"    
## [13] "z"              "classification" "uncertainty"

# Optimal selected model
mc$modelName

## [1] "EEE"

# Optimal number of cluster
mc$G

## [1] 3

# Probality for an observation to be in a given cluster
head(mc$z)

##           [,1]         [,2]         [,3]
## 1 2.181744e-02 1.130837e-08 9.781825e-01
## 2 2.475031e-21 1.000000e+00 3.320864e-13
## 3 2.521625e-03 2.051823e-05 9.974579e-01
## 4 6.553336e-14 9.999998e-01 1.664978e-07
## 5 9.838967e-01 7.642900e-20 1.610327e-02
## 6 2.104355e-07 9.975388e-01 2.461029e-03

# Cluster assignement of each observation
head(mc$classification, 10)

##  1  2  3  4  5  6  7  8  9 10 
##  3  2  3  2  1  2  1  3  2  1

# Uncertainty associated with the classification
head(mc$uncertainty)

##            1            2            3            4            5 
## 2.181745e-02 3.321787e-13 2.542143e-03 1.664978e-07 1.610327e-02 
##            6 
## 2.461239e-03

Model-based clustering results can be drawn using the function plot.Mclust():

plot(x, what = c("BIC", "classification", "uncertainty", "density"),
     xlab = NULL, ylab = NULL, addEllipses = TRUE, main = TRUE, ...)

# BIC values used for choosing the number of clusters
plot(mc, "BIC")

# Classification: plot showing the clustering
plot(mc, "classification")

# Classification uncertainty
plot(mc, "uncertainty")

# Estimated density. Contour plot
plot(mc, "density")

Clusters generated by Mclust() can be drawn using the function fviz_cluster() [in factoextra package]. Read more about [factoextra](https://www.sthda.com/english/wiki/factoextra-r-package-quick-multivariate-data-analysis-pca-ca-mca-and-visualization-r-software-and-data-mining.

library(factoextra)
fviz_cluster(mc, frame.type = "norm", geom = "point")

7 Infos

This analysis has been performed using R software (ver. 3.2.3)

Chris Fraley, A. E. Raftery, T. B. Murphy and L. Scrucca (2012). mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report No. 597, Department of Statistics, University of Washington. pdf
Chris Fraley and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97:611:631.

Beautiful dendrogram visualizations in R: 5+ must known methods - Unsupervised Machine Learning

Sun, 06 Dec 2015 12:47:59 +0100

A variety of functions exists in R for visualizing and customizing dendrogram. The aim of this article is to describe 5+ methods for drawing a beautiful dendrogram using R software.

We start by computing hierarchical clustering using the data set USArrests:

# Load data
data(USArrests)

# Compute distances and hierarchical clustering
dd <- dist(scale(USArrests), method = "euclidean")
hc <- hclust(dd, method = "ward.D2")

1 plot.hclust(): R base function

As you already know, the standard R function plot.hclust() can be used to draw a dendrogram from the results of hierarchical clustering analyses (computed using hclust() function).

A simplified format is:

plot(x, labels = NULL, hang = 0.1, 
     main = "Cluster dendrogram", sub = NULL,
     xlab = NULL, ylab = "Height", ...)

x: an object of the type produced by hclust()
labels: A character vector of labels for the leaves of the tree. The default value is row names. if labels = FALSE, no labels are drawn.
hang: The fraction of the plot height by which labels should hang below the rest of the plot. A negative value will cause the labels to hang down from 0.
main, sub, xlab, ylab: character strings for title.

# Default plot
plot(hc)

# Put the labels at the same height: hang = -1
plot(hc, hang = -1, cex = 0.6)

2 plot.dendrogram() function

In order to visualize the result of a hierarchical clustering analysis using the function plot.dendrogram(), we must firstly convert it as a dendrogram.

The format of the function plot.dendrogram() is:

plot(x, type = c("rectangle", "triangle"), horiz = FALSE)

x: an object of class dendrogram
type of plot. Possible values are “rectangle” or “triangle”
horiz: logical indicating if the dendrogram should be drawn horizontally or no

# Convert hclust into a dendrogram and plot
hcd <- as.dendrogram(hc)
# Default plot
plot(hcd, type = "rectangle", ylab = "Height")

# Triangle plot
plot(hcd, type = "triangle", ylab = "Height")

# Zoom in to the first dendrogram
plot(hcd, xlim = c(1, 20), ylim = c(1,8))

The above dendrogram can be customized using the arguments:

nodePar: a list of plotting parameters to use for the nodes (see ?points). Default value is NULL. The list may contain components named pch, cex, col, xpd, and/or bg each of which can have length two for specifying separate attributes for inner nodes and leaves.
edgePar: a list of plotting parameters to use for the edge segments (see ?segments). The list may contain components named col, lty and lwd (for the segments). As with nodePar, each can have length two for differentiating leaves and inner nodes.
leaflab: a string specifying how leaves are labeled. The default “perpendicular” write text vertically; “textlike” writes text horizontally (in a rectangle), and “none” suppresses leaf labels.

# Define nodePar
nodePar <- list(lab.cex = 0.6, pch = c(NA, 19), 
                cex = 0.7, col = "blue")
# Customized plot; remove labels
plot(hcd, ylab = "Height", nodePar = nodePar, leaflab = "none")

# Horizontal plot
plot(hcd,  xlab = "Height",
     nodePar = nodePar, horiz = TRUE)

# Change edge color
plot(hcd,  xlab = "Height", nodePar = nodePar, 
     edgePar = list(col = 2:3, lwd = 2:1))

3 Phylogenetic trees

The package ape (Analyses of Phylogenetics and Evolution) can be used to produce a more sophisticated dendrogram.

The function plot.phylo() can be used for plotting a dendrogram. A simplified format is:

plot(x, type = "phylogram", show.tip.label = TRUE,
     edge.color = "black", edge.width = 1, edge.lty = 1,
     tip.color = "black")

x: an object of class “phylo”
type: the type of phylogeny to be drawn. Possible values are: “phylogram” (the default), “cladogram”, “fan”, “unrooted” and “radial”
show.tip.label: if true labels are shown
edge.color, edge.width, edge.lty: line color, width and type to be used for edge
tip.color: color used for labels

# install.packages("ape")
library("ape")
# Default plot
plot(as.phylo(hc), cex = 0.6, label.offset = 0.5)

# Cladogram
plot(as.phylo(hc), type = "cladogram", cex = 0.6, 
     label.offset = 0.5)

# Unrooted
plot(as.phylo(hc), type = "unrooted", cex = 0.6,
     no.margin = TRUE)

# Fan
plot(as.phylo(hc), type = "fan")

# Radial
plot(as.phylo(hc), type = "radial")

# Cut the dendrogram into 4 clusters
colors = c("red", "blue", "green", "black")
clus4 = cutree(hc, 4)
plot(as.phylo(hc), type = "fan", tip.color = colors[clus4],
     label.offset = 1, cex = 0.7)

# Change the appearance
# change edge and label (tip)
plot(as.phylo(hc), type = "cladogram", cex = 0.6,
     edge.color = "steelblue", edge.width = 2, edge.lty = 2,
     tip.color = "steelblue")

4 ggdendro package : ggplot2 and dendrogram

The R package ggdendro can be used to extract the plot data from dendrogram and for drawing a dendrogram using ggplot2.

4.1 Installation and loading

ggdendro can be installed as follow:

install.packages("ggdendro")

ggdendro requires the package ggplot2. Make sure that ggplot2 is installed and loaded before using ggdendro.

Load ggdendro as follow:

library("ggplot2")
library("ggdendro")

4.2 Visualize dendrogram using ggdendrogram() function

The function ggdendrogram() creates dendrogram plot using ggplot2.

# Visualization using the default theme named theme_dendro()
ggdendrogram(hc)

# Rotate the plot and remove default theme
ggdendrogram(hc, rotate = TRUE, theme_dendro = FALSE)

4.3 Extract dendrogram plot data

The function dendro_data() can be used for extracting the data. It returns a list of data frames which can be extracted using the functions below:

segment(): To extract the data for dendrogram line segments
label(): To extract the labels

# Build dendrogram object from hclust results
dend <- as.dendrogram(hc)

# Extract the data (for rectangular lines)
# Type can be "rectangle" or "triangle"
dend_data <- dendro_data(dend, type = "rectangle")
# What contains dend_data
names(dend_data)

## [1] "segments"    "labels"      "leaf_labels" "class"

# Extract data for line segments
head(dend_data$segments)

##           x         y     xend      yend
## 1 19.771484 13.516242 8.867188 13.516242
## 2  8.867188 13.516242 8.867188  6.461866
## 3  8.867188  6.461866 4.125000  6.461866
## 4  4.125000  6.461866 4.125000  2.714554
## 5  4.125000  2.714554 2.500000  2.714554
## 6  2.500000  2.714554 2.500000  1.091092

# Extract data for labels
head(dend_data$labels)

##   x y          label
## 1 1 0        Alabama
## 2 2 0      Louisiana
## 3 3 0        Georgia
## 4 4 0      Tennessee
## 5 5 0 North Carolina
## 6 6 0    Mississippi

dend_data can be used to draw a customized dendrogram using ggplot2:

# Plot line segments and add labels
p <- ggplot(dend_data$segments) + 
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend))+
  geom_text(data = dend_data$labels, aes(x, y, label = label),
            hjust = 1, angle = 90, size = 3)+
  ylim(-3, 15)
print(p)

5 dendextend package: Extending R’s dendrogram functionality

The package dendextend contains many functions for changing the appearance of a dendrogram and for comparing dendrograms.

In this section we’ll use the chaining operator (%>%) to simplify our code.

5.1 Chaining

The chaining operator (%>%) turns x %>% f(y) into f(x, y) so you can use it to rewrite multiple operations such that they can be read from left-to-right, top-to-bottom. For instance, the results of the two R codes below are equivalent.

Standard R code for creating a dendrogram:

data <- scale(USArrests)
dist.res <- dist(data)
hc <- hclust(dist.res, method = "ward.D2")
dend <- as.dendrogram(hc)
plot(dend)

R code for creating a dendrogram using chaining operator:

dend <- USArrests[1:5,] %>% # data
        scale %>% # Scale the data
        dist %>% # calculate a distance matrix, 
        hclust(method = "ward.D2") %>% # Hierarchical clustering 
        as.dendrogram # Turn the object into a dendrogram.
plot(dend)

5.2 Installation and loading

Install the stable version as follow:

install.packages('dendextend')

Loading:

library(dendextend)

5.3 How to change a dendrogram

The function set() can be used to change the parameters with dendextend.

The format is:

set(object, what, value)

object: a dendrogram object
what: a character indicating what is the property of the tree that should be set/updated
value: a vector with the value to set in the tree (the type of the value depends on the “what”).

Possible values for the argument what include:

Value for the argument what	Description
labels	set the labels
labels_colors and labels_cex	Set the color and the size of labels, respectively
leaves_pch, leaves_cex and leaves_col	set the point type, size and color for leaves, respectively
nodes_pch, nodes_cex and nodes_col	set the point type, size and color for nodes, respectively
hang_leaves	hang the leaves
branches_k_color	color the branches
branches_col, branches_lwd , branches_lty	Set the color, the line width and the line type of branches, respectively
by_labels_branches_col, by_labels_branches_lwd and by_labels_branches_lty	Set the color, the line width and the line type of branches with specific labels, respectively
clear_branches and clear_leaves	Clear branches and leaves, respectively

5.4 Create a simple dendrogram

# Create a dendrogram and plot it
dend <- USArrests[1:5,] %>%  scale %>% 
        dist %>% hclust %>% as.dendrogram

dend %>% plot

# Get the labels of the tree
labels(dend)

## [1] "Alaska"     "Arizona"    "California" "Alabama"    "Arkansas"

5.5 Change labels

This section describes how to change label names as well as the color and the size for labels.

# Change the labels, and then plot:
dend %>% set("labels", c("a", "b", "c", "d", "e")) %>% plot

# Change color and size for labels
dend %>% set("labels_col", c("green", "blue")) %>% # change color
  set("labels_cex", 2) %>% # Change size
  plot(main = "Change the color \nand size") # plot

# Color labels by specifying the number of cluster (k)
dend %>% set("labels_col", value = c("green", "blue"), k=2) %>% 
          plot(main = "Color labels \nper cluster")
abline(h = 2, lty = 2)

In the R code above, the value of color vectors are too short. Hence, it’s recycled.

5.6 Change the points of a dendrogram nodes/leaves

# Change the type, the color and the size of node points
# +++++++++++++++++++++++++++++
dend %>% set("nodes_pch", 19) %>%  # node point type
  set("nodes_cex", 2) %>%  # node point size
  set("nodes_col", "blue") %>% # node point color
  plot(main = "Node points")

# Change the type, the color and the size of leave points
# +++++++++++++++++++++++++++++
dend %>% set("leaves_pch", 19) %>%  # node point type
  set("leaves_cex", 2) %>%  # node point size
  set("leaves_col", "blue") %>% # node point color
  plot(main = "Leaves points")

# Specify different point types and colors for each leave
dend %>% set("leaves_pch", c(17, 18, 19)) %>%  # node point type
  set("leaves_cex", 2) %>%  # node point size
  set("leaves_col", c("blue", "red", "green")) %>% #node point color
  plot(main = "Leaves points")

5.7 Change the color of branches

The color for branches can be controlled using k-means clustering:

# Default colors
dend %>% set("branches_k_color", k = 2) %>% 
  plot(main = "Default colors")

# Customized colors
dend %>% set("branches_k_color", 
             value = c("red", "blue"), k = 2) %>% 
   plot(main = "Customized colors")

It’s also possible to use the function color_branches().

5.8 Adding colored rectangles

Clusters can be highlighted by adding colored rectangles. This is done using the rect.dendrogram() function (modeled based on the rect.hclust() function). One advantage of rect.dendrogram over rect.hclust, is that it also works on horizontally plotted trees:

# Vertical plot
dend %>% set("branches_k_color", k = 3) %>% plot
dend %>% rect.dendrogram(k=3, border = 8, lty = 5, lwd = 2)

# Horizontal plot
dend %>% set("branches_k_color", k = 3) %>% plot(horiz = TRUE)
dend %>% rect.dendrogram(k = 3, horiz = TRUE, border = 8, lty = 5, lwd = 2)

5.9 Adding colored bars

This is useful for annotating the items in the clusters:

grp <- c(1,1,1, 2,2)
k_3 <- cutree(dend,k = 3, order_clusters_as_data = FALSE) 
# The FALSE above makes sure we get the clusters in the order of the
# dendrogram, and not in that of the original data. It is like:
# cutree(dend, k = 3)[order.dendrogram(dend)]

the_bars <- cbind(grp, k_3)

dend %>% set("labels", "") %>% plot
colored_bars(colors = the_bars, dend = dend)

5.10 ggplot2 integration

The following 2 steps are used:

Transform a dendrogram into a ggdend object using as.ggdend() function
Make the plot using the function ggplot()

dend <- iris[1:30,-5] %>% scale %>% dist %>% 
   hclust %>% as.dendrogram %>%
   set("branches_k_color", k=3) %>% set("branches_lwd", 1.2) %>%
   set("labels_colors") %>% set("labels_cex", c(.9,1.2)) %>% 
   set("leaves_pch", 19) %>% set("leaves_col", c("blue", "red"))
# plot the dend in usual "base" plotting engine:
plot(dend)

Produce the same plot in ggplot2 using the function:

library(ggplot2)
# Rectangle dendrogram using ggplot2
ggd1 <- as.ggdend(dend)
ggplot(ggd1)

# Change the theme to the default ggplot2 theme
ggplot(ggd1, horiz = TRUE, theme = NULL)

# Theme minimal
ggplot(ggd1, theme = theme_minimal())

# Create a radial plot and remove labels
ggplot(ggd1, labels = FALSE) + 
  scale_y_reverse(expand = c(0.2, 0)) +
  coord_polar(theta="x")

5.11 pvclust and dendextend

The package dendextend can be used to enhance many packages including pvclust. Recall that, pvclust is for calculating p-values for hierarchical clustering.

pvclust can be used as follow:

library(pvclust)
data(lung) # 916 genes for 73 subjects
set.seed(1234)
result <- pvclust(lung[1:100, 1:10], method.dist="cor", 
                  method.hclust="average", nboot=10)

## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.

# Default plot of the result
plot(result)
pvrect(result)

# pvclust and dendextend
result %>% as.dendrogram %>% 
  set("branches_k_color", k = 2, value = c("purple", "orange")) %>%
  plot
result %>% text
result %>% pvrect

6 Infos

This analysis has been performed using R software (ver. 3.2.1)

Determining the optimal number of clusters: 3 must known methods - Unsupervised Machine Learning

Sun, 22 Nov 2015 04:37:09 +0100

The first step in clustering analysis is to assess whether the dataset is clusterable. This has been described in a chapter entitled: Assessing Clustering Tendency.

Partitioning methods, such as k-means clustering require also the users to specify the number of clusters to be generated.

One fundamental question is: If the data is clusterable, then how to choose the right number of expected clusters (k)?

Unfortunately, there is no definitive answer to this question. The optimal clustering is somehow subjective and depend on the method used for measuring similarities and the parameters used for partitioning.

A simple and popular solution consists of inspecting the dendrogram produced using hierarchical clustering to see if it suggests a particular number of clusters. Unfortunately this approach is, again, subjective.

In this article, we’ll describe different methods for determining the optimal number of clusters for k-means, PAM and hierarchical clustering . These methods include direct methods and statistical testing methods.

Direct methods consists of optimizing a criterion, such as the within cluster sums of squares or the average silhouette. The corresponding methods are named elbow and silhouette methods, respectively.
Testing methods consists of comparing evidence against null hypothesis. An example is the gap statistic.

In addition to elbow, silhouette and gap statistic methods, there are more than thirty other indices and methods that have been published for identifying the optimal number of clusters. We’ll provide R codes for computing all these 30 indices in order to decide the best number of clusters using the “majority rule”.

For each of these methods:

We’ll describe the basic idea, the algorithm and the key mathematical concept
We’ll provide easy-o-use R codes with many examples for determining the optimal number of clusters and visualizing the output

1 Required packages

The following package will be used:

cluster for computing pam and for analyzing cluster silhouettes
factoextra for visualizing clusters using ggplot2 plotting system
NbClust for finding the optimal number of clusters

Install factoextra package as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

The remaining packages can be installed using the code below:

pkgs <- c("cluster",  "NbClust")
install.packages(pkgs)

Load packages:

library(factoextra)
library(cluster)
library(NbClust)

2 Data preparation

The data set iris is used. We start by excluding the species column and scaling the data using the function scale():

# Load the data
data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# Remove species column (5) and scale the data
iris.scaled <- scale(iris[, -5])

This iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

3 Example of partitioning method results

The functions kmeans() [in stats package] and pam() [in cluster package] are described in this section. We’ll split the data into 3 clusters as follow:

# K-means clustering
set.seed(123)
km.res <- kmeans(iris.scaled, 3, nstart = 25)
# k-means group number of each observation
km.res$cluster

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3
##  [71] 2 3 3 3 3 2 2 2 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 2 2 2 2 2 2 3 3 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3

# Visualize k-means clusters
fviz_cluster(km.res, data = iris.scaled, geom = "point",
             stand = FALSE, frame.type = "norm")

# PAM clustering
library("cluster")
pam.res <- pam(iris.scaled, 3)
pam.res$cluster

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3
##  [71] 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 3 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3

# Visualize pam clusters
fviz_cluster(pam.res, stand = FALSE, geom = "point",
             frame.type = "norm")

Read more about partitioning methods: Partitioning clustering

4 Example of hierarchical clustering results

The built-in R function hclust() is used:

# Compute pairewise distance matrices
dist.res <- dist(iris.scaled, method = "euclidean")
# Hierarchical clustering results
hc <- hclust(dist.res, method = "complete")
# Visualization of hclust
plot(hc, labels = FALSE, hang = -1)
# Add rectangle around 3 groups
rect.hclust(hc, k = 3, border = 2:4)

# Cut into 3 groups
hc.cut <- cutree(hc, k = 3)
head(hc.cut, 20)

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Read more about hierarchical clustering: Hierarchical clustering

5 Three popular methods for determining the optimal number of clusters

In this section we describe the three most popular methods including: i) Elbow method, ii) silhouette method and iii) gap statistic.

5.1 Elbow method

5.1.1 Concept

Recall that, the basic idea behind partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation (known as total within-cluster variation or total within-cluster sum of square) is minimized:

$minimize\left(\sum\limits_{k=1}^k W(C_k)\right)$,

Where $C_k$ is the $k_{th}$ cluster and $W(C_k)$ is the within-cluster variation.

The total within-cluster sum of square (wss) measures the compactness of the clustering and we want it to be as small as possible.

5.1.2 Algorithm

The optimal number of clusters can be defined as follow:

Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters
For each k, calculate the total within-cluster sum of square (wss)
Plot the curve of wss according to the number of clusters k.
The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

5.1.3 R codes

5.1.3.1 Elbow method for k-means clustering

set.seed(123)
# Compute and plot wss for k = 2 to k = 15
k.max <- 15 # Maximal number of clusters
data <- iris.scaled
wss <- sapply(1:k.max, 
        function(k){kmeans(data, k, nstart=10 )$tot.withinss})

plot(1:k.max, wss,
       type="b", pch = 19, frame = FALSE, 
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares")
abline(v = 3, lty =2)

The elbow method suggests 3 cluster solutions.

The elbow method is implemented in factoextra package and can be easily computed using the function fviz_nbclust(), which format is:

fviz_nbclust(x, FUNcluster, method = c("silhouette", "wss"))

x: numeric matrix or data frame
FUNcluster: a partitioning function such as kmeans, pam, clara etc
method: the method to be used for determining the optimal number of clusters.

The R code below computes the elbow method for kmeans():

fviz_nbclust(iris.scaled, kmeans, method = "wss") +
    geom_vline(xintercept = 3, linetype = 2)

Three clusters are suggested.

5.1.3.2 Elbow method for PAM clustering

It’s possible to use the function fviz_nbclust() as follow:

fviz_nbclust(iris.scaled, pam, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2)

Three clusters are suggested.

5.1.3.3 Elbow method for hierarchical clustering

We’ll use a helper function hcut() [in factoextra package] which will compute hierarchical clustering (HC) algorithm and cut the dendrogram in k clusters:

fviz_nbclust(iris.scaled, hcut, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2)

Three clusters are suggested.

Note that, the elbow method is sometimes ambiguous. An alternative is the average silhouette method (Kaufman and Rousseeuw [1990]) which can be also used with any clustering approach.

5.2 Average silhouette method

5.2.1 Concept

The average silhouette approach we’ll be described comprehensively in the chapter cluster validation statistics. Briefly, it measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.

Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k (Kaufman and Rousseeuw [1990]).

5.2.2 Algorithm

The algorithm is similar to the elbow method and can be computed as follow:

Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters
For each k, calculate the average silhouette of observations (avg.sil)
Plot the curve of avg.sil according to the number of clusters k.
The location of the maximum is considered as the appropriate number of clusters.

5.2.3 R codes

The function silhouette() [in cluster package] is used to compute the average silhouette width.

5.2.3.1 Average silhouette method for k-means clustering

The R code below determine the optimal number of clusters K for k-means clustering:

library(cluster)
k.max <- 15
data <- iris.scaled
sil <- rep(0, k.max)

# Compute the average silhouette width for 
# k = 2 to k = 15
for(i in 2:k.max){
  km.res <- kmeans(data, centers = i, nstart = 25)
  ss <- silhouette(km.res$cluster, dist(data))
  sil[i] <- mean(ss[, 3])
}

# Plot the  average silhouette width
plot(1:k.max, sil, type = "b", pch = 19, 
     frame = FALSE, xlab = "Number of clusters k")
abline(v = which.max(sil), lty = 2)

The function fviz_nbclust() [in factoextra package] can be also used. It just requires the cluster package to be installed:

require(cluster)
fviz_nbclust(iris.scaled, kmeans, method = "silhouette")

Two clusters are suggested.

5.2.3.2 Average silhouette method for PAM clustering

require(cluster)
fviz_nbclust(iris.scaled, pam, method = "silhouette")

Two clusters are suggested.

5.2.3.3 Average silhouette method for hierarchical clustering

require(cluster)
fviz_nbclust(iris.scaled, hcut, method = "silhouette",
             hc_method = "complete")

Three clusters are suggested.

5.3 Conclusions about elbow and silhouette methods

Three cluster solutions are suggested using k-means, PAM and hierarchical clustering in combination with the elbow method.
The average silhouette method gives two cluster solutions using k-means and PAM algorithms. Combining hierarchical clustering and silhouette method returns 3 clusters

According to these observations, it’s possible to define k = 3 as the optimal number of clusters in the data.

The disadvantage of elbow and average silhouette methods is that, they measure a global clustering characteristic only. A more sophisticated method is to use the gap statistic which provides a statistical procedure to formalize the elbow/silhouette heuristic in order to estimate the optimal number of clusters.

5.4 Gap statistic method

5.4.1 Concept

The gap statistic has been published by R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001). The approach can be applied to any clustering method (K-means clustering, hierarchical clustering, …).

The gap statistic compares the total within intracluster variation for different values of k with their expected values under null reference distribution of the data, i.e. a distribution with no obvious clustering.

Recall that, the total within intra-cluster variation for a given k clusters is the total within sum of square ($w_k$).

The reference dataset is generated using Monte Carlo simulations of the sampling process. That is, for each variable ($x_i$) in the data set we compute its range [$min(x_i), max(x_j)$] and generate values for the n points uniformly from the interval min to max.

Note that, the function runif(n, min, max) can be used to generate random uniform distribution.

For the observed data and the the reference data, the total intracluster variation is computed using different values of k. The gap statistic for a given k is defined as follow:

\[ Gap_n(k) = E_n^*\{log(W_k)\} - log(W_k) \]

Where $E_n^*$ denotes the expectation under a sample of size $n$ from the reference distribution. $E_n^*$ is defined via bootstrapping (B) by generating B copies of the reference datasets and, by computing the average $log(W_k^*)$.

Note that, the logarithm of the $W_k$ values is used, as they can be quite large.

The gap statistic measures the deviation of the observed $W_k$ value from its expected value under the null hypothesis.

The estimate of the optimal clusters $\hat{k}$ will be value that maximize $Gap_n(k)$ (i.e, that yields the largest gap statistic). This means that the clustering structure is far away from the uniform distribution of points.

Note that, using B = 500 gives quite precise results so that the gap plot is basically unchanged after an another run.

The standard deviation ($sd_k$) of $log(W_k^*)$ is also computed in order to define the standard error ($s_k$) of the simulation as follow:

\[ s_k = sd_k \times \sqrt{1 + 1/B} \]

Finally, a more robust approach is to choose the optimal number of clusters K as the smallest k such that:

\[Gap(k) \geq Gap(k+1) - s_{k+1}\]

That is, we choose the smallest value of k such that the gap statistic is within one standard deviation of the gap at k+1.

5.4.2 Algorithm

The algorithm involves the following steps (Read the original paper of the gap statistic):

Cluster the observed data, varying the number of clusters from k = 1, …, $k_{max}$, and compute the corresponding $W_k$.
Generate B reference data sets and cluster each of them with varying number of clusters k = 1, …, $k_{max}$. Compute the estimated gap statistic $Gap(k) = \frac{1}{B} \sum\limits_{b=1}^B log(W_{kb}^*) - log(W_k)$.
Let $\bar{w} = (1/B) \sum_b log(W^*_{kb})$, compute the standard deviation $sd(k) = \sqrt{(1/B) \sum_b (log(W^*_{kb}) - \bar{w})^2}$ and define $s_k = sd_k \times \sqrt{1 + 1/B}$.
Choose the number of clusters as the smallest k such that $Gap(k) \geq Gap(k+1) - s_{k+1}$.

5.4.3 R codes

5.4.3.1 R function for computing the gap statistic

The R function clusGap() [in cluster package ] can be used to estimate the number of clusters in the data by applying the gap statistic.

A simplified format is:

clusGap(x, FUNcluster, K.max, B = 100, verbose = TRUE, ...)

x: numeric matrix or data frame
FUNcluster: a function (e.g.: kmeans, pam, …) which accepts i) a data matrix like x as first argument; ii) the number of clusters desired (k > = 2) as a second argument; and returns a list containing a component named cluster which is a vector of length $n = nrow(x)$ of integers in 1:k determining the clustering or grouping of the n observations.
K.max: the maximum number of clusters to consider, must be at least two.
B: the number of Monte Carlo (“bootstrap”) samples.
verbose: if TRUE, the computing progression is shown.
…: Further arguments for FUNcluster(), see kmeans example below.

clusGap() function returns an object of class “clusGap” which main component is Tab with K.max rows and 4 columns, named “logW”, “E.logW”, “gap” and “SE.sim”. Recall that $gap = E.logW - logW$ and SE.sim is the standard error of gap.

5.4.3.2 Gap statistic for k-means clustering

The R code below shows some example using the clustGap() function.

We’ll use B = 50 to keep the function speedy. Note that, it’s recommended to use B = 500 for your analysis.

The output of clusGap() function can be visualized using the function fviz_gap_stat() [in factoextra].

# Compute gap statistic
library(cluster)
set.seed(123)
gap_stat <- clusGap(iris.scaled, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)

# Print the result
print(gap_stat, method = "firstmax")

## Clustering Gap statistic ["clusGap"].
## B=50 simulated reference sets, k = 1..10
##  --> Number of clusters (method 'firstmax'): 3
##           logW   E.logW       gap     SE.sim
##  [1,] 4.534565 4.754595 0.2200304 0.02504585
##  [2,] 4.021316 4.489687 0.4683711 0.02742112
##  [3,] 3.806577 4.295715 0.4891381 0.02384746
##  [4,] 3.699263 4.143675 0.4444115 0.02093871
##  [5,] 3.589284 4.052262 0.4629781 0.02036366
##  [6,] 3.519726 3.972254 0.4525278 0.02049566
##  [7,] 3.448288 3.905945 0.4576568 0.02106987
##  [8,] 3.398210 3.850807 0.4525967 0.01969193
##  [9,] 3.334279 3.802315 0.4680368 0.01905974
## [10,] 3.250246 3.759661 0.5094149 0.01928183

# Base plot of gap statistic
plot(gap_stat, frame = FALSE, xlab = "Number of clusters k")
abline(v = 3, lty = 2)

# Use factoextra
fviz_gap_stat(gap_stat)

In our example, the algorithm suggests k = 3

The optimal number of clusters, k, is computed using the “firstmax” method (see ?cluster::maxSE). The criterion proposed by Tibshirani et al (2001) can be used as follow:

# Print
print(gap_stat, method = "Tibs2001SEmax")
# Plot
fviz_gap_stat(gap_stat, 
              maxSE = list(method = "Tibs2001SEmax"))
# Relaxed the gap test to be within two standard deviations
fviz_gap_stat(gap_stat, 
          maxSE = list(method = "Tibs2001SEmax", SE.factor = 2))

5.4.3.3 Gap statistic for PAM clustering

We don’t need the argument “nstart” which is specific to kmeans() function.

# Compute gap statistic
set.seed(123)
gap_stat <- clusGap(iris.scaled, FUN = pam, K.max = 10, B = 50)
# Plot gap statistic
fviz_gap_stat(gap_stat)

Three cluster solutions are suggested.

5.4.3.4 Gap statistic for hierarchical clustering

# Compute gap statistic
set.seed(123)
gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 50)
# Plot gap statistic
fviz_gap_stat(gap_stat)

Three cluster solutions are suggested.

6 NbClust: A Package providing 30 indices for determining the best number of clusters

6.1 Overview of NbClust package

As mentioned in the introduction of this article, many indices have been proposed in the literature for determining the optimal number of clusters in a partitioning of a data set during the clustering process.

NbClust package, published by Charrad et al., 2014, provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.

An important advantage of NbClust is that the user can simultaneously computes multiple indices and determine the number of clusters in a single function call.

The indices provided in NbClust package includes the gap statistic, the silhouette method and 28 other indices described comprehensively in the original paper of Charrad et al., 2014.

6.2 NbClust R function

The simplified format of the function NbClust() is:

NbClust(data = NULL, diss = NULL, distance = "euclidean",
        min.nc = 2, max.nc = 15, method = NULL, index = "all")

data: matrix
diss: dissimilarity matrix to be used. By default, diss=NULL, but if it is replaced by a dissimilarity matrix, distance should be “NULL”
distance: the distance measure to be used to compute the dissimilarity matrix. Possible values include “euclidean”, “manhattan” or “NULL”.
min.nc, max.nc: minimal and maximal number of clusters, respectively
method: The cluster analysis method to be used including “ward.D”, “ward.D2”, “single”, “complete”, “average” and more
index: the index to be calculated including “silhouette”, “gap” and more.

The value of NbClust() function includes the following elements:

All.index: Values of indices for each partition of the dataset obtained with a number of clusters between min.nc and max.nc
All.CriticalValues: Critical values of some indices for each partition obtained with a number of clusters between min.nc and max.nc
Best.nc: Best number of clusters proposed by each index and the corresponding index value
Best.partition: Partition that corresponds to the best number of clusters

6.3 Examples of usage

Note that, user can request indices one by one, by setting the argument index to the name of the index of interest, for example index = “gap”.

In this case, NbClust function displays:

the gap statistic values of the partitions obtained with number of clusters varying from min.nc to max.nc ($All.index)
the optimal number of clusters ($Best.nc)
and the partition corresponding to the best number of clusters ($Best.partition)

6.3.1 Compute only an index of interest

The following example determine the number of clusters using gap statistics:

library("NbClust")
set.seed(123)
res.nb <- NbClust(iris.scaled, distance = "euclidean",
                  min.nc = 2, max.nc = 10, 
                  method = "complete", index ="gap") 
res.nb # print the results

## $All.index
##       2       3       4       5       6       7       8       9      10 
## -0.2899 -0.2303 -0.6915 -0.8606 -1.0506 -1.3223 -1.3303 -1.4759 -1.5551 
## 
## $All.CriticalValues
##       2       3       4       5       6       7       8       9      10 
## -0.0539  0.4694  0.1787  0.2009  0.2848  0.0230  0.1631  0.0988  0.1708 
## 
## $Best.nc
## Number_clusters     Value_Index 
##          3.0000         -0.2303 
## 
## $Best.partition
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 3 3 2 3 2 3 2 3 2 2 3 2 3 3 3 3 2 2 2
##  [71] 3 3 3 3 3 3 3 3 3 2 2 2 2 3 3 3 3 2 3 2 2 3 2 2 2 3 3 3 2 2 3 3 3 3 3
## [106] 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3

The elements returned by the function NbClust() are accessible using the R code below:

# All gap statistic values
res.nb$All.index

# Best number of clusters
res.nb$Best.nc

# Best partition
res.nb$Best.partition

6.3.2 Compute all the 30 indices

The following example compute all the 30 indices, in a single function call, for determining the number of clusters and suggests to user the best clustering scheme. The description of the indices are available in NbClust documentation (see ?NbClust).

To compute multiple indices simultaneously, the possible values for the argument index can be i) “alllong” or ii) “all”. The option “alllong” requires more time, as the run of some indices, such as Gamma, Tau, Gap and Gplus, is computationally very expensive. The user can avoid computing these four indices by setting the argument index to “all”. In this case, only 26 indices are calculated.

With the “alllong” option, the output of the NbClust function contains:

all validation indices
critical values for Duda, Gap, PseudoT2 and Beale indices
the number of clusters corresponding to the optimal score for each indice
the best number of clusters proposed by NbClust according to the majority rule
the best partition

The R code below computes NbClust() with index = “all”:

nb <- NbClust(iris.scaled, distance = "euclidean", min.nc = 2,
        max.nc = 10, method = "complete", index ="all")

# Print the result
nb

It’s possible to visualize the result using the function fviz_nbclust() [in factoextra], as follow:

fviz_nbclust(nb) + theme_minimal()

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 2 proposed  2 as the best number of clusters
## * 18 proposed  3 as the best number of clusters
## * 3 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * Accoridng to the majority rule, the best number of clusters is  3 .

….
2 proposed 2 as the best number of clusters
18 indices proposed 3 as the best number of clusters.
3 proposed 10 as the best number of clusters

According to the majority rule, the best number of clusters is 3

7 Infos

This analysis has been performed using R software (ver. 3.2.1)

Charrad M., Ghazzali N., Boiteau V., Niknafs A. (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, 61(6), 1-36.
Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411–423. PDF