Last articles - STHDA : Cluster Analysis in R: Practical Guide

Types of Clustering Methods: Overview and Quick Start R Code

Sat, 23 Sep 2017 12:23:00 +0200

Clustering methods are used to identify groups of similar objects in a multivariate data sets collected from fields such as marketing, bio-medical and geo-spatial. They are different types of clustering methods, including:

Partitioning methods
Hierarchical clustering
Fuzzy clustering
Density-based clustering
Model-based clustering

In this article, we provide an overview of clustering methods and quick start R code to perform cluster analysis in R:

we start by presenting required R packages and data format for cluster analysis and visualization.
next, we describe the two standard clustering techniques [partitioning methods (k-MEANS, PAM, CLARA) and hierarchical clustering] as well as how to assess the quality of clustering analysis.
finally, we describe advanced clustering approaches to find pattern of any shape in large data sets with noise and outliers.

Contents:

Installing and loading required R packages
Data preparation
Distance measures
Partitioning clustering
Hierarchical clustering
Clustering validation and evaluation
Advanced clustering methods
References

Installing and loading required R packages

We’ll use mainly two R packages:

cluster package: for computing clustering
factoextra package : for elegant ggplot2-based data visualization. Online documentation at: https://www.sthda.com/english/rpkgs/factoextra/

Accessory packages:

magrittr for piping: %>%

Install:

install.packages("factoextra")
install.packages("cluster")
install.packages("magrittr")

Load packages:

library("cluster")
library("factoextra")
library("magrittr")

Data preparation

Demo data set: the built-in R dataset named USArrest
Remove missing data
Scale variables to make them comparable

# Load  and prepare the data
data("USArrests")
my_data <- USArrests %>%
  na.omit() %>%          # Remove missing values (NA)
  scale()                # Scale variables
# View the firt 3 rows
head(my_data, n = 3)

##         Murder Assault UrbanPop     Rape
## Alabama 1.2426   0.783   -0.521 -0.00342
## Alaska  0.5079   1.107   -1.212  2.48420
## Arizona 0.0716   1.479    0.999  1.04288

Distance measures

The classification of objects, into clusters, requires some methods for measuring the distance or the (dis)similarity between the objects. Chapter Clustering Distance Measures Essentials covers the common distance measures used for assessing similarity between observations.

It’s simple to compute and visualize distance matrix using the functions get_dist() and fviz_dist() [factoextra R package]:

get_dist(): for computing a distance matrix between the rows of a data matrix. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
fviz_dist(): for visualizing a distance matrix

res.dist <- get_dist(USArrests, stand = TRUE, method = "pearson")
fviz_dist(res.dist, 
   gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

Partitioning clustering

Partitioning algorithms are clustering techniques that subdivide the data sets into a set of k groups, where k is the number of groups pre-specified by the analyst.

There are different types of partitioning clustering methods. The most popular is the K-means clustering (MacQueen 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster. The K-means method is sensitive to outliers.

An alternative to k-means clustering is the K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990), which is less sensitive to outliers compared to k-means.

The following R codes show how to determine the optimal number of clusters and how to compute k-means and PAM clustering in R.

Determining the optimal number of clusters: use factoextra::fviz_nbclust()

library("factoextra")
fviz_nbclust(my_data, kmeans, method = "gap_stat")

Suggested number of cluster: 3

Compute and visualize k-means clustering

set.seed(123)
km.res <- kmeans(my_data, 3, nstart = 25)
# Visualize
library("factoextra")
fviz_cluster(km.res, data = my_data,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())

Similarly, the k-medoids/pam clustering can be computed as follow:

# Compute PAM
library("cluster")
pam.res <- pam(my_data, 3)
# Visualize
fviz_cluster(pam.res)

Hierarchical clustering

Hierarchical clustering is an alternative approach to partitioning clustering for identifying groups in the dataset. It does not require to pre-specify the number of clusters to be generated.

The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram. Observations can be subdivided into groups by cutting the dendrogram at a desired similarity level.

R code to compute and visualize hierarchical clustering:

# Compute hierarchical clustering
res.hc <- USArrests %>%
  scale() %>%                    # Scale the data
  dist(method = "euclidean") %>% # Compute dissimilarity matrix
  hclust(method = "ward.D2")     # Compute hierachical clustering
# Visualize using factoextra
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k = 4, # Cut in four groups
          cex = 0.5, # label size
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
          )

Clustering validation and evaluation

Clustering validation and evaluation strategies, consist of measuring the goodness of clustering results. Before applying any clustering algorithm to a data set, the first thing to do is to assess the clustering tendency. That is, whether the data contains any inherent grouping structure.

If yes, then how many clusters are there. Next, you can perform hierarchical clustering or partitioning clustering (with a pre-specified number of clusters). Finally, you can use a number of measures, described in this chapter, to evaluate the goodness of the clustering results.

Assessing clustering tendency

To assess the clustering tendency, the Hopkins’ statistic and a visual approach can be used. This can be performed using the function get_clust_tendency() [factoextra package], which creates an ordered dissimilarity image (ODI).

Hopkins statistic: If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.
Visual approach: The visual approach detects the clustering tendency by counting the number of square shaped dark (or colored) blocks along the diagonal in the ordered dissimilarity image.

R code:

gradient.color <- list(low = "steelblue",  high = "white")
iris[, -5] %>%    # Remove column 5 (Species)
  scale() %>%     # Scale variables
  get_clust_tendency(n = 50, gradient = gradient.color)

## $hopkins_stat
## [1] 0.2
## 
## $plot

Determining the optimal number of clusters

There are different methods for determining the optimal number of clusters.

In the R code below, we’ll use the NbClust R package, which provides 30 indices for determining the best number of clusters. First, install it using install.packages("NbClust"), then type this:

set.seed(123)
# Compute
library("NbClust")
res.nbclust <- USArrests %>%
  scale() %>%
  NbClust(distance = "euclidean",
          min.nc = 2, max.nc = 10, 
          method = "complete", index ="all")

# Visualize
library(factoextra)
fviz_nbclust(res.nbclust, ggtheme = theme_minimal())

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 9 proposed  2 as the best number of clusters
## * 4 proposed  3 as the best number of clusters
## * 6 proposed  4 as the best number of clusters
## * 2 proposed  5 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

Clustering validation statistics

A variety of measures has been proposed in the literature for evaluating clustering results. The term clustering validation is used to design the procedure of evaluating the results of a clustering algorithm.

The silhouette plot is one of the many measures for inspecting and validating clustering results. Recall that the silhouette (\(S_i\)) measures how similar an object \(i\) is to the the other objects in its own cluster versus those in the neighbor cluster. \(S_i\) values range from 1 to - 1:

A value of \(S_i\) close to 1 indicates that the object is well clustered. In the other words, the object \(i\) is similar to the other objects in its group.
A value of \(S_i\) close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.

In the following R code, we’ll compute and evaluate the result of hierarchical clustering methods.

Compute and visualize hierarchical clustering:

set.seed(123)
# Enhanced hierarchical clustering, cut in 3 groups
res.hc <- iris[, -5] %>%
  scale() %>%
  eclust("hclust", k = 3, graph = FALSE)
# Visualize with factoextra
fviz_dend(res.hc, palette = "jco",
          rect = TRUE, show_labels = FALSE)

Inspect the silhouette plot:

fviz_silhouette(res.hc)

##   cluster size ave.sil.width
## 1       1   49          0.63
## 2       2   30          0.44
## 3       3   71          0.32

Which samples have negative silhouette? To what cluster are they closer?

# Silhouette width of observations
sil <- res.hc$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]

##     cluster neighbor sil_width
## 84        3        2   -0.0127
## 122       3        2   -0.0179
## 62        3        2   -0.0476
## 135       3        2   -0.0530
## 73        3        2   -0.1009
## 74        3        2   -0.1476
## 114       3        2   -0.1611
## 72        3        2   -0.2304

Advanced clustering methods

Hybrid clustering methods

Hierarchical K-means Clustering: an hybrid approach for improving k-means results
HCPC: Hierarchical clustering on principal components

Fuzzy clustering

Fuzzy clustering is also known as soft method. Standard clustering approaches produce partitions (K-means, PAM), in which each observation belongs to only one cluster. This is known as hard clustering.

In Fuzzy clustering, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster. The Fuzzy c-means method is the most popular fuzzy clustering algorithm.

Model-based clustering

In model-based clustering, the data are viewed as coming from a distribution that is mixture of two ore more clusters. It finds best fit of models to data and estimates the number of clusters.

DBSCAN: Density-based clustering

DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers (Ester et al. 1996). The basic idea behind density-based clustering approach is derived from a human intuitive clustering method.

The description and implementation of DBSCAN in R are provided at this link: DBSCAN: Density-Based Clustering.

References

Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In, 226–31. AAAI Press.

MacQueen, J. 1967. “Some Methods for Classification and Analysis of Multivariate Observations.” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 281–97. Berkeley, Calif.: University of California Press. http://projecteuclid.org:443/euclid.bsmsp/1200512992.

Clustering Example: 4 Steps You Should Know

Fri, 22 Sep 2017 22:20:00 +0200

This article describes k-means clustering example and provide a step-by-step guide summarizing the different steps to follow for conducting a cluster analysis on a real data set using R software.

We’ll use mainly two R packages:

cluster: for cluster analyses and
factoextra: for the visualization of the analysis results.

Install these packages, as follow:

install.packages(c("cluster", "factoextra"))

A rigorous cluster analysis can be conducted in 3 steps mentioned below:

Here, we provide quick R scripts to perform all these steps.

Contents:

Data preparation
Assessing the clusterability
Estimate the number of clusters in the data
Compute k-means clustering
Cluster validation statistics: Inspect cluster silhouette plot
eclust(): Enhanced clustering analysis
- K-means clustering using eclust()
- Hierachical clustering using eclust()

Data preparation

We’ll use the demo data set USArrests. We start by standardizing the data using the scale() function:

# Load the data set
data(USArrests)
# Standardize
df <- scale(USArrests)

Assessing the clusterability

The function get_clust_tendency() [factoextra package] can be used. It computes the Hopkins statistic and provides a visual approach.

library("factoextra")
res <- get_clust_tendency(df, 40, graph = TRUE)
# Hopskin statistic
res$hopkins_stat

## [1] 0.656

# Visualize the dissimilarity matrix
print(res$plot)

The value of the Hopkins statistic is significantly < 0.5, indicating that the data is highly clusterable. Additionally, It can be seen that the ordered dissimilarity image contains patterns (i.e., clusters).

Estimate the number of clusters in the data

As k-means clustering requires to specify the number of clusters to generate, we’ll use the function clusGap() [cluster package] to compute gap statistics for estimating the optimal number of clusters . The function fviz_gap_stat() [factoextra] is used to visualize the gap statistic plot.

library("cluster")
set.seed(123)
# Compute the gap statistic
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25, 
                    K.max = 10, B = 100) 
# Plot the result
library(factoextra)
fviz_gap_stat(gap_stat)

The gap statistic suggests a 4 cluster solutions.

It’s also possible to use the function NbClust() [in NbClust] package.

It’s also possible to use the function NbClust() [NbClust package].

Compute k-means clustering

K-means clustering with k = 4:

# Compute k-means
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)
head(km.res$cluster, 20)

##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           4           3           3           4           3           3 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           2           2           3           4           2           1 
##    Illinois     Indiana        Iowa      Kansas    Kentucky   Louisiana 
##           3           2           1           2           1           4 
##       Maine    Maryland 
##           1           3

# Visualize clusters using factoextra
fviz_cluster(km.res, USArrests)

Cluster validation statistics: Inspect cluster silhouette plot

Recall that the silhouette measures (\(S_i\)) how similar an object \(i\) is to the the other objects in its own cluster versus those in the neighbor cluster. \(S_i\) values range from 1 to - 1:

A value of \(S_i\) close to 1 indicates that the object is well clustered. In the other words, the object \(i\) is similar to the other objects in its group.
A value of \(S_i\) close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.

sil <- silhouette(km.res$cluster, dist(df))
rownames(sil) <- rownames(USArrests)
head(sil[, 1:3])

##            cluster neighbor sil_width
## Alabama          4        3    0.4858
## Alaska           3        4    0.0583
## Arizona          3        2    0.4155
## Arkansas         4        2    0.1187
## California       3        2    0.4356
## Colorado         3        2    0.3265

fviz_silhouette(sil)

##   cluster size ave.sil.width
## 1       1   13          0.37
## 2       2   16          0.34
## 3       3   13          0.27
## 4       4    8          0.39

It can be seen that there are some samples which have negative silhouette values. Some natural questions are :

Which samples are these? To what cluster are they closer?

This can be determined from the output of the function silhouette() as follow:

neg_sil_index <- which(sil[, "sil_width"] < 0)
sil[neg_sil_index, , drop = FALSE]

##          cluster neighbor sil_width
## Missouri       3        2   -0.0732

eclust(): Enhanced clustering analysis

The function eclust()[factoextra package] provides several advantages compared to the standard packages used for clustering analysis:

It simplifies the workflow of clustering analysis
It can be used to compute hierarchical clustering and partitioning clustering in a single line function call
The function eclust() computes automatically the gap statistic for estimating the right number of clusters.
It automatically provides silhouette information
It draws beautiful graphs using ggplot2

K-means clustering using eclust()

# Compute k-means
res.km <- eclust(df, "kmeans", nstart = 25)

# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)

# Silhouette plot
fviz_silhouette(res.km)

Hierachical clustering using eclust()

 # Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust

## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
## .................................................. 50 
## .................................................. 100

fviz_dend(res.hc, rect = TRUE) # dendrogam

The R code below generates the silhouette plot and the scatter plot for hierarchical clustering.

fviz_silhouette(res.hc) # silhouette plot
fviz_cluster(res.hc) # scatter plot

Cluster Analysis in R Simplified and Enhanced

Tue, 12 Sep 2017 03:56:00 +0200

In R, standard clustering methods (partitioning and hierarchical clustering) can be computed using the R packages stats and cluster. However the workflow, generally, requires multiple steps and multiple lines of R codes.

This article describes some easy-to-use wrapper functions, in the factoextra R package, for simplifying and improving cluster analysis in R. These functions include:

get_dist() & fviz_dist() for computing and visualizing distance matrix between rows of a data matrix. Compared to the standard dist() function, get_dist() supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
eclust(): enhanced cluster analysis. It has several advantages:
- It simplifies the workflow of clustering analysis
- It can be used to compute hierarchical clustering and partititioning clustering in a single line function call
- Compared to the standard partitioning functions (kmeans, pam, clara and fanny) which requires the user to specify the optimal number of clusters, the function eclust() computes automatically the gap statistic for estimating the right number of clusters.
- For hierarchical clustering, correlation-based metric is allowed
- It provides silhouette information for all partitioning methods and hierarchical clustering
- It creates beautiful graphs using ggplot2

Required packages

We’ll use the factoextra package for an enhanced cluster analysis and visualization.

Install factoextra:

install.packages("factoextra")

Load factoextra

library(factoextra)

Data preparation

The built-in R dataset USArrests is used:

# Load and scale the dataset
data("USArrests")
df <- scale(USArrests)
head(df)

##            Murder Assault UrbanPop     Rape
## Alabama    1.2426   0.783   -0.521 -0.00342
## Alaska     0.5079   1.107   -1.212  2.48420
## Arizona    0.0716   1.479    0.999  1.04288
## Arkansas   0.2323   0.231   -1.074 -0.18492
## California 0.2783   1.263    1.759  2.06782
## Colorado   0.0257   0.399    0.861  1.86497

Distance matrix computation and visualization

library(factoextra)
# Correlation-based distance method
res.dist <- get_dist(df, method = "pearson")
head(round(as.matrix(res.dist), 2))[, 1:6]

##            Alabama Alaska Arizona Arkansas California Colorado
## Alabama       0.00   0.71    1.45     0.09       1.87     1.69
## Alaska        0.71   0.00    0.83     0.37       0.81     0.52
## Arizona       1.45   0.83    0.00     1.18       0.29     0.60
## Arkansas      0.09   0.37    1.18     0.00       1.59     1.37
## California    1.87   0.81    0.29     1.59       0.00     0.11
## Colorado      1.69   0.52    0.60     1.37       0.11     0.00

# Visualize the dissimilarity matrix
fviz_dist(res.dist, lab_size = 8)

In the plot above, similar objects are close to one another. Red color corresponds to small distance and blue color indicates big distance between observation.

Enhanced clustering analysis

The standard R code for computing hierarchical clustering looks like this:

# Load and scale the dataset
data("USArrests")
df <- scale(USArrests)
# Compute dissimilarity matrix
res.dist <- dist(df, method = "euclidean")
# Compute hierarchical clustering
res.hc <- hclust(res.dist, method = "ward.D2")
# Visualize
plot(res.hc, cex = 0.5)

In this section we’ll describe the eclust() function [factoextra package] to simplify the workflow. The format is as follow:

eclust(x, FUNcluster = "kmeans", hc_metric = "euclidean", ...)

x: numeric vector, data matrix or data frame
FUNcluster: a clustering function including “kmeans”, “pam”, “clara”, “fanny”, “hclust”, “agnes” and “diana”. Abbreviation is allowed.
hc_metric: character string specifying the metric to be used for calculating dissimilarities between observations. Allowed values are those accepted by the function dist() [including “euclidean”, “manhattan”, “maximum”, “canberra”, “binary”, “minkowski”] and correlation based distance measures [“pearson”, “spearman” or “kendall”]. Used only when FUNcluster is a hierarchical clustering function such as one of “hclust”, “agnes” or “diana”.
…: other arguments to be passed to FUNcluster.

In the following R code, we’ll show some examples for enhanced k-means clustering and hierarchical clustering. Note that the same analysis can be done for PAM, CLARA, FANNY, AGNES and DIANA.

library("factoextra")
# Enhanced k-means clustering
res.km <- eclust(df, "kmeans", nstart = 25)

## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
## .................................................. 50 
## .................................................. 100

# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)

# Silhouette plot
fviz_silhouette(res.km)

##   cluster size ave.sil.width
## 1       1    8          0.39
## 2       2   16          0.34
## 3       3   13          0.37
## 4       4   13          0.27

# Optimal number of clusters using gap statistics
res.km$nbclust

## [1] 4

# Print result
 res.km

## K-means clustering with 4 clusters of sizes 8, 16, 13, 13
## 
## Cluster means:
##   Murder Assault UrbanPop    Rape
## 1  1.412   0.874   -0.815  0.0193
## 2 -0.489  -0.383    0.576 -0.2617
## 3 -0.962  -1.107   -0.930 -0.9668
## 4  0.695   1.039    0.723  1.2769
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              4              4              1              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              2              2              4              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              3              4              2              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              3              1              3              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              4              3              1              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              4              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              1              3              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              1              4              2              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2 
## 
## Within cluster sum of squares by cluster:
## [1]  8.32 16.21 11.95 19.92
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "clust_plot"   "silinfo"      "nbclust"     
## [13] "data"         "gap_stat"

 # Enhanced hierarchical clustering
 res.hc <- eclust(df, "hclust") # compute hclust

## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
## .................................................. 50 
## .................................................. 100

 fviz_dend(res.hc, rect = TRUE) # dendrogam

 fviz_silhouette(res.hc) # silhouette plot

##   cluster size ave.sil.width
## 1       1   19          0.26
## 2       2   19          0.28
## 3       3   12          0.43

 fviz_cluster(res.hc) # scatter plot

It’s also possible to specify the number of clusters as follow:

eclust(df, "kmeans", k = 4)

DBSCAN: Density-Based Clustering Essentials

Thu, 07 Sep 2017 20:02:00 +0200

DBSCAN (Density-Based Spatial Clustering and Application with Noise), is a density-based clusering algorithm, introduced in Ester et al. 1996, which can be used to identify clusters of any shape in a data set containing noise and outliers.

The basic idea behind the density-based clustering approach is derived from a human intuitive clustering method. For instance, by looking at the figure below, one can easily identify four clusters along with several points of noise, because of the differences in the density of points.

Clusters are dense regions in the data space, separated by regions of lower density of points. The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.

(From Ester et al. 1996)

In this chapter, we’ll describe the DBSCAN algorithm and demonstrate how to compute DBSCAN using the fpc R package.

Contents:

Why DBSCAN?
Algorithm
Advantages
Parameter estimation
Computing DBSCAN
Method for determining the optimal eps value
Cluster predictions with DBSCAN algorithm

Why DBSCAN?

Partitioning methods (K-means, PAM clustering) and hierarchical clustering are suitable for finding spherical-shaped clusters or convex clusters. In other words, they work well only for compact and well separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.

Unfortunately, real life data can contain: i) clusters of arbitrary shape such as those shown in the figure below (oval, linear and “S” shape clusters); ii) many outliers and noise.

The figure below shows a data set containing nonconvex clusters and outliers/noises. The simulated data set multishapes [in factoextra package] is used.

The plot above contains 5 clusters and outliers, including:

2 ovales clusters
2 linear clusters
1 compact cluster

Given such data, k-means algorithm has difficulties for identifying theses clusters with arbitrary shapes. To illustrate this situation, the following R code computes k-means algorithm on the multishapes data set. The function fviz_cluster()[factoextra package] is used to visualize the clusters.

First, install factoextra: install.packages(“factoextra”); then compute and visualize k-means clustering using the data set multishapes:

library(factoextra)
data("multishapes")
df <- multishapes[, 1:2]
set.seed(123)
km.res <- kmeans(df, 5, nstart = 25)
fviz_cluster(km.res, df,  geom = "point", 
             ellipse= FALSE, show.clust.cent = FALSE,
             palette = "jco", ggtheme = theme_classic())

We know there are 5 five clusters in the data, but it can be seen that k-means method inaccurately identify the 5 clusters.

Algorithm

The goal is to identify dense regions, which can be measured by the number of objects close to a given point.

Two important parameters are required for DBSCAN: epsilon (“eps”) and minimum points (“MinPts”). The parameter eps defines the radius of neighborhood around a point x. It’s called called the \(\epsilon\)-neighborhood of x. The parameter MinPts is the minimum number of neighbors within “eps” radius.

Any point x in the data set, with a neighbor count greater than or equal to MinPts, is marked as a core point. We say that x is border point, if the number of its neighbors is less than MinPts, but it belongs to the \(\epsilon\)-neighborhood of some core point z. Finally, if a point is neither a core nor a border point, then it is called a noise point or an outlier.

The figure below shows the different types of points (core, border and outlier points) using MinPts = 6. Here x is a core point because \(neighbours_\epsilon(x) = 6\), y is a border point because \(neighbours_\epsilon(y) < MinPts\), but it belongs to the \(\epsilon\)-neighborhood of the core point x. Finally, z is a noise point.

We start by defining 3 terms, required for understanding the DBSCAN algorithm:

Direct density reachable: A point “A” is directly density reachable from another point “B” if: i) “A” is in the \(\epsilon\)-neighborhood of “B” and ii) “B” is a core point.
Density reachable: A point “A” is density reachable from “B” if there are a set of core points leading from “B” to “A.
Density connected: Two points “A” and “B” are density connected if there are a core point “C”, such that both “A” and “B” are density reachable from “C”.

A density-based cluster is defined as a group of density connected points. The algorithm of density-based clustering (DBSCAN) works as follow:

For each point x_i, compute the distance between x_i and the other points. Finds all neighbor points within distance eps of the starting point (x_i). Each point, with a neighbor count greater than or equal to MinPts, is marked as core point or visited.
For each core point, if it’s not already assigned to a cluster, create a new cluster. Find recursively all its density connected points and assign them to the same cluster as the core point.
Iterate through the remaining unvisited points in the data set.

Those points that do not belong to any cluster are treated as outliers or noise.

Advantages

Unlike K-means, DBSCAN does not require the user to specify the number of clusters to be generated
DBSCAN can find any shape of clusters. The cluster doesn’t have to be circular.
DBSCAN can identify outliers

Parameter estimation

MinPts: The larger the data set, the larger the value of minPts should be chosen. minPts must be chosen at least 3.
\(\epsilon\): The value for \(\epsilon\) can then be chosen by using a k-distance graph, plotting the distance to the k = minPts nearest neighbor. Good values of \(\epsilon\) are where this plot shows a strong bend.

Computing DBSCAN

Here, we’ll use the R package fpc to compute DBSCAN. It’s also possible to use the package dbscan, which provides a faster re-implementation of DBSCAN algorithm compared to the fpc package.

We’ll also use the factoextra package for visualizing clusters.

First, install the packages as follow:

install.packages("fpc")
install.packages("dbscan")
install.packages("factoextra")

The R code below computes and visualizes DBSCAN using multishapes data set [factoextra R package]:

# Load the data 
data("multishapes", package = "factoextra")
df <- multishapes[, 1:2]
# Compute DBSCAN using fpc package
library("fpc")
set.seed(123)
db <- fpc::dbscan(df, eps = 0.15, MinPts = 5)
# Plot DBSCAN results
library("factoextra")
fviz_cluster(db, data = df, stand = FALSE,
             ellipse = FALSE, show.clust.cent = FALSE,
             geom = "point",palette = "jco", ggtheme = theme_classic())

Note that, the function fviz_cluster() uses different point symbols for core points (i.e, seed points) and border points. Black points correspond to outliers. You can play with eps and MinPts for changing cluster configurations.

It can be seen that DBSCAN performs better for these data sets and can identify the correct set of clusters compared to k-means algorithms.

The result of the fpc::dbscan() function can be displayed as follow:

print(db)

## dbscan Pts=1100 MinPts=5 eps=0.15
##         0   1   2   3  4  5
## border 31  24   1   5  7  1
## seed    0 386 404  99 92 50
## total  31 410 405 104 99 51

In the table above, column names are cluster number. Cluster 0 corresponds to outliers (black points in the DBSCAN plot). The function print.dbscan() shows a statistic of the number of points belonging to the clusters that are seeds and border points.

# Cluster membership. Noise/outlier observations are coded as 0
# A random subset is shown
db$cluster[sample(1:1089, 20)]

##  [1] 1 3 2 4 3 1 2 4 2 2 2 2 2 2 1 4 1 1 1 0

DBSCAN algorithm requires users to specify the optimal eps values and the parameter MinPts. In the R code above, we used eps = 0.15 and MinPts = 5. One limitation of DBSCAN is that it is sensitive to the choice of \(\epsilon\), in particular if clusters have different densities. If \(\epsilon\) is too small, sparser clusters will be defined as noise. If \(\epsilon\) is too large, denser clusters may be merged together. This implies that, if there are clusters with different local densities, then a single \(\epsilon\) value may not suffice.

A natural question is:

How to define the optimal value of ?

Method for determining the optimal eps value

The method proposed here consists of computing the k-nearest neighbor distances in a matrix of points.

The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts.

Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter.

A knee corresponds to a threshold where a sharp change occurs along the k-distance curve.

The function kNNdistplot() [in dbscan package] can be used to draw the k-distance plot:

dbscan::kNNdistplot(df, k =  5)
abline(h = 0.15, lty = 2)

It can be seen that the optimal eps value is around a distance of 0.15.

Cluster predictions with DBSCAN algorithm

The function predict.dbscan(object, data, newdata) [in fpc package] can be used to predict the clusters for the points in newdata. For more details, read the documentation (?predict.dbscan).

Model Based Clustering Essentials

Thu, 07 Sep 2017 19:46:00 +0200

The traditional clustering methods, such as hierarchical clustering (Chapter @ref(agglomerative-clustering)) and k-means clustering (Chapter @ref(kmeans-clustering)), are heuristic and are not based on formal models. Furthermore, k-means algorithm is commonly randomnly initialized, so different runs of k-means will often yield different results. Additionally, k-means requires the user to specify the the optimal number of clusters.

An alternative is model-based clustering, which consider the data as coming from a distribution that is mixture of two or more clusters (Fraley and Raftery 2002, Fraley et al. (2012)). Unlike k-means, the model-based clustering uses a soft assignment, where each data point has a probability of belonging to each cluster.

In this chapter, we illustrate model-based clustering using the R package mclust.

Contents:

Concept of model-based clustering
Estimating model parameters
Choosing the best model
Computing model-based clustering in R
Visualizing model-based clustering
References

Concept of model-based clustering

In model-based clustering, the data is considered as coming from a mixture of density.

Each component (i.e. cluster) k is modeled by the normal or Gaussian distribution which is characterized by the parameters:

\(\mu_k\): mean vector,
\(\sum_k\): covariance matrix,
An associated probability in the mixture. Each point has a probability of belonging to each cluster.

For example, consider the “old faithful geyser data” [in MASS R package], which can be illustrated as follow using the ggpubr R package:

# Load the data
library("MASS")
data("geyser")
# Scatter plot
library("ggpubr")
ggscatter(geyser, x = "duration", y = "waiting")+
  geom_density2d() # Add 2D density

The plot above suggests at least 3 clusters in the mixture. The shape of each of the 3 clusters appears to be approximately elliptical suggesting three bivariate normal distributions. As the 3 ellipses seems to be similar in terms of volume, shape and orientation, we might anticipate that the three components of this mixture might have homogeneous covariance matrices.

Estimating model parameters

The model parameters can be estimated using the Expectation-Maximization (EM) algorithm initialized by hierarchical model-based clustering. Each cluster k is centered at the means \(\mu_k\), with increased density for points near the mean.

Geometric features (shape, volume, orientation) of each cluster are determined by the covariance matrix \(\sum_k\).

Different possible parameterizations of \(\sum_k\) are available in the R package mclust (see ?mclustModelNames).

The available model options, in mclust package, are represented by identifiers including: EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV and VVV.

The first identifier refers to volume, the second to shape and the third to orientation. E stands for “equal”, V for “variable” and I for “coordinate axes”.

For example:

EVI denotes a model in which the volumes of all clusters are equal (E), the shapes of the clusters may vary (V), and the orientation is the identity (I) or “coordinate axes.
EEE means that the clusters have the same volume, shape and orientation in p-dimensional space.
VEI means that the clusters have variable volume, the same shape and orientation equal to coordinate axes.

Choosing the best model

The Mclust package uses maximum likelihood to fit all these models, with different covariance matrix parameterizations, for a range of k components.

The best model is selected using the Bayesian Information Criterion or BIC. A large BIC score indicates strong evidence for the corresponding model.

Computing model-based clustering in R

We start by installing the mclust package as follow: install.packages(“mclust”)

Note that, model-based clustering can be applied on univariate or multivariate data.

Here, we illustrate model-based clustering on the diabetes data set [mclust package] giving three measurements and the diagnosis for 145 subjects described as follow:

library("mclust")
data("diabetes")
head(diabetes, 3)

##    class glucose insulin sspg
## 1 Normal      80     356  124
## 2 Normal      97     289  117
## 3 Normal     105     319  143

class: the diagnosis: normal, chemically diabetic, and overtly diabetic. Excluded from the cluster analysis.
glucose: plasma glucose response to oral glucose
insulin: plasma insulin response to oral glucose
sspg: steady-state plasma glucose (measures insulin resistance)

Model-based clustering can be computed using the function Mclust() as follow:

library(mclust)
df <- scale(diabetes[, -1]) # Standardize the data
mc <- Mclust(df)            # Model-based-clustering
summary(mc)                 # Print a summary

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3 components:
## 
##  log.likelihood   n df  BIC  ICL
##            -169 145 29 -483 -501
## 
## Clustering table:
##  1  2  3 
## 81 36 28

For this data, it can be seen that model-based clustering selected a model with three components (i.e. clusters). The optimal selected model name is VVV model. That is the three components are ellipsoidal with varying volume, shape, and orientation. The summary contains also the clustering table specifying the number of observations in each clusters.

You can access to the results as follow:

mc$modelName                # Optimal selected model ==> "VVV"
mc$G                        # Optimal number of cluster => 3
head(mc$z, 30)              # Probality to belong to a given cluster
head(mc$classification, 30) # Cluster assignement of each observation

Visualizing model-based clustering

Model-based clustering results can be drawn using the base function plot.Mclust() [in mclust package]. Here we’ll use the function fviz_mclust() [in factoextra package] to create beautiful plots based on ggplot2.

In the situation, where the data contain more than two variables, fviz_mclust() uses a principal component analysis to reduce the dimensionnality of the data. The first two principal components are used to produce a scatter plot of the data. However, if you want to plot the data using only two variables of interest, let say here c(“insulin”, “sspg”), you can specify that in the fviz_mclust() function using the argument choose.vars = c(“insulin”, “sspg”).

library(factoextra)
# BIC values used for choosing the number of clusters
fviz_mclust(mc, "BIC", palette = "jco")
# Classification: plot showing the clustering
fviz_mclust(mc, "classification", geom = "point", 
            pointsize = 1.5, palette = "jco")
# Classification uncertainty
fviz_mclust(mc, "uncertainty", palette = "jco")

Note that, in the uncertainty plot, larger symbols indicate the more uncertain observations.

References

Fraley, Chris, and Adrian E Raftery. 2002. “Model-Based Clustering, Discriminant Analysis, and Density Estimation.” Journal of the American Statistical Association 97 (458): 611–31.

Fraley, Chris, Adrian E. Raftery, T. Brendan Murphy, and Luca Scrucca. 2012. “Mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation.” Technical Report No. 597, Department of Statistics, University of Washington. https://www.stat.washington.edu/research/reports/2012/tr597.pdf.

cmeans() R function: Compute Fuzzy clustering

Thu, 07 Sep 2017 19:26:00 +0200

This article describes how to compute the fuzzy clustering using the function cmeans() [in e1071 R package]. Previously, we explained what is fuzzy clustering and how to compute the fuzzy clustering using the R function fanny()[in cluster package].

cmeans() format

The simplified format of the function cmeans() is as follow:

cmeans(x, centers, iter.max = 100, dist = "euclidean", m = 2)

x: a data matrix where columns are variables and rows are observations
centers: Number of clusters or initial values for cluster centers
iter.max: Maximum number of iterations
dist: Possible values are “euclidean” or “manhattan”
m: A number greater than 1 giving the degree of fuzzification.

The function cmeans() returns an object of class fclust which is a list containing the following components:

centers: the final cluster centers
size: the number of data points in each cluster of the closest hard clustering
cluster: a vector of integers containing the indices of the clusters where the data points are assigned to for the closest hard - clustering, as obtained by assigning points to the (first) class with maximal membership.
iter: the number of iterations performed
membership: a matrix with the membership values of the data points to the clusters
withinerror: the value of the objective function

Compute fuzzy c-means clustering

set.seed(123)
# Load the data
data("USArrests")
# Subset of USArrests
ss <- sample(1:50, 20)
df <- scale(USArrests[ss,])
# Compute fuzzy clustering
library(e1071)
cm <- cmeans(df, 4)
cm

## Fuzzy c-means clustering with 4 clusters
## 
## Cluster centers:
##   Murder Assault UrbanPop   Rape
## 1  0.857   0.338   -0.729  0.200
## 2 -0.731  -0.665    1.003 -0.333
## 3 -1.210  -1.248   -0.728 -1.153
## 4  0.629   0.970    0.501  0.865
## 
## Memberships:
##                    1      2      3       4
## Iowa         0.00916 0.0191 0.9658 0.00594
## Rhode Island 0.09885 0.5915 0.2050 0.10463
## Maryland     0.22786 0.0475 0.0273 0.69731
## Tennessee    0.87231 0.0286 0.0211 0.07801
## Utah         0.04446 0.8218 0.0844 0.04929
## Arizona      0.11876 0.1008 0.0399 0.74056
## Mississippi  0.62441 0.0931 0.1030 0.17952
## Wisconsin    0.03363 0.1110 0.8313 0.02403
## Virginia     0.39552 0.2570 0.1918 0.15573
## Maine        0.03433 0.0530 0.8915 0.02117
## Texas        0.24082 0.1595 0.0541 0.54557
## Louisiana    0.61799 0.0653 0.0419 0.27473
## Montana      0.13551 0.1366 0.6657 0.06215
## Michigan     0.09620 0.0371 0.0178 0.84890
## Arkansas     0.56529 0.1223 0.1805 0.13188
## New York     0.13194 0.1323 0.0416 0.69421
## Florida      0.17377 0.0749 0.0398 0.71155
## Alaska       0.38155 0.1354 0.1136 0.36947
## Hawaii       0.06662 0.7206 0.1487 0.06410
## New Jersey   0.05957 0.8009 0.0575 0.08206
## 
## Closest hard clustering:
##         Iowa Rhode Island     Maryland    Tennessee         Utah 
##            3            2            4            1            2 
##      Arizona  Mississippi    Wisconsin     Virginia        Maine 
##            4            1            3            1            3 
##        Texas    Louisiana      Montana     Michigan     Arkansas 
##            4            1            3            4            1 
##     New York      Florida       Alaska       Hawaii   New Jersey 
##            4            4            1            2            2 
## 
## Available components:
## [1] "centers"     "size"        "cluster"     "membership"  "iter"       
## [6] "withinerror" "call"

The different components can be extracted using the code below:

# Membership coefficient
head(cm$membership)

##                    1      2      3       4
## Iowa         0.00916 0.0191 0.9658 0.00594
## Rhode Island 0.09885 0.5915 0.2050 0.10463
## Maryland     0.22786 0.0475 0.0273 0.69731
## Tennessee    0.87231 0.0286 0.0211 0.07801
## Utah         0.04446 0.8218 0.0844 0.04929
## Arizona      0.11876 0.1008 0.0399 0.74056

# Visualize using corrplot
library(corrplot)
corrplot(cm$membership, is.corr = FALSE)

# Observation groups/clusters
cm$cluster

##         Iowa Rhode Island     Maryland    Tennessee         Utah 
##            3            2            4            1            2 
##      Arizona  Mississippi    Wisconsin     Virginia        Maine 
##            4            1            3            1            3 
##        Texas    Louisiana      Montana     Michigan     Arkansas 
##            4            1            3            4            1 
##     New York      Florida       Alaska       Hawaii   New Jersey 
##            4            4            1            2            2

Visualize clusters

library(factoextra)
fviz_cluster(list(data = df, cluster=cm$cluster), 
             ellipse.type = "norm",
             ellipse.level = 0.68,
             palette = "jco",
             ggtheme = theme_minimal())

Fuzzy C-Means Clustering Algorithm

Thu, 07 Sep 2017 16:44:00 +0200

In our previous article, we described the basic concept of fuzzy clustering and we showed how to compute fuzzy clustering. In this current article, we’ll present the fuzzy c-means clustering algorithm, which is very similar to the k-means algorithm and the aim is to minimize the objective function defined as follow:

\[ \sum\limits_{j=1}^k \sum\limits_{x_i \in C_j} u_{ij}^m (x_i - \mu_j)^2 \]

Where,

\(u_{ij}\) is the degree to which an observation \(x_i\) belongs to a cluster \(c_j\)
\(\mu_j\) is the center of the cluster j
\(u_{ij}\) is the degree to which an observation \(x_i\) belongs to a cluster \(c_j\)
\(m\) is the fuzzifier.

It can be seen that, FCM differs from k-means by using the membership values \(u_{ij}\) and the fuzzifier \(m\).

The variable \(u_{ij}^m\) is defined as follow:

\[ u_{ij}^m = \frac{1}{\sum\limits_{l=1}^k \left( \frac{| x_i - c_j |}{| x_i - c_k |}\right)^{\frac{2}{m-1}}} \]

The degree of belonging, \(u_{ij}\), is linked inversely to the distance from x to the cluster center.

The parameter \(m\) is a real number greater than 1 (\(1.0 < m < \infty\)) and it defines the level of cluster fuzziness. Note that, a value of \(m\) close to 1 gives a cluster solution which becomes increasingly similar to the solution of hard clustering such as k-means; whereas a value of \(m\) close to infinite leads to complete fuzzyness.

Note that, a good choice is to use m = 2.0 (Hathaway and Bezdek 2001).

In fuzzy clustering the centroid of a cluster is he mean of all points, weighted by their degree of belonging to the cluster:

\[ C_j = \frac{\sum\limits_{x \in C_j} u_{ij}^m x}{\sum\limits_{x \in C_j} u_{ij}^m} \]

Where,

\(C_j\) is the centroid of the cluster j
\(u_{ij}\) is the degree to which an observation \(x_i\) belongs to a cluster \(c_j\)

The algorithm of fuzzy clustering can be summarize as follow:

Specify a number of clusters k (by the analyst)
Assign randomly to each point coefficients for being in the clusters.
Repeat until the maximum number of iterations (given by “maxit”) is reached, or when the algorithm has converged (that is, the coefficients’ change between two iterations is no more than \(\epsilon\), the given sensitivity threshold):
- Compute the centroid for each cluster, using the formula above.
- For each point, compute its coefficients of being in the clusters, using the formula above.

The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means; the minimum is a local minimum, and the results depend on the initial choice of weights. Hence, different initializations may lead to different results.

Using a mixture of Gaussians along with the expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes.

Fuzzy Clustering Essentials

Thu, 07 Sep 2017 15:50:00 +0200

The fuzzy clustering is considered as soft clustering, in which each element has a probability of belonging to each cluster. In other words, each element has a set of membership coefficients corresponding to the degree of being in a given cluster.

This is different from k-means and k-medoid clustering, where each object is affected exactly to one cluster. K-means and k-medoids clustering are known as hard or non-fuzzy clustering.

In fuzzy clustering, points close to the center of a cluster, may be in the cluster to a higher degree than points in the edge of a cluster. The degree, to which an element belongs to a given cluster, is a numerical value varying from 0 to 1.

The fuzzy c-means (FCM) algorithm is one of the most widely used fuzzy clustering algorithms. The centroid of a cluster is calculated as the mean of all points, weighted by their degree of belonging to the cluster:

In this article, we’ll describe how to compute fuzzy clustering using the R software.

Required R packages

We’ll use the following R packages: 1) cluster for computing fuzzy clustering and 2) factoextra for visualizing clusters.

Computing fuzzy clustering

The function fanny() [cluster R package] can be used to compute fuzzy clustering. FANNY stands for fuzzy analysis clustering. A simplified format is:

fanny(x, k, metric = "euclidean", stand = FALSE)

x: A data matrix or data frame or dissimilarity matrix
k: The desired number of clusters to be generated
metric: Metric for calculating dissimilarities between observations
stand: If TRUE, variables are standardized before calculating the dissimilarities

The function fanny() returns an object including the following components:

membership: matrix containing the degree to which each observation belongs to a given cluster. Column names are the clusters and rows are observations
coeff: Dunn’s partition coefficient F(k) of the clustering, where k is the number of clusters. F(k) is the sum of all squared membership coefficients, divided by the number of observations. Its value is between 1/k and 1. The normalized form of the coefficient is also given. It is defined as \((F(k) - 1/k) / (1 - 1/k)\), and ranges between 0 and 1. A low value of Dunn’s coefficient indicates a very fuzzy clustering, whereas a value close to 1 indicates a near-crisp clustering.
clustering: the clustering vector containing the nearest crisp grouping of observations

For example, the R code below applies fuzzy clustering on the USArrests data set:

library(cluster)
df <- scale(USArrests)     # Standardize the data
res.fanny <- fanny(df, 2)  # Compute fuzzy clustering with k = 2

The different components can be extracted using the code below:

head(res.fanny$membership, 3) # Membership coefficients

##          [,1]  [,2]
## Alabama 0.664 0.336
## Alaska  0.610 0.390
## Arizona 0.686 0.314

res.fanny$coeff # Dunn's partition coefficient

## dunn_coeff normalized 
##      0.555      0.109

head(res.fanny$clustering) # Observation groups

##    Alabama     Alaska    Arizona   Arkansas California   Colorado 
##          1          1          1          2          1          1

To visualize observation groups, use the function fviz_cluster() [factoextra package]:

library(factoextra)
fviz_cluster(res.fanny, ellipse.type = "norm", repel = TRUE,
             palette = "jco", ggtheme = theme_minimal(),
             legend = "right")

To evaluate the goodnesss of the clustering results, plot the silhouette coefficient as follow:

fviz_silhouette(res.fanny, palette = "jco",
                ggtheme = theme_minimal())

##   cluster size ave.sil.width
## 1       1   22          0.32
## 2       2   28          0.44

Summary

Fuzzy clustering is an alternative to k-means clustering, where each data point has membership coefficient to each cluster. Here, we demonstrated how to compute and visualize fuzzy clustering using the combination of cluster and factoextra R packages.

Hierarchical K-Means Clustering: Optimize Clusters

Thu, 07 Sep 2017 15:21:00 +0200

Hierarchical K-Means Clustering

K-means (Chapter @ref(kmeans-clustering)) represents one of the most popular clustering algorithm. However, it has some limitations: it requires the user to specify the number of clusters in advance and selects initial centroids randomly. The final k-means clustering solution is very sensitive to this initial random selection of cluster centers. The result might be (slightly) different each time you compute k-means.

In this chapter, we described an hybrid method, named hierarchical k-means clustering (hkmeans), for improving k-means results.

Algorithm

The algorithm is summarized as follow:

Compute hierarchical clustering and cut the tree into k-clusters
Compute the center (i.e the mean) of each cluster
Compute k-means by using the set of cluster centers (defined in step 2) as the initial cluster centers

Note that, k-means algorithm will improve the initial partitioning generated at the step 2 of the algorithm. Hence, the initial partitioning can be slightly different from the final partitioning obtained in the step 4.

R code

The R function hkmeans() [in factoextra], provides an easy solution to compute the hierarchical k-means clustering. The format of the result is similar to the one provided by the standard kmeans() function (see Chapter @ref(kmeans-clustering)).

To install factoextra, type this: install.packages(“factoextra”).

We’ll use the USArrest data set and we start by standardizing the data:

df <- scale(USArrests)

# Compute hierarchical k-means clustering
library(factoextra)
res.hk <-hkmeans(df, 4)
# Elements returned by hkmeans()
names(res.hk)

##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"

To print all the results, type this:

# Print the results
res.hk

# Visualize the tree
fviz_dend(res.hk, cex = 0.6, palette = "jco", 
          rect = TRUE, rect_border = "jco", rect_fill = TRUE)

# Visualize the hkmeans final clusters
fviz_cluster(res.hk, palette = "jco", repel = TRUE,
             ggtheme = theme_classic())

Summary

We described hybrid hierarchical k-means clustering for improving k-means results.

Computing P-value for Hierarchical Clustering

Thu, 07 Sep 2017 11:52:00 +0200

Clusters can be found in a data set by chance due to clustering noise or sampling error. This article describes the R package pvclust (Suzuki and Shimodaira 2015) which uses bootstrap resampling techniques to compute p-value for each hierarchical clusters.

Contents:

Algorithm
Required packages
Data preparation
Compute p-value for hierarchical clustering
- Description of pvclust() function
- Usage of pvclust() function
References

Algorithm

Generated thousands of bootstrap samples by randomly sampling elements of the data
Compute hierarchical clustering on each bootstrap copy
For each cluster:
- compute the bootstrap probability (BP) value which corresponds to the frequency that the cluster is identified in bootstrap copies.
- Compute the approximately unbiased (AU) probability values (p-values) by multiscale bootstrap resampling

Clusters with AU >= 95% are considered to be strongly supported by data.

Required packages

Install pvclust:

install.packages("pvclust")

Load pvclust:

library(pvclust)

Data preparation

We’ll use lung data set [in pvclust package]. It contains the gene expression profile of 916 genes of 73 lung tissues including 67 tumors. Columns are samples and rows are genes.

library(pvclust)
# Load the data
data("lung")
head(lung[, 1:4])

##               fetal_lung 232-97_SCC 232-97_node 68-96_Adeno
## IMAGE:196992       -0.40       4.28        3.68       -1.35
## IMAGE:587847       -2.22       5.21        4.75       -0.91
## IMAGE:1049185      -1.35      -0.84       -2.88        3.35
## IMAGE:135221        0.68       0.56       -0.45       -0.20
## IMAGE:298560          NA       4.14        3.58       -0.40
## IMAGE:119882       -3.23      -2.84       -2.72       -0.83

# Dimension of the data
dim(lung)

## [1] 916  73

We’ll use only a subset of the data set for the clustering analysis. The R function sample() can be used to extract a random subset of 30 samples:

set.seed(123)
ss <- sample(1:73, 30) # extract 20 samples out of
df <- lung[, ss]

Compute p-value for hierarchical clustering

Description of pvclust() function

The function pvclust() can be used as follow:

pvclust(data, method.hclust = "average",
        method.dist = "correlation", nboot = 1000)

Note that, the computation time can be strongly decreased using parallel computation version called parPvclust(). (Read ?parPvclust() for more information.)

parPvclust(cl=NULL, data, method.hclust = "average",
           method.dist = "correlation", nboot = 1000,
           iseed = NULL)

data: numeric data matrix or data frame.
method.hclust: the agglomerative method used in hierarchical clustering. Possible values are one of “average”, “ward”, “single”, “complete”, “mcquitty”, “median” or “centroid”. The default is “average”. See method argument in ?hclust.
method.dist: the distance measure to be used. Possible values are one of “correlation”, “uncentered”, “abscor” or those which are allowed for method argument in dist() function, such “euclidean” and “manhattan”.
nboot: the number of bootstrap replications. The default is 1000.
iseed: an integrer for random seeds. Use iseed argument to achieve reproducible results.

The function pvclust() returns an object of class pvclust containing many elements including hclust which contains hierarchical clustering result for the original data generated by the function hclust().

Usage of pvclust() function

pvclust() performs clustering on the columns of the data set, which correspond to samples in our case. If you want to perform the clustering on the variables (here, genes) you have to transpose the data set using the function t().

The R code below computes pvclust() using 10 as the number of bootstrap replications (for speed):

library(pvclust)
set.seed(123)
res.pv <- pvclust(df, method.dist="cor", 
                  method.hclust="average", nboot = 10)

# Default plot
plot(res.pv, hang = -1, cex = 0.5)
pvrect(res.pv)

Values on the dendrogram are AU p-values (Red, left), BP values (green, right), and clusterlabels (grey, bottom). Clusters with AU > = 95% are indicated by the rectangles and are considered to be strongly supported by data.

To extract the objects from the significant clusters, use the function pvpick():

clusters <- pvpick(res.pv)
clusters

Parallel computation can be applied as follow:

# Create a parallel socket cluster
library(parallel)
cl <- makeCluster(2, type = "PSOCK")
# parallel version of pvclust
res.pv <- parPvclust(cl, df, nboot=1000)
stopCluster(cl)

References

Suzuki, Ryota, and Hidetoshi Shimodaira. 2015. Pvclust: Hierarchical Clustering with P-Values via Multiscale Bootstrap Resampling. https://CRAN.R-project.org/package=pvclust.

Last articles - STHDA : Cluster Analysis in R: Practical Guide

Types of Clustering Methods: Overview and Quick Start R Code

Installing and loading required R packages

Data preparation

Distance measures

Partitioning clustering

Hierarchical clustering

Clustering validation and evaluation

Assessing clustering tendency

Determining the optimal number of clusters

Clustering validation statistics

See also:

Advanced clustering methods

Hybrid clustering methods

Fuzzy clustering

Model-based clustering

DBSCAN: Density-based clustering

References

Clustering Example: 4 Steps You Should Know

Data preparation

Assessing the clusterability

Estimate the number of clusters in the data

Compute k-means clustering

Cluster validation statistics: Inspect cluster silhouette plot

eclust(): Enhanced clustering analysis

K-means clustering using eclust()

Hierachical clustering using eclust()

Cluster Analysis in R Simplified and Enhanced

Required packages

Data preparation

Distance matrix computation and visualization

Enhanced clustering analysis

DBSCAN: Density-Based Clustering Essentials

Why DBSCAN?

Algorithm

Advantages

Parameter estimation

Computing DBSCAN

Method for determining the optimal eps value

Cluster predictions with DBSCAN algorithm

Model Based Clustering Essentials

Concept of model-based clustering

Estimating model parameters

Choosing the best model

Computing model-based clustering in R

Visualizing model-based clustering

References

cmeans() R function: Compute Fuzzy clustering

cmeans() format

Compute fuzzy c-means clustering

Visualize clusters

Fuzzy C-Means Clustering Algorithm

Fuzzy Clustering Essentials

Required R packages

Computing fuzzy clustering

Summary

Hierarchical K-Means Clustering: Optimize Clusters

Hierarchical K-Means Clustering

Algorithm

R code

Summary

Computing P-value for Hierarchical Clustering

Algorithm

Required packages

Data preparation

Compute p-value for hierarchical clustering

Description of pvclust() function

Usage of pvclust() function

References