Clustering methods are used to identify groups of similar objects in a multivariate data sets collected from fields such as marketing, bio-medical and geo-spatial. They are different types of clustering methods, including:
In this article, we provide an overview of clustering methods and quick start R code to perform cluster analysis in R:
Contents:
Related Book:
We’ll use mainly two R packages:
Accessory packages:
Install:
install.packages("factoextra")
install.packages("cluster")
install.packages("magrittr")
Load packages:
library("cluster")
library("factoextra")
library("magrittr")
Read more: Data Preparation and Essential R Packages for Cluster Analysis
# Load and prepare the data
data("USArrests")
my_data <- USArrests %>%
na.omit() %>% # Remove missing values (NA)
scale() # Scale variables
# View the firt 3 rows
head(my_data, n = 3)
## Murder Assault UrbanPop Rape
## Alabama 1.2426 0.783 -0.521 -0.00342
## Alaska 0.5079 1.107 -1.212 2.48420
## Arizona 0.0716 1.479 0.999 1.04288
The classification of objects, into clusters, requires some methods for measuring the distance or the (dis)similarity between the objects. Chapter Clustering Distance Measures Essentials covers the common distance measures used for assessing similarity between observations.
It’s simple to compute and visualize distance matrix using the functions get_dist() and fviz_dist() [factoextra R package]:
get_dist()
: for computing a distance matrix between the rows of a data matrix. Compared to the standard dist()
function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.fviz_dist()
: for visualizing a distance matrixres.dist <- get_dist(USArrests, stand = TRUE, method = "pearson")
fviz_dist(res.dist,
gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
Read more: Clustering Distance Measures Essentials
Partitioning algorithms are clustering techniques that subdivide the data sets into a set of k groups, where k is the number of groups pre-specified by the analyst.
There are different types of partitioning clustering methods. The most popular is the K-means clustering (MacQueen 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster. The K-means method is sensitive to outliers.
An alternative to k-means clustering is the K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990), which is less sensitive to outliers compared to k-means.
Read more: Partitioning Clustering methods.
The following R codes show how to determine the optimal number of clusters and how to compute k-means and PAM clustering in R.
factoextra::fviz_nbclust()
library("factoextra")
fviz_nbclust(my_data, kmeans, method = "gap_stat")
Suggested number of cluster: 3
Compute and visualize k-means clustering
set.seed(123)
km.res <- kmeans(my_data, 3, nstart = 25)
# Visualize
library("factoextra")
fviz_cluster(km.res, data = my_data,
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal())
Similarly, the k-medoids/pam clustering can be computed as follow:
# Compute PAM
library("cluster")
pam.res <- pam(my_data, 3)
# Visualize
fviz_cluster(pam.res)
Hierarchical clustering is an alternative approach to partitioning clustering for identifying groups in the dataset. It does not require to pre-specify the number of clusters to be generated.
The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram. Observations can be subdivided into groups by cutting the dendrogram at a desired similarity level.
R code to compute and visualize hierarchical clustering:
# Compute hierarchical clustering
res.hc <- USArrests %>%
scale() %>% # Scale the data
dist(method = "euclidean") %>% # Compute dissimilarity matrix
hclust(method = "ward.D2") # Compute hierachical clustering
# Visualize using factoextra
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k = 4, # Cut in four groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE # Add rectangle around groups
)
Read more: Hierarchical clustering
See also:
Clustering validation and evaluation strategies, consist of measuring the goodness of clustering results. Before applying any clustering algorithm to a data set, the first thing to do is to assess the clustering tendency. That is, whether the data contains any inherent grouping structure.
If yes, then how many clusters are there. Next, you can perform hierarchical clustering or partitioning clustering (with a pre-specified number of clusters). Finally, you can use a number of measures, described in this chapter, to evaluate the goodness of the clustering results.
Read more: Cluster Validation Essentials
To assess the clustering tendency, the Hopkins’ statistic and a visual approach can be used. This can be performed using the function get_clust_tendency()
[factoextra package], which creates an ordered dissimilarity image (ODI).
R code:
gradient.color <- list(low = "steelblue", high = "white")
iris[, -5] %>% # Remove column 5 (Species)
scale() %>% # Scale variables
get_clust_tendency(n = 50, gradient = gradient.color)
## $hopkins_stat
## [1] 0.2
##
## $plot
Read more: Assessing Clustering Tendency
There are different methods for determining the optimal number of clusters.
In the R code below, we’ll use the NbClust
R package, which provides 30 indices for determining the best number of clusters. First, install it using install.packages("NbClust")
, then type this:
set.seed(123)
# Compute
library("NbClust")
res.nbclust <- USArrests %>%
scale() %>%
NbClust(distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "complete", index ="all")
# Visualize
library(factoextra)
fviz_nbclust(res.nbclust, ggtheme = theme_minimal())
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 1 proposed 1 as the best number of clusters
## * 9 proposed 2 as the best number of clusters
## * 4 proposed 3 as the best number of clusters
## * 6 proposed 4 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
Read more: Determining the Optimal Number of Clusters
A variety of measures has been proposed in the literature for evaluating clustering results. The term clustering validation is used to design the procedure of evaluating the results of a clustering algorithm.
The silhouette plot is one of the many measures for inspecting and validating clustering results. Recall that the silhouette (\(S_i\)) measures how similar an object \(i\) is to the the other objects in its own cluster versus those in the neighbor cluster. \(S_i\) values range from 1 to - 1:
In the following R code, we’ll compute and evaluate the result of hierarchical clustering methods.
set.seed(123)
# Enhanced hierarchical clustering, cut in 3 groups
res.hc <- iris[, -5] %>%
scale() %>%
eclust("hclust", k = 3, graph = FALSE)
# Visualize with factoextra
fviz_dend(res.hc, palette = "jco",
rect = TRUE, show_labels = FALSE)
fviz_silhouette(res.hc)
## cluster size ave.sil.width
## 1 1 49 0.63
## 2 2 30 0.44
## 3 3 71 0.32
# Silhouette width of observations
sil <- res.hc$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]
## cluster neighbor sil_width
## 84 3 2 -0.0127
## 122 3 2 -0.0179
## 62 3 2 -0.0476
## 135 3 2 -0.0530
## 73 3 2 -0.1009
## 74 3 2 -0.1476
## 114 3 2 -0.1611
## 72 3 2 -0.2304
Read more: Cluster Validation Statistics
Fuzzy clustering is also known as soft method. Standard clustering approaches produce partitions (K-means, PAM), in which each observation belongs to only one cluster. This is known as hard clustering.
In Fuzzy clustering, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster. The Fuzzy c-means method is the most popular fuzzy clustering algorithm.
Read more: Fuzzy Clustering.
In model-based clustering, the data are viewed as coming from a distribution that is mixture of two ore more clusters. It finds best fit of models to data and estimates the number of clusters.
Read more: Model-Based Clustering.
DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers (Ester et al. 1996). The basic idea behind density-based clustering approach is derived from a human intuitive clustering method.
The description and implementation of DBSCAN in R are provided at this link: DBSCAN: Density-Based Clustering.
Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In, 226–31. AAAI Press.
MacQueen, J. 1967. “Some Methods for Classification and Analysis of Multivariate Observations.” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 281–97. Berkeley, Calif.: University of California Press. http://projecteuclid.org:443/euclid.bsmsp/1200512992.
This article describes k-means clustering example and provide a step-by-step guide summarizing the different steps to follow for conducting a cluster analysis on a real data set using R software.
We’ll use mainly two R packages:
Install these packages, as follow:
install.packages(c("cluster", "factoextra"))
A rigorous cluster analysis can be conducted in 3 steps mentioned below:
Here, we provide quick R scripts to perform all these steps.
Contents:
Related Book:
We’ll use the demo data set USArrests. We start by standardizing the data using the scale() function:
# Load the data set
data(USArrests)
# Standardize
df <- scale(USArrests)
The function get_clust_tendency() [factoextra package] can be used. It computes the Hopkins statistic and provides a visual approach.
library("factoextra")
res <- get_clust_tendency(df, 40, graph = TRUE)
# Hopskin statistic
res$hopkins_stat
## [1] 0.656
# Visualize the dissimilarity matrix
print(res$plot)
The value of the Hopkins statistic is significantly < 0.5, indicating that the data is highly clusterable. Additionally, It can be seen that the ordered dissimilarity image contains patterns (i.e., clusters).
As k-means clustering requires to specify the number of clusters to generate, we’ll use the function clusGap() [cluster package] to compute gap statistics for estimating the optimal number of clusters . The function fviz_gap_stat() [factoextra] is used to visualize the gap statistic plot.
library("cluster")
set.seed(123)
# Compute the gap statistic
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25,
K.max = 10, B = 100)
# Plot the result
library(factoextra)
fviz_gap_stat(gap_stat)
The gap statistic suggests a 4 cluster solutions.
It’s also possible to use the function NbClust() [in NbClust] package.
It’s also possible to use the function NbClust() [NbClust package].
K-means clustering with k = 4:
# Compute k-means
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)
head(km.res$cluster, 20)
## Alabama Alaska Arizona Arkansas California Colorado
## 4 3 3 4 3 3
## Connecticut Delaware Florida Georgia Hawaii Idaho
## 2 2 3 4 2 1
## Illinois Indiana Iowa Kansas Kentucky Louisiana
## 3 2 1 2 1 4
## Maine Maryland
## 1 3
# Visualize clusters using factoextra
fviz_cluster(km.res, USArrests)
Recall that the silhouette measures (\(S_i\)) how similar an object \(i\) is to the the other objects in its own cluster versus those in the neighbor cluster. \(S_i\) values range from 1 to - 1:
sil <- silhouette(km.res$cluster, dist(df))
rownames(sil) <- rownames(USArrests)
head(sil[, 1:3])
## cluster neighbor sil_width
## Alabama 4 3 0.4858
## Alaska 3 4 0.0583
## Arizona 3 2 0.4155
## Arkansas 4 2 0.1187
## California 3 2 0.4356
## Colorado 3 2 0.3265
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 13 0.37
## 2 2 16 0.34
## 3 3 13 0.27
## 4 4 8 0.39
It can be seen that there are some samples which have negative silhouette values. Some natural questions are :
Which samples are these? To what cluster are they closer?
This can be determined from the output of the function silhouette() as follow:
neg_sil_index <- which(sil[, "sil_width"] < 0)
sil[neg_sil_index, , drop = FALSE]
## cluster neighbor sil_width
## Missouri 3 2 -0.0732
The function eclust()[factoextra package] provides several advantages compared to the standard packages used for clustering analysis:
# Compute k-means
res.km <- eclust(df, "kmeans", nstart = 25)
# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)
# Silhouette plot
fviz_silhouette(res.km)
# Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust
## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
## .................................................. 50
## .................................................. 100
fviz_dend(res.hc, rect = TRUE) # dendrogam
The R code below generates the silhouette plot and the scatter plot for hierarchical clustering.
fviz_silhouette(res.hc) # silhouette plot
fviz_cluster(res.hc) # scatter plot
In R, standard clustering methods (partitioning and hierarchical clustering) can be computed using the R packages stats and cluster. However the workflow, generally, requires multiple steps and multiple lines of R codes.
This article describes some easy-to-use wrapper functions, in the factoextra R package, for simplifying and improving cluster analysis in R. These functions include:
get_dist() & fviz_dist() for computing and visualizing distance matrix between rows of a data matrix. Compared to the standard dist() function, get_dist() supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
We’ll use the factoextra package for an enhanced cluster analysis and visualization.
install.packages("factoextra")
library(factoextra)
The built-in R dataset USArrests is used:
# Load and scale the dataset
data("USArrests")
df <- scale(USArrests)
head(df)
## Murder Assault UrbanPop Rape
## Alabama 1.2426 0.783 -0.521 -0.00342
## Alaska 0.5079 1.107 -1.212 2.48420
## Arizona 0.0716 1.479 0.999 1.04288
## Arkansas 0.2323 0.231 -1.074 -0.18492
## California 0.2783 1.263 1.759 2.06782
## Colorado 0.0257 0.399 0.861 1.86497
library(factoextra)
# Correlation-based distance method
res.dist <- get_dist(df, method = "pearson")
head(round(as.matrix(res.dist), 2))[, 1:6]
## Alabama Alaska Arizona Arkansas California Colorado
## Alabama 0.00 0.71 1.45 0.09 1.87 1.69
## Alaska 0.71 0.00 0.83 0.37 0.81 0.52
## Arizona 1.45 0.83 0.00 1.18 0.29 0.60
## Arkansas 0.09 0.37 1.18 0.00 1.59 1.37
## California 1.87 0.81 0.29 1.59 0.00 0.11
## Colorado 1.69 0.52 0.60 1.37 0.11 0.00
# Visualize the dissimilarity matrix
fviz_dist(res.dist, lab_size = 8)
In the plot above, similar objects are close to one another. Red color corresponds to small distance and blue color indicates big distance between observation.
The standard R code for computing hierarchical clustering looks like this:
# Load and scale the dataset
data("USArrests")
df <- scale(USArrests)
# Compute dissimilarity matrix
res.dist <- dist(df, method = "euclidean")
# Compute hierarchical clustering
res.hc <- hclust(res.dist, method = "ward.D2")
# Visualize
plot(res.hc, cex = 0.5)
In this section we’ll describe the eclust() function [factoextra package] to simplify the workflow. The format is as follow:
eclust(x, FUNcluster = "kmeans", hc_metric = "euclidean", ...)
In the following R code, we’ll show some examples for enhanced k-means clustering and hierarchical clustering. Note that the same analysis can be done for PAM, CLARA, FANNY, AGNES and DIANA.
library("factoextra")
# Enhanced k-means clustering
res.km <- eclust(df, "kmeans", nstart = 25)
## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
## .................................................. 50
## .................................................. 100
# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)
# Silhouette plot
fviz_silhouette(res.km)
## cluster size ave.sil.width
## 1 1 8 0.39
## 2 2 16 0.34
## 3 3 13 0.37
## 4 4 13 0.27
# Optimal number of clusters using gap statistics
res.km$nbclust
## [1] 4
# Print result
res.km
## K-means clustering with 4 clusters of sizes 8, 16, 13, 13
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 1.412 0.874 -0.815 0.0193
## 2 -0.489 -0.383 0.576 -0.2617
## 3 -0.962 -1.107 -0.930 -0.9668
## 4 0.695 1.039 0.723 1.2769
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 4 4 1 4
## Colorado Connecticut Delaware Florida Georgia
## 4 2 2 4 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 3 4 2 3
## Kansas Kentucky Louisiana Maine Maryland
## 2 3 1 3 4
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 4 3 1 4
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 4 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 4 4 1 3 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 1 4 2 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
##
## Within cluster sum of squares by cluster:
## [1] 8.32 16.21 11.95 19.92
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault" "clust_plot" "silinfo" "nbclust"
## [13] "data" "gap_stat"
# Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust
## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
## .................................................. 50
## .................................................. 100
fviz_dend(res.hc, rect = TRUE) # dendrogam
fviz_silhouette(res.hc) # silhouette plot
## cluster size ave.sil.width
## 1 1 19 0.26
## 2 2 19 0.28
## 3 3 12 0.43
fviz_cluster(res.hc) # scatter plot
It’s also possible to specify the number of clusters as follow:
eclust(df, "kmeans", k = 4)
DBSCAN (Density-Based Spatial Clustering and Application with Noise), is a density-based clusering algorithm, introduced in Ester et al. 1996, which can be used to identify clusters of any shape in a data set containing noise and outliers.
The basic idea behind the density-based clustering approach is derived from a human intuitive clustering method. For instance, by looking at the figure below, one can easily identify four clusters along with several points of noise, because of the differences in the density of points.
Clusters are dense regions in the data space, separated by regions of lower density of points. The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.
(From Ester et al. 1996)
In this chapter, we’ll describe the DBSCAN algorithm and demonstrate how to compute DBSCAN using the fpc R package.
Contents:
Related Book:
Partitioning methods (K-means, PAM clustering) and hierarchical clustering are suitable for finding spherical-shaped clusters or convex clusters. In other words, they work well only for compact and well separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.
Unfortunately, real life data can contain: i) clusters of arbitrary shape such as those shown in the figure below (oval, linear and “S” shape clusters); ii) many outliers and noise.
The figure below shows a data set containing nonconvex clusters and outliers/noises. The simulated data set multishapes [in factoextra package] is used.
The plot above contains 5 clusters and outliers, including:
Given such data, k-means algorithm has difficulties for identifying theses clusters with arbitrary shapes. To illustrate this situation, the following R code computes k-means algorithm on the multishapes data set. The function fviz_cluster()[factoextra package] is used to visualize the clusters.
First, install factoextra: install.packages(“factoextra”); then compute and visualize k-means clustering using the data set multishapes:
library(factoextra)
data("multishapes")
df <- multishapes[, 1:2]
set.seed(123)
km.res <- kmeans(df, 5, nstart = 25)
fviz_cluster(km.res, df, geom = "point",
ellipse= FALSE, show.clust.cent = FALSE,
palette = "jco", ggtheme = theme_classic())
We know there are 5 five clusters in the data, but it can be seen that k-means method inaccurately identify the 5 clusters.
The goal is to identify dense regions, which can be measured by the number of objects close to a given point.
Two important parameters are required for DBSCAN: epsilon (“eps”) and minimum points (“MinPts”). The parameter eps defines the radius of neighborhood around a point x. It’s called called the \(\epsilon\)-neighborhood of x. The parameter MinPts is the minimum number of neighbors within “eps” radius.
Any point x in the data set, with a neighbor count greater than or equal to MinPts, is marked as a core point. We say that x is border point, if the number of its neighbors is less than MinPts, but it belongs to the \(\epsilon\)-neighborhood of some core point z. Finally, if a point is neither a core nor a border point, then it is called a noise point or an outlier.
The figure below shows the different types of points (core, border and outlier points) using MinPts = 6. Here x is a core point because \(neighbours_\epsilon(x) = 6\), y is a border point because \(neighbours_\epsilon(y) < MinPts\), but it belongs to the \(\epsilon\)-neighborhood of the core point x. Finally, z is a noise point.
We start by defining 3 terms, required for understanding the DBSCAN algorithm:
A density-based cluster is defined as a group of density connected points. The algorithm of density-based clustering (DBSCAN) works as follow:
For each point x_{i}, compute the distance between x_{i} and the other points. Finds all neighbor points within distance eps of the starting point (x_{i}). Each point, with a neighbor count greater than or equal to MinPts, is marked as core point or visited.
For each core point, if it’s not already assigned to a cluster, create a new cluster. Find recursively all its density connected points and assign them to the same cluster as the core point.
Iterate through the remaining unvisited points in the data set.
Those points that do not belong to any cluster are treated as outliers or noise.
MinPts: The larger the data set, the larger the value of minPts should be chosen. minPts must be chosen at least 3.
\(\epsilon\): The value for \(\epsilon\) can then be chosen by using a k-distance graph, plotting the distance to the k = minPts nearest neighbor. Good values of \(\epsilon\) are where this plot shows a strong bend.
Here, we’ll use the R package fpc to compute DBSCAN. It’s also possible to use the package dbscan, which provides a faster re-implementation of DBSCAN algorithm compared to the fpc package.
We’ll also use the factoextra package for visualizing clusters.
First, install the packages as follow:
install.packages("fpc")
install.packages("dbscan")
install.packages("factoextra")
The R code below computes and visualizes DBSCAN using multishapes data set [factoextra R package]:
# Load the data
data("multishapes", package = "factoextra")
df <- multishapes[, 1:2]
# Compute DBSCAN using fpc package
library("fpc")
set.seed(123)
db <- fpc::dbscan(df, eps = 0.15, MinPts = 5)
# Plot DBSCAN results
library("factoextra")
fviz_cluster(db, data = df, stand = FALSE,
ellipse = FALSE, show.clust.cent = FALSE,
geom = "point",palette = "jco", ggtheme = theme_classic())
Note that, the function fviz_cluster() uses different point symbols for core points (i.e, seed points) and border points. Black points correspond to outliers. You can play with eps and MinPts for changing cluster configurations.
It can be seen that DBSCAN performs better for these data sets and can identify the correct set of clusters compared to k-means algorithms.
The result of the fpc::dbscan() function can be displayed as follow:
print(db)
## dbscan Pts=1100 MinPts=5 eps=0.15
## 0 1 2 3 4 5
## border 31 24 1 5 7 1
## seed 0 386 404 99 92 50
## total 31 410 405 104 99 51
In the table above, column names are cluster number. Cluster 0 corresponds to outliers (black points in the DBSCAN plot). The function print.dbscan() shows a statistic of the number of points belonging to the clusters that are seeds and border points.
# Cluster membership. Noise/outlier observations are coded as 0
# A random subset is shown
db$cluster[sample(1:1089, 20)]
## [1] 1 3 2 4 3 1 2 4 2 2 2 2 2 2 1 4 1 1 1 0
DBSCAN algorithm requires users to specify the optimal eps values and the parameter MinPts. In the R code above, we used eps = 0.15 and MinPts = 5. One limitation of DBSCAN is that it is sensitive to the choice of \(\epsilon\), in particular if clusters have different densities. If \(\epsilon\) is too small, sparser clusters will be defined as noise. If \(\epsilon\) is too large, denser clusters may be merged together. This implies that, if there are clusters with different local densities, then a single \(\epsilon\) value may not suffice.
A natural question is:
How to define the optimal value of ?
The method proposed here consists of computing the k-nearest neighbor distances in a matrix of points.
The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts.
Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter.
A knee corresponds to a threshold where a sharp change occurs along the k-distance curve.
The function kNNdistplot() [in dbscan package] can be used to draw the k-distance plot:
dbscan::kNNdistplot(df, k = 5)
abline(h = 0.15, lty = 2)
It can be seen that the optimal eps value is around a distance of 0.15.
The function predict.dbscan(object, data, newdata) [in fpc package] can be used to predict the clusters for the points in newdata. For more details, read the documentation (?predict.dbscan).
The traditional clustering methods, such as hierarchical clustering (Chapter @ref(agglomerative-clustering)) and k-means clustering (Chapter @ref(kmeans-clustering)), are heuristic and are not based on formal models. Furthermore, k-means algorithm is commonly randomnly initialized, so different runs of k-means will often yield different results. Additionally, k-means requires the user to specify the the optimal number of clusters.
An alternative is model-based clustering, which consider the data as coming from a distribution that is mixture of two or more clusters (Fraley and Raftery 2002, Fraley et al. (2012)). Unlike k-means, the model-based clustering uses a soft assignment, where each data point has a probability of belonging to each cluster.
In this chapter, we illustrate model-based clustering using the R package mclust.
Contents:
Related Book:
In model-based clustering, the data is considered as coming from a mixture of density.
Each component (i.e. cluster) k is modeled by the normal or Gaussian distribution which is characterized by the parameters:
For example, consider the “old faithful geyser data” [in MASS R package], which can be illustrated as follow using the ggpubr R package:
# Load the data
library("MASS")
data("geyser")
# Scatter plot
library("ggpubr")
ggscatter(geyser, x = "duration", y = "waiting")+
geom_density2d() # Add 2D density
The plot above suggests at least 3 clusters in the mixture. The shape of each of the 3 clusters appears to be approximately elliptical suggesting three bivariate normal distributions. As the 3 ellipses seems to be similar in terms of volume, shape and orientation, we might anticipate that the three components of this mixture might have homogeneous covariance matrices.
The model parameters can be estimated using the Expectation-Maximization (EM) algorithm initialized by hierarchical model-based clustering. Each cluster k is centered at the means \(\mu_k\), with increased density for points near the mean.
Geometric features (shape, volume, orientation) of each cluster are determined by the covariance matrix \(\sum_k\).
Different possible parameterizations of \(\sum_k\) are available in the R package mclust (see ?mclustModelNames).
The available model options, in mclust package, are represented by identifiers including: EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV and VVV.
The first identifier refers to volume, the second to shape and the third to orientation. E stands for “equal”, V for “variable” and I for “coordinate axes”.
For example:
The Mclust package uses maximum likelihood to fit all these models, with different covariance matrix parameterizations, for a range of k components.
The best model is selected using the Bayesian Information Criterion or BIC. A large BIC score indicates strong evidence for the corresponding model.
We start by installing the mclust package as follow: install.packages(“mclust”)
Note that, model-based clustering can be applied on univariate or multivariate data.
Here, we illustrate model-based clustering on the diabetes data set [mclust package] giving three measurements and the diagnosis for 145 subjects described as follow:
library("mclust")
data("diabetes")
head(diabetes, 3)
## class glucose insulin sspg
## 1 Normal 80 356 124
## 2 Normal 97 289 117
## 3 Normal 105 319 143
Model-based clustering can be computed using the function Mclust() as follow:
library(mclust)
df <- scale(diabetes[, -1]) # Standardize the data
mc <- Mclust(df) # Model-based-clustering
summary(mc) # Print a summary
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3 components:
##
## log.likelihood n df BIC ICL
## -169 145 29 -483 -501
##
## Clustering table:
## 1 2 3
## 81 36 28
For this data, it can be seen that model-based clustering selected a model with three components (i.e. clusters). The optimal selected model name is VVV model. That is the three components are ellipsoidal with varying volume, shape, and orientation. The summary contains also the clustering table specifying the number of observations in each clusters.
You can access to the results as follow:
mc$modelName # Optimal selected model ==> "VVV"
mc$G # Optimal number of cluster => 3
head(mc$z, 30) # Probality to belong to a given cluster
head(mc$classification, 30) # Cluster assignement of each observation
Model-based clustering results can be drawn using the base function plot.Mclust() [in mclust package]. Here we’ll use the function fviz_mclust() [in factoextra package] to create beautiful plots based on ggplot2.
In the situation, where the data contain more than two variables, fviz_mclust() uses a principal component analysis to reduce the dimensionnality of the data. The first two principal components are used to produce a scatter plot of the data. However, if you want to plot the data using only two variables of interest, let say here c(“insulin”, “sspg”), you can specify that in the fviz_mclust() function using the argument choose.vars = c(“insulin”, “sspg”).
library(factoextra)
# BIC values used for choosing the number of clusters
fviz_mclust(mc, "BIC", palette = "jco")
# Classification: plot showing the clustering
fviz_mclust(mc, "classification", geom = "point",
pointsize = 1.5, palette = "jco")
# Classification uncertainty
fviz_mclust(mc, "uncertainty", palette = "jco")
Note that, in the uncertainty plot, larger symbols indicate the more uncertain observations.
Fraley, Chris, and Adrian E Raftery. 2002. “Model-Based Clustering, Discriminant Analysis, and Density Estimation.” Journal of the American Statistical Association 97 (458): 611–31.
Fraley, Chris, Adrian E. Raftery, T. Brendan Murphy, and Luca Scrucca. 2012. “Mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation.” Technical Report No. 597, Department of Statistics, University of Washington. https://www.stat.washington.edu/research/reports/2012/tr597.pdf.
This article describes how to compute the fuzzy clustering using the function cmeans() [in e1071 R package]. Previously, we explained what is fuzzy clustering and how to compute the fuzzy clustering using the R function fanny()[in cluster package].
Related articles:
The simplified format of the function cmeans() is as follow:
cmeans(x, centers, iter.max = 100, dist = "euclidean", m = 2)
The function cmeans() returns an object of class fclust which is a list containing the following components:
set.seed(123)
# Load the data
data("USArrests")
# Subset of USArrests
ss <- sample(1:50, 20)
df <- scale(USArrests[ss,])
# Compute fuzzy clustering
library(e1071)
cm <- cmeans(df, 4)
cm
## Fuzzy c-means clustering with 4 clusters
##
## Cluster centers:
## Murder Assault UrbanPop Rape
## 1 0.857 0.338 -0.729 0.200
## 2 -0.731 -0.665 1.003 -0.333
## 3 -1.210 -1.248 -0.728 -1.153
## 4 0.629 0.970 0.501 0.865
##
## Memberships:
## 1 2 3 4
## Iowa 0.00916 0.0191 0.9658 0.00594
## Rhode Island 0.09885 0.5915 0.2050 0.10463
## Maryland 0.22786 0.0475 0.0273 0.69731
## Tennessee 0.87231 0.0286 0.0211 0.07801
## Utah 0.04446 0.8218 0.0844 0.04929
## Arizona 0.11876 0.1008 0.0399 0.74056
## Mississippi 0.62441 0.0931 0.1030 0.17952
## Wisconsin 0.03363 0.1110 0.8313 0.02403
## Virginia 0.39552 0.2570 0.1918 0.15573
## Maine 0.03433 0.0530 0.8915 0.02117
## Texas 0.24082 0.1595 0.0541 0.54557
## Louisiana 0.61799 0.0653 0.0419 0.27473
## Montana 0.13551 0.1366 0.6657 0.06215
## Michigan 0.09620 0.0371 0.0178 0.84890
## Arkansas 0.56529 0.1223 0.1805 0.13188
## New York 0.13194 0.1323 0.0416 0.69421
## Florida 0.17377 0.0749 0.0398 0.71155
## Alaska 0.38155 0.1354 0.1136 0.36947
## Hawaii 0.06662 0.7206 0.1487 0.06410
## New Jersey 0.05957 0.8009 0.0575 0.08206
##
## Closest hard clustering:
## Iowa Rhode Island Maryland Tennessee Utah
## 3 2 4 1 2
## Arizona Mississippi Wisconsin Virginia Maine
## 4 1 3 1 3
## Texas Louisiana Montana Michigan Arkansas
## 4 1 3 4 1
## New York Florida Alaska Hawaii New Jersey
## 4 4 1 2 2
##
## Available components:
## [1] "centers" "size" "cluster" "membership" "iter"
## [6] "withinerror" "call"
The different components can be extracted using the code below:
# Membership coefficient
head(cm$membership)
## 1 2 3 4
## Iowa 0.00916 0.0191 0.9658 0.00594
## Rhode Island 0.09885 0.5915 0.2050 0.10463
## Maryland 0.22786 0.0475 0.0273 0.69731
## Tennessee 0.87231 0.0286 0.0211 0.07801
## Utah 0.04446 0.8218 0.0844 0.04929
## Arizona 0.11876 0.1008 0.0399 0.74056
# Visualize using corrplot
library(corrplot)
corrplot(cm$membership, is.corr = FALSE)
# Observation groups/clusters
cm$cluster
## Iowa Rhode Island Maryland Tennessee Utah
## 3 2 4 1 2
## Arizona Mississippi Wisconsin Virginia Maine
## 4 1 3 1 3
## Texas Louisiana Montana Michigan Arkansas
## 4 1 3 4 1
## New York Florida Alaska Hawaii New Jersey
## 4 4 1 2 2
library(factoextra)
fviz_cluster(list(data = df, cluster=cm$cluster),
ellipse.type = "norm",
ellipse.level = 0.68,
palette = "jco",
ggtheme = theme_minimal())
In our previous article, we described the basic concept of fuzzy clustering and we showed how to compute fuzzy clustering. In this current article, we’ll present the fuzzy c-means clustering algorithm, which is very similar to the k-means algorithm and the aim is to minimize the objective function defined as follow:
\[ \sum\limits_{j=1}^k \sum\limits_{x_i \in C_j} u_{ij}^m (x_i - \mu_j)^2 \]
Where,
It can be seen that, FCM differs from k-means by using the membership values \(u_{ij}\) and the fuzzifier \(m\).
The variable \(u_{ij}^m\) is defined as follow:
\[ u_{ij}^m = \frac{1}{\sum\limits_{l=1}^k \left( \frac{| x_i - c_j |}{| x_i - c_k |}\right)^{\frac{2}{m-1}}} \]
The degree of belonging, \(u_{ij}\), is linked inversely to the distance from x to the cluster center.
The parameter \(m\) is a real number greater than 1 (\(1.0 < m < \infty\)) and it defines the level of cluster fuzziness. Note that, a value of \(m\) close to 1 gives a cluster solution which becomes increasingly similar to the solution of hard clustering such as k-means; whereas a value of \(m\) close to infinite leads to complete fuzzyness.
Note that, a good choice is to use m = 2.0 (Hathaway and Bezdek 2001).
In fuzzy clustering the centroid of a cluster is he mean of all points, weighted by their degree of belonging to the cluster:
\[ C_j = \frac{\sum\limits_{x \in C_j} u_{ij}^m x}{\sum\limits_{x \in C_j} u_{ij}^m} \]
Where,
The algorithm of fuzzy clustering can be summarize as follow:
The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means; the minimum is a local minimum, and the results depend on the initial choice of weights. Hence, different initializations may lead to different results.
Using a mixture of Gaussians along with the expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes.
The fuzzy clustering is considered as soft clustering, in which each element has a probability of belonging to each cluster. In other words, each element has a set of membership coefficients corresponding to the degree of being in a given cluster.
This is different from k-means and k-medoid clustering, where each object is affected exactly to one cluster. K-means and k-medoids clustering are known as hard or non-fuzzy clustering.
In fuzzy clustering, points close to the center of a cluster, may be in the cluster to a higher degree than points in the edge of a cluster. The degree, to which an element belongs to a given cluster, is a numerical value varying from 0 to 1.
The fuzzy c-means (FCM) algorithm is one of the most widely used fuzzy clustering algorithms. The centroid of a cluster is calculated as the mean of all points, weighted by their degree of belonging to the cluster:
In this article, we’ll describe how to compute fuzzy clustering using the R software.
Related Book:
We’ll use the following R packages: 1) cluster for computing fuzzy clustering and 2) factoextra for visualizing clusters.
The function fanny() [cluster R package] can be used to compute fuzzy clustering. FANNY stands for fuzzy analysis clustering. A simplified format is:
fanny(x, k, metric = "euclidean", stand = FALSE)
The function fanny() returns an object including the following components:
For example, the R code below applies fuzzy clustering on the USArrests data set:
library(cluster)
df <- scale(USArrests) # Standardize the data
res.fanny <- fanny(df, 2) # Compute fuzzy clustering with k = 2
The different components can be extracted using the code below:
head(res.fanny$membership, 3) # Membership coefficients
## [,1] [,2]
## Alabama 0.664 0.336
## Alaska 0.610 0.390
## Arizona 0.686 0.314
res.fanny$coeff # Dunn's partition coefficient
## dunn_coeff normalized
## 0.555 0.109
head(res.fanny$clustering) # Observation groups
## Alabama Alaska Arizona Arkansas California Colorado
## 1 1 1 2 1 1
To visualize observation groups, use the function fviz_cluster() [factoextra package]:
library(factoextra)
fviz_cluster(res.fanny, ellipse.type = "norm", repel = TRUE,
palette = "jco", ggtheme = theme_minimal(),
legend = "right")
To evaluate the goodnesss of the clustering results, plot the silhouette coefficient as follow:
fviz_silhouette(res.fanny, palette = "jco",
ggtheme = theme_minimal())
## cluster size ave.sil.width
## 1 1 22 0.32
## 2 2 28 0.44
Fuzzy clustering is an alternative to k-means clustering, where each data point has membership coefficient to each cluster. Here, we demonstrated how to compute and visualize fuzzy clustering using the combination of cluster and factoextra R packages.
K-means (Chapter @ref(kmeans-clustering)) represents one of the most popular clustering algorithm. However, it has some limitations: it requires the user to specify the number of clusters in advance and selects initial centroids randomly. The final k-means clustering solution is very sensitive to this initial random selection of cluster centers. The result might be (slightly) different each time you compute k-means.
In this chapter, we described an hybrid method, named hierarchical k-means clustering (hkmeans), for improving k-means results.
Related Book:
The algorithm is summarized as follow:
Note that, k-means algorithm will improve the initial partitioning generated at the step 2 of the algorithm. Hence, the initial partitioning can be slightly different from the final partitioning obtained in the step 4.
The R function hkmeans() [in factoextra], provides an easy solution to compute the hierarchical k-means clustering. The format of the result is similar to the one provided by the standard kmeans() function (see Chapter @ref(kmeans-clustering)).
To install factoextra, type this: install.packages(“factoextra”).
We’ll use the USArrest data set and we start by standardizing the data:
df <- scale(USArrests)
# Compute hierarchical k-means clustering
library(factoextra)
res.hk <-hkmeans(df, 4)
# Elements returned by hkmeans()
names(res.hk)
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault" "data" "hclust"
To print all the results, type this:
# Print the results
res.hk
# Visualize the tree
fviz_dend(res.hk, cex = 0.6, palette = "jco",
rect = TRUE, rect_border = "jco", rect_fill = TRUE)
# Visualize the hkmeans final clusters
fviz_cluster(res.hk, palette = "jco", repel = TRUE,
ggtheme = theme_classic())
We described hybrid hierarchical k-means clustering for improving k-means results.
Clusters can be found in a data set by chance due to clustering noise or sampling error. This article describes the R package pvclust (Suzuki and Shimodaira 2015) which uses bootstrap resampling techniques to compute p-value for each hierarchical clusters.
Contents:
Related Book:
Clusters with AU >= 95% are considered to be strongly supported by data.
install.packages("pvclust")
library(pvclust)
We’ll use lung data set [in pvclust package]. It contains the gene expression profile of 916 genes of 73 lung tissues including 67 tumors. Columns are samples and rows are genes.
library(pvclust)
# Load the data
data("lung")
head(lung[, 1:4])
## fetal_lung 232-97_SCC 232-97_node 68-96_Adeno
## IMAGE:196992 -0.40 4.28 3.68 -1.35
## IMAGE:587847 -2.22 5.21 4.75 -0.91
## IMAGE:1049185 -1.35 -0.84 -2.88 3.35
## IMAGE:135221 0.68 0.56 -0.45 -0.20
## IMAGE:298560 NA 4.14 3.58 -0.40
## IMAGE:119882 -3.23 -2.84 -2.72 -0.83
# Dimension of the data
dim(lung)
## [1] 916 73
We’ll use only a subset of the data set for the clustering analysis. The R function sample() can be used to extract a random subset of 30 samples:
set.seed(123)
ss <- sample(1:73, 30) # extract 20 samples out of
df <- lung[, ss]
The function pvclust() can be used as follow:
pvclust(data, method.hclust = "average",
method.dist = "correlation", nboot = 1000)
Note that, the computation time can be strongly decreased using parallel computation version called parPvclust(). (Read ?parPvclust() for more information.)
parPvclust(cl=NULL, data, method.hclust = "average",
method.dist = "correlation", nboot = 1000,
iseed = NULL)
The function pvclust() returns an object of class pvclust containing many elements including hclust which contains hierarchical clustering result for the original data generated by the function hclust().
pvclust() performs clustering on the columns of the data set, which correspond to samples in our case. If you want to perform the clustering on the variables (here, genes) you have to transpose the data set using the function t().
The R code below computes pvclust() using 10 as the number of bootstrap replications (for speed):
library(pvclust)
set.seed(123)
res.pv <- pvclust(df, method.dist="cor",
method.hclust="average", nboot = 10)
# Default plot
plot(res.pv, hang = -1, cex = 0.5)
pvrect(res.pv)
Values on the dendrogram are AU p-values (Red, left), BP values (green, right), and clusterlabels (grey, bottom). Clusters with AU > = 95% are indicated by the rectangles and are considered to be strongly supported by data.
To extract the objects from the significant clusters, use the function pvpick():
clusters <- pvpick(res.pv)
clusters
Parallel computation can be applied as follow:
# Create a parallel socket cluster
library(parallel)
cl <- makeCluster(2, type = "PSOCK")
# parallel version of pvclust
res.pv <- parPvclust(cl, df, nboot=1000)
stopCluster(cl)
Suzuki, Ryota, and Hidetoshi Shimodaira. 2015. Pvclust: Hierarchical Clustering with P-Values via Multiscale Bootstrap Resampling. https://CRAN.R-project.org/package=pvclust.