Last articles - STHDA : Advanced Clustering

DBSCAN: Density-Based Clustering Essentials

Thu, 07 Sep 2017 20:02:00 +0200

DBSCAN (Density-Based Spatial Clustering and Application with Noise), is a density-based clusering algorithm, introduced in Ester et al. 1996, which can be used to identify clusters of any shape in a data set containing noise and outliers.

The basic idea behind the density-based clustering approach is derived from a human intuitive clustering method. For instance, by looking at the figure below, one can easily identify four clusters along with several points of noise, because of the differences in the density of points.

Clusters are dense regions in the data space, separated by regions of lower density of points. The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.

(From Ester et al. 1996)

In this chapter, we’ll describe the DBSCAN algorithm and demonstrate how to compute DBSCAN using the fpc R package.

Contents:

Why DBSCAN?
Algorithm
Advantages
Parameter estimation
Computing DBSCAN
Method for determining the optimal eps value
Cluster predictions with DBSCAN algorithm

Why DBSCAN?

Partitioning methods (K-means, PAM clustering) and hierarchical clustering are suitable for finding spherical-shaped clusters or convex clusters. In other words, they work well only for compact and well separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.

Unfortunately, real life data can contain: i) clusters of arbitrary shape such as those shown in the figure below (oval, linear and “S” shape clusters); ii) many outliers and noise.

The figure below shows a data set containing nonconvex clusters and outliers/noises. The simulated data set multishapes [in factoextra package] is used.

The plot above contains 5 clusters and outliers, including:

2 ovales clusters
2 linear clusters
1 compact cluster

Given such data, k-means algorithm has difficulties for identifying theses clusters with arbitrary shapes. To illustrate this situation, the following R code computes k-means algorithm on the multishapes data set. The function fviz_cluster()[factoextra package] is used to visualize the clusters.

First, install factoextra: install.packages(“factoextra”); then compute and visualize k-means clustering using the data set multishapes:

library(factoextra)
data("multishapes")
df <- multishapes[, 1:2]
set.seed(123)
km.res <- kmeans(df, 5, nstart = 25)
fviz_cluster(km.res, df,  geom = "point", 
             ellipse= FALSE, show.clust.cent = FALSE,
             palette = "jco", ggtheme = theme_classic())

We know there are 5 five clusters in the data, but it can be seen that k-means method inaccurately identify the 5 clusters.

Algorithm

The goal is to identify dense regions, which can be measured by the number of objects close to a given point.

Two important parameters are required for DBSCAN: epsilon (“eps”) and minimum points (“MinPts”). The parameter eps defines the radius of neighborhood around a point x. It’s called called the \(\epsilon\)-neighborhood of x. The parameter MinPts is the minimum number of neighbors within “eps” radius.

Any point x in the data set, with a neighbor count greater than or equal to MinPts, is marked as a core point. We say that x is border point, if the number of its neighbors is less than MinPts, but it belongs to the \(\epsilon\)-neighborhood of some core point z. Finally, if a point is neither a core nor a border point, then it is called a noise point or an outlier.

The figure below shows the different types of points (core, border and outlier points) using MinPts = 6. Here x is a core point because \(neighbours_\epsilon(x) = 6\), y is a border point because \(neighbours_\epsilon(y) < MinPts\), but it belongs to the \(\epsilon\)-neighborhood of the core point x. Finally, z is a noise point.

We start by defining 3 terms, required for understanding the DBSCAN algorithm:

Direct density reachable: A point “A” is directly density reachable from another point “B” if: i) “A” is in the \(\epsilon\)-neighborhood of “B” and ii) “B” is a core point.
Density reachable: A point “A” is density reachable from “B” if there are a set of core points leading from “B” to “A.
Density connected: Two points “A” and “B” are density connected if there are a core point “C”, such that both “A” and “B” are density reachable from “C”.

A density-based cluster is defined as a group of density connected points. The algorithm of density-based clustering (DBSCAN) works as follow:

For each point x_i, compute the distance between x_i and the other points. Finds all neighbor points within distance eps of the starting point (x_i). Each point, with a neighbor count greater than or equal to MinPts, is marked as core point or visited.
For each core point, if it’s not already assigned to a cluster, create a new cluster. Find recursively all its density connected points and assign them to the same cluster as the core point.
Iterate through the remaining unvisited points in the data set.

Those points that do not belong to any cluster are treated as outliers or noise.

Advantages

Unlike K-means, DBSCAN does not require the user to specify the number of clusters to be generated
DBSCAN can find any shape of clusters. The cluster doesn’t have to be circular.
DBSCAN can identify outliers

Parameter estimation

MinPts: The larger the data set, the larger the value of minPts should be chosen. minPts must be chosen at least 3.
\(\epsilon\): The value for \(\epsilon\) can then be chosen by using a k-distance graph, plotting the distance to the k = minPts nearest neighbor. Good values of \(\epsilon\) are where this plot shows a strong bend.

Computing DBSCAN

Here, we’ll use the R package fpc to compute DBSCAN. It’s also possible to use the package dbscan, which provides a faster re-implementation of DBSCAN algorithm compared to the fpc package.

We’ll also use the factoextra package for visualizing clusters.

First, install the packages as follow:

install.packages("fpc")
install.packages("dbscan")
install.packages("factoextra")

The R code below computes and visualizes DBSCAN using multishapes data set [factoextra R package]:

# Load the data 
data("multishapes", package = "factoextra")
df <- multishapes[, 1:2]
# Compute DBSCAN using fpc package
library("fpc")
set.seed(123)
db <- fpc::dbscan(df, eps = 0.15, MinPts = 5)
# Plot DBSCAN results
library("factoextra")
fviz_cluster(db, data = df, stand = FALSE,
             ellipse = FALSE, show.clust.cent = FALSE,
             geom = "point",palette = "jco", ggtheme = theme_classic())

Note that, the function fviz_cluster() uses different point symbols for core points (i.e, seed points) and border points. Black points correspond to outliers. You can play with eps and MinPts for changing cluster configurations.

It can be seen that DBSCAN performs better for these data sets and can identify the correct set of clusters compared to k-means algorithms.

The result of the fpc::dbscan() function can be displayed as follow:

print(db)

## dbscan Pts=1100 MinPts=5 eps=0.15
##         0   1   2   3  4  5
## border 31  24   1   5  7  1
## seed    0 386 404  99 92 50
## total  31 410 405 104 99 51

In the table above, column names are cluster number. Cluster 0 corresponds to outliers (black points in the DBSCAN plot). The function print.dbscan() shows a statistic of the number of points belonging to the clusters that are seeds and border points.

# Cluster membership. Noise/outlier observations are coded as 0
# A random subset is shown
db$cluster[sample(1:1089, 20)]

##  [1] 1 3 2 4 3 1 2 4 2 2 2 2 2 2 1 4 1 1 1 0

DBSCAN algorithm requires users to specify the optimal eps values and the parameter MinPts. In the R code above, we used eps = 0.15 and MinPts = 5. One limitation of DBSCAN is that it is sensitive to the choice of \(\epsilon\), in particular if clusters have different densities. If \(\epsilon\) is too small, sparser clusters will be defined as noise. If \(\epsilon\) is too large, denser clusters may be merged together. This implies that, if there are clusters with different local densities, then a single \(\epsilon\) value may not suffice.

A natural question is:

How to define the optimal value of ?

Method for determining the optimal eps value

The method proposed here consists of computing the k-nearest neighbor distances in a matrix of points.

The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts.

Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter.

A knee corresponds to a threshold where a sharp change occurs along the k-distance curve.

The function kNNdistplot() [in dbscan package] can be used to draw the k-distance plot:

dbscan::kNNdistplot(df, k =  5)
abline(h = 0.15, lty = 2)

It can be seen that the optimal eps value is around a distance of 0.15.

Cluster predictions with DBSCAN algorithm

The function predict.dbscan(object, data, newdata) [in fpc package] can be used to predict the clusters for the points in newdata. For more details, read the documentation (?predict.dbscan).

Model Based Clustering Essentials

Thu, 07 Sep 2017 19:46:00 +0200

The traditional clustering methods, such as hierarchical clustering (Chapter @ref(agglomerative-clustering)) and k-means clustering (Chapter @ref(kmeans-clustering)), are heuristic and are not based on formal models. Furthermore, k-means algorithm is commonly randomnly initialized, so different runs of k-means will often yield different results. Additionally, k-means requires the user to specify the the optimal number of clusters.

An alternative is model-based clustering, which consider the data as coming from a distribution that is mixture of two or more clusters (Fraley and Raftery 2002, Fraley et al. (2012)). Unlike k-means, the model-based clustering uses a soft assignment, where each data point has a probability of belonging to each cluster.

In this chapter, we illustrate model-based clustering using the R package mclust.

Contents:

Concept of model-based clustering
Estimating model parameters
Choosing the best model
Computing model-based clustering in R
Visualizing model-based clustering
References

Concept of model-based clustering

In model-based clustering, the data is considered as coming from a mixture of density.

Each component (i.e. cluster) k is modeled by the normal or Gaussian distribution which is characterized by the parameters:

\(\mu_k\): mean vector,
\(\sum_k\): covariance matrix,
An associated probability in the mixture. Each point has a probability of belonging to each cluster.

For example, consider the “old faithful geyser data” [in MASS R package], which can be illustrated as follow using the ggpubr R package:

# Load the data
library("MASS")
data("geyser")
# Scatter plot
library("ggpubr")
ggscatter(geyser, x = "duration", y = "waiting")+
  geom_density2d() # Add 2D density

The plot above suggests at least 3 clusters in the mixture. The shape of each of the 3 clusters appears to be approximately elliptical suggesting three bivariate normal distributions. As the 3 ellipses seems to be similar in terms of volume, shape and orientation, we might anticipate that the three components of this mixture might have homogeneous covariance matrices.

Estimating model parameters

The model parameters can be estimated using the Expectation-Maximization (EM) algorithm initialized by hierarchical model-based clustering. Each cluster k is centered at the means \(\mu_k\), with increased density for points near the mean.

Geometric features (shape, volume, orientation) of each cluster are determined by the covariance matrix \(\sum_k\).

Different possible parameterizations of \(\sum_k\) are available in the R package mclust (see ?mclustModelNames).

The available model options, in mclust package, are represented by identifiers including: EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV and VVV.

The first identifier refers to volume, the second to shape and the third to orientation. E stands for “equal”, V for “variable” and I for “coordinate axes”.

For example:

EVI denotes a model in which the volumes of all clusters are equal (E), the shapes of the clusters may vary (V), and the orientation is the identity (I) or “coordinate axes.
EEE means that the clusters have the same volume, shape and orientation in p-dimensional space.
VEI means that the clusters have variable volume, the same shape and orientation equal to coordinate axes.

Choosing the best model

The Mclust package uses maximum likelihood to fit all these models, with different covariance matrix parameterizations, for a range of k components.

The best model is selected using the Bayesian Information Criterion or BIC. A large BIC score indicates strong evidence for the corresponding model.

Computing model-based clustering in R

We start by installing the mclust package as follow: install.packages(“mclust”)

Note that, model-based clustering can be applied on univariate or multivariate data.

Here, we illustrate model-based clustering on the diabetes data set [mclust package] giving three measurements and the diagnosis for 145 subjects described as follow:

library("mclust")
data("diabetes")
head(diabetes, 3)

##    class glucose insulin sspg
## 1 Normal      80     356  124
## 2 Normal      97     289  117
## 3 Normal     105     319  143

class: the diagnosis: normal, chemically diabetic, and overtly diabetic. Excluded from the cluster analysis.
glucose: plasma glucose response to oral glucose
insulin: plasma insulin response to oral glucose
sspg: steady-state plasma glucose (measures insulin resistance)

Model-based clustering can be computed using the function Mclust() as follow:

library(mclust)
df <- scale(diabetes[, -1]) # Standardize the data
mc <- Mclust(df)            # Model-based-clustering
summary(mc)                 # Print a summary

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3 components:
## 
##  log.likelihood   n df  BIC  ICL
##            -169 145 29 -483 -501
## 
## Clustering table:
##  1  2  3 
## 81 36 28

For this data, it can be seen that model-based clustering selected a model with three components (i.e. clusters). The optimal selected model name is VVV model. That is the three components are ellipsoidal with varying volume, shape, and orientation. The summary contains also the clustering table specifying the number of observations in each clusters.

You can access to the results as follow:

mc$modelName                # Optimal selected model ==> "VVV"
mc$G                        # Optimal number of cluster => 3
head(mc$z, 30)              # Probality to belong to a given cluster
head(mc$classification, 30) # Cluster assignement of each observation

Visualizing model-based clustering

Model-based clustering results can be drawn using the base function plot.Mclust() [in mclust package]. Here we’ll use the function fviz_mclust() [in factoextra package] to create beautiful plots based on ggplot2.

In the situation, where the data contain more than two variables, fviz_mclust() uses a principal component analysis to reduce the dimensionnality of the data. The first two principal components are used to produce a scatter plot of the data. However, if you want to plot the data using only two variables of interest, let say here c(“insulin”, “sspg”), you can specify that in the fviz_mclust() function using the argument choose.vars = c(“insulin”, “sspg”).

library(factoextra)
# BIC values used for choosing the number of clusters
fviz_mclust(mc, "BIC", palette = "jco")
# Classification: plot showing the clustering
fviz_mclust(mc, "classification", geom = "point", 
            pointsize = 1.5, palette = "jco")
# Classification uncertainty
fviz_mclust(mc, "uncertainty", palette = "jco")

Note that, in the uncertainty plot, larger symbols indicate the more uncertain observations.

References

Fraley, Chris, and Adrian E Raftery. 2002. “Model-Based Clustering, Discriminant Analysis, and Density Estimation.” Journal of the American Statistical Association 97 (458): 611–31.

Fraley, Chris, Adrian E. Raftery, T. Brendan Murphy, and Luca Scrucca. 2012. “Mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation.” Technical Report No. 597, Department of Statistics, University of Washington. https://www.stat.washington.edu/research/reports/2012/tr597.pdf.

cmeans() R function: Compute Fuzzy clustering

Thu, 07 Sep 2017 19:26:00 +0200

This article describes how to compute the fuzzy clustering using the function cmeans() [in e1071 R package]. Previously, we explained what is fuzzy clustering and how to compute the fuzzy clustering using the R function fanny()[in cluster package].

cmeans() format

The simplified format of the function cmeans() is as follow:

cmeans(x, centers, iter.max = 100, dist = "euclidean", m = 2)

x: a data matrix where columns are variables and rows are observations
centers: Number of clusters or initial values for cluster centers
iter.max: Maximum number of iterations
dist: Possible values are “euclidean” or “manhattan”
m: A number greater than 1 giving the degree of fuzzification.

The function cmeans() returns an object of class fclust which is a list containing the following components:

centers: the final cluster centers
size: the number of data points in each cluster of the closest hard clustering
cluster: a vector of integers containing the indices of the clusters where the data points are assigned to for the closest hard - clustering, as obtained by assigning points to the (first) class with maximal membership.
iter: the number of iterations performed
membership: a matrix with the membership values of the data points to the clusters
withinerror: the value of the objective function

Compute fuzzy c-means clustering

set.seed(123)
# Load the data
data("USArrests")
# Subset of USArrests
ss <- sample(1:50, 20)
df <- scale(USArrests[ss,])
# Compute fuzzy clustering
library(e1071)
cm <- cmeans(df, 4)
cm

## Fuzzy c-means clustering with 4 clusters
## 
## Cluster centers:
##   Murder Assault UrbanPop   Rape
## 1  0.857   0.338   -0.729  0.200
## 2 -0.731  -0.665    1.003 -0.333
## 3 -1.210  -1.248   -0.728 -1.153
## 4  0.629   0.970    0.501  0.865
## 
## Memberships:
##                    1      2      3       4
## Iowa         0.00916 0.0191 0.9658 0.00594
## Rhode Island 0.09885 0.5915 0.2050 0.10463
## Maryland     0.22786 0.0475 0.0273 0.69731
## Tennessee    0.87231 0.0286 0.0211 0.07801
## Utah         0.04446 0.8218 0.0844 0.04929
## Arizona      0.11876 0.1008 0.0399 0.74056
## Mississippi  0.62441 0.0931 0.1030 0.17952
## Wisconsin    0.03363 0.1110 0.8313 0.02403
## Virginia     0.39552 0.2570 0.1918 0.15573
## Maine        0.03433 0.0530 0.8915 0.02117
## Texas        0.24082 0.1595 0.0541 0.54557
## Louisiana    0.61799 0.0653 0.0419 0.27473
## Montana      0.13551 0.1366 0.6657 0.06215
## Michigan     0.09620 0.0371 0.0178 0.84890
## Arkansas     0.56529 0.1223 0.1805 0.13188
## New York     0.13194 0.1323 0.0416 0.69421
## Florida      0.17377 0.0749 0.0398 0.71155
## Alaska       0.38155 0.1354 0.1136 0.36947
## Hawaii       0.06662 0.7206 0.1487 0.06410
## New Jersey   0.05957 0.8009 0.0575 0.08206
## 
## Closest hard clustering:
##         Iowa Rhode Island     Maryland    Tennessee         Utah 
##            3            2            4            1            2 
##      Arizona  Mississippi    Wisconsin     Virginia        Maine 
##            4            1            3            1            3 
##        Texas    Louisiana      Montana     Michigan     Arkansas 
##            4            1            3            4            1 
##     New York      Florida       Alaska       Hawaii   New Jersey 
##            4            4            1            2            2 
## 
## Available components:
## [1] "centers"     "size"        "cluster"     "membership"  "iter"       
## [6] "withinerror" "call"

The different components can be extracted using the code below:

# Membership coefficient
head(cm$membership)

##                    1      2      3       4
## Iowa         0.00916 0.0191 0.9658 0.00594
## Rhode Island 0.09885 0.5915 0.2050 0.10463
## Maryland     0.22786 0.0475 0.0273 0.69731
## Tennessee    0.87231 0.0286 0.0211 0.07801
## Utah         0.04446 0.8218 0.0844 0.04929
## Arizona      0.11876 0.1008 0.0399 0.74056

# Visualize using corrplot
library(corrplot)
corrplot(cm$membership, is.corr = FALSE)

# Observation groups/clusters
cm$cluster

##         Iowa Rhode Island     Maryland    Tennessee         Utah 
##            3            2            4            1            2 
##      Arizona  Mississippi    Wisconsin     Virginia        Maine 
##            4            1            3            1            3 
##        Texas    Louisiana      Montana     Michigan     Arkansas 
##            4            1            3            4            1 
##     New York      Florida       Alaska       Hawaii   New Jersey 
##            4            4            1            2            2

Visualize clusters

library(factoextra)
fviz_cluster(list(data = df, cluster=cm$cluster), 
             ellipse.type = "norm",
             ellipse.level = 0.68,
             palette = "jco",
             ggtheme = theme_minimal())

Fuzzy C-Means Clustering Algorithm

Thu, 07 Sep 2017 16:44:00 +0200

In our previous article, we described the basic concept of fuzzy clustering and we showed how to compute fuzzy clustering. In this current article, we’ll present the fuzzy c-means clustering algorithm, which is very similar to the k-means algorithm and the aim is to minimize the objective function defined as follow:

\[ \sum\limits_{j=1}^k \sum\limits_{x_i \in C_j} u_{ij}^m (x_i - \mu_j)^2 \]

Where,

\(u_{ij}\) is the degree to which an observation \(x_i\) belongs to a cluster \(c_j\)
\(\mu_j\) is the center of the cluster j
\(u_{ij}\) is the degree to which an observation \(x_i\) belongs to a cluster \(c_j\)
\(m\) is the fuzzifier.

It can be seen that, FCM differs from k-means by using the membership values \(u_{ij}\) and the fuzzifier \(m\).

The variable \(u_{ij}^m\) is defined as follow:

\[ u_{ij}^m = \frac{1}{\sum\limits_{l=1}^k \left( \frac{| x_i - c_j |}{| x_i - c_k |}\right)^{\frac{2}{m-1}}} \]

The degree of belonging, \(u_{ij}\), is linked inversely to the distance from x to the cluster center.

The parameter \(m\) is a real number greater than 1 (\(1.0 < m < \infty\)) and it defines the level of cluster fuzziness. Note that, a value of \(m\) close to 1 gives a cluster solution which becomes increasingly similar to the solution of hard clustering such as k-means; whereas a value of \(m\) close to infinite leads to complete fuzzyness.

Note that, a good choice is to use m = 2.0 (Hathaway and Bezdek 2001).

In fuzzy clustering the centroid of a cluster is he mean of all points, weighted by their degree of belonging to the cluster:

\[ C_j = \frac{\sum\limits_{x \in C_j} u_{ij}^m x}{\sum\limits_{x \in C_j} u_{ij}^m} \]

Where,

\(C_j\) is the centroid of the cluster j
\(u_{ij}\) is the degree to which an observation \(x_i\) belongs to a cluster \(c_j\)

The algorithm of fuzzy clustering can be summarize as follow:

Specify a number of clusters k (by the analyst)
Assign randomly to each point coefficients for being in the clusters.
Repeat until the maximum number of iterations (given by “maxit”) is reached, or when the algorithm has converged (that is, the coefficients’ change between two iterations is no more than \(\epsilon\), the given sensitivity threshold):
- Compute the centroid for each cluster, using the formula above.
- For each point, compute its coefficients of being in the clusters, using the formula above.

The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means; the minimum is a local minimum, and the results depend on the initial choice of weights. Hence, different initializations may lead to different results.

Using a mixture of Gaussians along with the expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes.

Fuzzy Clustering Essentials

Thu, 07 Sep 2017 15:50:00 +0200

The fuzzy clustering is considered as soft clustering, in which each element has a probability of belonging to each cluster. In other words, each element has a set of membership coefficients corresponding to the degree of being in a given cluster.

This is different from k-means and k-medoid clustering, where each object is affected exactly to one cluster. K-means and k-medoids clustering are known as hard or non-fuzzy clustering.

In fuzzy clustering, points close to the center of a cluster, may be in the cluster to a higher degree than points in the edge of a cluster. The degree, to which an element belongs to a given cluster, is a numerical value varying from 0 to 1.

The fuzzy c-means (FCM) algorithm is one of the most widely used fuzzy clustering algorithms. The centroid of a cluster is calculated as the mean of all points, weighted by their degree of belonging to the cluster:

In this article, we’ll describe how to compute fuzzy clustering using the R software.

Required R packages

We’ll use the following R packages: 1) cluster for computing fuzzy clustering and 2) factoextra for visualizing clusters.

Computing fuzzy clustering

The function fanny() [cluster R package] can be used to compute fuzzy clustering. FANNY stands for fuzzy analysis clustering. A simplified format is:

fanny(x, k, metric = "euclidean", stand = FALSE)

x: A data matrix or data frame or dissimilarity matrix
k: The desired number of clusters to be generated
metric: Metric for calculating dissimilarities between observations
stand: If TRUE, variables are standardized before calculating the dissimilarities

The function fanny() returns an object including the following components:

membership: matrix containing the degree to which each observation belongs to a given cluster. Column names are the clusters and rows are observations
coeff: Dunn’s partition coefficient F(k) of the clustering, where k is the number of clusters. F(k) is the sum of all squared membership coefficients, divided by the number of observations. Its value is between 1/k and 1. The normalized form of the coefficient is also given. It is defined as \((F(k) - 1/k) / (1 - 1/k)\), and ranges between 0 and 1. A low value of Dunn’s coefficient indicates a very fuzzy clustering, whereas a value close to 1 indicates a near-crisp clustering.
clustering: the clustering vector containing the nearest crisp grouping of observations

For example, the R code below applies fuzzy clustering on the USArrests data set:

library(cluster)
df <- scale(USArrests)     # Standardize the data
res.fanny <- fanny(df, 2)  # Compute fuzzy clustering with k = 2

The different components can be extracted using the code below:

head(res.fanny$membership, 3) # Membership coefficients

##          [,1]  [,2]
## Alabama 0.664 0.336
## Alaska  0.610 0.390
## Arizona 0.686 0.314

res.fanny$coeff # Dunn's partition coefficient

## dunn_coeff normalized 
##      0.555      0.109

head(res.fanny$clustering) # Observation groups

##    Alabama     Alaska    Arizona   Arkansas California   Colorado 
##          1          1          1          2          1          1

To visualize observation groups, use the function fviz_cluster() [factoextra package]:

library(factoextra)
fviz_cluster(res.fanny, ellipse.type = "norm", repel = TRUE,
             palette = "jco", ggtheme = theme_minimal(),
             legend = "right")

To evaluate the goodnesss of the clustering results, plot the silhouette coefficient as follow:

fviz_silhouette(res.fanny, palette = "jco",
                ggtheme = theme_minimal())

##   cluster size ave.sil.width
## 1       1   22          0.32
## 2       2   28          0.44

Summary

Fuzzy clustering is an alternative to k-means clustering, where each data point has membership coefficient to each cluster. Here, we demonstrated how to compute and visualize fuzzy clustering using the combination of cluster and factoextra R packages.

Hierarchical K-Means Clustering: Optimize Clusters

Thu, 07 Sep 2017 15:21:00 +0200

Hierarchical K-Means Clustering

K-means (Chapter @ref(kmeans-clustering)) represents one of the most popular clustering algorithm. However, it has some limitations: it requires the user to specify the number of clusters in advance and selects initial centroids randomly. The final k-means clustering solution is very sensitive to this initial random selection of cluster centers. The result might be (slightly) different each time you compute k-means.

In this chapter, we described an hybrid method, named hierarchical k-means clustering (hkmeans), for improving k-means results.

Algorithm

The algorithm is summarized as follow:

Compute hierarchical clustering and cut the tree into k-clusters
Compute the center (i.e the mean) of each cluster
Compute k-means by using the set of cluster centers (defined in step 2) as the initial cluster centers

Note that, k-means algorithm will improve the initial partitioning generated at the step 2 of the algorithm. Hence, the initial partitioning can be slightly different from the final partitioning obtained in the step 4.

R code

The R function hkmeans() [in factoextra], provides an easy solution to compute the hierarchical k-means clustering. The format of the result is similar to the one provided by the standard kmeans() function (see Chapter @ref(kmeans-clustering)).

To install factoextra, type this: install.packages(“factoextra”).

We’ll use the USArrest data set and we start by standardizing the data:

df <- scale(USArrests)

# Compute hierarchical k-means clustering
library(factoextra)
res.hk <-hkmeans(df, 4)
# Elements returned by hkmeans()
names(res.hk)

##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"

To print all the results, type this:

# Print the results
res.hk

# Visualize the tree
fviz_dend(res.hk, cex = 0.6, palette = "jco", 
          rect = TRUE, rect_border = "jco", rect_fill = TRUE)

# Visualize the hkmeans final clusters
fviz_cluster(res.hk, palette = "jco", repel = TRUE,
             ggtheme = theme_classic())

Summary

We described hybrid hierarchical k-means clustering for improving k-means results.