Last articles - STHDA : Partitioning Clustering Essentials

CLARA - Clustering Large Applications

Mon, 04 Sep 2017 22:23:00 +0200

CLARA (Clustering Large Applications, (Kaufman and Rousseeuw 1990)) is an extension to k-medoids methods (Chapter @ref(k-medoids)) to deal with data containing a large number of objects (more than several thousand observations) in order to reduce computing time and RAM storage problem. This is achieved using the sampling approach.

Contents:

CLARA concept
CLARA Algorithm
Computing CLARA in R
Summary
References

CLARA concept

Instead of finding medoids for the entire data set, CLARA considers a small sample of the data with fixed size (sampsize) and applies the PAM algorithm (Chapter @ref(k-medoids)) to generate an optimal set of medoids for the sample. The quality of resulting medoids is measured by the average dissimilarity between every object in the entire data set and the medoid of its cluster, defined as the cost function.

CLARA repeats the sampling and clustering processes a pre-specified number of times in order to minimize the sampling bias. The final clustering results correspond to the set of medoids with the minimal cost. The CLARA algorithm is summarized in the next section.

CLARA Algorithm

The algorithm is as follow:

Split randomly the data sets in multiple subsets with fixed size (sampsize)
Compute PAM algorithm on each subset and choose the corresponding k representative objects (medoids). Assign each observation of the entire data set to the closest medoid.
Calculate the mean (or the sum) of the dissimilarities of the observations to their closest medoid. This is used as a measure of the goodness of the clustering.
Retain the sub-dataset for which the mean (or sum) is minimal. A further analysis is carried out on the final partition.

Note that, each sub-data set is forced to contain the medoids obtained from the best sub-data set until then. Randomly drawn observations are added to this set until sampsize has been reached.

Computing CLARA in R

Data format and preparation

To compute the CLARA algorithm in R, the data should be prepared as indicated in Chapter @ref(data-preparation-and-r-packages).

Here, we’ll generate use a random data set. To make the result reproducible, we start by using the function set.seed().

set.seed(1234)
# Generate 500 objects, divided into 2 clusters.
df <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
           cbind(rnorm(300,50,8), rnorm(300,50,8)))
# Specify column and row names
colnames(df) <- c("x", "y")
rownames(df) <- paste0("S", 1:nrow(df))
# Previewing the data
head(df, nrow = 6)

##         x    y
## S1  -9.66 3.88
## S2   2.22 5.57
## S3   8.68 1.48
## S4 -18.77 5.61
## S5   3.43 2.49
## S6   4.05 6.08

Required R packages and functions

The function clara() [cluster package] can be used to compute CLARA. The simplified format is as follow:

clara(x, k, metric = "euclidean", stand = FALSE, 
      samples = 5, pamLike = FALSE)

x: a numeric data matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. Missing values (NAs) are allowed.
k: the number of clusters.
metric: the distance metrics to be used. Available options are “euclidean” and “manhattan”. Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. Read more on distance measures (Chapter ). Note that, manhattan distance is less sensitive to outliers.
stand: logical value; if true, the variables (columns) in x are standardized before calculating the dissimilarities. Note that, it’s recommended to standardize variables before clustering.
samples: number of samples to be drawn from the data set. Default value is 5 but it’s recommended a much larger value.
pamLike: logical indicating if the same algorithm in the pam() function should be used. This should be always true.

To create a beautiful graph of the clusters generated with the pam() function, will use the factoextra package.

Installing required packages:

install.packages(c("cluster", "factoextra"))

Loading the packages:

library(cluster)
library(factoextra)

Estimating the optimal number of clusters

To estimate the optimal number of clusters in your data, it’s possible to use the average silhouette method as described in PAM clustering chapter (Chapter @ref(k-medoids)). The R function fviz_nbclust() [factoextra package] provides a solution to facilitate this step.

library(cluster)
library(factoextra)
fviz_nbclust(df, clara, method = "silhouette")+
  theme_classic()

From the plot, the suggested number of clusters is 2. In the next section, we’ll classify the observations into 2 clusters.

Computing CLARA

The R code below computes PAM algorithm with k = 2:

# Compute CLARA
clara.res <- clara(df, 2, samples = 50, pamLike = TRUE)
# Print components of clara.res
print(clara.res)

## Call:     clara(x = df, k = 2, samples = 50, pamLike = TRUE) 
## Medoids:
##          x     y
## S121 -1.53  1.15
## S455 48.36 50.23
## Objective function:   9.88
## Clustering vector:    Named int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "names")= chr [1:500] "S1" "S2" "S3" "S4" "S5" "S6" "S7" ...
## Cluster sizes:            200 300 
## Best sample:
##  [1] S37  S49  S54  S63  S68  S71  S76  S80  S82  S101 S103 S108 S109 S118
## [15] S121 S128 S132 S138 S144 S162 S203 S210 S216 S231 S234 S249 S260 S261
## [29] S286 S299 S304 S305 S312 S315 S322 S350 S403 S450 S454 S455 S456 S465
## [43] S488 S497
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"

The output of the function clara() includes the following components:

medoids: Objects that represent clusters
clustering: a vector containing the cluster number of each object
sample: labels or case numbers of the observations in the best sample, that is, the sample used by the clara algorithm for the final partition.

If you want to add the point classifications to the original data, use this:

dd <- cbind(df, cluster = clara.res$cluster)
head(dd, n = 4)

##         x    y cluster
## S1  -9.66 3.88       1
## S2   2.22 5.57       1
## S3   8.68 1.48       1
## S4 -18.77 5.61       1

You can access to the results returned by clara() as follow:

# Medoids
clara.res$medoids

##          x     y
## S121 -1.53  1.15
## S455 48.36 50.23

# Clustering
head(clara.res$clustering, 10)

##  S1  S2  S3  S4  S5  S6  S7  S8  S9 S10 
##   1   1   1   1   1   1   1   1   1   1

The medoids are S121, S455

Visualizing CLARA clusters

To visualize the partitioning results, we’ll use the function fviz_cluster() [factoextra package]. It draws a scatter plot of data points colored by cluster numbers.

fviz_cluster(clara.res, 
             palette = c("#00AFBB", "#FC4E07"), # color palette
             ellipse.type = "t", # Concentration ellipse
             geom = "point", pointsize = 1,
             ggtheme = theme_classic()
             )

Summary

The CLARA (Clustering Large Applications) algorithm is an extension to the PAM (Partitioning Around Medoids) clustering method for large data sets. It intended to reduce the computation time in the case of large data set.

As almost all partitioning algorithm, it requires the user to specify the appropriate number of clusters to be produced. This can be estimated using the function fviz_nbclust [in factoextra R package].

The R function clara() [cluster package] can be used to compute CLARA algorithm. The simplified format is clara(x, k, pamLike = TRUE), where “x” is the data and k is the number of clusters to be generated.

After, computing CLARA, the R function fviz_cluster() [factoextra package] can be used to visualize the results. The format is fviz_cluster(clara.res), where clara.res is the CLARA results.

References

Kaufman, Leonard, and Peter Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis.

K-Medoids Essentials

Mon, 04 Sep 2017 22:09:00 +0200

The k-medoids algorithm is a clustering approach related to k-means clustering (chapter @ref(kmeans-clustering)) for partitioning a data set into k groups or clusters. In k-medoids clustering, each cluster is represented by one of the data point in the cluster. These points are named cluster medoids.

The term medoid refers to an object within a cluster for which average dissimilarity between it and all the other the members of the cluster is minimal. It corresponds to the most centrally located point in the cluster. These objects (one per cluster) can be considered as a representative example of the members of that cluster which may be useful in some situations. Recall that, in k-means clustering, the center of a given cluster is calculated as the mean value of all the data points in the cluster.

K-medoid is a robust alternative to k-means clustering. This means that, the algorithm is less sensitive to noise and outliers, compared to k-means, because it uses medoids as cluster centers instead of means (used in k-means).

The k-medoids algorithm requires the user to specify k, the number of clusters to be generated (like in k-means clustering). A useful approach to determine the optimal number of clusters is the silhouette method, described in the next sections.

The most common k-medoids clustering methods is the PAM algorithm (Partitioning Around Medoids, (Kaufman and Rousseeuw 1990)).

In this article, We’ll describe the PAM algorithm and provide practical examples using R software. In the next chapter, we’ll also discuss a variant of PAM named CLARA (Clustering Large Applications) which is used for analyzing large data sets.

Contents:

PAM concept
PAM algorithm
Computing PAM in R
Summary
References

PAM concept

The use of means implies that k-means clustering is highly sensitive to outliers. This can severely affects the assignment of observations to clusters. A more robust algorithm is provided by the PAM algorithm.

PAM algorithm

The PAM algorithm is based on the search for k representative objects or medoids among the observations of the data set.

After finding a set of k medoids, clusters are constructed by assigning each observation to the nearest medoid. Next, each selected medoid m and each non-medoid data point are swapped and the objective function is computed. The objective function corresponds to the sum of the dissimilarities of all objects to their nearest medoid.

The SWAP step attempts to improve the quality of the clustering by exchanging selected objects (medoids) and non-selected objects. If the objective function can be reduced by interchanging a selected object with an unselected object, then the swap is carried out. This is continued until the objective function can no longer be decreased. The goal is to find k representative objects which minimize the sum of the dissimilarities of the observations to their closest representative object.

In summary, PAM algorithm proceeds in two phases as follow:

Build phase:

Select k objects to become the medoids, or in case these objects were provided use them as the medoids;
Calculate the dissimilarity matrix if it was not provided;
Assign every object to its closest medoid;

Swap phase: 4. For each cluster search if any of the object of the cluster decreases the average dissimilarity coefficient; if it does, select the entity that decreases this coefficient the most as the medoid for this cluster; 5. If at least one medoid has changed go to (3), else end the algorithm.

As mentioned above, the PAM algorithm works with a matrix of dissimilarity, and to compute this matrix the algorithm can use two metrics:

The euclidean distances, that are the root sum-of-squares of differences;
And, the Manhattan distance that are the sum of absolute distances.

Note that, in practice, you should get similar results most of the time, using either euclidean or Manhattan distance. If your data contains outliers, Manhattan distance should give more robust results, whereas euclidean would be influenced by unusual values.

Read more on distance measures in Chapter @ref(clustering-distance-measures).

Computing PAM in R

Data

We’ll use the demo data sets “USArrests”, which we start by scaling (Chapter @ref(data-preparation-and-r-packages) using the R function scale() as follow:

data("USArrests")      # Load the data set
df <- scale(USArrests) # Scale the data
head(df, n = 3)        # View the firt 3 rows of the data

##         Murder Assault UrbanPop     Rape
## Alabama 1.2426   0.783   -0.521 -0.00342
## Alaska  0.5079   1.107   -1.212  2.48420
## Arizona 0.0716   1.479    0.999  1.04288

Required R packages and functions

The function pam() [cluster package] and pamk() [fpc package] can be used to compute PAM.

The function pamk() does not require a user to decide the number of clusters K.

In the following examples, we’ll describe only the function pam(), which simplified format is:

pam(x, k, metric = "euclidean", stand = FALSE)

x: possible values includes:
- Numeric data matrix or numeric data frame: each row corresponds to an observation, and each column corresponds to a variable.
- Dissimilarity matrix: in this case x is typically the output of daisy() or dist()
k: The number of clusters
metric: the distance metrics to be used. Available options are “euclidean” and “manhattan”.
stand: logical value; if true, the variables (columns) in x are standardized before calculating the dissimilarities. Ignored when x is a dissimilarity matrix.

To create a beautiful graph of the clusters generated with the pam() function, will use the factoextra package.

Installing required packages:

install.packages(c("cluster", "factoextra"))

Loading the packages:

library(cluster)
library(factoextra)

Estimating the optimal number of clusters

To estimate the optimal number of clusters, we’ll use the average silhouette method. The idea is to compute PAM algorithm using different values of clusters k. Next, the average clusters silhouette is drawn according to the number of clusters. The average silhouette measures the quality of a clustering. A high average silhouette width indicates a good clustering. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k (Kaufman and Rousseeuw 1990).

The R function fviz_nbclust() [factoextra package] provides a convenient solution to estimate the optimal number of clusters.

library(cluster)
library(factoextra)
fviz_nbclust(df, pam, method = "silhouette")+
  theme_classic()

From the plot, the suggested number of clusters is 2. In the next section, we’ll classify the observations into 2 clusters.

Computing PAM clustering

The R code below computes PAM algorithm with k = 2:

pam.res <- pam(df, 2)
print(pam.res)

## Medoids:
##            ID Murder Assault UrbanPop   Rape
## New Mexico 31  0.829   1.371    0.308  1.160
## Nebraska   27 -0.801  -0.825   -0.245 -0.505
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              2              2              1              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              2              1              2              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              2              1              2              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              2              1              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              1              2              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              2              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              1              2              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              2              2              2 
## Objective function:
## build  swap 
##  1.44  1.37 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"

The printed output shows:

the cluster medoids: a matrix, which rows are the medoids and columns are variables
the clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated

If you want to add the point classifications to the original data, use this:

dd <- cbind(USArrests, cluster = pam.res$cluster)
head(dd, n = 3)

##         Murder Assault UrbanPop Rape cluster
## Alabama   13.2     236       58 21.2       1
## Alaska    10.0     263       48 44.5       1
## Arizona    8.1     294       80 31.0       1

Accessing to the results of the pam() function

The function pam() returns an object of class pam which components include:

medoids: Objects that represent clusters
clustering: a vector containing the cluster number of each object

These components can be accessed as follow:

# Cluster medoids: New Mexico, Nebraska
pam.res$medoids

##            Murder Assault UrbanPop   Rape
## New Mexico  0.829   1.371    0.308  1.160
## Nebraska   -0.801  -0.825   -0.245 -0.505

# Cluster numbers
head(pam.res$clustering)

##    Alabama     Alaska    Arizona   Arkansas California   Colorado 
##          1          1          1          2          1          1

Visualizing PAM clusters

To visualize the partitioning results, we’ll use the function fviz_cluster() [factoextra package]. It draws a scatter plot of data points colored by cluster numbers. If the data contains more than 2 variables, the Principal Component Analysis (PCA) algorithm is used to reduce the dimensionality of the data. In this case, the first two principal dimensions are used to plot the data.

fviz_cluster(pam.res, 
             palette = c("#00AFBB", "#FC4E07"), # color palette
             ellipse.type = "t", # Concentration ellipse
             repel = TRUE, # Avoid label overplotting (slow)
             ggtheme = theme_classic()
             )

Summary

The K-medoids algorithm, PAM, is a robust alternative to k-means for partitioning a data set into clusters of observation.

In k-medoids method, each cluster is represented by a selected object within the cluster. The selected objects are named medoids and corresponds to the most centrally located points within the cluster.

The PAM algorithm requires the user to know the data and to indicate the appropriate number of clusters to be produced. This can be estimated using the function fviz_nbclust [in factoextra R package].

The R function pam() [cluster package] can be used to compute PAM algorithm. The simplified format is pam(x, k), where “x” is the data and k is the number of clusters to be generated.

After, performing PAM clustering, the R function fviz_cluster() [factoextra package] can be used to visualize the results. The format is fviz_cluster(pam.res), where pam.res is the PAM results.

Note that, for large data sets, () may need too much memory or too much computation time. In this case, the function () is preferable. This should not be a problem for modern computers.

References

Kaufman, Leonard, and Peter Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis.

K-Means Clustering Essentials

Mon, 04 Sep 2017 21:17:00 +0200

K-means clustering (MacQueen 1967) is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.

In this article, we’ll describe the k-means algorithm and provide practical examples using R software.

Contents:

K-means basic ideas
K-means algorithm
Computing k-means clustering in R
K-means clustering advantages and disadvantages
Alternative to k-means clustering
Summary
References

K-means basic ideas

The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized.

There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm (Hartigan and Wong 1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:

\[ W(C_k) = \sum\limits_{x_i \in C_k} (x_i - \mu_k)^2 \]

\(x_i\) design a data point belonging to the cluster \(C_k\)
\(\mu_k\) is the mean value of the points assigned to the cluster \(C_k\)

Each observation (\(x_i\)) is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers \(\mu_k\) is a minimum.

We define the total within-cluster variation as follow:

\[ tot.withinss = \sum\limits_{k=1}^k W(C_k) = \sum\limits_{k=1}^k \sum\limits_{x_i \in C_k} (x_i - \mu_k)^2 \]

The total within-cluster sum of square measures the compactness (i.e goodness) of the clustering and we want it to be as small as possible.

K-means algorithm

The first step when using k-means clustering is to indicate the number of clusters (k) that will be generated in the final solution.

The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The selected objects are also known as cluster means or centroids.

Next, each of the remaining objects is assigned to it’s closest centroid, where closest is defined using the Euclidean distance (Chapter @ref(clustering-distance-measures) between the object and the cluster mean. This step is called “cluster assignment step”. Note that, to use correlation distance, the data are input as z-scores.

After the assignment step, the algorithm computes the new mean value of each cluster. The term cluster “centroid update” is used to design this step. Now that the centers have been recalculated, every observation is checked again to see if it might be closer to a different cluster. All the objects are reassigned again using the updated cluster means.

The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e until convergence is achieved). That is, the clusters formed in the current iteration are the same as those obtained in the previous iteration.

K-means algorithm can be summarized as follow:

Specify the number of clusters (K) to be created (by the analyst)
Select randomly k objects from the dataset as the initial cluster centers or means
Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid
For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centoid of a K_th cluster is a vector of length p containing the means of all variables for the observations in the k_th cluster; p is the number of variables.
Iteratively minimize the total within sum of square. That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached. By default, the R software uses 10 as the default value for the maximum number of iterations.

Computing k-means clustering in R

Data

We’ll use the demo data sets “USArrests”. The data should be prepared as described in chapter @ref(data-preparation-and-r-packages). The data must contains only continuous variables, as the k-means algorithm uses variable means. As we don’t want the k-means algorithm to depend to an arbitrary variable unit, we start by scaling the data using the R function scale() as follow:

data("USArrests")      # Loading the data set
df <- scale(USArrests) # Scaling the data
# View the firt 3 rows of the data
head(df, n = 3)

##         Murder Assault UrbanPop     Rape
## Alabama 1.2426   0.783   -0.521 -0.00342
## Alaska  0.5079   1.107   -1.212  2.48420
## Arizona 0.0716   1.479    0.999  1.04288

Required R packages and functions

The standard R function for k-means clustering is kmeans() [stats package], which simplified format is as follow:

kmeans(x, centers, iter.max = 10, nstart = 1)

x: numeric matrix, numeric data frame or a numeric vector
centers: Possible values are the number of clusters (k) or a set of initial (distinct) cluster centers. If a number, a random set of (distinct) rows in x is chosen as the initial centers.
iter.max: The maximum number of iterations allowed. Default value is 10.
nstart: The number of random starting partitions when centers is a number. Trying nstart > 1 is often recommended.

To create a beautiful graph of the clusters generated with the kmeans() function, will use the factoextra package.

Installing factoextra package as:

install.packages("factoextra")

Loading factoextra:

library(factoextra)

Estimating the optimal number of clusters

The k-means clustering requires the users to specify the number of clusters to be generated.

One fundamental question is: How to choose the right number of expected clusters (k)?

Different methods will be presented in the chapter “cluster evaluation and validation statistics”.

Here, we provide a simple solution. The idea is to compute k-means clustering using different values of clusters k. Next, the wss (within sum of square) is drawn according to the number of clusters. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

The R function fviz_nbclust() [in factoextra package] provides a convenient solution to estimate the optimal number of clusters.

library(factoextra)
fviz_nbclust(df, kmeans, method = "wss") +
    geom_vline(xintercept = 4, linetype = 2)

The plot above represents the variance within the clusters. It decreases as k increases, but it can be seen a bend (or “elbow”) at k = 4. This bend indicates that additional clusters beyond the fourth have little value.. In the next section, we’ll classify the observations into 4 clusters.

Computing k-means clustering

As k-means clustering algorithm starts with k randomly selected centroids, it’s always recommended to use the set.seed() function in order to set a seed for R’s random number generator. The aim is to make reproducible the results, so that the reader of this article will obtain exactly the same results as those shown below.

The R code below performs k-means clustering with k = 4:

# Compute k-means with k = 4
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)

As the final result of k-means clustering result is sensitive to the random starting assignments, we specify nstart = 25. This means that R will try 25 different random starting assignments and then select the best results corresponding to the one with the lowest within cluster variation. The default value of nstart in R is one. But, it’s strongly recommended to compute k-means clustering with a large value of nstart such as 25 or 50, in order to have a more stable result.

# Print the results
print(km.res)

## K-means clustering with 4 clusters of sizes 13, 16, 13, 8
## 
## Cluster means:
##   Murder Assault UrbanPop    Rape
## 1 -0.962  -1.107   -0.930 -0.9668
## 2 -0.489  -0.383    0.576 -0.2617
## 3  0.695   1.039    0.723  1.2769
## 4  1.412   0.874   -0.815  0.0193
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              4              3              3              4              3 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              3              2              2              3              4 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              1              3              2              1 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              1              4              1              3 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              3              1              4              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              3              1              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              3              3              4              1              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              4 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              1              4              3              2              1 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              1              1              2 
## 
## Within cluster sum of squares by cluster:
## [1] 11.95 16.21 19.92  8.32
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

The printed output displays:

the cluster means or centers: a matrix, which rows are cluster number (1 to 4) and columns are variables
the clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated

It’s possible to compute the mean of each variables by clusters using the original data:

aggregate(USArrests, by=list(cluster=km.res$cluster), mean)

##   cluster Murder Assault UrbanPop Rape
## 1       1   3.60    78.5     52.1 12.2
## 2       2   5.66   138.9     73.9 18.8
## 3       3  10.82   257.4     76.0 33.2
## 4       4  13.94   243.6     53.8 21.4

If you want to add the point classifications to the original data, use this:

dd <- cbind(USArrests, cluster = km.res$cluster)
head(dd)

##            Murder Assault UrbanPop Rape cluster
## Alabama      13.2     236       58 21.2       4
## Alaska       10.0     263       48 44.5       3
## Arizona       8.1     294       80 31.0       3
## Arkansas      8.8     190       50 19.5       4
## California    9.0     276       91 40.6       3
## Colorado      7.9     204       78 38.7       3

Accessing to the results of kmeans() function

kmeans() function returns a list of components, including:

cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated
centers: A matrix of cluster centers (cluster means)
totss: The total sum of squares (TSS), i.e \(\sum{(x_i - \bar{x})^2}\). TSS measures the total variance in the data.
withinss: Vector of within-cluster sum of squares, one component per cluster
tot.withinss: Total within-cluster sum of squares, i.e. \(sum(withinss)\)
betweenss: The between-cluster sum of squares, i.e. \(totss - tot.withinss\)
size: The number of observations in each cluster

These components can be accessed as follow:

# Cluster number for each of the observations
km.res$cluster

head(km.res$cluster, 4)

##  Alabama   Alaska  Arizona Arkansas 
##        4        3        3        4

…..

# Cluster size
km.res$size

## [1] 13 16 13  8

# Cluster means
km.res$centers

##   Murder Assault UrbanPop    Rape
## 1 -0.962  -1.107   -0.930 -0.9668
## 2 -0.489  -0.383    0.576 -0.2617
## 3  0.695   1.039    0.723  1.2769
## 4  1.412   0.874   -0.815  0.0193

Visualizing k-means clusters

It is a good idea to plot the cluster results. These can be used to assess the choice of the number of clusters as well as comparing two different cluster analyses.

Now, we want to visualize the data in a scatter plot with coloring each data point according to its cluster assignment.

The problem is that the data contains more than 2 variables and the question is what variables to choose for the xy scatter plot.

A solution is to reduce the number of dimensions by applying a dimensionality reduction algorithm, such as Principal Component Analysis (PCA), that operates on the four variables and outputs two new variables (that represent the original variables) that you can use to do the plot.

In other words, if we have a multi-dimensional data set, a solution is to perform Principal Component Analysis (PCA) and to plot data points according to the first two principal components coordinates. }

The function fviz_cluster() [factoextra package] can be used to easily visualize k-means clusters. It takes k-means results and the original data as arguments. In the resulting plot, observations are represented by points, using principal components if the number of variables is greater than 2. It’s also possible to draw concentration ellipse around each cluster.

fviz_cluster(km.res, data = df,
             palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"), 
             ellipse.type = "euclid", # Concentration ellipse
             star.plot = TRUE, # Add segments from centroids to items
             repel = TRUE, # Avoid label overplotting (slow)
             ggtheme = theme_minimal()
             )

K-means clustering advantages and disadvantages

K-means clustering is very simple and fast algorithm. It can efficiently deal with very large data sets. However there are some weaknesses, including:

It assumes prior knowledge of the data and requires the analyst to choose the appropriate number of cluster (k) in advance
The final results obtained is sensitive to the initial random selection of cluster centers. Why is it a problem? Because, for every different run of the algorithm on the same dataset, you may choose different set of initial centers. This may lead to different clustering results on different runs of the algorithm.
It’s sensitive to outliers.
If you rearrange your data, it’s very possible that you’ll get a different solution every time you change the ordering of your data.

Possible solutions to these weaknesses, include:

Solution to issue 1: Compute k-means for a range of k values, for example by varying k between 2 and 10. Then, choose the best k by comparing the clustering results obtained for the different k values.
Solution to issue 2: Compute K-means algorithm several times with different initial cluster centers. The run with the lowest total within-cluster sum of square is selected as the final clustering solution.
To avoid distortions caused by excessive outliers, it’s possible to use PAM algorithm, which is less sensitive to outliers.

Alternative to k-means clustering

A robust alternative to k-means is PAM, which is based on medoids. As discussed in the next chapter, the PAM clustering can be computed using the function pam() [cluster package]. The function pamk( ) [fpc package] is a wrapper for PAM that also prints the suggested number of clusters based on optimum average silhouette width.

Summary

K-means clustering can be used to classify observations into k groups, based on their similarity. Each group is represented by the mean value of points in the group, known as the cluster centroid.

K-means algorithm requires users to specify the number of cluster to generate. The R function kmeans() [stats package] can be used to compute k-means algorithm. The simplified format is kmeans(x, centers), where “x” is the data and centers is the number of clusters to be produced.

After, computing k-means clustering, the R function fviz_cluster() [factoextra package] can be used to visualize the results. The format is fviz_cluster(km.res, data), where km.res is k-means results and data corresponds to the original data sets.

References

Hartigan, JA, and MA Wong. 1979. “Algorithm AS 136: A K-means clustering algorithm.” Applied Statistics. Royal Statistical Society, 100–108.

MacQueen, J. 1967. “Some Methods for Classification and Analysis of Multivariate Observations.” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 281–97. Berkeley, Calif.: University of California Press. http://projecteuclid.org:443/euclid.bsmsp/1200512992.