Articles - Partitioning Clustering Essentials

CLARA - Clustering Large Applications

CLARA (Clustering Large Applications, (Kaufman and Rousseeuw 1990)) is an extension to k-medoids methods (Chapter @ref(k-medoids)) to deal with data containing a large number of objects (more than several thousand observations) in order to reduce computing time and RAM storage problem. This is achieved using the sampling approach.


Contents:


Related Books:

CLARA concept

Instead of finding medoids for the entire data set, CLARA considers a small sample of the data with fixed size (sampsize) and applies the PAM algorithm (Chapter @ref(k-medoids)) to generate an optimal set of medoids for the sample. The quality of resulting medoids is measured by the average dissimilarity between every object in the entire data set and the medoid of its cluster, defined as the cost function.

CLARA repeats the sampling and clustering processes a pre-specified number of times in order to minimize the sampling bias. The final clustering results correspond to the set of medoids with the minimal cost. The CLARA algorithm is summarized in the next section.

CLARA Algorithm

The algorithm is as follow:

  1. Split randomly the data sets in multiple subsets with fixed size (sampsize)

  2. Compute PAM algorithm on each subset and choose the corresponding k representative objects (medoids). Assign each observation of the entire data set to the closest medoid.

  3. Calculate the mean (or the sum) of the dissimilarities of the observations to their closest medoid. This is used as a measure of the goodness of the clustering.

  4. Retain the sub-dataset for which the mean (or sum) is minimal. A further analysis is carried out on the final partition.

Note that, each sub-data set is forced to contain the medoids obtained from the best sub-data set until then. Randomly drawn observations are added to this set until sampsize has been reached.

Computing CLARA in R

Data format and preparation

To compute the CLARA algorithm in R, the data should be prepared as indicated in Chapter @ref(data-preparation-and-r-packages).

Here, we’ll generate use a random data set. To make the result reproducible, we start by using the function set.seed().

set.seed(1234)
# Generate 500 objects, divided into 2 clusters.
df <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
           cbind(rnorm(300,50,8), rnorm(300,50,8)))
# Specify column and row names
colnames(df) <- c("x", "y")
rownames(df) <- paste0("S", 1:nrow(df))
# Previewing the data
head(df, nrow = 6)
##         x    y
## S1  -9.66 3.88
## S2   2.22 5.57
## S3   8.68 1.48
## S4 -18.77 5.61
## S5   3.43 2.49
## S6   4.05 6.08

Required R packages and functions

The function clara() [cluster package] can be used to compute CLARA. The simplified format is as follow:

clara(x, k, metric = "euclidean", stand = FALSE, 
      samples = 5, pamLike = FALSE)
  • x: a numeric data matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. Missing values (NAs) are allowed.
  • k: the number of clusters.
  • metric: the distance metrics to be used. Available options are “euclidean” and “manhattan”. Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. Read more on distance measures (Chapter ). Note that, manhattan distance is less sensitive to outliers.
  • stand: logical value; if true, the variables (columns) in x are standardized before calculating the dissimilarities. Note that, it’s recommended to standardize variables before clustering.
  • samples: number of samples to be drawn from the data set. Default value is 5 but it’s recommended a much larger value.
  • pamLike: logical indicating if the same algorithm in the pam() function should be used. This should be always true.

To create a beautiful graph of the clusters generated with the pam() function, will use the factoextra package.

  1. Installing required packages:
install.packages(c("cluster", "factoextra"))
  1. Loading the packages:
library(cluster)
library(factoextra)

Estimating the optimal number of clusters

To estimate the optimal number of clusters in your data, it’s possible to use the average silhouette method as described in PAM clustering chapter (Chapter @ref(k-medoids)). The R function fviz_nbclust() [factoextra package] provides a solution to facilitate this step.

library(cluster)
library(factoextra)
fviz_nbclust(df, clara, method = "silhouette")+
  theme_classic()

From the plot, the suggested number of clusters is 2. In the next section, we’ll classify the observations into 2 clusters.

Computing CLARA

The R code below computes PAM algorithm with k = 2:

# Compute CLARA
clara.res <- clara(df, 2, samples = 50, pamLike = TRUE)
# Print components of clara.res
print(clara.res)
## Call:     clara(x = df, k = 2, samples = 50, pamLike = TRUE) 
## Medoids:
##          x     y
## S121 -1.53  1.15
## S455 48.36 50.23
## Objective function:   9.88
## Clustering vector:    Named int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "names")= chr [1:500] "S1" "S2" "S3" "S4" "S5" "S6" "S7" ...
## Cluster sizes:            200 300 
## Best sample:
##  [1] S37  S49  S54  S63  S68  S71  S76  S80  S82  S101 S103 S108 S109 S118
## [15] S121 S128 S132 S138 S144 S162 S203 S210 S216 S231 S234 S249 S260 S261
## [29] S286 S299 S304 S305 S312 S315 S322 S350 S403 S450 S454 S455 S456 S465
## [43] S488 S497
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"

The output of the function clara() includes the following components:

  • medoids: Objects that represent clusters
  • clustering: a vector containing the cluster number of each object
  • sample: labels or case numbers of the observations in the best sample, that is, the sample used by the clara algorithm for the final partition.

If you want to add the point classifications to the original data, use this:

dd <- cbind(df, cluster = clara.res$cluster)
head(dd, n = 4)
##         x    y cluster
## S1  -9.66 3.88       1
## S2   2.22 5.57       1
## S3   8.68 1.48       1
## S4 -18.77 5.61       1

You can access to the results returned by clara() as follow:

# Medoids
clara.res$medoids
##          x     y
## S121 -1.53  1.15
## S455 48.36 50.23
# Clustering
head(clara.res$clustering, 10)
##  S1  S2  S3  S4  S5  S6  S7  S8  S9 S10 
##   1   1   1   1   1   1   1   1   1   1

The medoids are S121, S455

Visualizing CLARA clusters

To visualize the partitioning results, we’ll use the function fviz_cluster() [factoextra package]. It draws a scatter plot of data points colored by cluster numbers.

fviz_cluster(clara.res, 
             palette = c("#00AFBB", "#FC4E07"), # color palette
             ellipse.type = "t", # Concentration ellipse
             geom = "point", pointsize = 1,
             ggtheme = theme_classic()
             )

Summary

The CLARA (Clustering Large Applications) algorithm is an extension to the PAM (Partitioning Around Medoids) clustering method for large data sets. It intended to reduce the computation time in the case of large data set.

As almost all partitioning algorithm, it requires the user to specify the appropriate number of clusters to be produced. This can be estimated using the function fviz_nbclust [in factoextra R package].

The R function clara() [cluster package] can be used to compute CLARA algorithm. The simplified format is clara(x, k, pamLike = TRUE), where “x” is the data and k is the number of clusters to be generated.

After, computing CLARA, the R function fviz_cluster() [factoextra package] can be used to visualize the results. The format is fviz_cluster(clara.res), where clara.res is the CLARA results.

References

Kaufman, Leonard, and Peter Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis.