Articles - Clustering Basics

Data Preparation and Essential R Packages for Cluster Analysis

  |   1051  |  Comments (4)  |  Clustering Basics

In this chapter, we start by presenting the data format and preparation for cluster analysis. Next, we introduce two main R packages - cluster and factoextra - for computing and visualizing clusters.


Related Books:

Data preparation

To perform a cluster analysis in R, generally, the data should be prepared as follow:

  1. Rows are observations (individuals) and columns are variables

  2. Any missing value in the data must be removed or estimated.

  3. The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one. Read more about data standardization in chapter @ref(clustering-distance-measures).

Here, we’ll use the built-in R data set “USArrests”, which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. It includes also the percent of the population living in urban areas.

data("USArrests")  # Load the data set
df <- USArrests    # Use df as shorter name
  1. To remove any missing value that might be present in the data, type this:
df <- na.omit(df)
  1. As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale():
df <- scale(df)
head(df, n = 3)
##         Murder Assault UrbanPop     Rape
## Alabama 1.2426   0.783   -0.521 -0.00342
## Alaska  0.5079   1.107   -1.212  2.48420
## Arizona 0.0716   1.479    0.999  1.04288

Required R Packages

In this book, we’ll use mainly the following R packages:

factoextra contains many functions for cluster analysis and visualization, including:

Functions Description
dist(fviz_dist, get_dist) Distance Matrix Computation and Visualization
get_clust_tendency Assessing Clustering Tendency
fviz_nbclust(fviz_gap_stat) Determining the Optimal Number of Clusters
fviz_dend Enhanced Visualization of Dendrogram
fviz_cluster Visualize Clustering Results
fviz_mclust Visualize Model-based Clustering Results
fviz_silhouette Visualize Silhouette Information from Clustering
hcut Computes Hierarchical Clustering and Cut the Tree
hkmeans Hierarchical k-means clustering
eclust Visual enhancement of clustering analysis

To install the two packages, type this:

install.packages(c("cluster", "factoextra"))