<?xml version="1.0" encoding="UTF-8" ?>
<!-- RSS generated by PHPBoost on Thu, 07 May 2026 08:10:12 +0200 -->

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title><![CDATA[Last articles - STHDA : Cluster Analysis in R: Practical Guide]]></title>
		<atom:link href="https://www.sthda.com/english/syndication/rss/articles/25" rel="self" type="application/rss+xml"/>
		<link>https://www.sthda.com</link>
		<description><![CDATA[Last articles - STHDA : Cluster Analysis in R: Practical Guide]]></description>
		<copyright>(C) 2005-2026 PHPBoost</copyright>
		<language>en</language>
		<generator>PHPBoost</generator>
		
		
		<item>
			<title><![CDATA[Types of Clustering Methods: Overview and Quick Start R Code ]]></title>
			<link>https://www.sthda.com/english/articles/25-cluster-analysis-in-r-practical-guide/111-types-of-clustering-methods-overview-and-quick-start-r-code/</link>
			<guid>https://www.sthda.com/english/articles/25-cluster-analysis-in-r-practical-guide/111-types-of-clustering-methods-overview-and-quick-start-r-code/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p><strong>Clustering methods</strong> are used to identify groups of similar objects in a multivariate data sets collected from fields such as marketing, bio-medical and geo-spatial. They are different <strong>types of clustering</strong> methods, including:</p>
<ul>
<li>Partitioning methods</li>
<li>Hierarchical clustering</li>
<li>Fuzzy clustering</li>
<li>Density-based clustering</li>
<li>Model-based clustering</li>
</ul>
<p><img src="https://www.sthda.com/english/sthda-upload/images/cluster-analysis/types-of-clustering-methods.png" alt="Types of clustering methods" /></p>
<div class="block">
<p>
In this article, we provide an overview of clustering methods and quick start R code to perform cluster analysis in R:
</p>
<ul>
<li>
we start by presenting required R packages and data format for cluster analysis and visualization.
</li>
<li>
next, we describe the two standard <em>clustering techniques</em> [partitioning methods (k-MEANS, PAM, CLARA) and hierarchical clustering] as well as how to assess the quality of clustering analysis.
</li>
<li>
finally, we describe advanced clustering approaches to find pattern of any shape in large data sets with noise and outliers.
</li>
</ul>
</div>
<br/>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#installing-and-loading-required-r-packages">Installing and loading required R packages</a></li>
<li><a href="#data-preparation">Data preparation</a></li>
<li><a href="#distance-measures">Distance measures</a></li>
<li><a href="#partitioning-clustering">Partitioning clustering</a></li>
<li><a href="#hierarchical-clustering">Hierarchical clustering</a></li>
<li><a href="#clustering-validation-and-evaluation">Clustering validation and evaluation</a><ul>
<li><a href="#assessing-clustering-tendency">Assessing clustering tendency</a></li>
<li><a href="#determining-the-optimal-number-of-clusters">Determining the optimal number of clusters</a></li>
<li><a href="#clustering-validation-statistics">Clustering validation statistics</a></li>
<li><a href="#see-also">See also:</a></li>
</ul></li>
<li><a href="#advanced-clustering-methods">Advanced clustering methods</a><ul>
<li><a href="#hybrid-clustering-methods">Hybrid clustering methods</a></li>
<li><a href="#fuzzy-clustering">Fuzzy clustering</a></li>
<li><a href="#model-based-clustering">Model-based clustering</a></li>
<li><a href="#dbscan-density-based-clustering">DBSCAN: Density-based clustering</a></li>
</ul></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<p>Related Book:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="installing-and-loading-required-r-packages" class="section level2">
<h2>Installing and loading required R packages</h2>
<p>We’ll use mainly two R packages:</p>
<ul>
<li>cluster package: for computing clustering</li>
<li>factoextra package : for elegant ggplot2-based data visualization. Online documentation at: <a href="https://www.sthda.com/english/rpkgs/factoextra/" class="uri">https://www.sthda.com/english/rpkgs/factoextra/</a></li>
</ul>
<p>Accessory packages:</p>
<ul>
<li>magrittr for piping: %>%</li>
</ul>
<p>Install:</p>
<pre class="r"><code>install.packages("factoextra")
install.packages("cluster")
install.packages("magrittr")</code></pre>
<p>Load packages:</p>
<pre class="r"><code>library("cluster")
library("factoextra")
library("magrittr")</code></pre>
</div>
<div id="data-preparation" class="section level2">
<h2>Data preparation</h2>
<ul>
<li>Demo data set: the built-in R dataset named USArrest</li>
<li>Remove missing data</li>
<li>Scale variables to make them comparable</li>
</ul>
<p>Read more: <a href="https://www.sthda.com/english/articles/26-clustering-basics/85-data-preparation-and-essential-r-packages-for-cluster-analysis/">Data Preparation and Essential R Packages for Cluster Analysis</a></p>
<pre class="r"><code># Load  and prepare the data
data("USArrests")
my_data <- USArrests %>%
  na.omit() %>%          # Remove missing values (NA)
  scale()                # Scale variables
# View the firt 3 rows
head(my_data, n = 3)</code></pre>
<pre><code>##         Murder Assault UrbanPop     Rape
## Alabama 1.2426   0.783   -0.521 -0.00342
## Alaska  0.5079   1.107   -1.212  2.48420
## Arizona 0.0716   1.479    0.999  1.04288</code></pre>
</div>
<div id="distance-measures" class="section level2">
<h2>Distance measures</h2>
<p>The classification of objects, into clusters, requires some methods for measuring the distance or the (dis)similarity between the objects. Chapter <a href="https://www.sthda.com/english/articles/26-clustering-basics/86-clustering-distance-measures-essentials/">Clustering Distance Measures Essentials</a> covers the common distance measures used for assessing similarity between observations.</p>
<p>It’s simple to compute and visualize distance matrix using the functions <a href="https://www.sthda.com/english/rpkgs/factoextra/dist.html">get_dist() and fviz_dist()</a> [factoextra R package]:</p>
<ul>
<li><code>get_dist()</code>: for computing a distance matrix between the rows of a data matrix. Compared to the standard <code>dist()</code> function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.</li>
<li><code>fviz_dist()</code>: for visualizing a distance matrix</li>
</ul>
<pre class="r"><code>res.dist <- get_dist(USArrests, stand = TRUE, method = "pearson")
fviz_dist(res.dist, 
   gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/060-types-of-clustering-methods-distance-matrix-1.png" width="518.4" /></p>
<p>Read more: <a href="https://www.sthda.com/english/articles/26-clustering-basics/86-clustering-distance-measures-essentials/">Clustering Distance Measures Essentials</a></p>
</div>
<div id="partitioning-clustering" class="section level2">
<h2>Partitioning clustering</h2>
<p>Partitioning algorithms are clustering techniques that subdivide the data sets into a set of k groups, where k is the number of groups pre-specified by the analyst.</p>
<p>There are different types of partitioning clustering methods. The most popular is the <a href="https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/87-k-means-clustering-essentials/">K-means clustering</a> <span class="citation">(MacQueen 1967)</span>, in which, each cluster is represented by the center or means of the data points belonging to the cluster. The K-means method is sensitive to outliers.</p>
<p>An alternative to k-means clustering is the <a href="https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/88-k-medoids-essentials/">K-medoids clustering</a> or PAM (Partitioning Around Medoids, Kaufman &amp; Rousseeuw, 1990), which is less sensitive to outliers compared to k-means.</p>
<p>Read more: <a href="https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/">Partitioning Clustering methods</a>.</p>
<p>The following R codes show how to determine the optimal number of clusters and how to compute k-means and PAM clustering in R.</p>
<ol style="list-style-type: decimal">
<li><a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/">Determining the optimal number of clusters</a>: use <code>factoextra::fviz_nbclust()</code></li>
</ol>
<pre class="r"><code>library("factoextra")
fviz_nbclust(my_data, kmeans, method = "gap_stat")</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/060-types-of-clustering-methods-optimal-number-of-clusters-1.png" width="384" /></p>
<p>Suggested number of cluster: 3</p>
<p><a href="https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/87-k-means-clustering-essentials/">Compute and visualize k-means clustering</a></p>
<pre class="r"><code>set.seed(123)
km.res <- kmeans(my_data, 3, nstart = 25)
# Visualize
library("factoextra")
fviz_cluster(km.res, data = my_data,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/060-types-of-clustering-methods-k-means-plot-ggplot2-factoextra-1.png" width="480" /></p>
<p>Similarly, the <a href="https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/88-k-medoids-essentials/">k-medoids/pam clustering</a> can be computed as follow:</p>
<pre class="r"><code># Compute PAM
library("cluster")
pam.res <- pam(my_data, 3)
# Visualize
fviz_cluster(pam.res)</code></pre>
</div>
<div id="hierarchical-clustering" class="section level2">
<h2>Hierarchical clustering</h2>
<p>Hierarchical clustering is an alternative approach to partitioning clustering for identifying groups in the dataset. It does not require to pre-specify the number of clusters to be generated.</p>
<p>The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram. Observations can be subdivided into groups by cutting the dendrogram at a desired similarity level.</p>
<p>R code to compute and visualize hierarchical clustering:</p>
<pre class="r"><code># Compute hierarchical clustering
res.hc <- USArrests %>%
  scale() %>%                    # Scale the data
  dist(method = "euclidean") %>% # Compute dissimilarity matrix
  hclust(method = "ward.D2")     # Compute hierachical clustering
# Visualize using factoextra
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k = 4, # Cut in four groups
          cex = 0.5, # label size
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
          )</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/060-types-of-clustering-methods-hierarchical-clustering-r-1.png" width="518.4" /></p>
<p>Read more: <a href="https://www.sthda.com/english/articles/28-hierarchical-clustering-essentials/">Hierarchical clustering</a></p>
<p>See also:</p>
<ul>
<li><a href="https://www.sthda.com/english/articles/28-hierarchical-clustering-essentials/94-divisive-hierarchical-clustering-essentials/">Divisive Clustering</a></li>
<li><a href="https://www.sthda.com/english/articles/28-hierarchical-clustering-essentials/91-comparing-dendrograms-essentials/">Compare Dendrograms</a></li>
<li><a href="https://www.sthda.com/english/articles/28-hierarchical-clustering-essentials/92-visualizing-dendrograms-ultimate-guide/">Visualize Dendrograms</a></li>
<li><a href="https://www.sthda.com/english/articles/28-hierarchical-clustering-essentials/93-heatmap-static-and-interactive-absolute-guide/">Heatmap: Static and Interactive</a></li>
</ul>
</div>
<div id="clustering-validation-and-evaluation" class="section level2">
<h2>Clustering validation and evaluation</h2>
<p>Clustering validation and evaluation strategies, consist of measuring the goodness of clustering results. Before applying any clustering algorithm to a data set, the first thing to do is to assess the <em>clustering tendency</em>. That is, whether the data contains any inherent grouping structure.</p>
<p>If yes, then how many clusters are there. Next, you can perform hierarchical clustering or partitioning clustering (with a pre-specified number of clusters). Finally, you can use a number of measures, described in this chapter, to evaluate the goodness of the clustering results.</p>
<p>Read more: <a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/">Cluster Validation Essentials</a></p>
<div id="assessing-clustering-tendency" class="section level3">
<h3>Assessing clustering tendency</h3>
<p>To assess the clustering tendency, the Hopkins’ statistic and a visual approach can be used. This can be performed using the function <code>get_clust_tendency()</code> [factoextra package], which creates an ordered dissimilarity image (ODI).</p>
<ul>
<li><em>Hopkins statistic</em>: If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.</li>
<li><em>Visual approach</em>: The visual approach detects the clustering tendency by counting the number of square shaped dark (or colored) blocks along the diagonal in the ordered dissimilarity image.</li>
</ul>
<p>R code:</p>
<pre class="r"><code>gradient.color <- list(low = "steelblue",  high = "white")
iris[, -5] %>%    # Remove column 5 (Species)
  scale() %>%     # Scale variables
  get_clust_tendency(n = 50, gradient = gradient.color)</code></pre>
<pre><code>## $hopkins_stat
## [1] 0.2
## 
## $plot</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/060-types-of-clustering-methods-clustering-tendency-1.png" width="432" /></p>
<p>Read more: <a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/95-assessing-clustering-tendency-essentials/">Assessing Clustering Tendency</a></p>
</div>
<div id="determining-the-optimal-number-of-clusters" class="section level3">
<h3>Determining the optimal number of clusters</h3>
<p>There are different methods for <a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/">determining the optimal number of clusters</a>.</p>
<p>In the R code below, we’ll use the <code>NbClust</code> R package, which provides 30 indices for determining the best number of clusters. First, install it using <code>install.packages("NbClust")</code>, then type this:</p>
<pre class="r"><code>set.seed(123)
# Compute
library("NbClust")
res.nbclust <- USArrests %>%
  scale() %>%
  NbClust(distance = "euclidean",
          min.nc = 2, max.nc = 10, 
          method = "complete", index ="all") </code></pre>
<pre class="r"><code># Visualize
library(factoextra)
fviz_nbclust(res.nbclust, ggtheme = theme_minimal())</code></pre>
<pre><code>## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 9 proposed  2 as the best number of clusters
## * 4 proposed  3 as the best number of clusters
## * 6 proposed  4 as the best number of clusters
## * 2 proposed  5 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/060-types-of-clustering-methods-determine-the-number-of-clusters-nbclust-1.png" width="518.4" /></p>
<p>Read more: <a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/">Determining the Optimal Number of Clusters</a></p>
</div>
<div id="clustering-validation-statistics" class="section level3">
<h3>Clustering validation statistics</h3>
<p>A variety of measures has been proposed in the literature for evaluating clustering results. The term clustering validation is used to design the procedure of evaluating the results of a clustering algorithm.</p>
<p>The <em>silhouette plot</em> is one of the many measures for inspecting and validating clustering results. Recall that the silhouette (<span class="math inline">\(S_i\)</span>) measures how similar an object <span class="math inline">\(i\)</span> is to the the other objects in its own cluster versus those in the neighbor cluster. <span class="math inline">\(S_i\)</span> values range from 1 to - 1:</p>
<ul>
<li>A value of <span class="math inline">\(S_i\)</span> close to 1 indicates that the object is well clustered. In the other words, the object <span class="math inline">\(i\)</span> is similar to the other objects in its group.</li>
<li>A value of <span class="math inline">\(S_i\)</span> close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.</li>
</ul>
<p>In the following R code, we’ll compute and evaluate the result of hierarchical clustering methods.</p>
<ol style="list-style-type: decimal">
<li>Compute and visualize hierarchical clustering:</li>
</ol>
<pre class="r"><code>set.seed(123)
# Enhanced hierarchical clustering, cut in 3 groups
res.hc <- iris[, -5] %>%
  scale() %>%
  eclust("hclust", k = 3, graph = FALSE)
# Visualize with factoextra
fviz_dend(res.hc, palette = "jco",
          rect = TRUE, show_labels = FALSE)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/060-types-of-clustering-methods-hierarchical-clustering-1.png" width="518.4" /></p>
<ol start="2" style="list-style-type: decimal">
<li>Inspect the silhouette plot:</li>
</ol>
<pre class="r"><code>fviz_silhouette(res.hc)</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1   49          0.63
## 2       2   30          0.44
## 3       3   71          0.32</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/060-types-of-clustering-methods-silhouette-plot-1.png" width="518.4" /></p>
<ol start="3" style="list-style-type: decimal">
<li>Which samples have negative silhouette? To what cluster are they closer?</li>
</ol>
<pre class="r"><code># Silhouette width of observations
sil <- res.hc$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]</code></pre>
<pre><code>##     cluster neighbor sil_width
## 84        3        2   -0.0127
## 122       3        2   -0.0179
## 62        3        2   -0.0476
## 135       3        2   -0.0530
## 73        3        2   -0.1009
## 74        3        2   -0.1476
## 114       3        2   -0.1611
## 72        3        2   -0.2304</code></pre>
<p>Read more: <a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/">Cluster Validation Statistics</a></p>
</div>
<div id="see-also" class="section level3">
<h3>See also:</h3>
<ul>
<li><a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/98-choosing-the-best-clustering-algorithms/">Choosing the Best Clustering Algorithms</a></li>
<li><a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/99-computing-p-value-for-hierarchical-clustering/">Computing p-value for Hierarchical Clustering</a></li>
</ul>
</div>
</div>
<div id="advanced-clustering-methods" class="section level2">
<h2>Advanced clustering methods</h2>
<div id="hybrid-clustering-methods" class="section level3">
<h3>Hybrid clustering methods</h3>
<ul>
<li><a href="https://www.sthda.com/english/articles/30-advanced-clustering/100-hierarchical-k-means-clustering-optimize-clusters/">Hierarchical K-means Clustering</a>: an hybrid approach for improving k-means results</li>
<li><a href="https://www.sthda.com/english/articles/22-principal-component-methods/74-hcpc-hierarchical-clustering-on-principal-components/">HCPC: Hierarchical clustering on principal components</a></li>
</ul>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/060-types-of-clustering-methods-hcpc-1.png" width="518.4" /></p>
</div>
<div id="fuzzy-clustering" class="section level3">
<h3>Fuzzy clustering</h3>
<p><em>Fuzzy clustering</em> is also known as soft method. Standard clustering approaches produce partitions (K-means, PAM), in which each observation belongs to only one cluster. This is known as hard clustering.</p>
<p>In <em>Fuzzy clustering</em>, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster. The <em>Fuzzy c-means</em> method is the most popular fuzzy clustering algorithm.</p>
<p>Read more: <a href="https://www.sthda.com/english/articles/30-advanced-clustering/101-fuzzy-clustering-essentials/">Fuzzy Clustering</a>.</p>
</div>
<div id="model-based-clustering" class="section level3">
<h3>Model-based clustering</h3>
<p>In <em>model-based clustering</em>, the data are viewed as coming from a distribution that is mixture of two ore more clusters. It finds best fit of models to data and estimates the number of clusters.</p>
<p>Read more: <a href="https://www.sthda.com/english/articles/30-advanced-clustering/104-model-based-clustering-essentials/">Model-Based Clustering</a>.</p>
</div>
<div id="dbscan-density-based-clustering" class="section level3">
<h3>DBSCAN: Density-based clustering</h3>
<p>DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers <span class="citation">(Ester et al. 1996)</span>. The basic idea behind density-based clustering approach is derived from a human intuitive clustering method.</p>
<p>The description and implementation of DBSCAN in R are provided at this link: <a href="https://www.sthda.com/english/articles/30-advanced-clustering/105-dbscan-density-based-clustering-essentials/">DBSCAN: Density-Based Clustering</a>.</p>
<p><img src="https://www.sthda.com/english/sthda-upload/images/cluster-analysis/dbscan-idea.png" alt="Density based clustering" /></p>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/060-types-of-clustering-methods-dbscan-1.png" width="518.4" /></p>
</div>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-ester1996">
<p>Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In, 226–31. AAAI Press.</p>
</div>
<div id="ref-macqueen1967">
<p>MacQueen, J. 1967. “Some Methods for Classification and Analysis of Multivariate Observations.” In <em>Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics</em>, 281–97. Berkeley, Calif.: University of California Press. <a href="http://projecteuclid.org:443/euclid.bsmsp/1200512992" class="uri">http://projecteuclid.org:443/euclid.bsmsp/1200512992</a>.</p>
</div>
</div>
</div>
</div><!--end rdoc-->
 
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

<!-- END HTML -->]]></description>
			<pubDate>Sat, 23 Sep 2017 12:23:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Clustering Example: 4 Steps You Should Know]]></title>
			<link>https://www.sthda.com/english/articles/25-cluster-analysis-in-r-practical-guide/108-clustering-example-4-steps-you-should-know/</link>
			<guid>https://www.sthda.com/english/articles/25-cluster-analysis-in-r-practical-guide/108-clustering-example-4-steps-you-should-know/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">





<p>This article describes <em>k-means</em> <strong>clustering example</strong> and provide a step-by-step guide summarizing the different steps to follow for conducting a cluster analysis on a real data set using R software.</p>
<p>We’ll use mainly two R packages:</p>
<ul>
<li>cluster: for cluster analyses and</li>
<li>factoextra: for the visualization of the analysis results.</li>
</ul>
<p>Install these packages, as follow:</p>
<pre class="r"><code>install.packages(c("cluster", "factoextra"))</code></pre>
<p>A rigorous cluster analysis can be conducted in 3 steps mentioned below:</p>
<div class="block">
<ol style="list-style-type: decimal">
<li>
Data preparation
</li>
<li>
<a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/95-assessing-clustering-tendency-essentials/">Assessing clustering tendency (i.e., the clusterability of the data)</a>
</li>
<li>
<a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/">Defining the optimal number of clusters</a>
</li>
<li>
<a href="https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/">Computing partitioning cluster analyses (e.g.: k-means, pam)</a> or <a href="https://www.sthda.com/english/articles/28-hierarchical-clustering-essentials/">hierarchical clustering</a>
</li>
<li>
<a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/">Validating clustering analyses: silhouette plot</a>
</li>
</ol>
</div>
<p>Here, we provide quick R scripts to perform all these steps.</p>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#data-preparation">Data preparation</a></li>
<li><a href="#assessing-the-clusterability">Assessing the clusterability</a></li>
<li><a href="#estimate-the-number-of-clusters-in-the-data">Estimate the number of clusters in the data</a></li>
<li><a href="#compute-k-means-clustering">Compute k-means clustering</a></li>
<li><a href="#cluster-validation-statistics-inspect-cluster-silhouette-plot">Cluster validation statistics: Inspect cluster silhouette plot</a></li>
<li><a href="#eclust-enhanced-clustering-analysis">eclust(): Enhanced clustering analysis</a><ul>
<li><a href="#k-means-clustering-using-eclust">K-means clustering using eclust()</a></li>
<li><a href="#hierachical-clustering-using-eclust">Hierachical clustering using eclust()</a></li>
</ul></li>
</ul>
</div><br/>
<p>Related Book:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>


<div id="data-preparation" class="section level2">
<h2>Data preparation</h2>
<p>We’ll use the demo data set USArrests. We start by standardizing the data using the <em>scale</em>() function:</p>
<pre class="r"><code># Load the data set
data(USArrests)
# Standardize
df <- scale(USArrests)</code></pre>
</div>
<div id="assessing-the-clusterability" class="section level2">
<h2>Assessing the clusterability</h2>
<p>The function <em>get_clust_tendency</em>() [factoextra package] can be used. It computes the <a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/95-assessing-clustering-tendency-essentials/#statistical-methods"><em>Hopkins statistic</em></a> and provides a visual approach.</p>
<pre class="r"><code>library("factoextra")
res <- get_clust_tendency(df, 40, graph = TRUE)
# Hopskin statistic
res$hopkins_stat</code></pre>
<pre><code>## [1] 0.656</code></pre>
<pre class="r"><code># Visualize the dissimilarity matrix
print(res$plot)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/058-clustering-example-cluster-tendency-1.png" width="432" /></p>
<div class="success">
<p>
The value of the <a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/95-assessing-clustering-tendency-essentials/#statistical-methods">Hopkins statistic</a> is significantly < 0.5, indicating that the data is highly clusterable. Additionally, It can be seen that the ordered dissimilarity image contains patterns (i.e., clusters).
</p>
</div>
</div>
<div id="estimate-the-number-of-clusters-in-the-data" class="section level2">
<h2>Estimate the number of clusters in the data</h2>
<p>As k-means clustering requires to specify the number of clusters to generate, we’ll use the function clusGap() [cluster package] to compute <a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/#gap-statistic-method">gap statistics</a> for estimating the optimal number of clusters . The function <em>fviz_gap_stat</em>() [factoextra] is used to visualize the gap statistic plot.</p>
<pre class="r"><code>library("cluster")
set.seed(123)
# Compute the gap statistic
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25, 
                    K.max = 10, B = 100) 
# Plot the result
library(factoextra)
fviz_gap_stat(gap_stat)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/058-clustering-example-number-of-clusters-gap-statistic-1.png" width="518.4" /></p>
<div class="success">
<p>
The gap statistic suggests a 4 cluster solutions.
</p>
</div>
<div class="notice">
<p>
It’s also possible to use the function <a href="https://www.sthda.com/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning#nbclust-a-package-providing-30-indices-for-determining-the-best-number-of-clusters"><strong>NbClust()</strong></a> [in <strong>NbClust</strong>] package.
</p>
</div>
<p><span class="notice">It’s also possible to use the function <a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/#nbclust-function-30-indices-for-choosing-the-best-number-of-clusters">NbClust()</a> [NbClust package].</span></p>
</div>
<div id="compute-k-means-clustering" class="section level2">
<h2>Compute k-means clustering</h2>
<p><a href="https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/87-k-means-clustering-essentials/">K-means clustering</a> with k = 4:</p>
<pre class="r"><code># Compute k-means
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)
head(km.res$cluster, 20)</code></pre>
<pre><code>##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           4           3           3           4           3           3 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           2           2           3           4           2           1 
##    Illinois     Indiana        Iowa      Kansas    Kentucky   Louisiana 
##           3           2           1           2           1           4 
##       Maine    Maryland 
##           1           3</code></pre>
<pre class="r"><code># Visualize clusters using factoextra
fviz_cluster(km.res, USArrests)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/058-clustering-example-k-means-factoextra-1.png" width="518.4" /></p>
</div>
<div id="cluster-validation-statistics-inspect-cluster-silhouette-plot" class="section level2">
<h2>Cluster validation statistics: Inspect cluster silhouette plot</h2>
<p>Recall that the <a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/#silhouette-coefficient">silhouette measures</a> (<span class="math inline">\(S_i\)</span>) how similar an object <span class="math inline">\(i\)</span> is to the the other objects in its own cluster versus those in the neighbor cluster. <span class="math inline">\(S_i\)</span> values range from 1 to - 1:</p>
<ul>
<li>A value of <span class="math inline">\(S_i\)</span> close to 1 indicates that the object is well clustered. In the other words, the object <span class="math inline">\(i\)</span> is similar to the other objects in its group.</li>
<li>A value of <span class="math inline">\(S_i\)</span> close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.</li>
</ul>
<pre class="r"><code>sil <- silhouette(km.res$cluster, dist(df))
rownames(sil) <- rownames(USArrests)
head(sil[, 1:3])</code></pre>
<pre><code>##            cluster neighbor sil_width
## Alabama          4        3    0.4858
## Alaska           3        4    0.0583
## Arizona          3        2    0.4155
## Arkansas         4        2    0.1187
## California       3        2    0.4356
## Colorado         3        2    0.3265</code></pre>
<pre class="r"><code>fviz_silhouette(sil)</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1   13          0.37
## 2       2   16          0.34
## 3       3   13          0.27
## 4       4    8          0.39</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/058-clustering-example-silhouette-plot-1.png" width="518.4" /></p>
<p>It can be seen that there are some samples which have negative silhouette values. Some natural questions are :</p>
<p><span class="question">Which samples are these? To what cluster are they closer?</span></p>
<p>This can be determined from the output of the function <em>silhouette</em>() as follow:</p>
<pre class="r"><code>neg_sil_index <- which(sil[, "sil_width"] < 0)
sil[neg_sil_index, , drop = FALSE]</code></pre>
<pre><code>##          cluster neighbor sil_width
## Missouri       3        2   -0.0732</code></pre>
</div>
<div id="eclust-enhanced-clustering-analysis" class="section level2">
<h2>eclust(): Enhanced clustering analysis</h2>
<p>The function <a href="https://www.sthda.com/english/articles/25-cluster-analysis-in-r-practical-guide/106-cluster-analysis-in-r-simplified-and-enhanced/">eclust()</a>[factoextra package] provides several advantages compared to the standard packages used for clustering analysis:</p>
<ul>
<li>It simplifies the workflow of clustering analysis</li>
<li>It can be used to compute hierarchical clustering and partitioning clustering in a single line function call</li>
<li>The function eclust() computes automatically the gap statistic for estimating the right number of clusters.</li>
<li>It automatically provides silhouette information</li>
<li>It draws beautiful graphs using ggplot2</li>
</ul>
<div id="k-means-clustering-using-eclust" class="section level3">
<h3>K-means clustering using eclust()</h3>
<pre class="r"><code># Compute k-means
res.km <- eclust(df, "kmeans", nstart = 25)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/058-clustering-example-eclust-k-means-1.png" width="518.4" /></p>
<pre class="r"><code># Gap statistic plot
fviz_gap_stat(res.km$gap_stat)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/058-clustering-example-eclust-k-means-2.png" width="518.4" /></p>
<pre class="r"><code># Silhouette plot
fviz_silhouette(res.km)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/058-clustering-example-eclust-k-means-3.png" width="518.4" /></p>
</div>
<div id="hierachical-clustering-using-eclust" class="section level3">
<h3>Hierachical clustering using eclust()</h3>
<pre class="r"><code> # Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust</code></pre>
<pre><code>## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
## .................................................. 50 
## .................................................. 100</code></pre>
<pre class="r"><code>fviz_dend(res.hc, rect = TRUE) # dendrogam</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/058-clustering-example-eclust-hierarchical-clustering-1.png" width="518.4" /></p>
<p>The R code below generates the silhouette plot and the scatter plot for hierarchical clustering.</p>
<pre class="r"><code>fviz_silhouette(res.hc) # silhouette plot
fviz_cluster(res.hc) # scatter plot</code></pre>
</div>
</div>


</div><!--end rdoc-->

 
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

<!-- END HTML -->]]></description>
			<pubDate>Fri, 22 Sep 2017 22:20:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Cluster Analysis in R Simplified and Enhanced]]></title>
			<link>https://www.sthda.com/english/articles/25-cluster-analysis-in-r-practical-guide/106-cluster-analysis-in-r-simplified-and-enhanced/</link>
			<guid>https://www.sthda.com/english/articles/25-cluster-analysis-in-r-practical-guide/106-cluster-analysis-in-r-simplified-and-enhanced/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p>In <strong>R</strong>, standard clustering methods (partitioning and hierarchical clustering) can be computed using the R packages <em>stats</em> and <em>cluster</em>. However the workflow, generally, requires multiple steps and multiple lines of R codes.</p>
<p>This article describes some easy-to-use wrapper functions, in the <strong>factoextra</strong> R package, for simplifying and improving <strong>cluster analysis</strong> in <strong>R</strong>. These functions include:</p>
<ol style="list-style-type: decimal">
<li><p><em>get_dist</em>() &amp; <em>fviz_dist</em>() for computing and visualizing distance matrix between rows of a data matrix. Compared to the standard <em>dist</em>() function, get_dist() supports <em>correlation-based distance measures</em> including “pearson”, “kendall” and “spearman” methods.</p></li>
<li><em>eclust</em>(): enhanced cluster analysis. It has several advantages:
<ul>
<li>It simplifies the workflow of clustering analysis</li>
<li>It can be used to compute <a href="https://www.sthda.com/english/articles/28-hierarchical-clustering-essentials/"><em>hierarchical clustering</em></a> and <a href="https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/"><em>partititioning clustering</em></a> in a single line function call</li>
<li>Compared to the standard partitioning functions (kmeans, pam, clara and fanny) which requires the user to specify the optimal number of clusters, the function eclust() computes automatically the <em>gap statistic</em> for estimating the right number of clusters.</li>
<li>For hierarchical clustering, correlation-based metric is allowed</li>
<li>It provides <a href="https://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/#silhouette-coefficient">silhouette information</a> for all partitioning methods and hierarchical clustering</li>
<li>It creates beautiful graphs using ggplot2</li>
</ul></li>
</ol>
<div id="required-packages" class="section level2">
<h2>Required packages</h2>
<p>We’ll use the factoextra package for an enhanced cluster analysis and visualization.</p>
<ul>
<li>Install factoextra:</li>
</ul>
<pre class="r"><code>install.packages("factoextra")</code></pre>
<ul>
<li>Load factoextra</li>
</ul>
<pre class="r"><code>library(factoextra)</code></pre>
</div>
<div id="data-preparation" class="section level2">
<h2>Data preparation</h2>
<p>The built-in R dataset <strong>USArrests</strong> is used:</p>
<pre class="r"><code># Load and scale the dataset
data("USArrests")
df <- scale(USArrests)
head(df)</code></pre>
<pre><code>##            Murder Assault UrbanPop     Rape
## Alabama    1.2426   0.783   -0.521 -0.00342
## Alaska     0.5079   1.107   -1.212  2.48420
## Arizona    0.0716   1.479    0.999  1.04288
## Arkansas   0.2323   0.231   -1.074 -0.18492
## California 0.2783   1.263    1.759  2.06782
## Colorado   0.0257   0.399    0.861  1.86497</code></pre>
</div>
<div id="distance-matrix-computation-and-visualization" class="section level2">
<h2>Distance matrix computation and visualization</h2>
<pre class="r"><code>library(factoextra)
# Correlation-based distance method
res.dist <- get_dist(df, method = "pearson")
head(round(as.matrix(res.dist), 2))[, 1:6]</code></pre>
<pre><code>##            Alabama Alaska Arizona Arkansas California Colorado
## Alabama       0.00   0.71    1.45     0.09       1.87     1.69
## Alaska        0.71   0.00    0.83     0.37       0.81     0.52
## Arizona       1.45   0.83    0.00     1.18       0.29     0.60
## Arkansas      0.09   0.37    1.18     0.00       1.59     1.37
## California    1.87   0.81    0.29     1.59       0.00     0.11
## Colorado      1.69   0.52    0.60     1.37       0.11     0.00</code></pre>
<pre class="r"><code># Visualize the dissimilarity matrix
fviz_dist(res.dist, lab_size = 8)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/057-cluster-analysis-in-r-simplified-and-enhanced-distance-matrix-computation-visualization-1.png" width="518.4" /></p>
<div class="success">
<p>
In the plot above, similar objects are close to one another. Red color corresponds to small distance and blue color indicates big distance between observation.
</p>
</div>
</div>
<div id="enhanced-clustering-analysis" class="section level2">
<h2>Enhanced clustering analysis</h2>
<p>The standard R code for computing hierarchical clustering looks like this:</p>
<pre class="r"><code># Load and scale the dataset
data("USArrests")
df <- scale(USArrests)
# Compute dissimilarity matrix
res.dist <- dist(df, method = "euclidean")
# Compute hierarchical clustering
res.hc <- hclust(res.dist, method = "ward.D2")
# Visualize
plot(res.hc, cex = 0.5)</code></pre>
<p>In this section we’ll describe the <em>eclust</em>() function [<em>factoextra</em> package] to simplify the workflow. The format is as follow:</p>
<pre class="r"><code>eclust(x, FUNcluster = "kmeans", hc_metric = "euclidean", ...)</code></pre>
<div class="block">
<ul>
<li>
x: numeric vector, data matrix or data frame
</li>
<li>
FUNcluster: a clustering function including “kmeans”, “pam”, “clara”, “fanny”, “hclust”, “agnes” and “diana”. Abbreviation is allowed.
</li>
<li>
hc_metric: character string specifying the metric to be used for calculating dissimilarities between observations. Allowed values are those accepted by the function dist() [including “euclidean”, “manhattan”, “maximum”, “canberra”, “binary”, “minkowski”] and correlation based distance measures [“pearson”, “spearman” or “kendall”]. Used only when FUNcluster is a hierarchical clustering function such as one of “hclust”, “agnes” or “diana”.
</li>
<li>
…: other arguments to be passed to FUNcluster.
</li>
</ul>
</div>
<p>In the following R code, we’ll show some examples for enhanced k-means clustering and hierarchical clustering. Note that the same analysis can be done for PAM, CLARA, FANNY, AGNES and DIANA.</p>
<pre class="r"><code>library("factoextra")
# Enhanced k-means clustering
res.km <- eclust(df, "kmeans", nstart = 25)</code></pre>
<pre><code>## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
## .................................................. 50 
## .................................................. 100</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/057-cluster-analysis-in-r-simplified-and-enhanced-eclust-1.png" width="518.4" /></p>
<pre class="r"><code># Gap statistic plot
fviz_gap_stat(res.km$gap_stat)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/057-cluster-analysis-in-r-simplified-and-enhanced-eclust-2.png" width="518.4" /></p>
<pre class="r"><code># Silhouette plot
fviz_silhouette(res.km)</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1    8          0.39
## 2       2   16          0.34
## 3       3   13          0.37
## 4       4   13          0.27</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/057-cluster-analysis-in-r-simplified-and-enhanced-eclust-3.png" width="518.4" /></p>
<pre class="r"><code># Optimal number of clusters using gap statistics
res.km$nbclust</code></pre>
<pre><code>## [1] 4</code></pre>
<pre class="r"><code># Print result
 res.km</code></pre>
<pre><code>## K-means clustering with 4 clusters of sizes 8, 16, 13, 13
## 
## Cluster means:
##   Murder Assault UrbanPop    Rape
## 1  1.412   0.874   -0.815  0.0193
## 2 -0.489  -0.383    0.576 -0.2617
## 3 -0.962  -1.107   -0.930 -0.9668
## 4  0.695   1.039    0.723  1.2769
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              4              4              1              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              2              2              4              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              3              4              2              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              3              1              3              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              4              3              1              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              4              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              1              3              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              1              4              2              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2 
## 
## Within cluster sum of squares by cluster:
## [1]  8.32 16.21 11.95 19.92
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "clust_plot"   "silinfo"      "nbclust"     
## [13] "data"         "gap_stat"</code></pre>
<pre class="r"><code> # Enhanced hierarchical clustering
 res.hc <- eclust(df, "hclust") # compute hclust</code></pre>
<pre><code>## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
## .................................................. 50 
## .................................................. 100</code></pre>
<pre class="r"><code> fviz_dend(res.hc, rect = TRUE) # dendrogam</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/057-cluster-analysis-in-r-simplified-and-enhanced-eclust-4.png" width="518.4" /></p>
<pre class="r"><code> fviz_silhouette(res.hc) # silhouette plot</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1   19          0.26
## 2       2   19          0.28
## 3       3   12          0.43</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/057-cluster-analysis-in-r-simplified-and-enhanced-eclust-5.png" width="518.4" /></p>
<pre class="r"><code> fviz_cluster(res.hc) # scatter plot</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/057-cluster-analysis-in-r-simplified-and-enhanced-eclust-6.png" width="518.4" /></p>
<p>It’s also possible to specify the number of clusters as follow:</p>
<pre class="r"><code>eclust(df, "kmeans", k = 4)</code></pre>
</div>
</div><!--end rdoc-->

<!-- END HTML -->]]></description>
			<pubDate>Tue, 12 Sep 2017 03:56:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[DBSCAN: Density-Based Clustering Essentials]]></title>
			<link>https://www.sthda.com/english/articles/30-advanced-clustering/105-dbscan-density-based-clustering-essentials/</link>
			<guid>https://www.sthda.com/english/articles/30-advanced-clustering/105-dbscan-density-based-clustering-essentials/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p><strong>DBSCAN</strong> (<strong>Density-Based Spatial Clustering and Application with Noise</strong>), is a <strong>density-based clusering</strong> algorithm, introduced in Ester et al. 1996, which can be used to identify clusters of any shape in a data set containing noise and outliers.</p>
<p>The basic idea behind the density-based clustering approach is derived from a human intuitive clustering method. For instance, by looking at the figure below, one can easily identify four clusters along with several points of noise, because of the differences in the density of points.</p>
<p>Clusters are dense regions in the data space, separated by regions of lower density of points. The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.</p>
<p><img src="https://www.sthda.com/english/sthda-upload/images/cluster-analysis/dbscan-idea.png" alt="DBSCAN idea" /> (From Ester et al. 1996)</p>
<div class="block">
<p>
In this chapter, we’ll describe the DBSCAN algorithm and demonstrate how to compute DBSCAN using the <em>fpc</em> R package.
</p>
</div>
<br/>
<p>Contents: </p>
<div id="TOC">
<ul>
<li><a href="#why-dbscan">Why DBSCAN?</a></li>
<li><a href="#algorithm">Algorithm</a></li>
<li><a href="#advantages">Advantages</a></li>
<li><a href="#parameter-estimation">Parameter estimation</a></li>
<li><a href="#computing-dbscan">Computing DBSCAN</a></li>
<li><a href="#method-for-determining-the-optimal-eps-value">Method for determining the optimal eps value</a></li>
<li><a href="#cluster-predictions-with-dbscan-algorithm">Cluster predictions with DBSCAN algorithm</a></li>
</ul>
</div>
<br/>
<p>Related Book:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="why-dbscan" class="section level2">
<h2>Why DBSCAN?</h2>
<p>Partitioning methods (K-means, PAM clustering) and hierarchical clustering are suitable for finding spherical-shaped clusters or convex clusters. In other words, they work well only for compact and well separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.</p>
<p>Unfortunately, real life data can contain: i) clusters of arbitrary shape such as those shown in the figure below (oval, linear and “S” shape clusters); ii) many outliers and noise.</p>
<p>The figure below shows a data set containing nonconvex clusters and outliers/noises. The simulated data set <em>multishapes</em> [in <em>factoextra</em> package] is used.</p>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/023-dbscan-density-based-clustering-data-dbscan-1.png" width="336" /></p>
<p>The plot above contains 5 clusters and outliers, including:</p>
<ul>
<li>2 ovales clusters</li>
<li>2 linear clusters</li>
<li>1 compact cluster</li>
</ul>
<p>Given such data, k-means algorithm has difficulties for identifying theses clusters with arbitrary shapes. To illustrate this situation, the following R code computes k-means algorithm on the multishapes data set. The function <em>fviz_cluster</em>()[<em>factoextra</em> package] is used to visualize the clusters.</p>
<p>First, install factoextra: install.packages(“factoextra”); then compute and visualize k-means clustering using the data set multishapes:</p>
<pre class="r"><code>library(factoextra)
data("multishapes")
df <- multishapes[, 1:2]
set.seed(123)
km.res <- kmeans(df, 5, nstart = 25)
fviz_cluster(km.res, df,  geom = "point", 
             ellipse= FALSE, show.clust.cent = FALSE,
             palette = "jco", ggtheme = theme_classic())</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/023-dbscan-density-based-clustering-k-means-multishapes-1.png" width="336" /></p>
<div class="success">
<p>
We know there are 5 five clusters in the data, but it can be seen that k-means method inaccurately identify the 5 clusters.
</p>
</div>
</div>
<div id="algorithm" class="section level2">
<h2>Algorithm</h2>
<p>The goal is to identify dense regions, which can be measured by the number of objects close to a given point.</p>
<p>Two important parameters are required for DBSCAN: <em>epsilon</em> (“eps”) and <em>minimum points</em> (“MinPts”). The parameter <em>eps</em> defines the radius of neighborhood around a point x. It’s called called the <span class="math inline">\(\epsilon\)</span>-neighborhood of x. The parameter <em>MinPts</em> is the minimum number of neighbors within “eps” radius.</p>
<p>Any point x in the data set, with a neighbor count greater than or equal to <em>MinPts</em>, is marked as a <em>core point</em>. We say that x is <em>border point</em>, if the number of its neighbors is less than MinPts, but it belongs to the <span class="math inline">\(\epsilon\)</span>-neighborhood of some core point z. Finally, if a point is neither a core nor a border point, then it is called a noise point or an outlier.</p>
<p>The figure below shows the different types of points (core, border and outlier points) using MinPts = 6. Here x is a core point because <span class="math inline">\(neighbours_\epsilon(x) = 6\)</span>, y is a border point because <span class="math inline">\(neighbours_\epsilon(y) < MinPts\)</span>, but it belongs to the <span class="math inline">\(\epsilon\)</span>-neighborhood of the core point x. Finally, z is a noise point.</p>
<p><img src="https://www.sthda.com/english/sthda-upload/images/cluster-analysis/dbscan-principle.png" alt="DBSCAN principle" /></p>
<p>We start by defining 3 terms, required for understanding the DBSCAN algorithm:</p>
<ul>
<li><em>Direct density reachable</em>: A point “A” is directly density reachable from another point “B” if: i) “A” is in the <span class="math inline">\(\epsilon\)</span>-neighborhood of “B” and ii) “B” is a core point.</li>
<li><em>Density reachable</em>: A point “A” is density reachable from “B” if there are a set of core points leading from “B” to “A.</li>
<li><em>Density connected</em>: Two points “A” and “B” are density connected if there are a core point “C”, such that both “A” and “B” are density reachable from “C”.</li>
</ul>
<p>A density-based cluster is defined as a group of density connected points. The algorithm of density-based clustering (DBSCAN) works as follow:</p>
<div class="block">
<ol style="list-style-type: decimal">
<li>
<p>
For each point <span class="math inline"><em>x</em><sub><em>i</em></sub></span>, compute the distance between <span class="math inline"><em>x</em><sub><em>i</em></sub></span> and the other points. Finds all neighbor points within distance <em>eps</em> of the starting point (<span class="math inline"><em>x</em><sub><em>i</em></sub></span>). Each point, with a neighbor count greater than or equal to <em>MinPts</em>, is marked as <em>core point</em> or <em>visited</em>.
</p>
</li>
<li>
<p>
For each <em>core point</em>, if it’s not already assigned to a cluster, create a new cluster. Find recursively all its density connected points and assign them to the same cluster as the core point.
</p>
</li>
<li>
<p>
Iterate through the remaining unvisited points in the data set.
</p>
</li>
</ol>
<p>
Those points that do not belong to any cluster are treated as outliers or noise.
</p>
</div>
</div>
<div id="advantages" class="section level2">
<h2>Advantages</h2>
<ol style="list-style-type: decimal">
<li>Unlike K-means, DBSCAN does not require the user to specify the number of clusters to be generated</li>
<li>DBSCAN can find any shape of clusters. The cluster doesn’t have to be circular.</li>
<li>DBSCAN can identify outliers</li>
</ol>
</div>
<div id="parameter-estimation" class="section level2">
<h2>Parameter estimation</h2>
<ul>
<li><p>MinPts: The larger the data set, the larger the value of minPts should be chosen. minPts must be chosen at least 3.</p></li>
<li><p><span class="math inline">\(\epsilon\)</span>: The value for <span class="math inline">\(\epsilon\)</span> can then be chosen by using a k-distance graph, plotting the distance to the k = minPts nearest neighbor. Good values of <span class="math inline">\(\epsilon\)</span> are where this plot shows a strong bend.</p></li>
</ul>
</div>
<div id="computing-dbscan" class="section level2">
<h2>Computing DBSCAN</h2>
<p>Here, we’ll use the R package <em>fpc</em> to compute DBSCAN. It’s also possible to use the package <em>dbscan</em>, which provides a faster re-implementation of DBSCAN algorithm compared to the fpc package.</p>
<p>We’ll also use the <em>factoextra</em> package for visualizing clusters.</p>
<p>First, install the packages as follow:</p>
<pre class="r"><code>install.packages("fpc")
install.packages("dbscan")
install.packages("factoextra")</code></pre>
<p>The R code below computes and visualizes DBSCAN using multishapes data set [factoextra R package]:</p>
<pre class="r"><code># Load the data 
data("multishapes", package = "factoextra")
df <- multishapes[, 1:2]
# Compute DBSCAN using fpc package
library("fpc")
set.seed(123)
db <- fpc::dbscan(df, eps = 0.15, MinPts = 5)
# Plot DBSCAN results
library("factoextra")
fviz_cluster(db, data = df, stand = FALSE,
             ellipse = FALSE, show.clust.cent = FALSE,
             geom = "point",palette = "jco", ggtheme = theme_classic())</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/023-dbscan-density-based-clustering-density-based-clustering-1.png" width="336" /></p>
<div class="notice">
<p>
Note that, the function <em>fviz_cluster</em>() uses different point symbols for core points (i.e, seed points) and border points. Black points correspond to outliers. You can play with <em>eps</em> and <em>MinPts</em> for changing cluster configurations.
</p>
</div>
<div class="success">
<p>
It can be seen that DBSCAN performs better for these data sets and can identify the correct set of clusters compared to k-means algorithms.
</p>
</div>
<p>The result of the <em>fpc::dbscan</em>() function can be displayed as follow:</p>
<pre class="r"><code>print(db)</code></pre>
<pre><code>## dbscan Pts=1100 MinPts=5 eps=0.15
##         0   1   2   3  4  5
## border 31  24   1   5  7  1
## seed    0 386 404  99 92 50
## total  31 410 405 104 99 51</code></pre>
<p>In the table above, column names are cluster number. Cluster 0 corresponds to outliers (black points in the DBSCAN plot). The function <em>print.dbscan</em>() shows a statistic of the number of points belonging to the clusters that are seeds and border points.</p>
<pre class="r"><code># Cluster membership. Noise/outlier observations are coded as 0
# A random subset is shown
db$cluster[sample(1:1089, 20)]</code></pre>
<pre><code>##  [1] 1 3 2 4 3 1 2 4 2 2 2 2 2 2 1 4 1 1 1 0</code></pre>
<p>DBSCAN algorithm requires users to specify the optimal <em>eps</em> values and the parameter <em>MinPts</em>. In the R code above, we used <em>eps = 0.15</em> and <em>MinPts = 5</em>. One limitation of DBSCAN is that it is sensitive to the choice of <span class="math inline">\(\epsilon\)</span>, in particular if clusters have different densities. If <span class="math inline">\(\epsilon\)</span> is too small, sparser clusters will be defined as noise. If <span class="math inline">\(\epsilon\)</span> is too large, denser clusters may be merged together. This implies that, if there are clusters with different local densities, then a single <span class="math inline">\(\epsilon\)</span> value may not suffice.</p>
<p>A natural question is:</p>
<div class="block">
<p>
How to define the optimal value of ?
</p>
</div>
</div>
<div id="method-for-determining-the-optimal-eps-value" class="section level2">
<h2>Method for determining the optimal eps value</h2>
<p>The method proposed here consists of computing the k-nearest neighbor distances in a matrix of points.</p>
<p>The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to <em>MinPts</em>.</p>
<p>Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal <em>eps</em> parameter.</p>
<p>A knee corresponds to a threshold where a sharp change occurs along the k-distance curve.</p>
<p>The function <em>kNNdistplot</em>() [in <em>dbscan</em> package] can be used to draw the k-distance plot:</p>
<pre class="r"><code>dbscan::kNNdistplot(df, k =  5)
abline(h = 0.15, lty = 2)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/023-dbscan-density-based-clustering-k-nearest-neighbor-distance-1.png" width="384" /></p>
<div class="success">
<p>
It can be seen that the optimal <em>eps</em> value is around a distance of 0.15.
</p>
</div>
</div>
<div id="cluster-predictions-with-dbscan-algorithm" class="section level2">
<h2>Cluster predictions with DBSCAN algorithm</h2>
<p>The function <em>predict.dbscan(object, data, newdata)</em> [in <em>fpc</em> package] can be used to predict the clusters for the points in <em>newdata</em>. For more details, read the documentation (<em>?predict.dbscan</em>).</p>
</div>
</div><!--end rdoc-->
 
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

<!-- END HTML -->]]></description>
			<pubDate>Thu, 07 Sep 2017 20:02:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Model Based Clustering Essentials]]></title>
			<link>https://www.sthda.com/english/articles/30-advanced-clustering/104-model-based-clustering-essentials/</link>
			<guid>https://www.sthda.com/english/articles/30-advanced-clustering/104-model-based-clustering-essentials/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p>The traditional clustering methods, such as hierarchical clustering (Chapter @ref(agglomerative-clustering)) and k-means clustering (Chapter @ref(kmeans-clustering)), are heuristic and are not based on formal models. Furthermore, k-means algorithm is commonly randomnly initialized, so different runs of k-means will often yield different results. Additionally, k-means requires the user to specify the the optimal number of clusters.</p>
<p>An alternative is <strong>model-based clustering</strong>, which consider the data as coming from a distribution that is mixture of two or more clusters <span class="citation">(Fraley and Raftery 2002, <span class="citation">Fraley et al. (2012)</span>)</span>. Unlike k-means, the model-based clustering uses a soft assignment, where each data point has a probability of belonging to each cluster.</p>
<div class="block">
<p>
In this chapter, we illustrate model-based clustering using the R package <em>mclust</em>.
</p>
</div>
<br/>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#concept-of-model-based-clustering">Concept of model-based clustering</a></li>
<li><a href="#estimating-model-parameters">Estimating model parameters</a></li>
<li><a href="#choosing-the-best-model">Choosing the best model</a></li>
<li><a href="#computing-model-based-clustering-in-r">Computing model-based clustering in R</a></li>
<li><a href="#visualizing-model-based-clustering">Visualizing model-based clustering</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<p>Related Book:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="concept-of-model-based-clustering" class="section level2">
<h2>Concept of model-based clustering</h2>
<p>In model-based clustering, the data is considered as coming from a mixture of density.</p>
<p>Each component (i.e. cluster) k is modeled by the normal or Gaussian distribution which is characterized by the parameters:</p>
<ul>
<li><span class="math inline">\(\mu_k\)</span>: mean vector,</li>
<li><span class="math inline">\(\sum_k\)</span>: covariance matrix,</li>
<li>An associated probability in the mixture. Each point has a probability of belonging to each cluster.</li>
</ul>
<p>For example, consider the “old faithful geyser data” [in MASS R package], which can be illustrated as follow using the ggpubr R package:</p>
<pre class="r"><code># Load the data
library("MASS")
data("geyser")
# Scatter plot
library("ggpubr")
ggscatter(geyser, x = "duration", y = "waiting")+
  geom_density2d() # Add 2D density</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/022-model-based-clustering-scatter-plot-1.png" width="336" /></p>
<p>The plot above suggests at least 3 clusters in the mixture. The shape of each of the 3 clusters appears to be approximately elliptical suggesting three bivariate normal distributions. As the 3 ellipses seems to be similar in terms of volume, shape and orientation, we might anticipate that the three components of this mixture might have homogeneous covariance matrices.</p>
</div>
<div id="estimating-model-parameters" class="section level2">
<h2>Estimating model parameters</h2>
<p>The model parameters can be estimated using the <em>Expectation-Maximization</em> (EM) algorithm initialized by hierarchical model-based clustering. Each cluster k is centered at the means <span class="math inline">\(\mu_k\)</span>, with increased density for points near the mean.</p>
<p>Geometric features (shape, volume, orientation) of each cluster are determined by the covariance matrix <span class="math inline">\(\sum_k\)</span>.</p>
<p>Different possible parameterizations of <span class="math inline">\(\sum_k\)</span> are available in the R package <em>mclust</em> (see <em>?mclustModelNames</em>).</p>
<p>The available model options, in <em>mclust</em> package, are represented by identifiers including: EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV and VVV.</p>
<p>The first identifier refers to volume, the second to shape and the third to orientation. E stands for “equal”, V for “variable” and I for “coordinate axes”.</p>
<p>For example:</p>
<ul>
<li>EVI denotes a model in which the volumes of all clusters are equal (E), the shapes of the clusters may vary (V), and the orientation is the identity (I) or “coordinate axes.
</li>
<li>EEE means that the clusters have the same volume, shape and orientation in p-dimensional space.</li>
<li>VEI means that the clusters have variable volume, the same shape and orientation equal to coordinate axes.</li>
</ul>
</div>
<div id="choosing-the-best-model" class="section level2">
<h2>Choosing the best model</h2>
<p>The <em>Mclust</em> package uses maximum likelihood to fit all these models, with different covariance matrix parameterizations, for a range of k components.</p>
<p>The best model is selected using the Bayesian Information Criterion or <em>BIC</em>. A large BIC score indicates strong evidence for the corresponding model.</p>
</div>
<div id="computing-model-based-clustering-in-r" class="section level2">
<h2>Computing model-based clustering in R</h2>
<p>We start by installing the <em>mclust</em> package as follow: <em>install.packages(“mclust”)</em></p>
<div class="notice">
<p>
Note that, model-based clustering can be applied on univariate or multivariate data.
</p>
</div>
<p>Here, we illustrate model-based clustering on the diabetes data set [mclust package] giving three measurements and the diagnosis for 145 subjects described as follow:</p>
<pre class="r"><code>library("mclust")
data("diabetes")
head(diabetes, 3)</code></pre>
<pre><code>##    class glucose insulin sspg
## 1 Normal      80     356  124
## 2 Normal      97     289  117
## 3 Normal     105     319  143</code></pre>
<ul>
<li>class: the diagnosis: normal, chemically diabetic, and overtly diabetic. Excluded from the cluster analysis.</li>
<li>glucose: plasma glucose response to oral glucose</li>
<li>insulin: plasma insulin response to oral glucose</li>
<li>sspg: steady-state plasma glucose (measures insulin resistance)</li>
</ul>
<p>Model-based clustering can be computed using the function Mclust() as follow:</p>
<pre class="r"><code>library(mclust)
df <- scale(diabetes[, -1]) # Standardize the data
mc <- Mclust(df)            # Model-based-clustering
summary(mc)                 # Print a summary</code></pre>
<pre><code>## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3 components:
## 
##  log.likelihood   n df  BIC  ICL
##            -169 145 29 -483 -501
## 
## Clustering table:
##  1  2  3 
## 81 36 28</code></pre>
<p>For this data, it can be seen that model-based clustering selected a model with three components (i.e. clusters). The optimal selected model name is VVV model. That is the three components are ellipsoidal with varying volume, shape, and orientation. The summary contains also the clustering table specifying the number of observations in each clusters.</p>
<p>You can access to the results as follow:</p>
<pre class="r"><code>mc$modelName                # Optimal selected model ==> "VVV"
mc$G                        # Optimal number of cluster => 3
head(mc$z, 30)              # Probality to belong to a given cluster
head(mc$classification, 30) # Cluster assignement of each observation</code></pre>
</div>
<div id="visualizing-model-based-clustering" class="section level2">
<h2>Visualizing model-based clustering</h2>
<p>Model-based clustering results can be drawn using the base function plot.Mclust() [in mclust package]. Here we’ll use the function <em>fviz_mclust</em>() [in <em>factoextra</em> package] to create beautiful plots based on ggplot2.</p>
<p>In the situation, where the data contain more than two variables, <em>fviz_mclust</em>() uses a principal component analysis to reduce the dimensionnality of the data. The first two principal components are used to produce a scatter plot of the data. However, if you want to plot the data using only two variables of interest, let say here c(“insulin”, “sspg”), you can specify that in the <em>fviz_mclust</em>() function using the argument <em>choose.vars = c(“insulin”, “sspg”)</em>.</p>
<pre class="r"><code>library(factoextra)
# BIC values used for choosing the number of clusters
fviz_mclust(mc, "BIC", palette = "jco")
# Classification: plot showing the clustering
fviz_mclust(mc, "classification", geom = "point", 
            pointsize = 1.5, palette = "jco")
# Classification uncertainty
fviz_mclust(mc, "uncertainty", palette = "jco")</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/022-model-based-clustering-model-base-clustering-1.png" width="307.2" /><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/022-model-based-clustering-model-base-clustering-2.png" width="307.2" /><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/022-model-based-clustering-model-base-clustering-3.png" width="307.2" /></p>
<p>Note that, in the uncertainty plot, larger symbols indicate the more uncertain observations.</p>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-fraley2002">
<p>Fraley, Chris, and Adrian E Raftery. 2002. “Model-Based Clustering, Discriminant Analysis, and Density Estimation.” <em>Journal of the American Statistical Association</em> 97 (458): 611–31.</p>
</div>
<div id="ref-fraley2012">
<p>Fraley, Chris, Adrian E. Raftery, T. Brendan Murphy, and Luca Scrucca. 2012. “Mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation.” <em>Technical Report No. 597, Department of Statistics, University of Washington</em>. <a href="https://www.stat.washington.edu/research/reports/2012/tr597.pdf" class="uri">https://www.stat.washington.edu/research/reports/2012/tr597.pdf</a>.</p>
</div>
</div>
</div>
</div><!--end rdoc-->
 
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

<!-- END HTML -->]]></description>
			<pubDate>Thu, 07 Sep 2017 19:46:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[cmeans() R function: Compute Fuzzy clustering]]></title>
			<link>https://www.sthda.com/english/articles/30-advanced-clustering/103-cmeans-r-function-compute-fuzzy-clustering/</link>
			<guid>https://www.sthda.com/english/articles/30-advanced-clustering/103-cmeans-r-function-compute-fuzzy-clustering/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p>This article describes how to compute the <strong>fuzzy clustering</strong> using the function <strong>cmeans</strong>() [in <em>e1071</em> R package]. Previously, we explained <a href="https://www.sthda.com/english/articles/30-advanced-clustering/101-fuzzy-clustering-essentials/">what is fuzzy clustering</a> and how to compute the fuzzy clustering using the R function fanny()[in cluster package].</p>
<p>Related articles:</p>
<ul>
<li><a href="https://www.sthda.com/english/articles/30-advanced-clustering/101-fuzzy-clustering-essentials/">Fuzzy Clustering Essentials</a></li>
<li><a href="https://www.sthda.com/english/articles/30-advanced-clustering/102-fuzzy-c-means-clustering-algorithm/">Fuzzy C-Means Clustering Algorithm</a></li>
</ul>
<div id="cmeans-format" class="section level2">
<h2>cmeans() format</h2>
<p>The simplified format of the function <strong>cmeans</strong>() is as follow:</p>
<pre class="r"><code>cmeans(x, centers, iter.max = 100, dist = "euclidean", m = 2)</code></pre>
<div class="block">
<ul>
<li>
x: a data matrix where columns are variables and rows are observations
</li>
<li>
centers: Number of clusters or initial values for cluster centers
</li>
<li>
iter.max: Maximum number of iterations
</li>
<li>
dist: Possible values are “euclidean” or “manhattan”
</li>
<li>
m: A number greater than 1 giving the degree of fuzzification.
</li>
</ul>
</div>
<p>The function cmeans() returns an object of class fclust which is a list containing the following components:</p>
<ul>
<li>centers: the final cluster centers</li>
<li>size: the number of data points in each cluster of the closest hard clustering</li>
<li>cluster: a vector of integers containing the indices of the clusters where the data points are assigned to for the closest hard - clustering, as obtained by assigning points to the (first) class with maximal membership.</li>
<li>iter: the number of iterations performed</li>
<li>membership: a matrix with the membership values of the data points to the clusters</li>
<li>withinerror: the value of the objective function</li>
</ul>
</div>
<div id="compute-fuzzy-c-means-clustering" class="section level2">
<h2>Compute fuzzy c-means clustering</h2>
<pre class="r"><code>set.seed(123)
# Load the data
data("USArrests")
# Subset of USArrests
ss <- sample(1:50, 20)
df <- scale(USArrests[ss,])
# Compute fuzzy clustering
library(e1071)
cm <- cmeans(df, 4)
cm</code></pre>
<pre><code>## Fuzzy c-means clustering with 4 clusters
## 
## Cluster centers:
##   Murder Assault UrbanPop   Rape
## 1  0.857   0.338   -0.729  0.200
## 2 -0.731  -0.665    1.003 -0.333
## 3 -1.210  -1.248   -0.728 -1.153
## 4  0.629   0.970    0.501  0.865
## 
## Memberships:
##                    1      2      3       4
## Iowa         0.00916 0.0191 0.9658 0.00594
## Rhode Island 0.09885 0.5915 0.2050 0.10463
## Maryland     0.22786 0.0475 0.0273 0.69731
## Tennessee    0.87231 0.0286 0.0211 0.07801
## Utah         0.04446 0.8218 0.0844 0.04929
## Arizona      0.11876 0.1008 0.0399 0.74056
## Mississippi  0.62441 0.0931 0.1030 0.17952
## Wisconsin    0.03363 0.1110 0.8313 0.02403
## Virginia     0.39552 0.2570 0.1918 0.15573
## Maine        0.03433 0.0530 0.8915 0.02117
## Texas        0.24082 0.1595 0.0541 0.54557
## Louisiana    0.61799 0.0653 0.0419 0.27473
## Montana      0.13551 0.1366 0.6657 0.06215
## Michigan     0.09620 0.0371 0.0178 0.84890
## Arkansas     0.56529 0.1223 0.1805 0.13188
## New York     0.13194 0.1323 0.0416 0.69421
## Florida      0.17377 0.0749 0.0398 0.71155
## Alaska       0.38155 0.1354 0.1136 0.36947
## Hawaii       0.06662 0.7206 0.1487 0.06410
## New Jersey   0.05957 0.8009 0.0575 0.08206
## 
## Closest hard clustering:
##         Iowa Rhode Island     Maryland    Tennessee         Utah 
##            3            2            4            1            2 
##      Arizona  Mississippi    Wisconsin     Virginia        Maine 
##            4            1            3            1            3 
##        Texas    Louisiana      Montana     Michigan     Arkansas 
##            4            1            3            4            1 
##     New York      Florida       Alaska       Hawaii   New Jersey 
##            4            4            1            2            2 
## 
## Available components:
## [1] "centers"     "size"        "cluster"     "membership"  "iter"       
## [6] "withinerror" "call"</code></pre>
<p>The different components can be extracted using the code below:</p>
<pre class="r"><code># Membership coefficient
head(cm$membership)</code></pre>
<pre><code>##                    1      2      3       4
## Iowa         0.00916 0.0191 0.9658 0.00594
## Rhode Island 0.09885 0.5915 0.2050 0.10463
## Maryland     0.22786 0.0475 0.0273 0.69731
## Tennessee    0.87231 0.0286 0.0211 0.07801
## Utah         0.04446 0.8218 0.0844 0.04929
## Arizona      0.11876 0.1008 0.0399 0.74056</code></pre>
<pre class="r"><code># Visualize using corrplot
library(corrplot)
corrplot(cm$membership, is.corr = FALSE)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/055-cmeans-c-means-clustering-1.png" width="384" /></p>
<pre class="r"><code># Observation groups/clusters
cm$cluster</code></pre>
<pre><code>##         Iowa Rhode Island     Maryland    Tennessee         Utah 
##            3            2            4            1            2 
##      Arizona  Mississippi    Wisconsin     Virginia        Maine 
##            4            1            3            1            3 
##        Texas    Louisiana      Montana     Michigan     Arkansas 
##            4            1            3            4            1 
##     New York      Florida       Alaska       Hawaii   New Jersey 
##            4            4            1            2            2</code></pre>
</div>
<div id="visualize-clusters" class="section level2">
<h2>Visualize clusters</h2>
<pre class="r"><code>library(factoextra)
fviz_cluster(list(data = df, cluster=cm$cluster), 
             ellipse.type = "norm",
             ellipse.level = 0.68,
             palette = "jco",
             ggtheme = theme_minimal())</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/055-cmeans-visualize-c-means-clusters-1.png" width="480" /></p>
</div>
</div>
</div><!--end rdoc-->

<!-- END HTML -->]]></description>
			<pubDate>Thu, 07 Sep 2017 19:26:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Fuzzy C-Means Clustering Algorithm]]></title>
			<link>https://www.sthda.com/english/articles/30-advanced-clustering/102-fuzzy-c-means-clustering-algorithm/</link>
			<guid>https://www.sthda.com/english/articles/30-advanced-clustering/102-fuzzy-c-means-clustering-algorithm/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">

<p>In our previous article, we described the basic concept of <a href="https://www.sthda.com/english/articles/30-advanced-clustering/101-fuzzy-clustering-essentials/"><strong>fuzzy clustering</strong></a> and we showed how to compute fuzzy clustering. In this current article, we’ll present the <strong>fuzzy c-means clustering algorithm</strong>, which is very similar to the <a href="https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/87-k-means-clustering-essentials/">k-means algorithm</a> and the aim is to minimize the objective function defined as follow:</p>
<p>
<span class="math"><span class="math display">\[
\sum\limits_{j=1}^k \sum\limits_{x_i \in C_j} u_{ij}^m (x_i - \mu_j)^2
\]</span></span>
</p>
<p>
Where,
</p>
<ul>
<li>
<span class="math"><span class="math inline">\(u_{ij}\)</span></span> is the degree to which an observation <span class="math"><span class="math inline">\(x_i\)</span></span> belongs to a cluster <span class="math"><span class="math inline">\(c_j\)</span></span>
</li>
<li>
<span class="math"><span class="math inline">\(\mu_j\)</span></span> is the center of the cluster j
</li>
<li>
<span class="math"><span class="math inline">\(u_{ij}\)</span></span> is the degree to which an observation <span class="math"><span class="math inline">\(x_i\)</span></span> belongs to a cluster <span class="math"><span class="math inline">\(c_j\)</span></span>
</li>
<li>
<span class="math"><span class="math inline">\(m\)</span></span> is the fuzzifier.
</li>
</ul>
<p>
<span class="notice">It can be seen that, FCM differs from k-means by using the membership values <span class="math"><span class="math inline">\(u_{ij}\)</span></span> and the fuzzifier <span class="math"><span class="math inline">\(m\)</span></span>.</span>
</p>
<p>
The variable <span class="math"><span class="math inline">\(u_{ij}^m\)</span></span> is defined as follow:
</p>
<p>
<span class="math"><span class="math display">\[
u_{ij}^m = \frac{1}{\sum\limits_{l=1}^k \left( \frac{| x_i - c_j |}{| x_i - c_k |}\right)^{\frac{2}{m-1}}}
\]</span></span>
</p>
<p>
The degree of belonging, <span class="math"><span class="math inline">\(u_{ij}\)</span></span>, is linked inversely to the distance from x to the cluster center.
</p>
<p>
The parameter <span class="math"><span class="math inline">\(m\)</span></span> is a real number greater than 1 (<span class="math"><span class="math inline">\(1.0 < m < \infty\)</span></span>) and it defines the level of cluster fuzziness. Note that, a value of <span class="math"><span class="math inline">\(m\)</span></span> close to 1 gives a cluster solution which becomes increasingly similar to the solution of hard clustering such as k-means; whereas a value of <span class="math"><span class="math inline">\(m\)</span></span> close to infinite leads to complete fuzzyness.
</p>
<p>
<span class="success">Note that, a good choice is to use <strong>m = 2.0</strong> (Hathaway and Bezdek 2001).</span>
</p>
<p>
In <strong>fuzzy clustering</strong> the centroid of a cluster is he mean of all points, weighted by their degree of belonging to the cluster:
</p>
<p>
<span class="math"><span class="math display">\[
C_j = \frac{\sum\limits_{x \in C_j} u_{ij}^m x}{\sum\limits_{x \in C_j} u_{ij}^m}
\]</span></span>
</p>
<p>
Where,
</p>
<ul>
<li>
<span class="math"><span class="math inline">\(C_j\)</span></span> is the centroid of the cluster j
</li>
<li>
<span class="math"><span class="math inline">\(u_{ij}\)</span></span> is the degree to which an observation <span class="math"><span class="math inline">\(x_i\)</span></span> belongs to a cluster <span class="math"><span class="math inline">\(c_j\)</span></span>
</li>
</ul>
<p>
The algorithm of fuzzy clustering can be summarize as follow:
</p>
<ol style="list-style-type: decimal">
<li>
Specify a number of clusters k (by the analyst)
</li>
<li>
Assign randomly to each point coefficients for being in the clusters.
</li>
<li>
Repeat until the maximum number of iterations (given by “maxit”) is reached, or when the algorithm has converged (that is, the coefficients’ change between two iterations is no more than <span class="math"><span class="math inline">\(\epsilon\)</span></span>, the given sensitivity threshold):
<ul>
<li>
Compute the centroid for each cluster, using the formula above.
</li>
<li>
For each point, compute its coefficients of being in the clusters, using the formula above.
</li>
</ul>
</li>
</ol>
<p>
The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means; the minimum is a local minimum, and the results depend on the initial choice of weights. Hence, different initializations may lead to different results.
</p>
<p>
Using a mixture of Gaussians along with the expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes.
</p>
</div>


</div><!--end rdoc-->

 
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

<!-- END HTML -->]]></description>
			<pubDate>Thu, 07 Sep 2017 16:44:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Fuzzy Clustering Essentials]]></title>
			<link>https://www.sthda.com/english/articles/30-advanced-clustering/101-fuzzy-clustering-essentials/</link>
			<guid>https://www.sthda.com/english/articles/30-advanced-clustering/101-fuzzy-clustering-essentials/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p>The <strong>fuzzy clustering</strong> is considered as soft clustering, in which each element has a probability of belonging to each cluster. In other words, each element has a set of membership coefficients corresponding to the degree of being in a given cluster.</p>
<p>This is different from k-means and k-medoid clustering, where each object is affected exactly to one cluster. K-means and k-medoids clustering are known as hard or non-fuzzy clustering.</p>
<p>In fuzzy clustering, points close to the center of a cluster, may be in the cluster to a higher degree than points in the edge of a cluster. The degree, to which an element belongs to a given cluster, is a numerical value varying from 0 to 1.</p>
<p>The <strong>fuzzy c-means</strong> (FCM) algorithm is one of the most widely used fuzzy clustering algorithms. The centroid of a cluster is calculated as the mean of all points, weighted by their degree of belonging to the cluster:</p>
<div class="block">
<p>
In this article, we’ll describe how to compute fuzzy clustering using the R software.
</p>
</div>
<br/>
<p>Related Book:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="required-r-packages" class="section level2">
<h2>Required R packages</h2>
<p>We’ll use the following R packages: 1) <em>cluster</em> for computing fuzzy clustering and 2) <em>factoextra</em> for visualizing clusters.</p>
</div>
<div id="computing-fuzzy-clustering" class="section level2">
<h2>Computing fuzzy clustering</h2>
<p>The function <em>fanny</em>() [<em>cluster</em> R package] can be used to compute fuzzy clustering. <strong>FANNY</strong> stands for <strong>fuzzy analysis clustering</strong>. A simplified format is:</p>
<pre class="r"><code>fanny(x, k, metric = "euclidean", stand = FALSE)</code></pre>
<div class="block">
<ul>
<li>
<strong>x</strong>: A data matrix or data frame or dissimilarity matrix
</li>
<li>
<strong>k</strong>: The desired number of clusters to be generated
</li>
<li>
<strong>metric</strong>: Metric for calculating dissimilarities between observations
</li>
<li>
<strong>stand</strong>: If TRUE, variables are standardized before calculating the dissimilarities
</li>
</ul>
</div>
<p>The function <em>fanny</em>() returns an object including the following components:</p>
<ul>
<li><strong>membership</strong>: matrix containing the degree to which each observation belongs to a given cluster. Column names are the clusters and rows are observations</li>
<li><strong>coeff</strong>: Dunn’s partition coefficient F(k) of the clustering, where k is the number of clusters. F(k) is the sum of all squared membership coefficients, divided by the number of observations. Its value is between 1/k and 1. The normalized form of the coefficient is also given. It is defined as <span class="math inline">\((F(k) - 1/k) / (1 - 1/k)\)</span>, and ranges between 0 and 1. A low value of Dunn’s coefficient indicates a very fuzzy clustering, whereas a value close to 1 indicates a near-crisp clustering.</li>
<li><strong>clustering</strong>: the clustering vector containing the nearest crisp grouping of observations</li>
</ul>
<p>For example, the R code below applies fuzzy clustering on the USArrests data set:</p>
<pre class="r"><code>library(cluster)
df <- scale(USArrests)     # Standardize the data
res.fanny <- fanny(df, 2)  # Compute fuzzy clustering with k = 2</code></pre>
<p>The different components can be extracted using the code below:</p>
<pre class="r"><code>head(res.fanny$membership, 3) # Membership coefficients</code></pre>
<pre><code>##          [,1]  [,2]
## Alabama 0.664 0.336
## Alaska  0.610 0.390
## Arizona 0.686 0.314</code></pre>
<pre class="r"><code>res.fanny$coeff # Dunn's partition coefficient</code></pre>
<pre><code>## dunn_coeff normalized 
##      0.555      0.109</code></pre>
<pre class="r"><code>head(res.fanny$clustering) # Observation groups</code></pre>
<pre><code>##    Alabama     Alaska    Arizona   Arkansas California   Colorado 
##          1          1          1          2          1          1</code></pre>
<p>To visualize observation groups, use the function <em>fviz_cluster</em>() [<em>factoextra</em> package]:</p>
<pre class="r"><code>library(factoextra)
fviz_cluster(res.fanny, ellipse.type = "norm", repel = TRUE,
             palette = "jco", ggtheme = theme_minimal(),
             legend = "right")</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/021-fuzzy-clustering-visualize-1.png" width="518.4" /></p>
<p>To evaluate the goodnesss of the clustering results, plot the silhouette coefficient as follow:</p>
<pre class="r"><code>fviz_silhouette(res.fanny, palette = "jco",
                ggtheme = theme_minimal())</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1   22          0.32
## 2       2   28          0.44</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/021-fuzzy-clustering-silhouette-1.png" width="518.4" /></p>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>Fuzzy clustering is an alternative to k-means clustering, where each data point has membership coefficient to each cluster. Here, we demonstrated how to compute and visualize fuzzy clustering using the combination of <em>cluster</em> and <em>factoextra</em> R packages.</p>
</div>
</div><!--end rdoc-->
 
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

<!-- END HTML -->]]></description>
			<pubDate>Thu, 07 Sep 2017 15:50:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Hierarchical K-Means Clustering: Optimize Clusters]]></title>
			<link>https://www.sthda.com/english/articles/30-advanced-clustering/100-hierarchical-k-means-clustering-optimize-clusters/</link>
			<guid>https://www.sthda.com/english/articles/30-advanced-clustering/100-hierarchical-k-means-clustering-optimize-clusters/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<div id="hkmeans" class="section level1">
<h1>Hierarchical K-Means Clustering</h1>
<p>K-means (Chapter @ref(kmeans-clustering)) represents one of the most popular clustering algorithm. However, it has some limitations: it requires the user to specify the number of clusters in advance and selects initial centroids randomly. The final k-means clustering solution is very sensitive to this initial random selection of cluster centers. The result might be (slightly) different each time you compute k-means.</p>
<div class="block">
<p>
In this chapter, we described an hybrid method, named <strong>hierarchical k-means clustering</strong> (hkmeans), for improving k-means results.
</p>
</div>
<br/>
<p>Related Book:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="algorithm" class="section level2">
<h2>Algorithm</h2>
<p>The algorithm is summarized as follow:</p>
<ol style="list-style-type: decimal">
<li>Compute hierarchical clustering and cut the tree into k-clusters</li>
<li>Compute the center (i.e the mean) of each cluster</li>
<li>Compute k-means by using the set of cluster centers (defined in step 2) as the initial cluster centers</li>
</ol>
<div class="notice">
<p>
Note that, k-means algorithm will improve the initial partitioning generated at the step 2 of the algorithm. Hence, the initial partitioning can be slightly different from the final partitioning obtained in the step 4.
</p>
</div>
</div>
<div id="r-code" class="section level2">
<h2>R code</h2>
<p>The R function <em>hkmeans</em>() [in <em>factoextra</em>], provides an easy solution to compute the hierarchical k-means clustering. The format of the result is similar to the one provided by the standard kmeans() function (see Chapter @ref(kmeans-clustering)).</p>
<p>To install factoextra, type this: <em>install.packages(“factoextra”)</em>.</p>
<p>We’ll use the USArrest data set and we start by standardizing the data:</p>
<pre class="r"><code>df <- scale(USArrests)</code></pre>
<pre class="r"><code># Compute hierarchical k-means clustering
library(factoextra)
res.hk <-hkmeans(df, 4)
# Elements returned by hkmeans()
names(res.hk)</code></pre>
<pre><code>##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"</code></pre>
<p>To print all the results, type this:</p>
<pre class="r"><code># Print the results
res.hk</code></pre>
<pre class="r"><code># Visualize the tree
fviz_dend(res.hk, cex = 0.6, palette = "jco", 
          rect = TRUE, rect_border = "jco", rect_fill = TRUE)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/020-hierarchical-k-means-clustering-hierarchical-k-means-clustering-1.png" width="518.4" /></p>
<pre class="r"><code># Visualize the hkmeans final clusters
fviz_cluster(res.hk, palette = "jco", repel = TRUE,
             ggtheme = theme_classic())</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/020-hierarchical-k-means-clustering-hierarchical-k-means-clustering-2.png" width="518.4" /></p>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>We described hybrid <strong>hierarchical k-means clustering</strong> for improving k-means results.</p>
</div>
</div><!--end rdoc-->

<!-- END HTML -->]]></description>
			<pubDate>Thu, 07 Sep 2017 15:21:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Computing P-value for Hierarchical Clustering]]></title>
			<link>https://www.sthda.com/english/articles/29-cluster-validation-essentials/99-computing-p-value-for-hierarchical-clustering/</link>
			<guid>https://www.sthda.com/english/articles/29-cluster-validation-essentials/99-computing-p-value-for-hierarchical-clustering/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p>Clusters can be found in a data set by chance due to clustering noise or sampling error. This article describes the R package <strong>pvclust</strong> <span class="citation">(Suzuki and Shimodaira 2015)</span> which uses bootstrap resampling techniques to <strong>compute p-value</strong> for each <strong>hierarchical clusters</strong>.</p>
<br/>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#algorithm">Algorithm</a></li>
<li><a href="#required-packages">Required packages</a></li>
<li><a href="#data-preparation">Data preparation</a></li>
<li><a href="#compute-p-value-for-hierarchical-clustering">Compute p-value for hierarchical clustering</a><ul>
<li><a href="#description-of-pvclust-function">Description of pvclust() function</a></li>
<li><a href="#usage-of-pvclust-function">Usage of pvclust() function</a></li>
</ul></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<p>Related Book:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="algorithm" class="section level2">
<h2>Algorithm</h2>
<ol style="list-style-type: decimal">
<li>Generated thousands of bootstrap samples by randomly sampling elements of the data</li>
<li>Compute hierarchical clustering on each bootstrap copy</li>
<li>For each cluster:
<ul>
<li>compute the <em>bootstrap probability</em> (<em>BP</em>) value which corresponds to the frequency that the cluster is identified in bootstrap copies.</li>
<li>Compute the <em>approximately unbiased</em> (AU) probability values (p-values) by multiscale bootstrap resampling</li>
</ul></li>
</ol>
<div class="success">
<p>
Clusters with AU >= 95% are considered to be strongly supported by data.
</p>
</div>
</div>
<div id="required-packages" class="section level2">
<h2>Required packages</h2>
<ol style="list-style-type: decimal">
<li>Install <strong>pvclust</strong>:</li>
</ol>
<pre class="r"><code>install.packages("pvclust")</code></pre>
<ol start="2" style="list-style-type: decimal">
<li>Load <strong>pvclust</strong>:</li>
</ol>
<pre class="r"><code>library(pvclust)</code></pre>
</div>
<div id="data-preparation" class="section level2">
<h2>Data preparation</h2>
<p>We’ll use <em>lung</em> data set [in <em>pvclust</em> package]. It contains the gene expression profile of 916 genes of 73 lung tissues including 67 tumors. Columns are samples and rows are genes.</p>
<pre class="r"><code>library(pvclust)
# Load the data
data("lung")
head(lung[, 1:4])</code></pre>
<pre><code>##               fetal_lung 232-97_SCC 232-97_node 68-96_Adeno
## IMAGE:196992       -0.40       4.28        3.68       -1.35
## IMAGE:587847       -2.22       5.21        4.75       -0.91
## IMAGE:1049185      -1.35      -0.84       -2.88        3.35
## IMAGE:135221        0.68       0.56       -0.45       -0.20
## IMAGE:298560          NA       4.14        3.58       -0.40
## IMAGE:119882       -3.23      -2.84       -2.72       -0.83</code></pre>
<pre class="r"><code># Dimension of the data
dim(lung)</code></pre>
<pre><code>## [1] 916  73</code></pre>
<p>We’ll use only a subset of the data set for the clustering analysis. The R function <em>sample</em>() can be used to extract a random subset of 30 samples:</p>
<pre class="r"><code>set.seed(123)
ss <- sample(1:73, 30) # extract 20 samples out of
df <- lung[, ss]</code></pre>
</div>
<div id="compute-p-value-for-hierarchical-clustering" class="section level2">
<h2>Compute p-value for hierarchical clustering</h2>
<div id="description-of-pvclust-function" class="section level3">
<h3>Description of pvclust() function</h3>
<p>The function <em>pvclust</em>() can be used as follow:</p>
<pre class="r"><code>pvclust(data, method.hclust = "average",
        method.dist = "correlation", nboot = 1000)</code></pre>
<p>Note that, the computation time can be strongly decreased using parallel computation version called <em>parPvclust</em>(). (Read ?parPvclust() for more information.)</p>
<pre class="r"><code>parPvclust(cl=NULL, data, method.hclust = "average",
           method.dist = "correlation", nboot = 1000,
           iseed = NULL)</code></pre>
<div class="block">
<ul>
<li>
<strong>data</strong>: numeric data matrix or data frame.
</li>
<li>
<strong>method.hclust</strong>: the agglomerative method used in hierarchical clustering. Possible values are one of “average”, “ward”, “single”, “complete”, “mcquitty”, “median” or “centroid”. The default is “average”. See method argument in <strong>?hclust</strong>.
</li>
<li>
<strong>method.dist</strong>: the distance measure to be used. Possible values are one of “correlation”, “uncentered”, “abscor” or those which are allowed for <strong>method</strong> argument in <strong>dist()</strong> function, such “euclidean” and “manhattan”.
</li>
<li>
<strong>nboot</strong>: the number of bootstrap replications. The default is 1000.
</li>
<li>
<strong>iseed</strong>: an integrer for random seeds. Use iseed argument to achieve reproducible results.
</li>
</ul>
</div>
<p>The function <em>pvclust</em>() returns an object of class <em>pvclust</em> containing many elements including <em>hclust</em> which contains hierarchical clustering result for the original data generated by the function <em>hclust</em>().</p>
</div>
<div id="usage-of-pvclust-function" class="section level3">
<h3>Usage of pvclust() function</h3>
<p><em>pvclust</em>() performs clustering on the columns of the data set, which correspond to samples in our case. If you want to perform the clustering on the variables (here, genes) you have to transpose the data set using the function <em>t</em>().</p>
<p>The R code below computes <em>pvclust</em>() using 10 as the number of bootstrap replications (for speed):</p>
<pre class="r"><code>library(pvclust)
set.seed(123)
res.pv <- pvclust(df, method.dist="cor", 
                  method.hclust="average", nboot = 10)</code></pre>
<pre class="r"><code># Default plot
plot(res.pv, hang = -1, cex = 0.5)
pvrect(res.pv)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/018-p-value-for-hierarchical-clustering-pvclust-p-value-hierarchical-clustering-1.png" width="518.4" /></p>
<div class="success">
<p>
Values on the dendrogram are <em>AU p-values</em> (Red, left), <em>BP values</em> (green, right), and <span class="math inline"><em>c</em><em>l</em><em>u</em><em>s</em><em>t</em><em>e</em><em>r</em><em>l</em><em>a</em><em>b</em><em>e</em><em>l</em><em>s</em></span> (grey, bottom). Clusters with AU > = 95% are indicated by the rectangles and are considered to be strongly supported by data.
</p>
</div>
<p>To extract the objects from the significant clusters, use the function <em>pvpick</em>():</p>
<pre class="r"><code>clusters <- pvpick(res.pv)
clusters</code></pre>
<p>Parallel computation can be applied as follow:</p>
<pre class="r"><code># Create a parallel socket cluster
library(parallel)
cl <- makeCluster(2, type = "PSOCK")
# parallel version of pvclust
res.pv <- parPvclust(cl, df, nboot=1000)
stopCluster(cl)</code></pre>
</div>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-suzuki2015">
<p>Suzuki, Ryota, and Hidetoshi Shimodaira. 2015. <em>Pvclust: Hierarchical Clustering with P-Values via Multiscale Bootstrap Resampling</em>. <a href="https://CRAN.R-project.org/package=pvclust" class="uri">https://CRAN.R-project.org/package=pvclust</a>.</p>
</div>
</div>
</div>
</div><!--end rdoc-->

<!-- END HTML -->]]></description>
			<pubDate>Thu, 07 Sep 2017 11:52:00 +0200</pubDate>
			
		</item>
		
	</channel>
</rss>
