<?xml version="1.0" encoding="UTF-8" ?>
<!-- RSS generated by PHPBoost on Wed, 10 Jun 2026 03:18:06 +0200 -->

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title><![CDATA[Easy Guides]]></title>
		<atom:link href="https://www.sthda.com/english/syndication/rss/wiki/34" rel="self" type="application/rss+xml"/>
		<link>https://www.sthda.com</link>
		<description><![CDATA[Last articles of the category: Cluster Analysis in R - Unsupervised machine learning]]></description>
		<copyright>(C) 2005-2026 PHPBoost</copyright>
		<language>en</language>
		<generator>PHPBoost</generator>
		
		
		<item>
			<title><![CDATA[Practical Guide to Cluster Analysis in R - Book]]></title>
			<link>https://www.sthda.com/english/wiki/practical-guide-to-cluster-analysis-in-r-book</link>
			<guid>https://www.sthda.com/english/wiki/practical-guide-to-cluster-analysis-in-r-book</guid>
			<description><![CDATA[<!-- START HTML -->

  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">

<div id="introduction" class="section level2">
<h2>Introduction</h2>
<p>Large amounts of data are collected every day from satellite images, bio-medical, security, marketing, web search, geo-spatial or other automatic equipment. Mining knowledge from these big data far exceeds human’s abilities.</p>
<p><strong>Clustering</strong> is one of the important data mining methods for discovering knowledge in multidimensional data. The goal of clustering is to identify pattern or groups of similar objects within a data set of interest.</p>
<p>In the litterature, it is referred as “pattern recognition” or “unsupervised machine learning” - “unsupervised” because we are not guided by a priori ideas of which variables or samples belong in which clusters. “Learning” because the machine algorithm “learns” how to cluster.</p>
<p>Cluster analysis is popular in many fields, including:</p>
<ul>
<li><p>In <em>cancer research</em> for classifying patients into subgroups according their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.</p></li>
<li><p>In <em>marketing</em> for <em>market segmentation</em> by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.</p></li>
<li><p>In <em>City-planning</em> for identifying groups of houses according to their type, value and location.</p></li>
</ul>
<br/>
<div class="block">
This book provides a practical guide to unsupervised machine learning or cluster analysis using R software. Additionally, we developped an R package named <a href="https://www.sthda.com/english/rpkgs/factoextra"><em>factoextra</em></a> to create, easily, a ggplot2-based elegant plots of cluster analysis results. Factoextra official online documentation: <a href="https://www.sthda.com/english/rpkgs/factoextra" class="uri">https://www.sthda.com/english/rpkgs/factoextra</a>
</div>
<p><br/></p>
<p><img src="https://www.sthda.com/english/sthda/RDoc/images/clustering-e1-cover.png" alt="clustering book cover" /></p>
<p><strong>Preview of the first 38 pages</strong> of the book: <a href ="https://www.sthda.com/sthda/ebooks/clustering_english_edition1_preview.pdf">Practical Guide to Cluster Analysis in R (preview)</a>.</span></p>
<p><a href="https://www.sthda.com/sthda/ebooks/clustering_english_edition1_preview.pdf" target="_blank"><img src="https://www.sthda.com/english/sthda/RDoc/images/preview.png" alt ="Preview"/></a></p>
<p><strong>Download the ebook</strong> through <a href="https://payhip.com/b/MOUP">payhip</a>:</p>
<p><a href="https://payhip.com/b/MOUP" target="_blank"><img src="https://www.sthda.com/english/sthda/RDoc/images/download-now.png" alt ="payhip"/></a></p>
<p><strong>Order a physical copy</strong> from <a href="https://www.amazon.com/dp/1542462703/">amazon</a>:</p>
<p><a href="https://www.amazon.com/dp/1542462703/" target="_blank"><img src="https://www.sthda.com/english/sthda/RDoc/images/amazon.png" alt ="Amazon"/></a></p>
</div>
<div id="key-features-of-this-book" class="section level2">
<h2>Key features of this book</h2>
<p>Although there are several good books on unsupervised machine learning/clustering and related topics, we felt that many of them are either too high-level, theoretical or too advanced. Our goal was to write a practical guide to cluster analysis, elegant visualization and interpretation.</p>
<p>The main parts of the book include:</p>
<ul>
<li><em>distance measures</em>,</li>
<li><em>partitioning clustering</em>,</li>
<li><em>hierarchical clustering</em>,</li>
<li><em>cluster validation methods</em>, as well as,</li>
<li><em>advanced clustering methods</em> such as fuzzy clustering, density-based clustering and model-based clustering.</li>
</ul>
<p>The book presents the basic principles of these tasks and provide many examples in R. This book offers solid guidance in data mining for students and researchers.</p>
<p>Key features:</p>
<ul>
<li>Covers clustering algorithm and implementation</li>
<li>Key mathematical concepts are presented</li>
<li>Short, self-contained chapters with practical examples. This means that, you don’t need to read the different chapters in sequence.</li>
</ul>
<br/>
<div class="block">
At the end of each chapter, we present R lab sections in which we systematically work through applications of the various methods discussed in that chapter.
</div>
<p><br/></p>
</div>
<div id="how-this-book-is-organized" class="section level1">
<h1>How this book is organized?</h1>
<p><img src="https://www.sthda.com/english/sthda/RDoc/images/clustering-e1-book-plan.png" alt="clustering plan" /></p>
<p>This book contains 5 parts. Part I (Chapter 1 - 3) provides a quick introduction to R (chapter 1) and presents required R packages and data format (Chapter 2) for clustering analysis and visualization.</p>
<p>The classification of objects, into clusters, requires some methods for measuring the distance or the (dis)similarity between the objects. Chapter 3 covers the common distance measures used for assessing similarity between observations.</p>
<p>Part II starts with partitioning clustering methods, which include:</p>
<ul>
<li>K-means clustering (Chapter 4),</li>
<li>K-Medoids or PAM (partitioning around medoids) algorithm (Chapter 5) and</li>
<li>CLARA algorithms (Chapter 6).</li>
</ul>
<p>Partitioning clustering approaches subdivide the data sets into a set of k groups, where k is the number of groups pre-specified by the analyst.</p>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/ebook/clustering/cluster-analysis-book-edition1-cluster-plots-1.png" alt="cluster analysis in R" width="518.4" style="margin-bottom:10px;" />
<p class="caption">
cluster analysis in R
</p>
</div>
<p>In Part III, we consider agglomerative hierarchical clustering method, which is an alternative approach to partitionning clustering for identifying groups in a data set. It does not require to pre-specify the number of clusters to be generated. The result of hierarchical clustering is a tree-based representation of the objects, which is also known as <em>dendrogram</em> (see the figure below).</p>
<p>In this part, we describe how to compute, visualize, interpret and compare dendrograms:</p>
<ul>
<li>Agglomerative clustering (Chapter 7)
<ul>
<li>Algorithm and steps</li>
<li>Verify the cluster tree</li>
<li>Cut the dendrogram into different groups</li>
</ul></li>
<li>Compare dendrograms (Chapter 8)
<ul>
<li>Visual comparison of two dendrograms</li>
<li>Correlation matrix between a list of dendrograms</li>
</ul></li>
<li>Visualize dendrograms (Chapter 9)
<ul>
<li>Case of small data sets</li>
<li>Case of dendrogram with large data sets: zoom, sub-tree, PDF</li>
<li>Customize dendrograms using dendextend</li>
</ul></li>
<li>Heatmap: static and interactive (Chapter 10)
<ul>
<li>R base heat maps</li>
<li>Pretty heat maps</li>
<li>Interactive heat maps</li>
<li>Complex heatmap</li>
<li>Real application: gene expression data</li>
</ul></li>
</ul>
<p>
</p>
<p>In this section, you will learn how to generate and interpret the following plots.</p>
<ul>
<li><strong>Standard dendrogram with filled rectangle around clusters</strong>:</li>
</ul>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/ebook/clustering/cluster-analysis-book-edition1-dendrogram-1.png" alt="cluster analysis in R" width="518.4" style="margin-bottom:10px;" />
<p class="caption">
cluster analysis in R
</p>
</div>
<p><br/></p>
<ul>
<li><strong>Compare two dendrograms</strong>:</li>
</ul>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/ebook/clustering/cluster-analysis-book-edition1-compare-dendrogram-tanglegram-1-1.png" alt="cluster analysis in R" width="518.4" style="margin-bottom:10px;" />
<p class="caption">
cluster analysis in R
</p>
</div>
<p><br/></p>
<ul>
<li><strong>Heatmap</strong>:</li>
</ul>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/ebook/clustering/cluster-analysis-book-edition1-pheatmap-1-1.png" alt="cluster analysis in R" width="518.4" style="margin-bottom:10px;" />
<p class="caption">
cluster analysis in R
</p>
</div>
<p><br/></p>
<p>Part IV describes clustering validation and evaluation strategies, which consists of measuring the goodness of clustering results. Before applying any clustering algorithm to a data set, the first thing to do is to assess the <em>clustering tendency</em>. That is, whether applying clustering is suitable for the data. If yes, then how many clusters are there. Next, you can perform hierarchical clustering or partitioning clustering (with a pre-specified number of clusters). Finally, you can use a number of measures, described in this chapter, to evaluate the goodness of the clustering results.</p>
<p>The different chapters included in part IV are organized as follow:</p>
<ul>
<li><p>Assessing clustering tendency (Chapter 11)</p></li>
<li><p>Determining the optimal number of clusters (Chapter 12)</p></li>
<li><p>Cluster validation statistics (Chapter 13)</p></li>
<li><p>Choosing the best clustering algorithms (Chapter 14)</p></li>
<li><p>Computing p-value for hierarchical clustering (Chapter 15)</p></li>
</ul>
<p>In this section, you’ll learn how to create and interpret the plots hereafter.</p>
<ul>
<li><strong>Visual assessment of clustering tendency</strong> (left panel): Clustering tendency is detected in a visual form by counting the number of square shaped dark blocks along the diagonal in the image.</li>
<li><strong>Determine the optimal number of clusters</strong> (right panel) in a data set using the gap statistics.</li>
</ul>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/ebook/clustering/cluster-analysis-book-edition1-clustering-tendency-1-1.png" alt="cluster analysis in R" width="307.2" style="margin-bottom:10px;" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/ebook/clustering/cluster-analysis-book-edition1-clustering-tendency-1-2.png" alt="cluster analysis in R" width="307.2" style="margin-bottom:10px;" />
<p class="caption">
cluster analysis in R
</p>
</div>
<ul>
<li>Cluster validation using the <em>silhouette coefficient</em> (Si): A value of Si close to 1 indicates that the object is well clustered. A value of Si close to -1 indicates that the object is poorly clustered. The figure below shows the silhouette plot of a k-means clustering.</li>
</ul>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/ebook/clustering/cluster-analysis-book-edition1-silhouette-coefficient-1-1.png" alt="cluster analysis in R" width="518.4" style="margin-bottom:10px;" />
<p class="caption">
cluster analysis in R
</p>
</div>
<p>Part V presents advanced clustering methods, including:</p>
<ul>
<li>Hierarchical k-means clustering (Chapter 16)</li>
<li>Fuzzy clustering (Chapter 17)</li>
<li>Model-based clustering (Chapter 18)</li>
<li>DBSCAN: Density-Based Clustering (Chapter 19)</li>
</ul>
<p>The <em>hierarchical k-means clustering</em> is an hybrid approach for improving k-means results.</p>
<p>In <em>Fuzzy clustering</em>, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster.</p>
<p>In <em>model-based clustering</em>, the data are viewed as coming from a distribution that is mixture of two ore more clusters. It finds best fit of models to data and estimates the number of clusters.</p>
<p>The <em>density-based clustering</em> (DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers.</p>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/ebook/clustering/cluster-analysis-book-edition1-dbscan-1-1.png" alt="cluster analysis in R" width="432" style="margin-bottom:10px;" />
<p class="caption">
cluster analysis in R
</p>
</div>
</div>

<script>jQuery(document).ready(function () {
    jQuery('h1').addClass('wiki_paragraph1');
    jQuery('h2').addClass('wiki_paragraph2');
    jQuery('h3').addClass('wiki_paragraph3');
    jQuery('h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->


<!-- END HTML -->]]></description>
			<pubDate>Wed, 08 Feb 2017 06:30:53 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Hybrid hierarchical k-means clustering for optimizing clustering outputs - Unsupervised Machine Learning]]></title>
			<link>https://www.sthda.com/english/wiki/hybrid-hierarchical-k-means-clustering-for-optimizing-clustering-outputs-unsupervised-machine-learning</link>
			<guid>https://www.sthda.com/english/wiki/hybrid-hierarchical-k-means-clustering-for-optimizing-clustering-outputs-unsupervised-machine-learning</guid>
			<description><![CDATA[<!-- START HTML -->


  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<div id="TOC">
<ul>
<li><a href="#how-this-article-is-organized"><span class="toc-section-number">1</span> How this article is organized</a></li>
<li><a href="#required-r-packages"><span class="toc-section-number">2</span> Required R packages</a></li>
<li><a href="#data-preparation"><span class="toc-section-number">3</span> Data preparation</a></li>
<li><a href="#r-function-for-clustering-analyses"><span class="toc-section-number">4</span> R function for clustering analyses</a><ul>
<li><a href="#example-of-k-means-clustering"><span class="toc-section-number">4.1</span> Example of k-means clustering</a></li>
<li><a href="#example-of-hierarchical-clustering"><span class="toc-section-number">4.2</span> Example of hierarchical clustering</a></li>
</ul></li>
<li><a href="#combining-hierarchical-clustering-and-k-means"><span class="toc-section-number">5</span> Combining hierarchical clustering and k-means</a><ul>
<li><a href="#why"><span class="toc-section-number">5.1</span> Why?</a></li>
<li><a href="#how"><span class="toc-section-number">5.2</span> How ?</a></li>
<li><a href="#r-codes"><span class="toc-section-number">5.3</span> R codes</a><ul>
<li><a href="#compute-hierarchical-clustering-and-cut-the-tree-into-k-clusters"><span class="toc-section-number">5.3.1</span> Compute hierarchical clustering and cut the tree into k-clusters:</a></li>
<li><a href="#compute-the-centers-of-clusters-defined-by-hierarchical-clustering"><span class="toc-section-number">5.3.2</span> Compute the centers of clusters defined by hierarchical clustering:</a></li>
<li><a href="#k-means-clustering-using-hierarchical-clustering-defined-cluster-centers"><span class="toc-section-number">5.3.3</span> K-means clustering using hierarchical clustering defined cluster-centers</a></li>
<li><a href="#compare-the-results-of-hierarchical-clustering-and-hybrid-approach"><span class="toc-section-number">5.3.4</span> Compare the results of hierarchical clustering and hybrid approach</a></li>
<li><a href="#compare-the-results-of-standard-k-means-clustering-and-hybrid-approach"><span class="toc-section-number">5.3.5</span> Compare the results of standard k-means clustering and hybrid approach</a></li>
</ul></li>
<li><a href="#hkmeans-easy-to-use-function-for-hybrid-hierarchical-k-means-clustering"><span class="toc-section-number">5.4</span> hkmeans(): Easy-to-use function for hybrid hierarchical k-means clustering</a></li>
</ul></li>
<li><a href="#infos"><span class="toc-section-number">6</span> Infos</a></li>
</ul>
</div>

<p><br/></p>
<p><strong>Clustering algorithms</strong> are used to split a dataset into several groups (i.e clusters), so that the objects in the same group are as similar as possible and the objects in different groups are as dissimilar as possible.</p>
<p>The most popular clustering algorithms are:</p>
<ul>
<li><a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning">k-means clustering</a>, a partitioning method used for splitting a dataset into a set of k clusters.</li>
<li><a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning">hierarchical clustering</a>, an alternative approach to k-means clustering for identifying clustering in the dataset by using <a href="https://www.sthda.com/english/english/wiki/clarifying-distance-measures-unsupervised-machine-learning">pairwise distance matrix</a> between observations as clustering criteria.</li>
</ul>
<p>However, each of these two standard clustering methods has its limitations. K-means clustering requires the user to specify the number of clusters in advance and selects initial centroids randomly. Agglomerative hierarchical clustering is good at identifying small clusters but not large ones.</p>
<p>In this article, we document hybrid approaches for easily mixing the best of k-means clustering and hierarchical clustering.</p>
<div id="how-this-article-is-organized" class="section level1">
<h1><span class="header-section-number">1</span> How this article is organized</h1>
<p>We’ll start by demonstrating why we should combine <strong>k-means</strong> and <strong>hierarcical clustering</strong>. An application is provided using <strong>R software</strong>.</p>
<p>Finally, we’ll provide an easy to use <strong>R</strong> function (in <strong>factoextra</strong> package) for computing <strong>hybrid hierachical k-means clustering</strong>.</p>
</div>
<div id="required-r-packages" class="section level1">
<h1><span class="header-section-number">2</span> Required R packages</h1>
<p>We’ll use the R package <strong>factoextra</strong> which is very helpful for simplifying clustering workflows and for visualizing clusters using <strong>ggplot2</strong> plotting system</p>
<p>Install <strong>factoextra</strong> package as follow:</p>
<pre class="r"><code>if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")</code></pre>
<p>Load the package:</p>
<pre class="r"><code>library(factoextra)</code></pre>
</div>
<div id="data-preparation" class="section level1">
<h1><span class="header-section-number">3</span> Data preparation</h1>
<p>We’ll use USArrest dataset and we start by scaling the data:</p>
<pre class="r"><code># Load the data
data(USArrests)
# Scale the data
df <- scale(USArrests)
head(df)</code></pre>
<pre><code>##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207</code></pre>
<p><span class="warning"> If you want to understand why the data are scaled before the analysis, then you should read this section: <a href="https://www.sthda.com/english/english/wiki/clarifying-distance-measures-unsupervised-machine-learning#distances-and-scaling">Distances and scaling</a>.</span></p>
</div>
<div id="r-function-for-clustering-analyses" class="section level1">
<h1><span class="header-section-number">4</span> R function for clustering analyses</h1>
<p>We’ll use the function <strong>eclust()</strong> [in <strong>factoextra</strong>] which provides several advantages as described in the previous chapter: <a href="https://www.sthda.com/english/english/wiki/visual-enhancement-of-clustering-analysis-unsupervised-machine-learning">Visual Enhancement of Clustering Analysis</a>.</p>
<p><strong>eclust()</strong> stands for enhanced clustering. It simplifies the workflow of clustering analysis and, it can be used for computing <a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning">hierarchical clustering</a> and <a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning">partititioning clustering</a> in a single line function call.</p>
<div id="example-of-k-means-clustering" class="section level2">
<h2><span class="header-section-number">4.1</span> Example of k-means clustering</h2>
<p>We’ll split the data into 4 clusters using <strong>k-means clustering</strong> as follow:</p>
<pre class="r"><code>library("factoextra")
# K-means clustering
km.res <- eclust(df, "kmeans", k = 4,
                 nstart = 25, graph = FALSE)
# k-means group number of each observation
head(km.res$cluster, 15)</code></pre>
<pre><code>##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           4           3           3           4           3           3 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           2           2           3           4           2           1 
##    Illinois     Indiana        Iowa 
##           3           2           1</code></pre>
<pre class="r"><code># Visualize k-means clusters
fviz_cluster(km.res,  frame.type = "norm", frame.level = 0.68)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/hierarchical-k-means-clustering-k-means-clustering-1.png" width="518.4" /></p>
<pre class="r"><code># Visualize the silhouette of clusters
fviz_silhouette(km.res)</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1   13          0.37
## 2       2   16          0.34
## 3       3   13          0.27
## 4       4    8          0.39</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/hierarchical-k-means-clustering-k-means-clustering-2.png" width="518.4" /></p>
<p><span class="warning">Note that, <strong>silhouette coefficient</strong> measures how well an observation is clustered and it estimates the <strong>average distance between clusters</strong> (i.e, the <strong>average silhouette width</strong>). Observations with negative silhouette are probably placed in the wrong cluster. Read more here: <a href="https://www.sthda.com/english/english/wiki/clustering-validation-statistics-4-vital-things-everyone-should-know-unsupervised-machine-learning">cluster validation statistics</a></span></p>
<p><strong>Samples with negative silhouette coefficient</strong>:</p>
<pre class="r"><code># Silhouette width of observation
sil <- km.res$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, &amp;#39;sil_width&amp;#39;] < 0)
sil[neg_sil_index, , drop = FALSE]</code></pre>
<pre><code>##          cluster neighbor   sil_width
## Missouri       3        2 -0.07318144</code></pre>
<p>Read more about k-means clustering: <a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning">K-means clustering</a></p>
</div>
<div id="example-of-hierarchical-clustering" class="section level2">
<h2><span class="header-section-number">4.2</span> Example of hierarchical clustering</h2>
<pre class="r"><code># Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust", k = 4,
                method = "ward.D2", graph = FALSE) 
head(res.hc$cluster, 15)</code></pre>
<pre><code>##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           1           2           2           3           2           2 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           3           3           2           1           3           4 
##    Illinois     Indiana        Iowa 
##           2           3           4</code></pre>
<pre class="r"><code># Dendrogram
fviz_dend(res.hc, rect = TRUE, show_labels = TRUE, cex = 0.5) </code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/hierarchical-k-means-clustering-hierarchical-clustering-1.png" width="518.4" /></p>
<pre class="r"><code># Visualize the silhouette of clusters
fviz_silhouette(res.hc)</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1    7          0.46
## 2       2   12          0.29
## 3       3   19          0.26
## 4       4   12          0.43</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/hierarchical-k-means-clustering-hierarchical-clustering-2.png" width="518.4" /></p>
<p>It can be seen that three samples have negative <strong>silhouette coefficient</strong> indicating that they are not in the right cluster. These samples are:</p>
<pre class="r"><code># Silhouette width of observation
sil <- res.hc$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, &amp;#39;sil_width&amp;#39;] < 0)
sil[neg_sil_index, , drop = FALSE]</code></pre>
<pre><code>##          cluster neighbor   sil_width
## Kentucky       3        4 -0.06459230
## Arkansas       3        1 -0.08467352</code></pre>
<p>Read more about hierarchical clustering: <a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning">Hierarchical clustering</a></p>
</div>
</div>
<div id="combining-hierarchical-clustering-and-k-means" class="section level1">
<h1><span class="header-section-number">5</span> Combining hierarchical clustering and k-means</h1>
<div id="why" class="section level2">
<h2><span class="header-section-number">5.1</span> Why?</h2>
<p>Recall that, in <strong>k-means algorithm</strong>, a random set of observations are chosen as the initial centers.</p>
<p>The final <strong>k-means</strong> clustering solution is very sensitive to this initial random selection of cluster centers. The result might be (slightly) different each time you compute k-means.</p>
<p>To avoid this, a solution is to use an <strong>hybrid approach</strong> by combining the <strong>hierarchical clustering</strong> and the <strong>k-means</strong> methods. This process is named <strong>hybrid hierarchical k-means clustering</strong> (hkmeans).</p>
</div>
<div id="how" class="section level2">
<h2><span class="header-section-number">5.2</span> How ?</h2>
<p>The procedure is as follow:</p>
<ol style="list-style-type: decimal">
<li>Compute <strong>hierarchical clustering</strong> and cut the tree into k-clusters</li>
<li>compute the center (i.e the mean) of each cluster</li>
<li>Compute k-means by using the set of cluster centers (defined in step 3) as the initial cluster centers</li>
</ol>
<p><span class="notice">Note that, k-means algorithm will improve the initial partitioning generated at the step 2 of the algorithm. Hence, the initial partitioning can be slightly different from the final partitioning obtained in the step 4.</span></p>
</div>
<div id="r-codes" class="section level2">
<h2><span class="header-section-number">5.3</span> R codes</h2>
<div id="compute-hierarchical-clustering-and-cut-the-tree-into-k-clusters" class="section level3">
<h3><span class="header-section-number">5.3.1</span> Compute hierarchical clustering and cut the tree into k-clusters:</h3>
<pre class="r"><code>res.hc <- eclust(df, "hclust", k = 4,
                method = "ward.D2", graph = FALSE) 
grp <- res.hc$cluster</code></pre>
</div>
<div id="compute-the-centers-of-clusters-defined-by-hierarchical-clustering" class="section level3">
<h3><span class="header-section-number">5.3.2</span> Compute the centers of clusters defined by hierarchical clustering:</h3>
<p><strong>Cluster centers</strong> are defined as the means of variables in clusters. The function <strong>aggregate()</strong> can be used to compute the mean per group in a data frame.</p>
<pre class="r"><code># Compute cluster centers
clus.centers <- aggregate(df, list(grp), mean)
clus.centers</code></pre>
<pre><code>##   Group.1     Murder    Assault   UrbanPop        Rape
## 1       1  1.5803956  0.9662584 -0.7775109  0.04844071
## 2       2  0.7298036  1.1188219  0.7571799  1.32135653
## 3       3 -0.3621789 -0.3444705  0.3953887 -0.21863180
## 4       4 -1.0782511 -1.1370610 -0.9296640 -1.00344660</code></pre>
<pre class="r"><code># Remove the first column
clus.centers <- clus.centers[, -1]
clus.centers</code></pre>
<pre><code>##       Murder    Assault   UrbanPop        Rape
## 1  1.5803956  0.9662584 -0.7775109  0.04844071
## 2  0.7298036  1.1188219  0.7571799  1.32135653
## 3 -0.3621789 -0.3444705  0.3953887 -0.21863180
## 4 -1.0782511 -1.1370610 -0.9296640 -1.00344660</code></pre>
</div>
<div id="k-means-clustering-using-hierarchical-clustering-defined-cluster-centers" class="section level3">
<h3><span class="header-section-number">5.3.3</span> K-means clustering using hierarchical clustering defined cluster-centers</h3>
<pre class="r"><code>km.res2 <- eclust(df, "kmeans", k = clus.centers, graph = FALSE)
fviz_silhouette(km.res2)</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1    8          0.39
## 2       2   13          0.27
## 3       3   16          0.34
## 4       4   13          0.37</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/hierarchical-k-means-clustering-k-means-and-hierarchical-clustering-1.png" width="518.4" /></p>
</div>
<div id="compare-the-results-of-hierarchical-clustering-and-hybrid-approach" class="section level3">
<h3><span class="header-section-number">5.3.4</span> Compare the results of hierarchical clustering and hybrid approach</h3>
<p>The R code below compares the initial clusters defined using only <strong>hierarchical clustering</strong> and the final ones defined using <strong>hierarchical clustering</strong> + <strong>k-means</strong>:</p>
<pre class="r"><code># res.hc$cluster: Initial clusters defined using hierarchical clustering
# km.res2$cluster: Final clusters defined using k-means
table(km.res2$cluster, res.hc$cluster)</code></pre>
<pre><code>##    
##      1  2  3  4
##   1  7  0  1  0
##   2  0 12  1  0
##   3  0  0 16  0
##   4  0  0  1 12</code></pre>
<p>It can be seen that, 3 of the observations defined as belonging to cluster 3 by <strong>hierarchical clustering</strong> has been reclassified to cluster 1, 2, and 4 in the final solution defined by k-means clustering.</p>
<p>The difference can be easily visualized using the function <strong>fviz_dend()</strong> [in <strong>factoextra</strong>]. The labels are colored using k-means clusters:</p>
<pre class="r"><code>fviz_dend(res.hc, k = 4, 
          k_colors = c("black", "red",  "blue", "green3"),
          label_cols =  km.res2$cluster[res.hc$order], cex = 0.6)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/hierarchical-k-means-clustering-k-means-hclust-1.png" width="518.4" /></p>
<p><span class="success">It can be seen that the hierarchical clustering result has been improved by the k-means algorithm.</span></p>
</div>
<div id="compare-the-results-of-standard-k-means-clustering-and-hybrid-approach" class="section level3">
<h3><span class="header-section-number">5.3.5</span> Compare the results of standard k-means clustering and hybrid approach</h3>
<pre class="r"><code># Final clusters defined using hierarchical k-means clustering
km.clust <- km.res$cluster

# Standard k-means clustering
set.seed(123)
res.km <- kmeans(df, centers = 4, iter.max = 100)


# comparison
table(km.clust, res.km$cluster)</code></pre>
<pre><code>##         
## km.clust  1  2  3  4
##        1 13  0  0  0
##        2  0 16  0  0
##        3  0  0 13  0
##        4  0  0  0  8</code></pre>
<p><span class="success">In our current example, there was no further improvement of the k-means clustering result by the hybrid approach. An improvement might be observed using another dataset.</span></p>
</div>
</div>
<div id="hkmeans-easy-to-use-function-for-hybrid-hierarchical-k-means-clustering" class="section level2">
<h2><span class="header-section-number">5.4</span> hkmeans(): Easy-to-use function for hybrid hierarchical k-means clustering</h2>
<p>The function <strong>hkmeans()</strong> [in <strong>factoextra</strong>] can be used to compute easily the hybrid approach of k-means on hierarchical clustering. The format of the result is similar to the one provided by the standard kmeans() function.</p>
<pre class="r"><code># Compute hierarchical k-means clustering
res.hk <-hkmeans(df, 4)
# Elements returned by hkmeans()
names(res.hk)</code></pre>
<pre><code>##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"</code></pre>
<pre class="r"><code># Print the results
res.hk</code></pre>
<pre><code>## Hierarchical K-means clustering with 4 clusters of sizes 8, 13, 16, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2  0.6950701  1.0394414  0.7226370  1.27693964
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              1              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              4              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3 
## 
## Within cluster sum of squares by cluster:
## [1]  8.316061 19.922437 16.212213 11.952463
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"</code></pre>
<pre class="r"><code># Visualize the tree
fviz_dend(res.hk, cex = 0.6, rect = TRUE)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/hierarchical-k-means-clustering-hkmeans-hierarchical-k-means-clustering-1.png" width="518.4" /></p>
<pre class="r"><code># Visualize the hkmeans final clusters
fviz_cluster(res.hk, frame.type = "norm", frame.level = 0.68)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/hierarchical-k-means-clustering-hkmeans-hierarchical-k-means-clustering-2.png" width="518.4" /></p>
</div>
</div>
<div id="infos" class="section level1">
<h1><span class="header-section-number">6</span> Infos</h1>
<p><span class="warning">This analysis has been performed using <strong>R software</strong> (ver. 3.2.4)</span></p>
</div>

<script>jQuery(document).ready(function () {
    jQuery('#rdoc h1').addClass('wiki_paragraph1');
    jQuery('#rdoc h2').addClass('wiki_paragraph2');
    jQuery('#rdoc h3').addClass('wiki_paragraph3');
    jQuery('#rdoc h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->


<!-- END HTML -->]]></description>
			<pubDate>Mon, 14 Nov 2016 10:43:38 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Assessing clustering tendency: A vital issue - Unsupervised Machine Learning]]></title>
			<link>https://www.sthda.com/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning</link>
			<guid>https://www.sthda.com/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning</guid>
			<description><![CDATA[<!-- START HTML -->

  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<div id="TOC">
<ul>
<li><a href="#required-packages"><span class="toc-section-number">1</span> Required packages</a></li>
<li><a href="#data-preparation"><span class="toc-section-number">2</span> Data preparation</a><ul>
<li><a href="#faithful-dataset"><span class="toc-section-number">2.1</span> faithful dataset</a></li>
<li><a href="#random-uniformly-distributed-dataset"><span class="toc-section-number">2.2</span> Random uniformly distributed dataset</a></li>
</ul></li>
<li><a href="#why-assessing-clustering-tendency"><span class="toc-section-number">3</span> Why assessing clustering tendency?</a></li>
<li><a href="#methods-for-assessing-clustering-tendency"><span class="toc-section-number">4</span> Methods for assessing clustering tendency</a><ul>
<li><a href="#hopkins-statistic"><span class="toc-section-number">4.1</span> Hopkins statistic</a><ul>
<li><a href="#algorithm"><span class="toc-section-number">4.1.1</span> Algorithm</a></li>
<li><a href="#r-function-for-computing-hopkins-statistic"><span class="toc-section-number">4.1.2</span> R function for computing Hopkins statistic</a></li>
</ul></li>
<li><a href="#vat-visual-assessment-of-cluster-tendency"><span class="toc-section-number">4.2</span> VAT: Visual Assessment of cluster Tendency</a><ul>
<li><a href="#vat-algorithm"><span class="toc-section-number">4.2.1</span> VAT Algorithm</a></li>
<li><a href="#r-functions-for-vat"><span class="toc-section-number">4.2.2</span> R functions for VAT</a></li>
</ul></li>
</ul></li>
<li><a href="#a-single-function-for-hopkins-statistic-and-vat"><span class="toc-section-number">5</span> A single function for Hopkins statistic and VAT</a></li>
<li><a href="#infos"><span class="toc-section-number">6</span> Infos</a></li>
</ul>
</div>

<p><br/></p>
<p><strong>Clustering algorithms</strong>, including <a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning">partitioning methods</a> (K-means, PAM, CLARA and FANNY) and <a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning">hierarchical clustering</a>, are used to split the dataset into groups or <strong>clusters</strong> of similar objects.</p>
<p>Before applying any clustering method on the dataset, a natural question is:</p>
<p><span class="question">Does the dataset contains any inherent clusters?</span></p>
<p>A big issue, in <strong>unsupervised machine learning</strong>, is that <strong>clustering methods</strong> will return clusters even if the data does not contain any clusters. In other words, if you blindly apply a clustering analysis on a dataset, it will divide the data into clusters because that is what it supposed to do.</p>
<p>Therefore before choosing a clustering approach, the analyst has to decide whether the dataset contains meaningful clusters (i.e nonrandom structures) or not. If yes, then how many clusters are there. This process is defined as the <strong>assessing of clustering tendency</strong> or the feasibility of the clustering analysis.</p>
<br/>
<div class="block">
<p>In this chapter:</p>
<ul>
<li>We describe why we should evaluate the <strong>clustering tendency</strong> (i.e., <strong>clusterability</strong>) before applying any cluster analysis on a dataset.</li>
<li>We describe statistical and visual methods for assessing the clustering tendency</li>
<li>R lab sections containing many examples are also provided for computing clustering tendency and visualizing clusters</li>
</ul>
</div>
<p><br/></p>
<div id="required-packages" class="section level1">
<h1><span class="header-section-number">1</span> Required packages</h1>
<p>The following R packages are required in this chapter:</p>
<ul>
<li><strong>factoextra</strong> for data visualization</li>
<li><strong>clustertend</strong> for assessing clustering tendency</li>
<li><strong>seriation</strong> for visually assessment of cluster tendency</li>
</ul>
<ol style="list-style-type: decimal">
<li><strong>factoextra</strong> can be installed as follow:</li>
</ol>
<pre class="r"><code>if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")</code></pre>
<ol start="2" style="list-style-type: decimal">
<li>Install <strong>clustertend</strong> and <strong>seriation</strong>:</li>
</ol>
<pre class="r"><code>install.packages("clustertend")
install.packages("seriation")</code></pre>
<ol start="3" style="list-style-type: decimal">
<li>Load required packages:</li>
</ol>
<pre class="r"><code>library(factoextra)
library(clustertend)
library(seriation)</code></pre>
</div>
<div id="data-preparation" class="section level1">
<h1><span class="header-section-number">2</span> Data preparation</h1>
<p>We’ll use two datasets: the built-in R dataset <em>faithful</em> and a simulated dataset.</p>
<div id="faithful-dataset" class="section level2">
<h2><span class="header-section-number">2.1</span> faithful dataset</h2>
<p>faithful dataset contains the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park (Wyoming, USA).</p>
<pre class="r"><code># Load the data
data("faithful")
df <- faithful
head(df)</code></pre>
<pre><code>##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55</code></pre>
<p>An illustration of the data can be drawn using <strong>ggplot2</strong> package as follow:</p>
<pre class="r"><code>library("ggplot2")
ggplot(df, aes(x=eruptions, y=waiting)) +
  geom_point() +  # Scatter plot
  geom_density_2d() # Add 2d density estimation</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/clustering-tendency-scatter-plot-faithful-cluster-tendency-1.png" width="518.4" /></p>
</div>
<div id="random-uniformly-distributed-dataset" class="section level2">
<h2><span class="header-section-number">2.2</span> Random uniformly distributed dataset</h2>
<p>The R code below generates a random uniform data with the same dimension as the faithful dataset. The function <strong>runif(n, min, max)</strong> is used for generating uniform distribution on the interval from min to max.</p>
<pre class="r"><code># Generate random dataset
set.seed(123)
n <- nrow(df)

random_df <- data.frame(
  x = runif(nrow(df), min(df$eruptions), max(df$eruptions)),
  y = runif(nrow(df), min(df$waiting), max(df$waiting)))

# Plot the data
ggplot(random_df, aes(x, y)) + geom_point()</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/clustering-tendency-uniform-data-1.png" width="518.4" /></p>
<br/>
<div class="block">
<p>Note that for a given real dataset, random uniform data can be generated in a single line function call as follow:</p>
<pre class="r"><code>random_df <- apply(df, 2, 
                function(x, n){runif(n, min(x), (max(x)))}, n)</code></pre>
</div>
<p><br/></p>
</div>
</div>
<div id="why-assessing-clustering-tendency" class="section level1">
<h1><span class="header-section-number">3</span> Why assessing clustering tendency?</h1>
<p>As shown above, we know that <strong>faithful</strong> dataset contains 2 real clusters. However the randomly generated dataset doesn’t contain any meaningful clusters.</p>
<p>The R code below computes <a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning">k-means clustering</a> and/or <a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning"><strong>hierarchical clustering</strong></a> on the two datasets. The function <strong>fviz_cluster()</strong> and <strong>fviz_dend()</strong> [in <strong>factoextra</strong>] will be used to visualize the results.</p>
<pre class="r"><code>library(factoextra)
set.seed(123)
# K-means on faithful dataset
km.res1 <- kmeans(df, 2)
fviz_cluster(list(data = df, cluster = km.res1$cluster),
             frame.type = "norm", geom = "point", stand = FALSE)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/clustering-tendency-k-means-1.png" width="518.4" /></p>
<pre class="r"><code># K-means on the random dataset
km.res2 <- kmeans(random_df, 2)
fviz_cluster(list(data = random_df, cluster = km.res2$cluster),
             frame.type = "norm", geom = "point", stand = FALSE)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/clustering-tendency-k-means-2.png" width="518.4" /></p>
<pre class="r"><code># Hierarchical clustering on the random dataset
fviz_dend(hclust(dist(random_df)), k = 2,  cex = 0.5)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/clustering-tendency-k-means-3.png" width="518.4" /></p>
<p><span class="success">It can be seen that, <strong>k-means algorithm</strong> and <strong>hierarchical clustering</strong> impose a classification on the random uniformly distributed dataset even if there are no meaningful clusters present in it.</span></p>
<p><span class="warning">Clustering tendency assessment methods are used to avoid this issue.</span></p>
</div>
<div id="methods-for-assessing-clustering-tendency" class="section level1">
<h1><span class="header-section-number">4</span> Methods for assessing clustering tendency</h1>
<p><strong>Clustering tendency assessment</strong> determines whether a given dataset contains meaningful clusters (i.e., <strong>non-random structure</strong>).</p>
<p>In this section, we’ll describe two methods for determining the clustering tendency: i) a statistical (<strong>Hopkins statistic</strong>) and ii) a visual methods (<strong>Visual Assessment of cluster Tendency</strong> (VAT) algorithm).</p>
<div id="hopkins-statistic" class="section level2">
<h2><span class="header-section-number">4.1</span> Hopkins statistic</h2>
<p><strong>Hopkins statistic</strong> is used to assess the <strong>clustering tendency</strong> of a dataset by measuring the probability that a given dataset is generated by a uniform data distribution. In other words it tests the <strong>spatial randomness</strong> of the data.</p>
<div id="algorithm" class="section level3">
<h3><span class="header-section-number">4.1.1</span> Algorithm</h3>
<p>Let D be a real dataset. The <strong>Hopkins statistic</strong> can be calculated as follow:</p>
<br/>
<div class="block">
<ol style="list-style-type: decimal">
<li>Sample uniformly <span class="math">\(n\)</span> points (<span class="math">\(p_1\)</span>,…, <span class="math">\(p_n\)</span>) from D.</li>
<li>For each point <span class="math">\(p_i \in D\)</span>, find it’s nearest neighbor <span class="math">\(p_j\)</span>; then compute the distance between <span class="math">\(p_i\)</span> and <span class="math">\(p_j\)</span> and denote it as <span class="math">\(x_i = dist(p_i, p_j)\)</span></li>
<li>Generate a simulated dataset (<span class="math">\(random_D\)</span>) drawn from a random uniform distribution with <span class="math">\(n\)</span> points (<span class="math">\(q_1\)</span>,…, <span class="math">\(q_n\)</span>) and the same variation as the original real dataset D.</li>
<li>For each point <span class="math">\(q_i \in random_D\)</span>, find it’s nearest neighbor <span class="math">\(q_j\)</span> in D; then compute the distance between <span class="math">\(q_i\)</span> and <span class="math">\(q_j\)</span> and denote it <span class="math">\(y_i = dist(q_i, q_j)\)</span></li>
<li>Calculate the <strong>Hopkins statistic</strong> (H) as the mean nearest neighbor distance in the random dataset divided by the sum of the mean nearest neighbor distances in the real and across the simulated dataset.</li>
</ol>
<p>The formula is defined as follow:</p>
<span class="math">\[H = \frac{\sum\limits_{i=1}^ny_i}{\sum\limits_{i=1}^nx_i + \sum\limits_{i=1}^ny_i}\]</span>
</div>
<p><br/></p>
<p>A value of H about 0.5 means that <span class="math">\(\sum\limits_{i=1}^ny_i\)</span> and <span class="math">\(\sum\limits_{i=1}^nx_i\)</span> are close to each other, and thus the data D is uniformly distributed.</p>
<p>The null and the alternative hypotheses are defined as follow:</p>
<ul>
<li><strong>Null hypothesis</strong>: the dataset D is uniformly distributed (i.e., no meaningful clusters)</li>
<li><strong>Alternative hypothesis</strong>: the dataset D is not uniformly distributed (i.e., contains meaningful clusters)</li>
</ul>
<p><span class="success"> If the value of <strong>Hopkins statistic</strong> is close to zero, then we can reject the null hypothesis and conclude that the dataset D is significantly a clusterable data.</span></p>
</div>
<div id="r-function-for-computing-hopkins-statistic" class="section level3">
<h3><span class="header-section-number">4.1.2</span> R function for computing Hopkins statistic</h3>
<p>The function <strong>hopkins()</strong> [in <strong>clustertend</strong> package] can be used to statistically evaluate clustering tendency in <strong>R</strong>. The simplified format is:</p>
<pre class="r"><code>hopkins(data, n, byrow = F, header = F)</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>data</strong>: a data frame or matrix</li>
<li><strong>n</strong>: the number of points to be selected from the data</li>
<li><strong>byrow</strong>: logical value. If FALSE (default), the variables is taken by columns, otherwise the variables is taken by rows</li>
<li><strong>header</strong>: logical. If FALSE (the default) the first column (or row) will be deleted in the calculation</li>
</ul>
</div>
<p><br/></p>
<pre class="r"><code>library(clustertend)
# Compute Hopkins statistic for faithful dataset
set.seed(123)
hopkins(faithful, n = nrow(faithful)-1)</code></pre>
<pre><code>## $H
## [1] 0.1588201</code></pre>
<pre class="r"><code># Compute Hopkins statistic for a random dataset
set.seed(123)
hopkins(random_df, n = nrow(random_df)-1)</code></pre>
<pre><code>## $H
## [1] 0.5388899</code></pre>
<p><span class="success">It can be seen that <strong>faithful</strong> dataset is highly clusterable (the <strong>H</strong> value = 0.15 which is far below the threshold 0.5). However the <strong>random_df</strong> dataset is not clusterable (<span class="math">\(H = 0.53\)</span>)</span></p>
</div>
</div>
<div id="vat-visual-assessment-of-cluster-tendency" class="section level2">
<h2><span class="header-section-number">4.2</span> VAT: Visual Assessment of cluster Tendency</h2>
<p>The <strong>visual assessment of cluster tendency</strong> (VAT) has been originally described by Bezdek and Hathaway (2002). This approach can be used to visually inspect the clustering tendency of the dataset.</p>
<div id="vat-algorithm" class="section level3">
<h3><span class="header-section-number">4.2.1</span> VAT Algorithm</h3>
<p>The algorithm of VAT is as follow:</p>
<br/>
<div class="block">
<ol style="list-style-type: decimal">
<li>Compute the dissimilarity (DM) matrix between the objects in the dataset using <a href="https://www.sthda.com/english/english/wiki/clarifying-distance-measures-unsupervised-machine-learning"><strong>Euclidean distance measure</strong></a></li>
<li>Reorder the DM so that similar objects are close to one another. This process create an <strong>ordered dissimilarity matrix</strong> (ODM)</li>
<li>The ODM is displayed as an <strong>ordered dissimilarity image</strong> (ODI), which is the visual output of VAT</li>
</ol>
</div>
<p><br/></p>
</div>
<div id="r-functions-for-vat" class="section level3">
<h3><span class="header-section-number">4.2.2</span> R functions for VAT</h3>
<p>We start by <a href="https://www.sthda.com/english/english/wiki/clarifying-distance-measures-unsupervised-machine-learning">scaling the data</a> using the function <strong>scale()</strong>. Next we compute the dissimilarity matrix between observations using the function <strong>dist()</strong>. finally the function <strong>dissplot()</strong> [in the package <strong>seriation</strong>] is used to display an <strong>ordered dissimilarity image</strong>.</p>
<p>The R code below computes VAT algorithm for the <strong>faithful</strong> dataset</p>
<pre class="r"><code>library("seriation")
# faithful data: ordered dissimilarity image
df_scaled <- scale(faithful)
df_dist <- dist(df_scaled) 
dissplot(df_dist)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/clustering-tendency-visual-assessment-cluster-tendency-1.png" width="518.4" /></p>
<p><span class="notice"> The gray level is proportional to the value of the dissimilarity between observations: pure black if <span class="math">\(dist(x_i, x_j) = 0\)</span> and pure white if <span class="math">\(dist(x_i, x_j) = 1\)</span>. Objects belonging to the same cluster are displayed in consecutive order. </span></p>
<p>The VAT detects the clustering tendency in a visual form by counting the number of square shaped dark blocks along the diagonal in a VAT image.</p>
<p><span class="success">The figure above suggests two clusters represented by two well-formed black blocks.</span></p>
<p>The same analysis can be done with the random dataset:</p>
<pre class="r"><code># faithful data: ordered dissimilarity image
random_df_scaled <- scale(random_df)
random_df_dist <- dist(random_df_scaled) 
dissplot(random_df_dist)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/clustering-tendency-visual-assessment-cluster-tendency-random-1.png" width="518.4" /></p>
<p><span class="success">It can be seen that the <strong>random_df</strong> dataset doesn’t contain any evident clusters.</span></p>
<p>Now, we can perform k-means on faithful dataset and add cluster labels on the dissimilarity plot:</p>
<pre class="r"><code>set.seed(123)
km.res <- kmeans(scale(faithful), 2)
dissplot(df_dist, labels = km.res$cluster)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/clustering-tendency-dissimilarity-plot-k-means-1.png" width="518.4" /></p>
<p><span class="warning">After showing that the data is clusterable, the next step is to determine the number of optimal clusters in the data. This will be described in the next chapter.</span></p>
</div>
</div>
</div>
<div id="a-single-function-for-hopkins-statistic-and-vat" class="section level1">
<h1><span class="header-section-number">5</span> A single function for Hopkins statistic and VAT</h1>
<p>The function <strong>get_clust_tendency()</strong> [in <strong>factoextra</strong> package] can be used to compute Hopkins statistic and provides also an ordered dissimilarity image using ggplot2, in a single function call. The ordering of dissimilarity matrix is done using hierarchical clustering.</p>
<pre class="r"><code># Cluster tendency
clustend <- get_clust_tendency(scale(faithful), 100)
# Hopkins statistic
clustend$hopkins_stat</code></pre>
<pre><code>## [1] 0.1482683</code></pre>
<pre class="r"><code># Customize the plot
clustend$plot + 
  scale_fill_gradient(low = "steelblue", high = "white")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/clustering-tendency-ggplot2-factoextra-cluster-tendency-1.png" width="518.4" /></p>
</div>
<div id="infos" class="section level1">
<h1><span class="header-section-number">6</span> Infos</h1>
<p><span class="warning">This analysis has been performed using <strong>R software</strong> (ver. 3.2.4)</span></p>
</div>

<script>jQuery(document).ready(function () {
    jQuery('#rdoc h1').addClass('wiki_paragraph1');
    jQuery('#rdoc h2').addClass('wiki_paragraph2');
    jQuery('#rdoc h3').addClass('wiki_paragraph3');
    jQuery('#rdoc h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>


<!-- END HTML -->]]></description>
			<pubDate>Fri, 28 Oct 2016 06:29:34 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Cluster Analysis in R - Unsupervised machine learning]]></title>
			<link>https://www.sthda.com/english/wiki/cluster-analysis-in-r-unsupervised-machine-learning</link>
			<guid>https://www.sthda.com/english/wiki/cluster-analysis-in-r-unsupervised-machine-learning</guid>
			<description><![CDATA[<!-- START HTML -->

  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<div id="TOC">
<ul>
<li><a href="#introduction"><span class="toc-section-number">1</span> Introduction</a><ul>
<li><a href="#quick-overview-of-machine-learning"><span class="toc-section-number">1.1</span> Quick overview of machine learning</a></li>
<li><a href="#applications-of-unsupervised-machine-learning"><span class="toc-section-number">1.2</span> Applications of unsupervised machine learning</a></li>
</ul></li>
<li><a href="#how-this-document-is-organized"><span class="toc-section-number">2</span> How this document is organized?</a></li>
<li><a href="#data-preparation"><span class="toc-section-number">3</span> Data preparation</a></li>
<li><a href="#installing-and-loading-required-r-packages"><span class="toc-section-number">4</span> Installing and loading required R packages</a></li>
<li><a href="#clarifying-distance-measures"><span class="toc-section-number">5</span> Clarifying distance measures</a></li>
<li><a href="#basic-clustering-methods"><span class="toc-section-number">6</span> Basic clustering methods</a><ul>
<li><a href="#partitioning-clustering"><span class="toc-section-number">6.1</span> Partitioning clustering</a></li>
<li><a href="#hierarchical-clustering"><span class="toc-section-number">6.2</span> Hierarchical clustering</a></li>
</ul></li>
<li><a href="#clustering-validation"><span class="toc-section-number">7</span> Clustering validation</a><ul>
<li><a href="#assessing-clustering-tendency"><span class="toc-section-number">7.1</span> Assessing clustering tendency</a></li>
<li><a href="#determining-the-optimal-number-of-clusters"><span class="toc-section-number">7.2</span> Determining the optimal number of clusters</a></li>
<li><a href="#clustering-validation-statistics"><span class="toc-section-number">7.3</span> Clustering validation statistics</a></li>
<li><a href="#how-to-choose-the-appropriate-clustering-algorithms-for-your-data"><span class="toc-section-number">7.4</span> How to choose the appropriate clustering algorithms for your data?</a></li>
<li><a href="#how-to-compute-p-value-for-hierarchical-clustering-in-r"><span class="toc-section-number">7.5</span> How to compute p-value for hierarchical clustering in R?</a></li>
</ul></li>
<li><a href="#the-guide-for-clustering-analysis-on-a-real-data-4-steps-you-should-know"><span class="toc-section-number">8</span> The guide for clustering analysis on a real data: 4 steps you should know</a></li>
<li><a href="#visualization-of-clustering-results"><span class="toc-section-number">9</span> Visualization of clustering results</a><ul>
<li><a href="#visual-enhancement-of-clustering-analysis"><span class="toc-section-number">9.1</span> Visual enhancement of clustering analysis</a></li>
<li><a href="#beautiful-dendrogram-visualizations"><span class="toc-section-number">9.2</span> Beautiful dendrogram visualizations</a></li>
<li><a href="#static-and-interactive-heatmap"><span class="toc-section-number">9.3</span> Static and Interactive Heatmap</a></li>
</ul></li>
<li><a href="#advanced-clustering-methods"><span class="toc-section-number">10</span> Advanced clustering methods</a><ul>
<li><a href="#fuzzy-clustering-analysis"><span class="toc-section-number">10.1</span> Fuzzy clustering analysis</a></li>
<li><a href="#model-based-clustering"><span class="toc-section-number">10.2</span> Model-based clustering</a></li>
<li><a href="#dbscan-density-based-clustering"><span class="toc-section-number">10.3</span> DBSCAN: Density-based clustering</a></li>
<li><a href="#hybrid-clustering-methods"><span class="toc-section-number">10.4</span> Hybrid clustering methods</a></li>
</ul></li>
<li><a href="#infos"><span class="toc-section-number">11</span> Infos</a></li>
</ul>
</div>

<style>#rdoc .course_material a{font-size:1.5em;} #rdoc .readmore a{font-size:1em;}</style>
<p><br/></p>
<p><img src="https://www.sthda.com/english/sthda/RDoc/images/clustering-toc.png" alt="unsupervised machine learning R" /></p>
<div id="introduction" class="section level1">
<h1><span class="header-section-number">1</span> Introduction</h1>
<div id="quick-overview-of-machine-learning" class="section level2">
<h2><span class="header-section-number">1.1</span> Quick overview of machine learning</h2>
<p>A huge amounts of multidimensional data have been collected in various fields such as marketing, bio-medical and geo-spatial fields. Mining knowledge from these <strong>big data</strong> becomes a highly demanding field. However, it far exceeded human’s ability to analyze these huge data. <strong>Unsupervised Machine Learning</strong> or <strong>clustering</strong> is one of the important data mining methods for discovering knowledge in <strong>multidimensional data</strong>.</p>
<p><strong>Machine learning</strong> (<strong>ML</strong>) is divided into two different fields:</p>
<ul>
<li><strong>Supervised ML</strong> defined as a set of tools used for prediction (linear model, logistic regression, linear discriminant analysis, classification trees, support vector machines and more)</li>
<li><strong>Unsupervised ML</strong>, also known as clustering, is an <strong>exploratory data analysis technique</strong> used for identifying groups (i.e clusters) in the data set of interest. Each group contains observations with similar profile according to a specific criteria. <strong>Similarity</strong> between observations is defined using some inter-observation distance measures including <strong>Euclidean</strong> and <strong>correlation-based</strong> distance measures.</li>
</ul>
<p><span class="success"> This document describes the use of <strong>unsupervised machine learning</strong> approaches, including <strong>Principal Component Analysis (PCA)</strong> and <strong>clustering methods</strong>. </span></p>
<ul>
<li><a href="https://www.sthda.com/english/english/wiki/factominer-and-factoextra-principal-component-analysis-visualization-r-software-and-data-mining"><strong>Principal Component Analysis</strong></a> (PCA) is a dimension reduction techniques applied for simplifying the data and for visualizing the most important information in the data set</li>
<li><strong>Clustering</strong> is applied for identifying groups (i.e clusters) among the observations. Clustering can be subdivided into five general strategies:
<ul>
<li><strong>Partitioning methods</strong></li>
<li><strong>Hierarchical clustering</strong></li>
<li><strong>Fuzzy clustering</strong></li>
<li><strong>Density-based clustering</strong></li>
<li><strong>Model-based clustering</strong></li>
</ul></li>
</ul>
<p><span class="notice">Note that, it’ possible to cluster both observations (i.e, samples or individuals) and features (i.e, variables). Observations can be clustered on the basis of variables and variables can be clustered on the basis of observations.</span></p>
</div>
<div id="applications-of-unsupervised-machine-learning" class="section level2">
<h2><span class="header-section-number">1.2</span> Applications of unsupervised machine learning</h2>
<p><strong>Unsupervised ML</strong> is popular in many fields, including:</p>
<ul>
<li>In <strong>cancer research</strong> field in order to classify patients in subgroups according their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.</li>
<li>In <strong>marketing</strong> for <strong>market segmentation</strong> by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.</li>
<li><strong>City-planning</strong>: for identifying groups of houses according to their type, value and location.</li>
</ul>
</div>
</div>
<div id="how-this-document-is-organized" class="section level1">
<h1><span class="header-section-number">2</span> How this document is organized?</h1>
<p>Here,</p>
<ul>
<li>we start by describing the two <strong>standard clustering strategies</strong> [<strong>partitioning methods</strong> (k-MEANS, PAM, CLARA) and <strong>hierarchical clustering</strong>] as well as how to assess the quality of clustering analysis.
</li>
<li>next, we provide a step-by-step <strong>guide for clustering analysis</strong> and an <strong>R package</strong>, named <a href="https://www.sthda.com/english/english/rpkgs/factoextra/">factoextra</a>, for ggplot2-based elegant clustering visualization.</li>
<li>finally, we describe advanced clustering approaches to find pattern of any shape in large data sets with noise and outliers.</li>
</ul>
<br/>
<div class="block">
<ol style="list-style-type: decimal">
<li><p><strong>Data preparation</strong></p></li>
<li><p><strong>Installing and loading required R packages</strong></p></li>
<li><p><a href="https://www.sthda.com/english/english/wiki/clarifying-distance-measures-unsupervised-machine-learning">Clarifying distance measures</a></p></li>
<li><strong>Basic clustering methods</strong>
</li>
</ol>
<ul>
<li><a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning">Partitioning Cluster Analysis</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning">Hierarchical Clustering Essentials</a></li>
<li><strong>Evaluation of clustering</strong>
<ul>
<li><a href="https://www.sthda.com/english/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning">Assessing Clustering Tendency</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning">Determining the Optimal Number of Clusters</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/clustering-validation-statistics-4-vital-things-everyone-should-know-unsupervised-machine-learning">Clustering Validation Statistics</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/how-to-choose-the-appropriate-clustering-algorithms-for-your-data-unsupervised-machine-learning">Compare Clustering Algorithms: How to Choose the Appropriate Clustering Algorithms for your Data?</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/how-to-compute-p-value-for-hierarchical-clustering-in-r-unsupervised-machine-learning">How to Compute p-value for Hierarchical Clustering?</a></li>
</ul></li>
</ul>
<ol start="5" style="list-style-type: decimal">
<li><p><a href="https://www.sthda.com/english/english/wiki/the-guide-for-clustering-analysis-on-a-real-data-4-steps-you-should-know-unsupervised-machine-learning">The guide for clustering analysis on a real data: 4 steps you should know?</a></p></li>
<li><strong>Elegant Clustering Visualization</strong>
</li>
</ol>
<ul>
<li><a href="https://www.sthda.com/english/english/wiki/visual-enhancement-of-clustering-analysis-unsupervised-machine-learning">Visual enhancement of clustering analysis</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning">Beautiful dendrogram visualizations in R: 5+ must known methods</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/static-and-interactive-heatmap-in-r-unsupervised-machine-learning">Static and Interactive Heatmap in R</a></li>
</ul>
<ol start="7" style="list-style-type: decimal">
<li><strong>Advanced Clustering Methods</strong>
<ul>
<li><a href="https://www.sthda.com/english/english/wiki/fuzzy-clustering-analysis-unsupervised-machine-learning">Fuzzy clustering analysis</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/model-based-clustering-unsupervised-machine-learning">Model-Based Clustering</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning">DBSCAN: density-based clustering for discovering clusters in large datasets with noise</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/hybrid-hierarchical-k-means-clustering-for-optimizing-clustering-outputs-unsupervised-machine-learning">Hybrid hierarchical k-means clustering for optimizing clustering outputs - Hybrid approach (1/1)</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/hcpc-hierarchical-clustering-on-principal-components-hybrid-approach-2-2-unsupervised-machine-learning">HCPC: Hierarchical clustering on principal components - Hybrid approach (2/2)</a></li>
</ul></li>
</ol>
<ul>
<li>Clustering on categorical variables: CA, MCA –> HCPC (coming soon)</li>
</ul>
</div>
<p><br/></p>
<p><a href = "http://eepurl.com/bZSqBr" target = "_blank"><img src="https://www.sthda.com/english/sthda/RDoc/images/clustering_cover.png" alt="clustering book" /></a></p>
<p><span class="success">To be published late in 2016. Subscribe to our mailing list at: <a href = "http://eepurl.com/bZSqBr" target = "_blank">STHDA mailing list</a>. You will be notified about this book.</span></p>
<p><span class="error">This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc-nd/3.0/us/">Creative Commons Attribution-NonCommercial-NoDerivs 3.0</a> United States License.</span></p>
</div>
<div id="data-preparation" class="section level1">
<h1><span class="header-section-number">3</span> Data preparation</h1>
<p>The built-in R dataset <a href="https://www.sthda.com/english/english/wiki/r-built-in-data-sets">USArrest</a> is used as demo data.</p>
<ul>
<li>Remove missing data</li>
<li>Scale variables to make them comparable</li>
</ul>
<pre class="r"><code># Load data
data("USArrests")
my_data <- USArrests

# Remove any missing value (i.e, NA values for not available)
my_data <- na.omit(my_data)

# Scale variables
my_data <- scale(my_data)

# View the firt 3 rows
head(my_data, n = 3)</code></pre>
<pre><code>##             Murder   Assault   UrbanPop         Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska  0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona 0.07163341 1.4788032  0.9989801  1.042878388</code></pre>
</div>
<div id="installing-and-loading-required-r-packages" class="section level1">
<h1><span class="header-section-number">4</span> Installing and loading required R packages</h1>
<ol style="list-style-type: decimal">
<li><strong>Install required packages</strong></li>
</ol>
<br/>
<div class="block">
<ul>
<li><strong>cluster</strong>: for computing clustering</li>
<li><strong>factoextra</strong>: for elegant ggplot2-based data visualization. See the online documentation at: <a href="https://www.sthda.com/english/english/rpkgs/factoextra/" class="uri">https://www.sthda.com/english/rpkgs/factoextra/</a></li>
</ul>
</div>
<p><br/></p>
<pre class="r"><code># Install factoextra
install.packages("factoextra")

# Install cluster package
install.packages("cluster")</code></pre>
<ol start="2" style="list-style-type: decimal">
<li><strong>Loading required packages</strong></li>
</ol>
<pre class="r"><code>library("cluster")
library("factoextra")</code></pre>
</div>
<div id="clarifying-distance-measures" class="section level1">
<h1><span class="header-section-number">5</span> Clarifying distance measures</h1>
<br/>
<div class="block">
The classification of observations into groups, requires some methods for measuring the distance or the (dis)similarity between the observations.
</div>
<p><br/></p>
<p>In this chapter, we covered the <strong>common distance measures</strong> used for assessing similarity between observations. Some R codes, for computing and visualizing pairwise-distances between observations, are also provided.</p>
<p><br/></p>
<p><span class="question">How this chapter is organized?</span></p>
<ul>
<li>Methods for measuring distances</li>
<li>Distances and scaling</li>
<li>Data preparation</li>
<li>R functions for computing distances
<ul>
<li>The standard dist() function</li>
<li>Correlation based distance measures</li>
<li>The function daisy() in cluster package</li>
</ul></li>
<li>Visualizing distance matrices</li>
</ul>
<p><br/>
<span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/clarifying-distance-measures-unsupervised-machine-learning"><i class="fa fa-play"></i> Clarifying distance measures</a>.</span></p>
<p>It’s simple to compute and visualize distance matrix using the functions <a href="https://www.sthda.com/english/english/rpkgs/factoextra/dist.html"><strong>get_dist</strong>() and <strong>fviz_dist</strong>()</a> in <strong>factoextra</strong> R package:</p>
<ul>
<li><p><strong>get_dist</strong>(): for computing a distance matrix between the rows of a data matrix. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.</p></li>
<li><p><strong>fviz_dist</strong>(): for visualizing a distance matrix</p></li>
</ul>
<pre class="r"><code>res.dist <- get_dist(USArrests, stand = TRUE, method = "pearson")

fviz_dist(res.dist, 
   gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-distance-matrix-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<p><span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/clarifying-distance-measures-unsupervised-machine-learning"><i class="fa fa-play"></i> Clarifying distance measures</a>.</span></p>
</div>
<div id="basic-clustering-methods" class="section level1">
<h1><span class="header-section-number">6</span> Basic clustering methods</h1>
<div id="partitioning-clustering" class="section level2">
<h2><span class="header-section-number">6.1</span> Partitioning clustering</h2>
<p><br/></p>
<br/>
<div class="block">
<strong>Partitioning algorithms</strong> are clustering approaches that split the data sets, containing <em>n</em> observations, into a set of k groups (i.e. <strong>clusters</strong>). The algorithms require the analyst to specify the number of clusters to be generated.
</div>
<p>This chapter describes the most commonly used <strong>partitioning algorithms</strong> including:</p>
<ul>
<li><strong>K-means clustering</strong> (MacQueen, 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster.</li>
<li><strong>K-medoids clustering</strong> or <strong>PAM</strong> (<strong>Partitioning Around Medoids</strong>, Kaufman &amp; Rousseeuw, 1990), in which, each cluster is represented by one of the objects in the cluster. It’s a “non-parametric” alternative of k-means clustering. We’ll describe also a variant of <strong>PAM</strong> named <strong>CLARA</strong> (<strong>Clustering Large Applications</strong>) which is used for analyzing large data sets.</li>
</ul>
<p>For each of these methods, we provide:</p>
<ul>
<li>the basic idea and the key mathematical concepts</li>
<li>the clustering algorithm and implementation in R software</li>
<li>R lab sections with many examples for computing clustering methods and visualizing the outputs</li>
</ul>
<p><img src="https://www.sthda.com/english/sthda/RDoc/images/partitioning-clustering.png" alt="Partitioning cluster analysis" /></p>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-partitioning-clustering-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<p><br/></p>
<p><span class="question">How this chapter is organized?</span></p>
<ol style="list-style-type: decimal">
<li><strong>Required packages</strong>: <strong>cluster</strong> (for computing clustering algorithm) and <strong>factoextra</strong> (for elegant visualization)</li>
<li><strong>K-means clustering</strong>
<ul>
<li>Concept</li>
<li>Algorithm</li>
<li>R function for k-means clustering: <strong>stats::kmeans</strong>()</li>
<li>Data format</li>
<li>Compute k-means clustering</li>
<li>Application of K-means clustering on real data
<ul>
<li>Data preparation and descriptive statistics</li>
<li>Determine the number of optimal clusters in the data: <strong>factoextra::fviz_nbclust</strong>()</li>
<li>Compute k-means clustering</li>
<li>Plot the result: <strong>factoextra::fviz_cluster</strong>()</li>
</ul></li>
</ul></li>
<li><strong>PAM: Partitioning Around Medoids</strong>
<ul>
<li>Concept</li>
<li>Algorithm</li>
<li>R function for computing PAM: <strong>cluster::pam</strong>() or <strong>fpc::pamk</strong>()</li>
<li>Compute PAM</li>
</ul></li>
<li><strong>CLARA: Clustering Large Applications</strong>
<ul>
<li>Concept</li>
<li>Algorithm</li>
<li>R function for computing CLARA: <strong>cluster::clara</strong>()</li>
</ul></li>
<li><strong>R packages and functions for visualizing partitioning clusters</strong>
<ul>
<li><strong>cluster::clusplot</strong>() function</li>
<li><strong>factoextra::fviz_cluster</strong>() function</li>
</ul></li>
</ol>
<p><span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning"><i class="fa fa-play"></i> Partitioning cluster analysis</a>. If you are in hurry, read the following quick-start guide.</span> 
</p>
<ul>
<li><strong>K-means clustering</strong>: split the data into a set of k groups (i.e., cluster), where k must be specified by the analyst. Each cluster is represented by means of points belonging to the cluster.</li>
</ul>
<p><a href="https://www.sthda.com/english/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning">Determine the optimal number of clusters</a>: use <strong>factoextra::fviz_nbclust</strong>()</p>
<pre class="r"><code>library("factoextra")
fviz_nbclust(my_data, kmeans, method = "gap_stat")</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-optimal-number-of-clusters-1.png" alt="Clustering - Unsupervised Machine Learning" width="384" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<p><a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning">Compute and visualize k-means clustering</a></p>
<pre class="r"><code>km.res <- kmeans(my_data, 4, nstart = 25)

# Visualize
library("factoextra")
fviz_cluster(km.res, data = my_data, frame.type = "convex")+
  theme_minimal()</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-k-means-plot-ggplot2-factoextra-1.png" alt="Clustering - Unsupervised Machine Learning" width="480" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<ul>
<li><a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning#pam-partitioning-around-medoids"><strong>PAM clustering: Partitioning Around Medoids</strong></a>. Robust alternative to k-means clustering, less sensitive to outliers.</li>
</ul>
<pre class="r"><code># Compute PAM
library("cluster")
pam.res <- pam(my_data, 4)

# Visualize
fviz_cluster(pam.res)</code></pre>
<p><span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning"><i class="fa fa-play"></i> Partitioning cluster analysis</a>.</span></p>
</div>
<div id="hierarchical-clustering" class="section level2">
<h2><span class="header-section-number">6.2</span> Hierarchical clustering</h2>
<br/>
<div class="block">
<strong>Hierarchical clustering</strong> is an alternative approach to k-means clustering for identifying groups in the dataset. It does not require to pre-specify the number of clusters to be generated.
</div>
<p><br/></p>
<p><strong>Hierarchical clustering</strong> can be subdivided into two types:</p>
<ul>
<li><strong>Agglomerative hierarchical clustering</strong> (AHC) in which, each observation is initially considered as a cluster of its own (<strong>leaf</strong>. Then, the most similar clusters are iteratively merged until there is just one single big cluster (<strong>root</strong>).</li>
<li><strong>Divise hierarchical clustering</strong> which is an inverse of AHC. It begins with the root, in witch all objects are included in one cluster. Then the most heterogeneous clusters are iteratively divided until all observation are in their own cluster.</li>
</ul>
<p><span class="success">The result of hierarchical clustering is a tree-based representation of the observations which is called a <strong>dendrogram</strong>. Observations can be subdivided into groups by cutting the dendogram at a desired similarity level.</span></p>
<p>This chapter provides:</p>
<ul>
<li>The description of the different types of <strong>hierarchical clustering algorithms</strong></li>
<li>R lab sections with many examples for <strong>computing hierarchical clustering</strong>, <strong>visualizing</strong> and <strong>comparing dendrogram</strong></li>
<li>The <strong>interpretation of dendrogram</strong></li>
<li>R codes for <strong>cutting the dendrograms</strong> into groups</li>
</ul>
<p><img src="https://www.sthda.com/english/sthda/RDoc/images/hierarchical-clustering.png" alt="Hierarchical clustering" /></p>
<p><br/></p>
<p><span class="question">How this chapter is organized?</span></p>
<ol style="list-style-type: decimal">
<li><strong>Required R packages</strong></li>
<li><strong>Algorithm</strong></li>
<li><strong>Data preparation</strong> and descriptive statistics</li>
<li><strong>R functions</strong> for hierarchical clustering
<ul>
<li>hclust() function</li>
<li>agnes() and diana() functions</li>
</ul></li>
<li><strong>Interpretation</strong> of the dendrogram</li>
<li><strong>Cut the dendrogram</strong> into different groups</li>
<li><strong>Hierarchical clustering</strong> and <strong>correlation based distance</strong></li>
<li><strong>What type of distance measures should we choose?</strong></li>
<li><strong>Comparing two dendrograms</strong>
<ul>
<li>Tanglegram</li>
<li>Correlation matrix between a list of dendrogram</li>
</ul></li>
</ol>
<p><span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning"><i class="fa fa-play"></i> Hierarchical clustering essentials</a>. If you are in hurry, read the following quick-start guide.</span></p>
<ol style="list-style-type: decimal">
<li><p><strong>Install and load required packages</strong> (cluster, factoextra) as previously described</p></li>
<li><p><strong>Compute and visualize hierarchical clustering</strong> using R base functions</p></li>
</ol>
<pre class="r"><code># 1. Loading and preparing data
data("USArrests")
my_data <- scale(USArrests)

# 2. Compute dissimilarity matrix
d <- dist(my_data, method = "euclidean")

# Hierarchical clustering using Ward&amp;#39;s method
res.hc <- hclust(d, method = "ward.D2" )

# Cut tree into 4 groups
grp <- cutree(res.hc, k = 4)

# Visualize
plot(res.hc, cex = 0.6) # plot tree
rect.hclust(res.hc, k = 4, border = 2:5) # add rectangle</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-hierarchical-clustering-r-base-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<ol start="3" style="list-style-type: decimal">
<li><strong>Elegant visualization</strong> using factoextra functions: <a href="https://www.sthda.com/english/english/rpkgs/factoextra/hcut.html"><strong>factoextra::hcut</strong>()</a>, <a href="https://www.sthda.com/english/english/rpkgs/factoextra/fviz_dend.html"><strong>factoextra::fviz_dend</strong>()</a></li>
</ol>
<pre class="r"><code>library("factoextra")
# Compute hierarchical clustering and cut into 4 clusters
res <- hcut(USArrests, k = 4, stand = TRUE)

# Visualize
fviz_dend(res, rect = TRUE, cex = 0.5,
          k_colors = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-hierarchical-clustering2-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<p>We’ll see also, how to <a href="https://www.sthda.com/english/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning">customize the dendrogram</a>:</p>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-unnamed-chunk-5-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<p><span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning"><i class="fa fa-play"></i> Hierarchical clustering essentials</a>.</span></p>
</div>
</div>
<div id="clustering-validation" class="section level1">
<h1><span class="header-section-number">7</span> Clustering validation</h1>
<br/>
<div class="block">
<p><strong>Clustering validation</strong> includes three main tasks:</p>
<ol style="list-style-type: decimal">
<li><strong>clustering tendency</strong> assesses whether applying clustering is suitable to your data.</li>
<li><strong>clustering evaluation</strong> assesses the goodness or quality of the clustering.</li>
<li><strong>clustering stability</strong> seeks to understand the sensitivity of the clustering result to various algorithmic parameters, for example, the number of clusters.</li>
</ol>
</div>
<p><br/></p>
<p>The aim of this part is to:</p>
<ul>
<li>describe the different methods for clustering validation</li>
<li>compare the quality of clustering results obtained with different clustering algorithms</li>
<li>provide R lab section for validating clustering results</li>
</ul>
<div id="assessing-clustering-tendency" class="section level2">
<h2><span class="header-section-number">7.1</span> Assessing clustering tendency</h2>
<br/>
<div class="block">
<strong>Assessing clustering tendency</strong> consists of examining whether the data is clusterable, that is, whether the data contains any inherent grouping structure. This should be checked before applying clustering analysis.
</div>
<p><br/></p>
<p>In this chapter:</p>
<ul>
<li>We describe why we should evaluate the <strong>clustering tendency</strong> before applying any cluster analysis on a dataset.</li>
<li>We describe statistical and visual methods for assessing the clustering tendency</li>
<li>R lab sections containing many examples are also provided for computing clustering tendency and visualizing clusters</li>
</ul>
<p><br/></p>
<p><span class="question">How this chapter is organized?</span></p>
<ol style="list-style-type: decimal">
<li><strong>Required packages</strong></li>
<li><strong>Data preparation</strong></li>
<li><strong>Why assessing clustering tendency</strong>?</li>
<li><strong>Methods for assessing clustering tendency</strong>
<ul>
<li><strong>Hopkins statistic</strong>
<ul>
<li>Algorithm</li>
<li>R function for computing Hopkins statistic: <strong>clustertend::hopkins</strong>()</li>
</ul></li>
<li><strong>VAT: Visual Assessment of cluster Tendency</strong>: <strong>seriation::dissplot</strong>()
<ul>
<li>VAT Algorithm</li>
<li>R functions for VAT</li>
</ul></li>
</ul></li>
<li><strong>A single function for Hopkins statistic and VAT</strong>: <strong>factoextra::get_clust_tendency</strong>()</li>
</ol>
<p><br/> <span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning"><i class="fa fa-play"></i> Assessing clustering tendency</a>. If you are in hurry, read the following quick-start guide.</span> <br/></p>
<ol style="list-style-type: decimal">
<li><p><strong>Install and load factoextra</strong> as previously described</p></li>
<li><p><strong>Assessing clustering tendency</strong>: use factoextra::get_clust_tendency(). Assess clustering tendency using Hopkins’ statistic and a visual approach. An ordered dissimilarity image (ODI) is shown.</p></li>
</ol>
<br/>
<div class="block">
<ul>
<li><p><strong>Hopkins statistic</strong>: If the value of Hopkins statistic is close to zero (far below 0.5), then we can conclude that the dataset is significantly clusterable.</p></li>
<li><strong>VAT (Visual Assessment of cluster Tendency)</strong>: The VAT detects the clustering tendency in a visual form by counting the number of square shaped dark (or colored) blocks along the diagonal in a VAT image.</li>
</ul>
</div>
<p><br/></p>
<pre class="r"><code>library("factoextra")
my_data <- scale(iris[, -5])
get_clust_tendency(my_data, n = 50,
                   gradient = list(low = "steelblue",  high = "white"))</code></pre>
<pre><code>## $hopkins_stat
## [1] 0.2002686
## 
## $plot</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-clustering-tendency-1.png" alt="Clustering - Unsupervised Machine Learning" width="432" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<p><span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning"><i class="fa fa-play"></i> Assessing clustering tendency</a>.</span></p>
</div>
<div id="determining-the-optimal-number-of-clusters" class="section level2">
<h2><span class="header-section-number">7.2</span> Determining the optimal number of clusters</h2>
<br/>
<div class="block">
As described above, <a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning"><strong>Partitioning methods</strong></a>, such as <strong>k-means clustering</strong> require the users to specify the number of clusters to be generated.
</div>
<p><br/></p>
<p>In this chapter, we’ll describe different methods to determine the optimal number of clusters for <strong>k-means</strong>, <strong>PAM</strong> and <strong>hierarchical</strong> clustering.</p>
<p><br/> <span class="question">How this chapter is organized?</span></p>
<ol style="list-style-type: decimal">
<li>Required packages</li>
<li>Data preparation</li>
<li>Example of partitioning method results</li>
<li>Example of hierarchical clustering results</li>
<li><strong>Three popular methods for determining the optimal number of clusters</strong>
<ul>
<li><strong>Elbow method</strong>
<ul>
<li>Concept</li>
<li>Algorithm</li>
<li>R codes</li>
</ul></li>
<li><strong>Average silhouette method</strong>
<ul>
<li>Concept</li>
<li>Algorithm</li>
<li>R codes</li>
</ul></li>
<li>Conclusions about elbow and silhouette methods</li>
<li><strong>Gap statistic method</strong>
<ul>
<li>Concept</li>
<li>Algorithm</li>
<li>R codes</li>
</ul></li>
</ul></li>
<li><strong>NbClust: A Package providing 30 indices for determining the best number of clusters</strong>
<ul>
<li>Overview of NbClust package</li>
<li>NbClust R function</li>
<li>Examples of usage
<ul>
<li>Compute only an index of interest</li>
<li>Compute all the 30 indices</li>
</ul></li>
</ul></li>
</ol>
<p><br/> <span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning"><i class="fa fa-play"></i> Determining the optimal number of clusters</a>. If you are in hurry, read the following quick-start guide.</span> <br/></p>
<ul>
<li><strong>Estimate the number of clusters in the data using gap statistics</strong> : <strong>factoextra::fviz_nbclust</strong>()</li>
</ul>
<pre class="r"><code>my_data <- scale(USArrests)
library("factoextra")
fviz_nbclust(my_data, kmeans, method = "gap_stat")</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-determine-the-number-of-clusters-gap-statistics-1.png" alt="Clustering - Unsupervised Machine Learning" width="384" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<ul>
<li><strong>NbClust: A Package providing 30 indices for determining the best number of clusters</strong></li>
</ul>
<pre class="r"><code>library("NbClust")
set.seed(123)
res.nbclust <- NbClust(my_data, distance = "euclidean",
                  min.nc = 2, max.nc = 10, 
                  method = "complete", index ="all") </code></pre>
<p><strong>Visualize using factoextra</strong>:</p>
<pre class="r"><code>factoextra::fviz_nbclust(res.nbclust) + theme_minimal()</code></pre>
<pre><code>## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 9 proposed  2 as the best number of clusters
## * 4 proposed  3 as the best number of clusters
## * 6 proposed  4 as the best number of clusters
## * 2 proposed  5 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-determine-the-number-of-clusters-nbclust-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<p><span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning"><i class="fa fa-play"></i> Determining the optimal number of clusters</a>.</span></p>
</div>
<div id="clustering-validation-statistics" class="section level2">
<h2><span class="header-section-number">7.3</span> Clustering validation statistics</h2>
<br/>
<div class="block">
A variety of measures has been proposed in the literature for evaluating clustering results. The term clustering validation is used to design the procedure of evaluating the results of a clustering algorithm.
</div>
<p><br/></p>
<p>The aim of this chapter is to:</p>
<ul>
<li>describe the different methods for clustering validation</li>
<li>compare the quality of clustering results obtained with different clustering algorithms</li>
<li>provide R lab section for validating clustering results</li>
</ul>
<p><br/> <span class="question">How this chapter is organized?</span></p>
<ol style="list-style-type: decimal">
<li><strong>Required packages</strong>: cluster, factoextra, NbClust, fpc</li>
<li><strong>Data preparation</strong></li>
<li><strong>Relative measures - Determine the optimal number of clusters</strong>: NbClust::NbClust()</li>
<li><strong>Clustering analysis</strong>
<ul>
<li>Example of partitioning method results</li>
<li>Example of hierarchical clustering results</li>
</ul></li>
<li><strong>Internal clustering validation measures</strong>
<ul>
<li><strong>Silhouette analysis</strong>
<ul>
<li>Concept and algorithm</li>
<li>Interpretation of silhouette width</li>
<li>R functions for silhouette analysis: cluster::silhouette(), factoextra::fviz_silhouette()</li>
</ul></li>
<li><strong>Dunn index</strong>
<ul>
<li>Concept and algorithm</li>
<li>R function for computing Dunn index: fpc::cluster.stats(), NbClust::NbClust()</li>
</ul></li>
<li><strong>Clustering validation statistics</strong>: fpc::cluster.stats()</li>
</ul></li>
<li><strong>External clustering validation</strong></li>
</ol>
<p><br/> <span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/clustering-validation-statistics-4-vital-things-everyone-should-know-unsupervised-machine-learning"><i class="fa fa-play"></i> Clustering Validation Statistics</a>. If you are in hurry, read the following quick-start guide.</span> <br/></p>
<ol style="list-style-type: decimal">
<li><strong>Compute and visualize hierarchical clustering</strong></li>
</ol>
<ul>
<li>Compute: factoextra::eclust()</li>
<li>Elegant visualization: factoextra::fviz_dend()</li>
</ul>
<pre class="r"><code>my_data <- scale(iris[, -5])

# Enhanced hierarchical clustering, cut in 3 groups
library("factoextra")
res.hc <- eclust(my_data, "hclust", k = 3, graph = FALSE) 

# Visualize
fviz_dend(res.hc, rect = TRUE, show_labels = FALSE)</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-unnamed-chunk-7-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<ol start="2" style="list-style-type: decimal">
<li><strong>Validate clustering results</strong> by inspection the cluster <strong>silhouette plot</strong></li>
</ol>
<p>Recall that the <a href="https://www.sthda.com/english/english/wiki/clustering-validation-statistics-4-vital-things-everyone-should-know-unsupervised-machine-learning#silhouette-analysis">silhouette</a> (<span class="math">\(S_i\)</span>) measures how similar an object <span class="math">\(i\)</span> is to the the other objects in its own cluster versus those in the neighbor cluster. <span class="math">\(S_i\)</span> values range from 1 to - 1:</p>
<ul>
<li>A value of <span class="math">\(S_i\)</span> close to 1 indicates that the object is well clustered. In the other words, the object <span class="math">\(i\)</span> is similar to the other objects in its group.</li>
<li>A value of <span class="math">\(S_i\)</span> close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.</li>
</ul>
<pre class="r"><code># Visualize the silhouette plot
fviz_silhouette(res.hc)</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1   49          0.63
## 2       2   30          0.44
## 3       3   71          0.32</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-silhouette-plot-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<p><span class="question">Which samples have negative silhouette? To what cluster are they closer?</span></p>
<pre class="r"><code># Silhouette width of observations
sil <- res.hc$silinfo$widths[, 1:3]

# Objects with negative silhouette
neg_sil_index <- which(sil[, &amp;#39;sil_width&amp;#39;] < 0)
sil[neg_sil_index, , drop = FALSE]</code></pre>
<pre><code>##     cluster neighbor   sil_width
## 84        3        2 -0.01269799
## 122       3        2 -0.01789603
## 62        3        2 -0.04756835
## 135       3        2 -0.05302402
## 73        3        2 -0.10091884
## 74        3        2 -0.14761137
## 114       3        2 -0.16107155
## 72        3        2 -0.23036371</code></pre>
<p><br/> <span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/clustering-validation-statistics-4-vital-things-everyone-should-know-unsupervised-machine-learning"><i class="fa fa-play"></i> Clustering Validation Statistics</a>.</span> <br/></p>
</div>
<div id="how-to-choose-the-appropriate-clustering-algorithms-for-your-data" class="section level2">
<h2><span class="header-section-number">7.4</span> How to choose the appropriate clustering algorithms for your data?</h2>
<br/>
<div class="block">
This chapter describes the R package <strong>clValid</strong> (G. Brock et al., 2008) which can be used for simultaneously <strong>comparing multiple clustering algorithms</strong> in a single function call for <strong>identifying the best clustering approach</strong> and the <strong>optimal number of clusters</strong>.
</div>
<p><br/></p>
<p>We’ll start by describing the different clustering validation measures in the package. Next, we’ll present the function clValid() and finally we’ll provide an R lab section for validating clustering results and comparing clustering algorithms.</p>
<p><br/> <span class="question">How this chapter is organized?</span></p>
<ol style="list-style-type: decimal">
<li><strong>Clustering validation measures in clValid package</strong>
<ul>
<li>Internal validation measures</li>
<li>Stability validation measures</li>
<li>Biological validation measures</li>
</ul></li>
<li><strong>R function clValid</strong>()
<ul>
<li>Format</li>
<li>Examples of usage
<ul>
<li>Data</li>
<li>Compute clValid()</li>
</ul></li>
</ul></li>
</ol>
<p><br/> <span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/how-to-choose-the-appropriate-clustering-algorithms-for-your-data-unsupervised-machine-learning"><i class="fa fa-play"></i> How to choose the appropriate clustering algorithms for your data?</a>. If you are in hurry, read the following quick-start guide.</span> <br/></p>
<pre class="r"><code>my_data <- scale(USArrests)

# Compute clValid
library("clValid")
intern <- clValid(my_data, nClust = 2:6, 
              clMethods = c("hierarchical","kmeans","pam"),
              validation = "internal")
# Summary
summary(intern)</code></pre>
<pre><code>## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 
## 
## Validation Measures:
##                                  2       3       4       5       6
##                                                                   
## hierarchical Connectivity   6.6437  9.5615 13.9563 22.5782 31.2873
##              Dunn           0.2214  0.2214  0.2224  0.2046  0.2126
##              Silhouette     0.4085  0.3486  0.3637  0.3213  0.2720
## kmeans       Connectivity   6.6437 13.6484 16.2413 24.6639 33.7194
##              Dunn           0.2214  0.2224  0.2224  0.1983  0.2231
##              Silhouette     0.4085  0.3668  0.3573  0.3377  0.3079
## pam          Connectivity   6.6437 13.8302 20.4421 29.5726 38.2643
##              Dunn           0.2214  0.1376  0.1849  0.1849  0.2019
##              Silhouette     0.4085  0.3144  0.3390  0.3105  0.2630
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 6.6437 hierarchical 2       
## Dunn         0.2231 kmeans       6       
## Silhouette   0.4085 hierarchical 2</code></pre>
<p><span class="warning">It can be seen that hierarchical clustering with two clusters performs the best in each case (i.e., for connectivity, Dunn and Silhouette measures).</span></p>
<p><br/> <span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/how-to-choose-the-appropriate-clustering-algorithms-for-your-data-unsupervised-machine-learning"><i class="fa fa-play"></i> How to choose the appropriate clustering algorithms for your data?</a>.</span> <br/></p>
</div>
<div id="how-to-compute-p-value-for-hierarchical-clustering-in-r" class="section level2">
<h2><span class="header-section-number">7.5</span> How to compute p-value for hierarchical clustering in R?</h2>
<br/>
<div class="block">
This chapter describes the <strong>R</strong> package <strong>pvclust</strong> (Suzuki et al., 2004) which uses bootstrap resampling techniques to compute <strong>p-value</strong> for each <strong>clusters</strong>.
</div>
<p><br/></p>
<p><span class="question">How this chapter is organized?</span>
<br/></p>
<ol style="list-style-type: decimal">
<li>Concept</li>
<li>Algorithm</li>
<li>Required R packages</li>
<li>Data preparation</li>
<li>Compute p-value for hierarchical clustering
<ul>
<li>Description of pvclust() function</li>
<li>Usage of pvclust() function</li>
</ul></li>
</ol>
<p><br/> <span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/how-to-compute-p-value-for-hierarchical-clustering-in-r-unsupervised-machine-learning"><i class="fa fa-play"></i> How to compute p-value for hierarchical clustering in R?</a>. If you are in hurry, read the following quick-start guide.</span> <br/></p>
<p>Note that, <strong>pvclust()</strong> performs clustering on the columns of the dataset, which correspond to samples in our case.</p>
<pre class="r"><code>library(pvclust)
# Data preparation
set.seed(123)
data("lung")
ss <- sample(1:73, 30) # extract 20 samples out of
my_data <- lung[, ss]</code></pre>
<pre class="r"><code># Compute pvclust
res.pv <- pvclust(my_data, method.dist="cor", 
                  method.hclust="average", nboot = 10)</code></pre>
<pre><code>## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.</code></pre>
<pre class="r"><code># Default plot
plot(res.pv, hang = -1, cex = 0.5)
pvrect(res.pv)</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-pvclust-p-value-hierarchical-clustering-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
<p><span class="warning"> Clusters with AU > = 95% are indicated by the rectangles and are considered to be strongly supported by data.</span></p>
<p><br/> <span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/how-to-compute-p-value-for-hierarchical-clustering-in-r-unsupervised-machine-learning"><i class="fa fa-play"></i> How to compute p-value for hierarchical clustering in R?</a>.</span> <br/></p>
</div>
</div>
<div id="the-guide-for-clustering-analysis-on-a-real-data-4-steps-you-should-know" class="section level1">
<h1><span class="header-section-number">8</span> The guide for clustering analysis on a real data: 4 steps you should know</h1>
<br/>
<div class="block">
<p>In this chapter we’ll describe the different steps to follow for computing <strong>clustering</strong> on a real data using <strong>k-means clustering</strong>:</p>
<ol style="list-style-type: decimal">
<li>Data preparation</li>
<li><a href="https://www.sthda.com/english/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning">Assessing clustering tendency (i.e., the clusterability of the data)</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning">Defining the optimal number of clusters</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning">Computing partitioning cluster analyses (e.g.: k-means, pam)</a> or <a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning">hierarchical clustering analyses</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/clustering-validation-statistics-4-vital-things-everyone-should-know-unsupervised-machine-learning">Validating clustering analyses: silhouette plot</a></li>
</ol>
</div>
<p><br/></p>
<p><br/> <span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/the-guide-for-clustering-analysis-on-a-real-data-4-steps-you-should-know-unsupervised-machine-learning"><i class="fa fa-play"></i> The guide for clustering analysis on a real data: 4 steps you should know</a>.</span> <br/></p>
</div>
<div id="visualization-of-clustering-results" class="section level1">
<h1><span class="header-section-number">9</span> Visualization of clustering results</h1>
<p>In this chapter, we’ll describe how to visualize the result of clustering using dendrograms as well as static and <strong>interactive</strong> <strong>heatmap</strong>.</p>
<p><strong>Heat map</strong> is a false color image with a dendrogram added to the left side and to the top. It’s used to visualize a hidden pattern in a data matrix in order to reveal some associations between rows or columns.</p>
<div id="visual-enhancement-of-clustering-analysis" class="section level2">
<h2><span class="header-section-number">9.1</span> Visual enhancement of clustering analysis</h2>
<br/>
<div class="block">
In this chapter, we provide some easy-to-use functions for enhancing the workflow of clustering analyses and we implemented ggplot2 method for visualizing the results: <strong>factoextra::eclust</strong>().
</div>
<p><br/></p>
<p><br/> <span class="success readmore">Read more: <a href="https://www.sthda.com/english/english/wiki/visual-enhancement-of-clustering-analysis-unsupervised-machine-learning"><i class="fa fa-play"></i> Visual enhancement of clustering analysis</a>.</span> <br/></p>
</div>
<div id="beautiful-dendrogram-visualizations" class="section level2">
<h2><span class="header-section-number">9.2</span> Beautiful dendrogram visualizations</h2>
<p>Read more: <a href="https://www.sthda.com/english/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning">Beautiful dendrogram visualizations in R: 5+ must known methods</a></p>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-dendrogram-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
</div>
<div id="static-and-interactive-heatmap" class="section level2">
<h2><span class="header-section-number">9.3</span> Static and Interactive Heatmap</h2>
<p>Read more: <a href="https://www.sthda.com/english/english/wiki/static-and-interactive-heatmap-in-r-unsupervised-machine-learning">Static and Interactive Heatmap in R</a></p>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-heatmap-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
</div>
</div>
<div id="advanced-clustering-methods" class="section level1">
<h1><span class="header-section-number">10</span> Advanced clustering methods</h1>
<div id="fuzzy-clustering-analysis" class="section level2">
<h2><span class="header-section-number">10.1</span> Fuzzy clustering analysis</h2>
<p>Fuzzy clustering is also known as <strong>soft method</strong>. Standard clustering approaches produce partitions (K-means, PAM), in which each observation belongs to only one cluster. This is known as <strong>hard clustering</strong>.</p>
<br/>
<div class="block">
In <strong>Fuzzy clustering</strong>, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster. The <strong>Fuzzy c-means</strong> method is the most popular fuzzy clustering algorithm. Read more: <a href="https://www.sthda.com/english/english/wiki/fuzzy-clustering-analysis-unsupervised-machine-learning">Fuzzy clustering analysis</a>.
</div>
<p><br/></p>
</div>
<div id="model-based-clustering" class="section level2">
<h2><span class="header-section-number">10.2</span> Model-based clustering</h2>
<br/>
<div class="block">
In model-based clustering, the data are viewed as coming from a distribution that is mixture of two ore more clusters. It finds best fit of models to data and estimates the number of clusters. Read more: <a href="https://www.sthda.com/english/english/wiki/model-based-clustering-unsupervised-machine-learning">Model-based clustering</a>.
</div>
<p><br/></p>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-model-based-clustering-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-model-based-clustering-2.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
</div>
<div id="dbscan-density-based-clustering" class="section level2">
<h2><span class="header-section-number">10.3</span> DBSCAN: Density-based clustering</h2>
<br/>
<div class="block">
<p>DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers.The basic idea behind density-based clustering approach is derived from a human intuitive clustering method.</p>
The description and implementation of DBSCAN in R are provided in this chapter : <a href="https://www.sthda.com/english/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning">DBSCAN</a>.
</div>
<p><br/></p>
<p><img src="https://www.sthda.com/english/sthda/RDoc/images/dbscan-idea.png" alt="Density based clustering basic idea" /></p>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-dbscan-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
</div>
<div id="hybrid-clustering-methods" class="section level2">
<h2><span class="header-section-number">10.4</span> Hybrid clustering methods</h2>
<ul>
<li><a href="https://www.sthda.com/english/english/wiki/hybrid-hierarchical-k-means-clustering-for-optimizing-clustering-outputs-unsupervised-machine-learning">Hybrid hierarchical k-means clustering for optimizing clustering outputs - Hybrid approach (1/1)</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/hcpc-hierarchical-clustering-on-principal-components-hybrid-approach-2-2-unsupervised-machine-learning">HCPC: Hierarchical clustering on principal components - Hybrid approach (2/2)</a></li>
</ul>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/cluster-analysis-in-r-hcpc-1.png" alt="Clustering - Unsupervised Machine Learning" width="518.4" />
<p class="caption">
Clustering - Unsupervised Machine Learning
</p>
</div>
</div>
</div>
<div id="infos" class="section level1">
<h1><span class="header-section-number">11</span> Infos</h1>
<p><span class="warning">This analysis has been performed using <strong>R software</strong> (ver. 3.2.4)</span></p>
<ul>
<li>Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96). <a href="https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf">pdf</a></li>
</ul>
</div>

<script>jQuery(document).ready(function () {
    jQuery('h1').addClass('wiki_paragraph1');
    jQuery('h2').addClass('wiki_paragraph2');
    jQuery('h3').addClass('wiki_paragraph3');
    jQuery('h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->


<!-- END HTML -->]]></description>
			<pubDate>Fri, 29 Apr 2016 12:01:52 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Fuzzy clustering analysis - Unsupervised Machine Learning]]></title>
			<link>https://www.sthda.com/english/wiki/fuzzy-clustering-analysis-unsupervised-machine-learning</link>
			<guid>https://www.sthda.com/english/wiki/fuzzy-clustering-analysis-unsupervised-machine-learning</guid>
			<description><![CDATA[<!-- START HTML -->

  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<div id="TOC">
<ul>
<li><a href="#required-packages"><span class="toc-section-number">1</span> Required packages</a></li>
<li><a href="#concept-of-fuzzy-clustering"><span class="toc-section-number">2</span> Concept of fuzzy clustering</a></li>
<li><a href="#algorithm-of-fuzzy-clustering"><span class="toc-section-number">3</span> Algorithm of fuzzy clustering</a><ul>
<li><a href="#r-functions-for-fuzzy-clustering"><span class="toc-section-number">3.1</span> R functions for fuzzy clustering</a><ul>
<li><a href="#fanny-fuzzy-analysis-clustering"><span class="toc-section-number">3.1.1</span> fanny(): Fuzzy analysis clustering</a></li>
<li><a href="#cmeans"><span class="toc-section-number">3.1.2</span> cmeans()</a></li>
</ul></li>
</ul></li>
<li><a href="#infos"><span class="toc-section-number">4</span> Infos</a></li>
</ul>
</div>

<p><br/></p>
<div id="required-packages" class="section level1">
<h1><span class="header-section-number">1</span> Required packages</h1>
<p>Three R packages are required for this chapter:</p>
<ol style="list-style-type: decimal">
<li><strong>cluster</strong> and <strong>e1071</strong> for computing fuzzy clustering</li>
<li><strong>factoextra</strong> for visualizing clusters</li>
</ol>
<pre class="r"><code>install.packages("cluster")
install.packages("e1071")
install.packages("factoextra")</code></pre>
</div>
<div id="concept-of-fuzzy-clustering" class="section level1">
<h1><span class="header-section-number">2</span> Concept of fuzzy clustering</h1>
<p>In K-means or PAM clustering, the data is divided into distinct clusters, where each element is affected exactly to one cluster. This type of clustering is also known as <strong>hard clustering</strong> or <strong>non-fuzzy</strong> clustering. Unlike K-means, <strong>Fuzzy clustering</strong> is considered as a <strong>soft clustering</strong>, in which each element has a probability of belonging to each cluster. In other words, each element has a set of membership coefficients corresponding to the degree of being in a given cluster.</p>
<p>Points close to the center of a cluster, may be in the cluster to a higher degree than points in the edge of a cluster. The degree, to which an element belongs to a given cluster, is a numerical value in [0, 1].</p>
<p><strong>Fuzzy c-means</strong> (FCM) algorithm is one of the most widely used fuzzy clustering algorithms. It was developed by Dunn in 1973 and improved by Bezdek in 1981. It’s frequently used in pattern recognition.</p>
</div>
<div id="algorithm-of-fuzzy-clustering" class="section level1">
<h1><span class="header-section-number">3</span> Algorithm of fuzzy clustering</h1>
<p>FCM algorithm is very similar to the <strong>k-means algorithm</strong> and the aim is to minimize the objective function defined as follow:</p>
<p><span class="math">\[
\sum\limits_{j=1}^k \sum\limits_{x_i \in C_j} u_{ij}^m (x_i - \mu_j)^2
\]</span></p>
<p>Where,</p>
<ul>
<li><span class="math">\(u_{ij}\)</span> is the degree to which an observation <span class="math">\(x_i\)</span> belongs to a cluster <span class="math">\(c_j\)</span></li>
<li><span class="math">\(\mu_j\)</span> is the center of the cluster j</li>
<li><span class="math">\(u_{ij}\)</span> is the degree to which an observation <span class="math">\(x_i\)</span> belongs to a cluster <span class="math">\(c_j\)</span></li>
<li><span class="math">\(m\)</span> is the fuzzifier.</li>
</ul>
<p><span class="notice">It can be seen that, FCM differs from k-means by using the membership values <span class="math">\(u_{ij}\)</span> and the fuzzifier <span class="math">\(m\)</span>.</span></p>
<p>The variable <span class="math">\(u_{ij}^m\)</span> is defined as follow:</p>
<p><span class="math">\[
u_{ij}^m = \frac{1}{\sum\limits_{l=1}^k \left( \frac{| x_i - c_j |}{| x_i - c_k |}\right)^{\frac{2}{m-1}}}
\]</span></p>
<p>The degree of belonging, <span class="math">\(u_{ij}\)</span>, is linked inversely to the distance from x to the cluster center.</p>
<p>The parameter <span class="math">\(m\)</span> is a real number greater than 1 (<span class="math">\(1.0 < m < \infty\)</span>) and it defines the level of cluster fuzziness. Note that, a value of <span class="math">\(m\)</span> close to 1 gives a cluster solution which becomes increasingly similar to the solution of hard clustering such as k-means; whereas a value of <span class="math">\(m\)</span> close to infinite leads to complete fuzzyness.</p>
<p><span class="success">Note that, a good choice is to use <strong>m = 2.0</strong> (Hathaway and Bezdek 2001).</span></p>
<p>In <strong>fuzzy clustering</strong> the centroid of a cluster is he mean of all points, weighted by their degree of belonging to the cluster:</p>
<p><span class="math">\[
C_j = \frac{\sum\limits_{x \in C_j} u_{ij}^m x}{\sum\limits_{x \in C_j} u_{ij}^m}
\]</span></p>
<p>Where,</p>
<ul>
<li><span class="math">\(C_j\)</span> is the centroid of the cluster j</li>
<li><span class="math">\(u_{ij}\)</span> is the degree to which an observation <span class="math">\(x_i\)</span> belongs to a cluster <span class="math">\(c_j\)</span></li>
</ul>
<p>The algorithm of fuzzy clustering can be summarize as follow:</p>
<ol style="list-style-type: decimal">
<li>Specify a number of clusters k (by the analyst)</li>
<li>Assign randomly to each point coefficients for being in the clusters.</li>
<li>Repeat until the maximum number of iterations (given by “maxit”) is reached, or when the algorithm has converged (that is, the coefficients’ change between two iterations is no more than <span class="math">\(\epsilon\)</span>, the given sensitivity threshold):
<ul>
<li>Compute the centroid for each cluster, using the formula above.</li>
<li>For each point, compute its coefficients of being in the clusters, using the formula above.</li>
</ul></li>
</ol>
<p>The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means; the minimum is a local minimum, and the results depend on the initial choice of weights. Hence, different initializations may lead to different results.</p>
<p>Using a mixture of Gaussians along with the expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes.</p>
<div id="r-functions-for-fuzzy-clustering" class="section level2">
<h2><span class="header-section-number">3.1</span> R functions for fuzzy clustering</h2>
<div id="fanny-fuzzy-analysis-clustering" class="section level3">
<h3><span class="header-section-number">3.1.1</span> fanny(): Fuzzy analysis clustering</h3>
<p>The function <strong>fanny()</strong> [in <strong>cluster</strong> package] can be used to compute <strong>fuzzy clustering</strong>. <strong>FANNY</strong> stands for <strong>fuzzy analysis clustering</strong>. A simplified format is:</p>
<pre class="r"><code>fanny(x, k, memb.exp = 2, metric = "euclidean", 
      stand = FALSE, maxit = 500)</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>x</strong>: A data matrix or data frame or dissimilarity matrix</li>
<li><strong>k</strong>: The desired number of clusters to be generated</li>
<li><strong>memb.exp</strong>: The membership exponent (strictly larger than 1) used in the fit criteria. It’s also known as the fuzzifier</li>
<li><strong>metric</strong>: The metric to be used for calculating dissimilarities between observations</li>
<li><strong>stand</strong>: Logical; if true, the measurements in x are standardized before calculating the dissimilarities</li>
<li><strong>maxit</strong>: maximal number of iterations</li>
</ul>
</div>
<p><br/></p>
<p>The function <strong>fanny()</strong> returns an object including the following components:</p>
<ul>
<li><strong>membership</strong>: matrix containing the degree to which each observation belongs to a given cluster. Column names are the clusters and rows are observations</li>
<li><strong>coeff</strong>: <strong>Dunn’s partition coefficient</strong> F(k) of the clustering, where k is the number of clusters. F(k) is the sum of all squared membership coefficients, divided by the number of observations. Its value is between 1/k and 1. The normalized form of the coefficient is also given. It is defined as <span class="math">\((F(k) - 1/k) / (1 - 1/k)\)</span>, and ranges between 0 and 1. A low value of Dunn’s coefficient indicates a very fuzzy clustering, whereas a value close to 1 indicates a near-crisp clustering.</li>
<li><strong>clustering</strong>: the clustering vector containing the nearest crisp grouping of observations</li>
</ul>
<p>A subset of USArrests data is used in the following example:</p>
<pre class="r"><code>library(cluster)
set.seed(123)
# Load the data
data("USArrests")

# Subset of USArrests
ss <- sample(1:50, 20)
df <- scale(USArrests[ss,])

# Compute fuzzy clustering
res.fanny <- fanny(df, 4)

# Cluster plot using fviz_cluster()
# You can use also : clusplot(res.fanny)
library(factoextra)
fviz_cluster(res.fanny, frame.type = "norm",
             frame.level = 0.68)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/fuzzy-clustering-fuzzy-clustering-fanny-1.png" title="" alt="" width="518.4" /></p>
<pre class="r"><code># Silhouette plot
fviz_silhouette(res.fanny, label = TRUE)</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1    4          0.52
## 2       2    6          0.10
## 3       3    6          0.41
## 4       4    4          0.04</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/fuzzy-clustering-fuzzy-clustering-fanny-2.png" title="" alt="" width="518.4" /></p>
<p>The result of <strong>fanny()</strong> function can be printed as follow:</p>
<pre class="r"><code>print(res.fanny)</code></pre>
<pre><code>## Fuzzy Clustering object of class &amp;#39;fanny&amp;#39; :                      
## m.ship.expon.        2
## objective     6.052789
## tolerance        1e-15
## iterations         215
## converged            1
## maxit              500
## n                   20
## Membership coefficients (in %, rounded):
##              [,1] [,2] [,3] [,4]
## Iowa           75   11    7    7
## Rhode Island   26   32   21   21
## Maryland        8   19   37   37
## Tennessee      10   24   33   33
## Utah           23   36   20   20
## Arizona        10   23   34   34
## Mississippi    16   25   29   29
## Wisconsin      65   15   10   10
## Virginia       17   37   23   23
## Maine          63   15   11   11
## Texas           8   25   33   33
## Louisiana       9   22   35   35
## Montana        41   26   17   17
## Michigan        8   20   36   36
## Arkansas       19   30   25   25
## New York        9   24   34   34
## Florida        10   21   35   35
## Alaska         15   24   31   31
## Hawaii         27   34   20   20
## New Jersey     16   37   23   23
## Fuzzyness coefficients:
## dunn_coeff normalized 
## 0.31337355 0.08449807 
## Closest hard clustering:
##         Iowa Rhode Island     Maryland    Tennessee         Utah 
##            1            2            3            4            2 
##      Arizona  Mississippi    Wisconsin     Virginia        Maine 
##            3            4            1            2            1 
##        Texas    Louisiana      Montana     Michigan     Arkansas 
##            3            4            1            3            2 
##     New York      Florida       Alaska       Hawaii   New Jersey 
##            3            3            4            2            2 
## 
## Available components:
##  [1] "membership"  "coeff"       "memb.exp"    "clustering"  "k.crisp"    
##  [6] "objective"   "convergence" "diss"        "call"        "silinfo"    
## [11] "data"</code></pre>
<p>The different components can be extracted using the code below:</p>
<pre class="r"><code># Membership coefficient
res.fanny$membership</code></pre>
<pre><code>##                    [,1]      [,2]       [,3]       [,4]
## Iowa         0.75234997 0.1056742 0.07098791 0.07098791
## Rhode Island 0.26129280 0.3198982 0.20940449 0.20940449
## Maryland     0.07559096 0.1906031 0.36690296 0.36690296
## Tennessee    0.10351700 0.2444743 0.32600436 0.32600436
## Utah         0.23177048 0.3631831 0.20252321 0.20252321
## Arizona      0.09505979 0.2329621 0.33598906 0.33598906
## Mississippi  0.15957721 0.2511123 0.29465525 0.29465525
## Wisconsin    0.65274007 0.1530047 0.09712764 0.09712764
## Virginia     0.16856415 0.3654879 0.23297397 0.23297397
## Maine        0.62818484 0.1532966 0.10925930 0.10925930
## Texas        0.08407125 0.2465250 0.33470188 0.33470188
## Louisiana    0.09152177 0.2159634 0.34625741 0.34625741
## Montana      0.40788012 0.2556886 0.16821562 0.16821562
## Michigan     0.07811792 0.1957270 0.36307753 0.36307753
## Arkansas     0.19473888 0.2992279 0.25301662 0.25301662
## New York     0.08723572 0.2392572 0.33675356 0.33675356
## Florida      0.09725070 0.2073927 0.34767830 0.34767830
## Alaska       0.14688036 0.2428630 0.30512830 0.30512830
## Hawaii       0.26945561 0.3356724 0.19743602 0.19743602
## New Jersey   0.16160093 0.3720897 0.23315470 0.23315470</code></pre>
<pre class="r"><code># Visualize using corrplot
library(corrplot)
corrplot(res.fanny$membership, is.corr = FALSE)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/fuzzy-clustering-fuzzy-clustering-membership-1.png" title="" alt="" width="518.4" /></p>
<pre class="r"><code># Dunn&amp;#39;s partition coefficient
res.fanny$coeff</code></pre>
<pre><code>## dunn_coeff normalized 
## 0.31337355 0.08449807</code></pre>
<pre class="r"><code># Observation groups
res.fanny$clustering</code></pre>
<pre><code>##         Iowa Rhode Island     Maryland    Tennessee         Utah 
##            1            2            3            4            2 
##      Arizona  Mississippi    Wisconsin     Virginia        Maine 
##            3            4            1            2            1 
##        Texas    Louisiana      Montana     Michigan     Arkansas 
##            3            4            1            3            2 
##     New York      Florida       Alaska       Hawaii   New Jersey 
##            3            3            4            2            2</code></pre>
</div>
<div id="cmeans" class="section level3">
<h3><span class="header-section-number">3.1.2</span> cmeans()</h3>
<p>It’s also possible to use the function <strong>cmeans()</strong> [in <strong>e1071</strong> package] for computing fuzzy clustering.</p>
<pre class="r"><code>cmeans(x, centers, iter.max = 100, dist = "euclidean", m = 2)</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>x</strong>: a data matrix where columns are variables and rows are observations</li>
<li><strong>centers</strong>: Number of clusters or initial values for cluster centers</li>
<li><strong>iter.max</strong>: Maximum number of iterations</li>
<li><strong>dist</strong>: Possible values are “euclidean” or “manhattan”</li>
<li><strong>m</strong>: A number greater than 1 giving the degree of fuzzification.</li>
</ul>
</div>
<p><br/></p>
<p>The function <strong>cmeans()</strong> returns an object of class <strong>fclust</strong> which is a list containing the following components:</p>
<ul>
<li><strong>centers</strong>: the final cluster centers</li>
<li><strong>size</strong>: the number of data points in each cluster of the closest hard clustering</li>
<li><strong>cluster</strong>: a vector of integers containing the indices of the clusters where the data points are assigned to for the closest hard clustering, as obtained by assigning points to the (first) class with maximal membership.</li>
<li><strong>iter</strong>: the number of iterations performed</li>
<li><strong>membership</strong>: a matrix with the membership values of the data points to the clusters</li>
<li><strong>withinerror</strong>: the value of the objective function</li>
</ul>
<pre class="r"><code>set.seed(123)
library(e1071)
cm <- cmeans(df, 4)
cm</code></pre>
<pre><code>## Fuzzy c-means clustering with 4 clusters
## 
## Cluster centers:
##       Murder    Assault   UrbanPop       Rape
## 1  0.6290005  0.9705484  0.5006389  0.8647698
## 2  0.8560350  0.3375298 -0.7294688  0.2002994
## 3 -1.2101485 -1.2476750 -0.7277747 -1.1534135
## 4 -0.7314218 -0.6647441  1.0032068 -0.3335272
## 
## Memberships:
##                        1           2          3          4
## Iowa         0.005939255 0.009155372 0.96585947 0.01904590
## Rhode Island 0.104616576 0.098854401 0.20500209 0.59152694
## Maryland     0.697459281 0.227720539 0.02731256 0.04750762
## Tennessee    0.078024194 0.872296030 0.02111342 0.02856636
## Utah         0.049301432 0.044484100 0.08442894 0.82178552
## Arizona      0.740498081 0.118781050 0.03988867 0.10083220
## Mississippi  0.179555100 0.624367937 0.10296383 0.09311313
## Wisconsin    0.024017906 0.033630983 0.83136508 0.11098604
## Virginia     0.155690387 0.395730684 0.19167059 0.25690834
## Maine        0.021165990 0.034336946 0.89152511 0.05297195
## Texas        0.545608753 0.240753676 0.05410235 0.15953522
## Louisiana    0.275003950 0.617629141 0.04197257 0.06539434
## Montana      0.062161310 0.135620851 0.66557661 0.13664123
## Michigan     0.848927329 0.096168273 0.01784963 0.03705477
## Arkansas     0.131803310 0.565593614 0.18039386 0.12220922
## New York     0.694179984 0.131927283 0.04157413 0.13231860
## Florida      0.711655719 0.173670792 0.03979837 0.07487512
## Alaska       0.369474028 0.381553979 0.11356564 0.13540635
## Hawaii       0.064103932 0.066647766 0.14874490 0.72050340
## New Jersey   0.082015921 0.059546923 0.05743425 0.80100291
## 
## Closest hard clustering:
##         Iowa Rhode Island     Maryland    Tennessee         Utah 
##            3            4            1            2            4 
##      Arizona  Mississippi    Wisconsin     Virginia        Maine 
##            1            2            3            2            3 
##        Texas    Louisiana      Montana     Michigan     Arkansas 
##            1            2            3            1            2 
##     New York      Florida       Alaska       Hawaii   New Jersey 
##            1            1            2            4            4 
## 
## Available components:
## [1] "centers"     "size"        "cluster"     "membership"  "iter"       
## [6] "withinerror" "call"</code></pre>
<pre class="r"><code>fviz_cluster(list(data = df, cluster=cm$cluster), frame.type = "norm",
             frame.level = 0.68)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/fuzzy-clustering-cmeans-1.png" title="" alt="" width="518.4" /></p>
</div>
</div>
</div>
<div id="infos" class="section level1">
<h1><span class="header-section-number">4</span> Infos</h1>
<p><span class="warning">This analysis has been performed using <strong>R software</strong> (ver. 3.2.4)</span></p>
<ul>
<li>J. C. Dunn (1973): A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics 3: 32-57</li>
<li>J. C. Bezdek (1981): Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York Tariq Rashid: “Clustering”</li>
</ul>
</div>

<script>jQuery(document).ready(function () {
    jQuery('h1').addClass('wiki_paragraph1');
    jQuery('h2').addClass('wiki_paragraph2');
    jQuery('h3').addClass('wiki_paragraph3');
    jQuery('h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>
<!--====================== stop here when you copy to sthda================-->



<!-- END HTML -->]]></description>
			<pubDate>Wed, 27 Apr 2016 22:34:53 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Static and Interactive Heatmap in R - Unsupervised Machine Learning]]></title>
			<link>https://www.sthda.com/english/wiki/static-and-interactive-heatmap-in-r-unsupervised-machine-learning</link>
			<guid>https://www.sthda.com/english/wiki/static-and-interactive-heatmap-in-r-unsupervised-machine-learning</guid>
			<description><![CDATA[<!-- START HTML -->

            
  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<div id="TOC">
<ul>
<li><a href="#data"><span class="toc-section-number">1</span> Data</a></li>
<li><a href="#draw-a-heat-map-using-r-base-function"><span class="toc-section-number">2</span> Draw a heat map using R base function</a></li>
<li><a href="#enhanced-heat-map"><span class="toc-section-number">3</span> Enhanced heat map</a></li>
<li><a href="#interactive-heatmap"><span class="toc-section-number">4</span> Interactive heatmap</a></li>
<li><a href="#enhancing-heatmaps-using-dendextend"><span class="toc-section-number">5</span> Enhancing heatmaps using dendextend</a></li>
<li><a href="#complex-heatmap"><span class="toc-section-number">6</span> Complex heatmap</a><ul>
<li><a href="#install-and-load-complexheatmap-package"><span class="toc-section-number">6.1</span> Install and load ComplexHeatmap package</a></li>
<li><a href="#main-function-heatmap"><span class="toc-section-number">6.2</span> Main function: Heatmap()</a></li>
<li><a href="#single-heatmap"><span class="toc-section-number">6.3</span> Single heatmap</a><ul>
<li><a href="#colors"><span class="toc-section-number">6.3.1</span> Colors</a></li>
<li><a href="#titles"><span class="toc-section-number">6.3.2</span> Titles</a></li>
<li><a href="#row-and-column-names"><span class="toc-section-number">6.3.3</span> Row and column names</a></li>
<li><a href="#clustering"><span class="toc-section-number">6.3.4</span> Clustering</a></li>
<li><a href="#split-heatmap-by-rows"><span class="toc-section-number">6.3.5</span> Split heatmap by rows</a></li>
</ul></li>
<li><a href="#heatmap-annotation"><span class="toc-section-number">6.4</span> Heatmap annotation</a><ul>
<li><a href="#prepare-the-data"><span class="toc-section-number">6.4.1</span> Prepare the data</a></li>
<li><a href="#simple-annotation"><span class="toc-section-number">6.4.2</span> Simple annotation</a></li>
<li><a href="#complex-annotation"><span class="toc-section-number">6.4.3</span> Complex annotation</a></li>
</ul></li>
<li><a href="#combine-multiple-heatmaps"><span class="toc-section-number">6.5</span> Combine multiple heatmaps</a></li>
<li><a href="#real-application"><span class="toc-section-number">6.6</span> Real application</a></li>
<li><a href="#gene-expression-matrix"><span class="toc-section-number">6.7</span> Gene expression matrix</a></li>
<li><a href="#visualize-distribution-of-column-in-matrix"><span class="toc-section-number">6.8</span> Visualize distribution of column in matrix</a></li>
</ul></li>
<li><a href="#infos"><span class="toc-section-number">7</span> Infos</a></li>
</ul>
</div>

<p>In this article, we’ll describe how to draw static and <strong>interactive</strong> <strong>heatmap</strong> in R. The following R packages and functions will be used:</p>
<ul>
<li><strong>heatmap</strong>(): an R base function for drawing a simple heatmap</li>
<li><strong>heatmap.2()</strong> [in <strong>gplots</strong> package]: a function for drawing an enhanced heatmap</li>
<li><strong>d3heatmap</strong>: an R package for drawing <strong>interactive</strong> heatmap</li>
<li><strong>ComplexHeatmap</strong>: an R/bioconductor package for drawing, annotating and arranging complex heatmaps (very useful for genomic data analysis)</li>
</ul>
<div id="data" class="section level1">
<h1><span class="header-section-number">1</span> Data</h1>
<p>The built-in <strong>mtcars</strong> R data is used:</p>
<pre class="r"><code>df <- as.matrix(scale(mtcars))</code></pre>
</div>
<div id="draw-a-heat-map-using-r-base-function" class="section level1">
<h1><span class="header-section-number">2</span> Draw a heat map using R base function</h1>
<p>The built-in R <strong>heatmap</strong> function [in <strong>stats</strong> package] can be used.</p>
<p>A simplified format is:</p>
<pre class="r"><code>heatmap(x, scale = "row")</code></pre>
<ul>
<li><strong>x</strong>: a numeric matrix</li>
<li><strong>scale</strong>: a character indicating if the values should be centered and scaled in either the row direction or the column direction, or none. Allowed values are in c(“row”, “column”, “none”). Default is “row”.</li>
</ul>
<pre class="r"><code># Default plot
heatmap(df, scale = "none")</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-r-base-heatmap-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<pre class="r"><code># Use custom colors
col<- colorRampPalette(c("red", "white", "blue"))(256)
heatmap(scale(as.matrix(mtcars)), scale = "none",
        col =  col)</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-r-base-heatmap-2.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p>The R code below will customize the heatmap as follow:</p>
<ol style="list-style-type: decimal">
<li>An <a href="https://www.sthda.com/english/english/wiki/colors-in-r"><strong>RColorBrewer</strong></a> color palette name is used to change the appearance</li>
<li>The argument <strong>RowSideColors</strong> and <strong>ColSideColors</strong> are used to annotate rows and columns respectively. The expected values for these options are a vector containing color names specifying the classes for rows/columns.</li>
</ol>
<pre class="r"><code># Use RColorBrewer color palette names
library("RColorBrewer")
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)
heatmap(df, scale = "none", col =  col, 
        RowSideColors = rep(c("blue", "pink"), each = 16),
        ColSideColors = c(rep("purple", 5), rep("orange", 6)))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-unnamed-chunk-3-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
</div>
<div id="enhanced-heat-map" class="section level1">
<h1><span class="header-section-number">3</span> Enhanced heat map</h1>
<p>The function <strong>heatmap.2()</strong> [in <strong>gplots</strong> package] provides many extensions to the standard R <strong>heatmap()</strong> function presented in the previous section.</p>
<pre class="r"><code># install.packages("gplots")
library("gplots")
heatmap.2(df, scale = "none", col = bluered(100), 
          trace = "none", density.info = "none")</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-unnamed-chunk-4-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p>Other arguments can be used including:</p>
<ul>
<li><strong>labRow, labCol</strong></li>
<li><strong>margins</strong></li>
<li><strong>hclustfun</strong>: hclustfun=function(x) hclust(x, method=“ward”)</li>
<li><strong>keysize</strong></li>
</ul>
<p>In the R code above, <strong>bluered()</strong> function [in <strong>gplots</strong> package] is used to generate a smoothly varying set of colors. You can also use the following color generator functions:</p>
<ul>
<li>colorpanel(n, low, mid, high)
<ul>
<li>n: Desired number of color elements to be generated</li>
<li>low, mid, high: Colors to use for the Lowest, middle, and highest values. mid may be omitted.</li>
</ul></li>
<li>redgreen(n)</li>
<li>greenred(n)</li>
<li>bluered(n)</li>
<li>redblue(n)</li>
</ul>
</div>
<div id="interactive-heatmap" class="section level1">
<h1><span class="header-section-number">4</span> Interactive heatmap</h1>
<p>The package <strong>d3heatmap</strong> can be used to produce an <strong>interactive heatmap</strong>:</p>
<p>It can be installed as follow:</p>
<pre class="r"><code>if (!require("devtools")) install.packages("devtools")
devtools::install_github("rstudio/d3heatmap")</code></pre>
<p>The function <strong>d3heatmap()</strong> is used to create the <strong>interactive heatmap</strong>:</p>
<p>The possibilities below are provided:</p>
<ul>
<li>Put the mouse on a heatmap cell of interest to view the row and the column names as well as the corresponding value.</li>
<li><strong>select an area</strong> for zooming. After zooming, click on the heatmap again to go back to the previous display</li>
</ul>
<pre class="r"><code>library("d3heatmap")
d3heatmap(scale(mtcars), colors = "RdBu",
          k_row = 4, k_col = 2)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/images/interactive-heatmap.png" alt="interactive heatmap" /></p>
<ul>
<li><strong>colors</strong>: Either an <strong>RColorBrewer</strong> color palette name (e.g. “YlOrRd” or “Blues”), or a vector of colors to interpolate in hexadecimal “#RRGGBB” format, or a color interpolation function like colorRamp. Read this: <a href="https://www.sthda.com/english/english/wiki/colors-in-r">available colors in R</a></li>
<li><strong>k_row</strong>, <strong>k_col</strong>: an integer specifying the desired number of groups by which to color the dendrogram’s branches in row and column, respectively.</li>
</ul>
<p>For further customizing the heatmap read <strong>?d3heatmap</strong>. Possible options include:</p>
</div>
<div id="enhancing-heatmaps-using-dendextend" class="section level1">
<h1><span class="header-section-number">5</span> Enhancing heatmaps using dendextend</h1>
<p>The package <strong>dendextend</strong> can be used to enhance functions from other packages. The <em>mtcars</em> data is used in the following sections. We’ll start by defining the order and the appearance for rows and columns using dendextend. These results are used in others functions from others packages.</p>
<p>The order and the appearance for rows and columns can be defined as follow:</p>
<pre class="r"><code>library(dendextend)
# order for rows
Rowv  <- mtcars %>% scale %>% dist %>% hclust %>% as.dendrogram %>%
   set("branches_k_color", k = 3) %>% set("branches_lwd", 1.2) %>%
   ladderize

# Order for columns
# We must transpose the data
Colv  <- mtcars %>% scale %>% t %>% dist %>% hclust %>% as.dendrogram %>%
   set("branches_k_color", k = 2, value = c("orange", "blue")) %>%
   set("branches_lwd", 1.2) %>%
   ladderize</code></pre>
<p>The arguments above can be used in the functions below:</p>
<ol style="list-style-type: decimal">
<li>The standard <strong>heatmap()</strong> function [in **stats* package]:</li>
</ol>
<pre class="r"><code>heatmap(scale(mtcars), Rowv = Rowv, Colv = Colv,
        scale = "none")</code></pre>
<ol start="2" style="list-style-type: decimal">
<li>The enhanced <strong>heatmap.2()</strong> function [in <strong>gplots</strong> package]:</li>
</ol>
<pre class="r"><code>library(gplots)
heatmap.2(scale(mtcars), scale = "none", col = bluered(100), 
          Rowv = Rowv, Colv = Colv,
          trace = "none", density.info = "none")</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-heatmap-2-dendextend-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<ol start="3" style="list-style-type: decimal">
<li>The interactive heatmap generator <strong>d3heatmap()</strong> function [in <strong>d3heatmap</strong> package]:</li>
</ol>
<pre class="r"><code>library("d3heatmap")
d3heatmap(scale(mtcars), colors = "RdBu",
          Rowv = Rowv, Colv = Colv)</code></pre>
</div>
<div id="complex-heatmap" class="section level1">
<h1><span class="header-section-number">6</span> Complex heatmap</h1>
<p><strong>ComplexHeatmap</strong> is an R/bioconductor package, developed by Zuguang Gu, which provides a flexible solution to arrange and annotate multiple heatmaps. It allows also to visualize the association between different data from different sources.</p>
<div id="install-and-load-complexheatmap-package" class="section level2">
<h2><span class="header-section-number">6.1</span> Install and load ComplexHeatmap package</h2>
<p>The latest version can be installed as follow:</p>
<pre class="r"><code>if (!require("devtools")) install.packages("devtools")
devtools::install_github("jokergoo/ComplexHeatmap")</code></pre>
<p>Loading:</p>
<pre class="r"><code>library("ComplexHeatmap")</code></pre>
</div>
<div id="main-function-heatmap" class="section level2">
<h2><span class="header-section-number">6.2</span> Main function: Heatmap()</h2>
<p>The main function from <strong>ComplexHeatmap</strong> package is <strong>Heatmap()</strong>. A simplified format is:</p>
<pre class="r"><code>Heatmap(matrix, col, name)</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>matrix</strong>: a numeric or character matrix</li>
<li><strong>col</strong>: a vector of colors (discrete color mapping) or a color mapping function (if the matrix is continuous numbers)</li>
<li><strong>name</strong>: the name of the <strong>heatmap</strong></li>
</ul>
</div>
<p><br/></p>
</div>
<div id="single-heatmap" class="section level2">
<h2><span class="header-section-number">6.3</span> Single heatmap</h2>
<p>A <strong>single heatmap</strong> can be used to visualize a data set containing <strong>continuous</strong> or <strong>discrete</strong> values.</p>
<p>In the example below we’ll visualize the built-in <em>mtcars</em> data set.</p>
<p>Recall that, the <em>mtcars</em> data comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).</p>
<pre class="r"><code>data(mtcars)
head(mtcars[, 1:6])</code></pre>
<pre><code>##                    mpg cyl disp  hp drat    wt
## Mazda RX4         21.0   6  160 110 3.90 2.620
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875
## Datsun 710        22.8   4  108  93 3.85 2.320
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215
## Hornet Sportabout 18.7   8  360 175 3.15 3.440
## Valiant           18.1   6  225 105 2.76 3.460</code></pre>
<p>Before drawing the heatmap, the data is firstly <strong>scaled</strong>, using the R base <strong>scale()</strong> function.</p>
<pre class="r"><code>df <- scale(mtcars)
Heatmap(df, name = "mtcars")</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-single-heatmap-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<div id="colors" class="section level3">
<h3><span class="header-section-number">6.3.1</span> Colors</h3>
<p>The argument <strong>col</strong> is used to specify colors. As our data matrix contains continuous values, the option <strong>col</strong> should be a <strong>color mapping function</strong>. In this case, the <strong>colorRamp2()</strong> function [in <strong>circlize</strong>] can be used as follow:</p>
<pre class="r"><code>library("circlize")
Heatmap(df, name = "mtcars",
        col = colorRamp2(c(-2, 0, 2), c("green", "white", "red")))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-colors-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p><span class="warning">The two arguments for <strong>colorRamp2()</strong> is a vector of breaks values and corresponding colors.</span></p>
<p>It’s also possible to use <a href="https://www.sthda.com/english/english/wiki/colors-in-r"><strong>RColorBrewer</strong></a> color palettes:</p>
<pre class="r"><code>library("RColorBrewer")
Heatmap(df, name = "mtcars",
        col = colorRamp2(c(-2, 0, 2), brewer.pal(n=3, name="RdBu")))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-colors-rcolorbrewer-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<br/>
<div class="warning">
<p>In the next sections, we’ll use the following custom color palette:</p>
<pre class="r"><code>mycol <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))</code></pre>
</div>
<p><br/></p>
</div>
<div id="titles" class="section level3">
<h3><span class="header-section-number">6.3.2</span> Titles</h3>
<p>The heatmap <strong>name</strong>, <strong>column title</strong> and <strong>row title</strong> can be changed as follow:</p>
<pre class="r"><code>Heatmap(df, name = "mtcars", col = mycol,
        column_title = "Column title",
        row_title = "Row title")</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-row-column-titles-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<br/>
<div class="warning">
<p>Note that, the default side for the <strong>row title</strong> is “left” and the default side for the <strong>column title</strong> is “top”. This can be changed using the options below:</p>
<ul>
<li><strong>row_title_side</strong>: Allowed values are “left” or “right” (e.g.: <strong>row_title_side = “right”</strong> )</li>
<li><strong>column_title_side</strong>: Allowed values are “top” or “bottom” (e.g.: <strong>column_title_side = “bottom”</strong> )</li>
</ul>
</div>
<p><br/></p>
<p>It’s also possible to modify the font size and face of titles using the options:</p>
<br/>
<div class="block">
<ul>
<li><strong>row_title_gp</strong>: graphic parameters for drawing <strong>row text</strong></li>
<li><strong>column_title_gp</strong>: graphic parameters for drawing <strong>column text</strong></li>
</ul>
</div>
<p><br/></p>
<p>For instance,</p>
<pre class="r"><code>Heatmap(df, name = "mtcars", col = mycol,
        column_title = "Column title",
        column_title_gp = gpar(fontsize = 14, fontface = "bold"),
        row_title = "Row title",
        row_title_gp = gpar(fontsize = 14, fontface = "bold"))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-row-column-titles-font-size-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p><span class="warning"> In the R code above, the possible values for <strong>fontface</strong> can be an integer or string: 1 = plain, 2 = bold, 3 = italic and 4 = bold italic. If a string, then valid values are: “plain”, “bold”, “italic”, “oblique”, and “bold.italic”.</span></p>
</div>
<div id="row-and-column-names" class="section level3">
<h3><span class="header-section-number">6.3.3</span> Row and column names</h3>
<ol style="list-style-type: decimal">
<li><strong>Show row/column names</strong>:
<ul>
<li><strong>show_row_names</strong>: whether to show row names. Default value is <strong>TRUE</strong></li>
<li><strong>show_column_names</strong>: whether to show column names. Default value is <strong>TRUE</strong></li>
</ul></li>
</ol>
<pre class="r"><code>Heatmap(df, name = "mtcars", show_row_names = FALSE)</code></pre>
<ol start="2" style="list-style-type: decimal">
<li><strong>Change font size and face</strong>:
<ul>
<li><strong>row_names_gp</strong>: graphical parameters for drawing row names</li>
<li><strong>column_names_gp</strong>: graphical parameters for drawing column names</li>
</ul></li>
</ol>
<pre class="r"><code>Heatmap(df, name = "mtcars", 
        row_names_gp = gpar(fontsize = 14, fontface = "bold",
                            col = c("blue", "red")))</code></pre>
</div>
<div id="clustering" class="section level3">
<h3><span class="header-section-number">6.3.4</span> Clustering</h3>
<div id="change-the-appearance-of-clustering" class="section level4">
<h4><span class="header-section-number">6.3.4.1</span> Change the appearance of clustering</h4>
<p>By default, rows and columns are clustered. This can be inactivated using the argument:</p>
<br/>
<div class="block">
<ul>
<li><strong>cluster_rows = FALSE</strong>. If TRUE, makes cluster on rows</li>
<li><strong>cluster_columns = FALSE</strong>. If TRUE, makes cluster on columns</li>
</ul>
</div>
<p><br/></p>
<p>Cluster on rows are inactivated using the R code below:</p>
<pre class="r"><code># Inactivate cluster on rows
Heatmap(df, name = "mtcars", col = mycol, cluster_rows = FALSE)</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-clustering-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p>In some cases, we want to make the cluster on rows/columns, but we don’t want to show the dendogram on the final image. In this case, use the options:</p>
<br/>
<div class="block">
<ul>
<li><strong>show_row_hclust</strong>: logical value; whether to show row clusters</li>
<li><strong>show_column_hclust</strong>: logical value; whether to show column clusters</li>
</ul>
</div>
<p><br/></p>
<p>It’s also possible to change the <strong>side</strong> of row and column clusters using the arguments:</p>
<br/>
<div class="block">
<ul>
<li><strong>row_hclust_side</strong>: The allowed values are “left” or “right”</li>
<li><strong>column_hclust_side</strong>: The allowed values are “top” or “bottom”</li>
</ul>
</div>
<p><br/></p>
<p>If you want to change the height of column clusters or the width of row clusters, you can use the option <strong>column_dend_height</strong> and <strong>row_dend_width</strong> as follow:</p>
<pre class="r"><code>Heatmap(df, name = "mtcars", col = mycol,
        column_dend_height = unit(2, "cm"),
        row_dend_width = unit(2, "cm") )</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-cluster-height-width-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p>We can also customize the appearance of <strong>dendogram</strong> using the function <strong>color_branches()</strong> [in <strong>dendextend</strong> package]:</p>
<pre class="r"><code># install.packages("dendextend")
library(dendextend)
row_dend = hclust(dist(df)) # row clustering
col_dend = hclust(dist(t(df))) # column clustering
Heatmap(df, name = "mtcars", col = mycol,
        cluster_rows = color_branches(row_dend, k = 4),
        cluster_columns = color_branches(col_dend, k = 2))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-dendogram-appearance-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
</div>
<div id="metric-for-clustering" class="section level4">
<h4><span class="header-section-number">6.3.4.2</span> Metric for clustering</h4>
<p>The arguments <strong>clustering_distance_rows</strong> and <strong>clustering_distance_columns</strong> are used to specify the metric for row and column clustering, respectively. Default values are <strong>“euclidean”</strong>.</p>
<p>Allowed values are:</p>
<ol style="list-style-type: decimal">
<li>A <strong>Pre-defined character</strong> which is in (“euclidean”, “maximum”, “manhattan”, “canberra”, “binary”, “minkowski”, “pearson”, “spearman”, “kendall”):</li>
</ol>
<pre class="r"><code>Heatmap(df, name = "mtcars", clustering_distance_rows = "pearson",
        clustering_distance_columns = "pearson")</code></pre>
<ol start="2" style="list-style-type: decimal">
<li>A <strong>Pre-defined function</strong>, such as <strong>dist()</strong>, to calculate distance from matrix (m):</li>
</ol>
<pre class="r"><code>Heatmap(df, name = "mtcars", 
        clustering_distance_rows = function(m) dist(m))</code></pre>
<ol start="3" style="list-style-type: decimal">
<li>A <strong>Self-defined function</strong> which calculates distance from two vectors:</li>
</ol>
<pre class="r"><code>Heatmap(df, name = "mtcars", 
        clustering_distance_rows = function(x, y) 1 - cor(x, y))</code></pre>
<p><span class="warning">Note that, in the R code above, the example is generally shown for the argument <strong>clustering_distance_rows</strong> which specify the metric for row clustering. I recommend to use the same metric for the argument <strong>clustering_distance_columns</strong> (metric for column clustering). </span></p>
<p>As an illustration, the R code below applies a self defined function for <strong>clustering</strong> which is <strong>robust to outliers</strong> based on the pair-wise distance:</p>
<pre class="r"><code># Clustering metric function
robust_dist = function(x, y) {
    qx = quantile(x, c(0.1, 0.9))
    qy = quantile(y, c(0.1, 0.9))
    l = x > qx[1] &amp; x < qx[2] &amp; y > qy[1] &amp; y < qy[2]
    x = x[l]
    y = y[l]
    sqrt(sum((x - y)^2))
}
# Heatmap
Heatmap(df, name = "mtcars", 
    clustering_distance_rows = robust_dist,
    clustering_distance_columns = robust_dist,
    col = colorRamp2(c(-2, 0, 2), c("purple", "white", "orange")))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-clustering-metric-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
</div>
<div id="clustering-methods" class="section level4">
<h4><span class="header-section-number">6.3.4.3</span> Clustering methods</h4>
<p>The arguments <strong>clustering_method_rows</strong> and <strong>clustering_method_columns</strong> can be used to specify the method for making <strong>hierarchical clustering</strong>. Allowed values are those supported by <strong>hclust()</strong> function including “ward.D”, “ward.D2”, “single”, “complete”, “average”, … (see <strong>?hclust</strong>).</p>
<p>As an example:</p>
<pre class="r"><code>Heatmap(df, name = "mtcars", clustering_method_rows = "ward.D",
        clustering_method_columns = "ward.D")</code></pre>
</div>
</div>
<div id="split-heatmap-by-rows" class="section level3">
<h3><span class="header-section-number">6.3.5</span> Split heatmap by rows</h3>
<p>There are many ways to split the <strong>heatmap</strong>. One solution is to apply <strong>k-means</strong> using the argument <strong>km</strong>.</p>
<p><span class="notice">It’s important to use the <strong>set.seed()</strong> function when performing <strong>k-means</strong> so that the results obtained can be reproduced precisely at a later time.</span></p>
<pre class="r"><code>set.seed(2)
# split into 2 groups
Heatmap(df, name = "mtcars", col = mycol, k = 2)</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-k-means-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p>It’s also possible to use <strong>split</strong> argument to specify row classes as a vector. In the following example we’ll use the levels of the factor variable <strong>cyl</strong> [in <em>mtcars</em>] to split the heatmap by rows. Recall that <strong>cyl</strong> corresponds to the number of cylinders.</p>
<pre class="r"><code># split by a vector specifying row classes
Heatmap(df, name = "mtcars", col = mycol, 
        split = mtcars$cyl )</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-split-heatmap-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p><span "warning"="">Note that, <strong>split</strong> can be also a data frame in which different combinations of levels split the rows of the heatmap. </span></p>
<pre class="r"><code># Split by combining multiple variables
Heatmap(df, name ="mtcars", col = mycol,
        split = data.frame(cyl = mtcars$cyl, am = mtcars$am))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-split-heatmap-multiple-variables-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<pre class="r"><code># Combine km and split
Heatmap(df, name ="mtcars", col = mycol,
        km = 2, split =  mtcars$cyl)</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-split-heatmap-multiple-variables-2.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p>If you want to use other partitioning method, rather than k-means, you can easily do it by just assigning the partitioning vector to <strong>split</strong>. In the R code below, we’ll use <strong>pam()</strong> function [in <strong>cluster</strong> package]. <strong>pam()</strong> stands for Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.</p>
<pre class="r"><code># install.packages("cluster")
library("cluster")
set.seed(2)
pa = pam(df, k = 3)
Heatmap(df, name = "mtcars", col = mycol,
        split = paste0("pam", pa$clustering))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-split-heatmap-pam-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p>It’s also possible to combine user defined dendrograms and split. In this case, split can be specified as a single number:</p>
<pre class="r"><code>library(dendextend)
row_dend = hclust(dist(df)) # row clustering
row_dend = color_branches(row_dend, k = 4)
Heatmap(df, name = "mtcars", col = mycol,
        cluster_rows = row_dend, split = 2)</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-combine-split-dendextend-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
</div>
</div>
<div id="heatmap-annotation" class="section level2">
<h2><span class="header-section-number">6.4</span> Heatmap annotation</h2>
<p>The <strong>HeatmapAnnotation</strong> class is used to define annotation on row or column. A simplified format is:</p>
<pre class="r"><code>HeatmapAnnotation(df, name, col, show_legend)</code></pre>
<ul>
<li><strong>df</strong>: a data.frame with column names</li>
<li><strong>name</strong>: the name of the heatmap annotation</li>
<li><strong>col</strong>: a list of colors which contains color mapping to columns in df</li>
</ul>
<p>For the example below, we’ll transpose our data to have the observations in columns and the variables in rows.</p>
<div id="prepare-the-data" class="section level3">
<h3><span class="header-section-number">6.4.1</span> Prepare the data</h3>
<pre class="r"><code># Transpose
df <- t(df)
# Heatmap of the transposed data
Heatmap(df, name ="mtcars", col = mycol)</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-heatmap-transposed-data-1.png" alt="Heatmap - R data visualization" width="528" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
</div>
<div id="simple-annotation" class="section level3">
<h3><span class="header-section-number">6.4.2</span> Simple annotation</h3>
<p>In <strong>simple annotation</strong> a vector, containing discrete or continuous values, is used to annotate rows or columns.</p>
<p>We’ll use the qualitative variables <em>cyl</em> (levels = “4”, “5” and “8”) and <em>am</em> (levels = “0” and “1”), and the continuous variable <em>mpg</em> to annotate columns.</p>
<p>For each of these 3 variables, custom colors are defined as follow:</p>
<pre class="r"><code># Annotation data frame
annot_df <- data.frame(cyl = mtcars$cyl, am = mtcars$am,  mpg = mtcars$mpg)

# Define colors for each levels of qualitative variables
# Define gradient color for continuous variable (mpg)
col = list(cyl = c("4" = "green", "6" = "gray", "8" = "darkred"),
            am = c("0" = "yellow", "1" = "orange"),
            mpg = colorRamp2(c(17, 25), c("lightblue", "purple"))
            )

# Create the heatmap annotation
ha <- HeatmapAnnotation(annot_df, col = col)

# Combine the heatmap and the annotation
Heatmap(df, name = "mtcars", col = mycol,
        top_annotation = ha)</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-heatmap-annotation-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<br/>
<div class="block">
<p>It’s possible to hide the annotation legend using the argument <strong>show_legend = FALSE</strong> as follow:</p>
<pre class="r"><code>ha <- HeatmapAnnotation(annot_df, col = col, show_legend = FALSE)
Heatmap(df, name = "mtcars", col = mycol, top_annotation = ha)</code></pre>
</div>
<p><br/></p>
<p><strong>Annotation names</strong> can be added using the R code hereafter. The function <strong>qq()</strong> [in <strong>GetoptLong</strong> package], for simple variable interpolation in texts, is required.</p>
<pre class="r"><code>library("GetoptLong")
# Combine Heatmap and annotation
ha <- HeatmapAnnotation(annot_df, col = col, show_legend = FALSE)
Heatmap(df, name = "mtcars", col = mycol, top_annotation = ha)
# Add annotation names on the right
for(an in colnames(annot_df)) {
    seekViewport(qq("annotation_@{an}"))
    grid.text(an, unit(1, "npc") + unit(2, "mm"), 0.5,
              default.units = "npc", just = "left")
}</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-unnamed-chunk-21-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p>To add annotation names on the left, use the code below:</p>
<pre class="r"><code># Annotation names on the left
for(an in colnames(annot_df)) {
    seekViewport(qq("annotation_@{an}"))
    grid.text(an, unit(1, "npc") - unit(2, "mm"), 0.5,
              default.units = "npc", just = "left")
}</code></pre>
</div>
<div id="complex-annotation" class="section level3">
<h3><span class="header-section-number">6.4.3</span> Complex annotation</h3>
<p>In this section we’ll see how to combine heatmap and some basic graphs to show the data distribution. For simple annotation graphics, the following functions can be used: anno_points(), anno_barplot(), anno_boxplot(), anno_density() and anno_histogram().</p>
<p>An example is shown below:</p>
<pre class="r"><code># Define some graphics to display the distribution of columns
.hist = anno_histogram(df, gp = gpar(fill = "lightblue"))
.density = anno_density(df, type = "line", gp = gpar(col = "blue"))
ha_mix_top = HeatmapAnnotation(hist = .hist, density = .density)

# Define some graphics to display the distribution of rows
.violin = anno_density(df, type = "violin", 
                       gp = gpar(fill = "lightblue"), which = "row")
.boxplot = anno_boxplot(df, which = "row")
ha_mix_right = HeatmapAnnotation(violin = .violin, bxplt = .boxplot,
                              which = "row", width = unit(4, "cm"))

# Combine annotation with heatmap
Heatmap(df, name = "mtcars", col = mycol,
        column_names_gp = gpar(fontsize = 8),
        top_annotation = ha_mix_top, 
        top_annotation_height = unit(4, "cm")) + ha_mix_right</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-complex-heatmap-annotation-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p><span class="warning">Note that, it’s also possible to use the argument <strong>bottom_annotation</strong>.</span></p>
</div>
</div>
<div id="combine-multiple-heatmaps" class="section level2">
<h2><span class="header-section-number">6.5</span> Combine multiple heatmaps</h2>
<p>Multiple heatmaps can be arranged as follow:</p>
<pre class="r"><code># Heatmap 1
ht1 = Heatmap(df, name = "ht1", col = mycol, km = 2,
              column_names_gp = gpar(fontsize = 9))
# Heatmap 2
ht2 = Heatmap(df, name = "ht2", 
        col = colorRamp2(c(-2, 0, 2), c("green", "white", "red")),
        column_names_gp = gpar(fontsize = 9))
# Combine the two heatmaps
ht1 + ht2</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-combine-multiple-heatmaps-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p><span class="notice">You can use the option <strong>width = unit(3, “cm”))</strong> to control the size of the heatmaps.</span></p>
<p><span class="warning"> Note that when combining multiple heatmaps, the first heatmap is considered as the main heatmap. Some settings of the remaining heatmaps are auto-adjusted according to the setting of the main heatmap. These include: removing row clusters and titles, and adding splitting </span></p>
<p>The <strong>draw()</strong> function can be used to customize the appearance of the final image:</p>
<pre class="r"><code>draw(ht1 + ht2, 
    # Titles
    row_title = "Two heatmaps, row title", 
    row_title_gp = gpar(col = "red"),
    column_title = "Two heatmaps, column title", 
    column_title_side = "bottom",
    # Gap between heatmaps
    gap = unit(0.5, "cm"))</code></pre>
<p><span class="notice">Legends can be removed using the arguments show_heatmap_legend = FALSE, show_annotation_legend = FALSE.</span></p>
</div>
<div id="real-application" class="section level2">
<h2><span class="header-section-number">6.6</span> Real application</h2>
</div>
<div id="gene-expression-matrix" class="section level2">
<h2><span class="header-section-number">6.7</span> Gene expression matrix</h2>
<p>In gene expression data, rows are genes and columns are samples. More information about genes can be attached after the expression heatmap such as gene length and type of genes.</p>
<pre class="r"><code>expr = readRDS(paste0(system.file(package = "ComplexHeatmap"),
                      "/extdata/gene_expression.rds"))
mat = as.matrix(expr[, grep("cell", colnames(expr))])

type = gsub("s\\d+_", "", colnames(mat))
ha = HeatmapAnnotation(df = data.frame(type = type))

Heatmap(mat, name = "expression", km = 5, top_annotation = ha, 
    top_annotation_height = unit(4, "mm"), 
    show_row_names = FALSE, show_column_names = FALSE) +
Heatmap(expr$length, name = "length", width = unit(5, "mm"),
        col = colorRamp2(c(0, 100000), c("white", "orange"))) +
Heatmap(expr$type, name = "type", width = unit(5, "mm")) +
Heatmap(expr$chr, name = "chr", width = unit(5, "mm"),
        col = rand_color(length(unique(expr$chr))))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-gene-expression-data-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p><span class="notice">It’s also possible to visualize genomic alterations and to integrate different molecular levels (gene expression, DNA methylation, …). Read the <a href="http://bioconductor.org/packages/devel/bioc/vignettes/ComplexHeatmap/inst/doc/ComplexHeatmap.html">vignette</a> for further examples.</span></p>
</div>
<div id="visualize-distribution-of-column-in-matrix" class="section level2">
<h2><span class="header-section-number">6.8</span> Visualize distribution of column in matrix</h2>
<p>The function <strong>densityHeatmap()</strong> is used.</p>
<pre class="r"><code>densityHeatmap(scale(mtcars))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/heatmap-column-distribution-matrix-1.png" alt="Heatmap - R data visualization" width="518.4" />
<p class="caption">
Heatmap - R data visualization
</p>
</div>
<p>The dashed lines on the heatmap correspond to the five quantile numbers. The text for the five quantile levels are added in the right of the heatmap.</p>
</div>
</div>
<div id="infos" class="section level1">
<h1><span class="header-section-number">7</span> Infos</h1>
<p><span class="warning">This analysis has been performed using <strong>R software</strong> (ver. 3.2.3) and <strong>ComplexHeatmap</strong> (ver. )</span></p>
</div>

<script>jQuery(document).ready(function () {
    jQuery('h1').addClass('wiki_paragraph1');
    jQuery('h2').addClass('wiki_paragraph2');
    jQuery('h3').addClass('wiki_paragraph3');
    jQuery('h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->


<!-- END HTML -->]]></description>
			<pubDate>Wed, 06 Apr 2016 22:55:51 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[The Guide for Clustering Analysis on a Real Data: 4 steps you should know - Unsupervised Machine Learning]]></title>
			<link>https://www.sthda.com/english/wiki/the-guide-for-clustering-analysis-on-a-real-data-4-steps-you-should-know-unsupervised-machine-learning</link>
			<guid>https://www.sthda.com/english/wiki/the-guide-for-clustering-analysis-on-a-real-data-4-steps-you-should-know-unsupervised-machine-learning</guid>
			<description><![CDATA[<!-- START HTML -->

  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">

<div id="TOC">
<ul>
<li><a href="#required-packages"><span class="toc-section-number">1</span> Required packages</a></li>
<li><a href="#data-preparation"><span class="toc-section-number">2</span> Data preparation</a></li>
<li><a href="#assessing-the-clusterability"><span class="toc-section-number">3</span> Assessing the clusterability</a></li>
<li><a href="#estimate-the-number-of-clusters-in-the-data"><span class="toc-section-number">4</span> Estimate the number of clusters in the data</a></li>
<li><a href="#compute-k-means-clustering"><span class="toc-section-number">5</span> Compute k-means clustering</a></li>
<li><a href="#cluster-validation-statistics-inspect-cluster-silhouette-plot"><span class="toc-section-number">6</span> Cluster validation statistics: Inspect cluster silhouette plot</a></li>
<li><a href="#eclust-enhanced-clustering-analysis"><span class="toc-section-number">7</span> eclust(): Enhanced clustering analysis</a><ul>
<li><a href="#k-means-clustering-using-eclust"><span class="toc-section-number">7.1</span> K-means clustering using eclust()</a></li>
<li><a href="#hierachical-clustering-using-eclust"><span class="toc-section-number">7.2</span> Hierachical clustering using eclust()</a></li>
</ul></li>
<li><a href="#infos"><span class="toc-section-number">8</span> Infos</a></li>
</ul>
</div>

<p><br/> Human’s abilities are exceeded by the large amounts of data collected every day from different fields, bio-medical, security, marketing, web search, geo-spatial or other automatic equipment. Consequently, <strong>unsupervised machine learning technics</strong>, such as <strong>clustering</strong>, are used for discovering knowledge from <strong>big data</strong>.</p>
<p><strong>Clustering</strong> approaches classify samples into groups (i.e clusters) containing objects of similar profiles. In our previous post, we <a href="https://www.sthda.com/english/english/wiki/clarifying-distance-measures-unsupervised-machine-learning">clarified distance measures</a> for assessing <strong>similarity</strong> between observations.</p>
<p>In this chapter we’ll describe the different steps to follow for computing <strong>clustering</strong> on a real data using <strong>k-means clustering</strong>:</p>
<br/>
<div class="block">
<ol style="list-style-type: decimal">
<li>Data preparation</li>
<li><a href="https://www.sthda.com/english/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning">Assessing clustering tendency (i.e., the clusterability of the data)</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning">Defining the optimal number of clusters</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning">Computing partitioning cluster analyses (e.g.: k-means, pam)</a> or <a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning">hierarchical clustering analyses</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/clustering-validation-statistics-4-vital-things-everyone-should-know-unsupervised-machine-learning">Validating clustering analyses: silhouette plot</a></li>
</ol>
</div>
<p><br/></p>
<div id="required-packages" class="section level1">
<h1><span class="header-section-number">1</span> Required packages</h1>
<p>The following packages will be used:</p>
<ul>
<li><strong>cluster</strong> for clustering analyses</li>
<li><strong>factoextra</strong> for visualizing clusters using <strong>ggplot2</strong> plotting system</li>
</ul>
<p>Install <strong>factoextra</strong> package as follow:</p>
<pre class="r"><code>if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")</code></pre>
<p>The <strong>cluster</strong> package can be installed using the code below:</p>
<pre class="r"><code>install.packages("cluster")</code></pre>
<p>Load packages:</p>
<pre class="r"><code>library(cluster)
library(factoextra)</code></pre>
</div>
<div id="data-preparation" class="section level1">
<h1><span class="header-section-number">2</span> Data preparation</h1>
<p>We’ll use the built-in R data set <strong>USArrests</strong>, which can be loaded and prepared as follow:</p>
<pre class="r"><code># Load the data set
data(USArrests)

# Remove any missing value (i.e, NA values for not available)
# That might be present in the data
USArrests <- na.omit(USArrests)

# View the firt 6 rows of the data
head(USArrests, n = 6)</code></pre>
<pre><code>##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7</code></pre>
<p><span class="notice"> In this data set, columns are variables and rows are observations (i.e., samples).</span></p>
<p>To inspect the data before the <strong>K-means clustering</strong> we’ll compute some descriptive statistics such as the mean and the standard deviation of the variables.</p>
<p>The <strong>apply()</strong> function is used to apply a given function (e.g : min(), max(), mean(), …) on the data set. The second argument can take the value of:</p>
<ul>
<li><strong>1</strong>: for applying the function on the rows</li>
<li><strong>2</strong>: for applying the function on the columns</li>
</ul>
<pre class="r"><code>desc_stats <- data.frame(
  Min = apply(USArrests, 2, min), # minimum
  Med = apply(USArrests, 2, median), # median
  Mean = apply(USArrests, 2, mean), # mean
  SD = apply(USArrests, 2, sd), # Standard deviation
  Max = apply(USArrests, 2, max) # Maximum
  )
desc_stats <- round(desc_stats, 1)
head(desc_stats)</code></pre>
<pre><code>##           Min   Med  Mean   SD   Max
## Murder    0.8   7.2   7.8  4.4  17.4
## Assault  45.0 159.0 170.8 83.3 337.0
## UrbanPop 32.0  66.0  65.5 14.5  91.0
## Rape      7.3  20.1  21.2  9.4  46.0</code></pre>
<p><span class="warning">Note that the variables have a large different means and variances. They must be standardized to make them comparable.</span></p>
<p><strong>Standardization</strong> consists of transforming the variables such that they have mean zero and standard deviation one. The <strong>scale()</strong> function can be used as follow:</p>
<pre class="r"><code>df<- scale(USArrests)</code></pre>
</div>
<div id="assessing-the-clusterability" class="section level1">
<h1><span class="header-section-number">3</span> Assessing the clusterability</h1>
<p>The function <strong>get_clust_tendency()</strong> [in <strong>factoextra</strong>] can be used. It computes <a href="https://www.sthda.com/english/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning#hopkins-statistic"><strong>Hopkins statistic</strong></a> and provides a visual approach.</p>
<pre class="r"><code>library("factoextra")
res <- get_clust_tendency(df, 40, graph = FALSE)
# Hopskin statistic
res$hopkins_stat</code></pre>
<pre><code>## [1] 0.3440875</code></pre>
<pre class="r"><code># Visualize the dissimilarity matrix
res$plot</code></pre>
<pre><code>## NULL</code></pre>
<p><span class="success">The value of <a href="https://www.sthda.com/english/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning#hopkins-statistic"><strong>Hopkins statistic</strong></a> is significantly < 0.5, indicating that the data is highly clusterable. Additionally, It can be seen that the ordered dissimilarity image contains patterns (i.e., clusters).</span></p>
</div>
<div id="estimate-the-number-of-clusters-in-the-data" class="section level1">
<h1><span class="header-section-number">4</span> Estimate the number of clusters in the data</h1>
<p>As <strong>k-means clustering</strong> requires to specify the number of clusters to generate, we’ll use the function <strong>clusGap()</strong> [in <strong>cluster</strong>] to compute <a href="https://www.sthda.com/english/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning#gap-statistic-method"><strong>gap statistics</strong></a> for estimating the optimal number of clusters . The function <strong>fviz_gap_stat()</strong> [in <strong>factoextra</strong>] is used to visualize the gap statistic plot.</p>
<pre class="r"><code>library("cluster")
set.seed(123)
# Compute the gap statistic
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25, 
                    K.max = 10, B = 500) 
# Plot the result
library(factoextra)
fviz_gap_stat(gap_stat)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/guide-for-clustering-number-of-clusters-gap-statistic-1.png" title="Step by step guide for partitioning clustering - Unsupervised Machine Learning" alt="Step by step guide for partitioning clustering - Unsupervised Machine Learning" width="518.4" /></p>
<p><span class="success">The <a href="https://www.sthda.com/english/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning#gap-statistic-method"><strong>gap statistic</strong></a> suggests a <strong>4 cluster solutions</strong>.</span></p>
<p><span class="notice">It’s also possible to use the function <a href="https://www.sthda.com/english/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning#nbclust-a-package-providing-30-indices-for-determining-the-best-number-of-clusters"><strong>NbClust()</strong></a> [in <strong>NbClust</strong>] package.</span></p>
</div>
<div id="compute-k-means-clustering" class="section level1">
<h1><span class="header-section-number">5</span> Compute k-means clustering</h1>
<p><a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning#k-means-clustering"><strong>K-means clustering</strong></a> with k = 4:</p>
<pre class="r"><code># Compute k-means
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)
head(km.res$cluster, 20)</code></pre>
<pre><code>##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           4           3           3           4           3           3 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           2           2           3           4           2           1 
##    Illinois     Indiana        Iowa      Kansas    Kentucky   Louisiana 
##           3           2           1           2           1           4 
##       Maine    Maryland 
##           1           3</code></pre>
<pre class="r"><code># Visualize clusters using factoextra
fviz_cluster(km.res, USArrests)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/guide-for-clustering-k-means-factoextra-1.png" title="Step by step guide for partitioning clustering - Unsupervised Machine Learning" alt="Step by step guide for partitioning clustering - Unsupervised Machine Learning" width="518.4" /></p>
</div>
<div id="cluster-validation-statistics-inspect-cluster-silhouette-plot" class="section level1">
<h1><span class="header-section-number">6</span> Cluster validation statistics: Inspect cluster silhouette plot</h1>
<p>Recall that the <a href="https://www.sthda.com/english/english/wiki/clustering-validation-statistics-4-vital-things-everyone-should-know-unsupervised-machine-learning#silhouette-analysis">silhouette measures</a> (<span class="math">\(S_i\)</span>) how similar an object <span class="math">\(i\)</span> is to the the other objects in its own cluster versus those in the neighbor cluster. <span class="math">\(S_i\)</span> values range from 1 to - 1:</p>
<ul>
<li>A value of <span class="math">\(S_i\)</span> close to 1 indicates that the object is well clustered. In the other words, the object <span class="math">\(i\)</span> is similar to the other objects in its group.</li>
<li>A value of <span class="math">\(S_i\)</span> close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.</li>
</ul>
<pre class="r"><code>sil <- silhouette(km.res$cluster, dist(df))
rownames(sil) <- rownames(USArrests)
head(sil[, 1:3])</code></pre>
<pre><code>##            cluster neighbor  sil_width
## Alabama          4        3 0.48577530
## Alaska           3        4 0.05825209
## Arizona          3        2 0.41548326
## Arkansas         4        2 0.11870947
## California       3        2 0.43555885
## Colorado         3        2 0.32654235</code></pre>
<pre class="r"><code>fviz_silhouette(sil)</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1   13          0.37
## 2       2   16          0.34
## 3       3   13          0.27
## 4       4    8          0.39</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/guide-for-clustering-silhouette-plot-1.png" title="Step by step guide for partitioning clustering - Unsupervised Machine Learning" alt="Step by step guide for partitioning clustering - Unsupervised Machine Learning" width="518.4" /></p>
<p>It can be seen that there are some samples which have negative silhouette values. Some natural questions are :</p>
<p><span class="question">Which samples are these? To what cluster are they closer?</span></p>
<p>This can be determined from the output of the function <strong>silhouette()</strong> as follow:</p>
<pre class="r"><code>neg_sil_index <- which(sil[, "sil_width"] < 0)
sil[neg_sil_index, , drop = FALSE]</code></pre>
<pre><code>##          cluster neighbor   sil_width
## Missouri       3        2 -0.07318144</code></pre>
</div>
<div id="eclust-enhanced-clustering-analysis" class="section level1">
<h1><span class="header-section-number">7</span> eclust(): Enhanced clustering analysis</h1>
<p>The function <a href="https://www.sthda.com/english/english/wiki/visual-enhancement-of-clustering-analysis-unsupervised-machine-learning#enhanced-clustering-analysis"><strong>eclust()</strong></a> [in <strong>factoextra</strong>] provides several advantages compared to the standard packages used for clustering analysis:</p>
<ul>
<li>It simplifies the workflow of clustering analysis</li>
<li>It can be used to compute <strong>hierarchical clustering</strong> and partitioning clustering in a single line function call</li>
<li>The function eclust() computes automatically the <strong>gap statistic</strong> for estimating the right number of clusters.</li>
<li>It automatically provides <strong>silhouette information </strong></li>
<li>It draws <strong>beautiful graphs</strong> using ggplot2</li>
</ul>
<div id="k-means-clustering-using-eclust" class="section level2">
<h2><span class="header-section-number">7.1</span> K-means clustering using eclust()</h2>
<pre class="r"><code># Compute k-means
res.km <- eclust(df, "kmeans")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/guide-for-clustering-eclust-k-means-1.png" title="Step by step guide for partitioning clustering - Unsupervised Machine Learning" alt="Step by step guide for partitioning clustering - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Gap statistic plot
fviz_gap_stat(res.km$gap_stat)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/guide-for-clustering-eclust-k-means-2.png" title="Step by step guide for partitioning clustering - Unsupervised Machine Learning" alt="Step by step guide for partitioning clustering - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Silhouette plot
fviz_silhouette(res.km)</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1   13          0.27
## 2       2   13          0.37
## 3       3    8          0.39
## 4       4   16          0.34</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/guide-for-clustering-eclust-k-means-3.png" title="Step by step guide for partitioning clustering - Unsupervised Machine Learning" alt="Step by step guide for partitioning clustering - Unsupervised Machine Learning" width="518.4" /></p>
</div>
<div id="hierachical-clustering-using-eclust" class="section level2">
<h2><span class="header-section-number">7.2</span> Hierachical clustering using eclust()</h2>
<pre class="r"><code> # Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust
fviz_dend(res.hc, rect = TRUE) # dendrogam</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/guide-for-clustering-eclust-hierarchical-clustering-1.png" title="Step by step guide for partitioning clustering - Unsupervised Machine Learning" alt="Step by step guide for partitioning clustering - Unsupervised Machine Learning" width="518.4" /></p>
<p>The R code below generates the silhouette plot and the scatter plot for hierarchical clustering.</p>
<pre class="r"><code>fviz_silhouette(res.hc) # silhouette plot
fviz_cluster(res.hc) # scatter plot</code></pre>
</div>
</div>
<div id="infos" class="section level1">
<h1><span class="header-section-number">8</span> Infos</h1>
<p><span class="warning">This analysis has been performed using <strong>R software</strong> (ver. 3.2.3)</span></p>
</div>

<script>jQuery(document).ready(function () {
    jQuery('h1').addClass('wiki_paragraph1');
    jQuery('h2').addClass('wiki_paragraph2');
    jQuery('h3').addClass('wiki_paragraph3');
    jQuery('h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>

<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->


<!-- END HTML -->]]></description>
			<pubDate>Mon, 21 Dec 2015 11:45:58 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Model-Based Clustering - Unsupervised Machine Learning]]></title>
			<link>https://www.sthda.com/english/wiki/model-based-clustering-unsupervised-machine-learning</link>
			<guid>https://www.sthda.com/english/wiki/model-based-clustering-unsupervised-machine-learning</guid>
			<description><![CDATA[<!-- START HTML -->

  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<div id="TOC">
<ul>
<li><a href="#concept"><span class="toc-section-number">1</span> Concept</a></li>
<li><a href="#model-parameters"><span class="toc-section-number">2</span> Model parameters</a></li>
<li><a href="#advantage-of-model-based-clustering"><span class="toc-section-number">3</span> Advantage of model-based clustering</a></li>
<li><a href="#example-of-data"><span class="toc-section-number">4</span> Example of data</a></li>
<li><a href="#mclust-r-function-for-computing-model-based-clustering"><span class="toc-section-number">5</span> Mclust(): R function for computing model-based clustering</a></li>
<li><a href="#example-of-cluster-analysis-using-mclust"><span class="toc-section-number">6</span> Example of cluster analysis using Mclust()</a></li>
<li><a href="#infos"><span class="toc-section-number">7</span> Infos</a></li>
</ul>
</div>

<p><br/></p>
<div id="concept" class="section level1">
<h1><span class="header-section-number">1</span> Concept</h1>
<p>The traditional clustering methods such as <a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning"><strong>hierarchical clustering</strong></a> and <a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning"><strong>partitioning algorithms</strong></a> (k-means and others) are heuristic and are not based on formal models.</p>
<p>An alternative is to use <strong>model-based clustering</strong>, in which, the data are considered as coming from a distribution that is mixture of two or more components (i.e. <strong>clusters</strong>) (Chris Fraley and Adrian E. Raftery, 2002 and 2012).</p>
<p>Each component k (i.e. group or cluster) is modeled by the normal or Gaussian distribution which is characterized by the parameters:</p>
<ul>
<li><span class="math">\(\mu_k\)</span>: mean vector,</li>
<li><span class="math">\(\sum_k\)</span>: covariance matrix,</li>
<li>An associated probability in the mixture. Each point has a probability of belonging to each cluster.</li>
</ul>
</div>
<div id="model-parameters" class="section level1">
<h1><span class="header-section-number">2</span> Model parameters</h1>
<p>The model parameters can be estimated using the <strong>EM</strong> (<strong>Expectation-Maximization</strong>) algorithm initialized by hierarchical model-based clustering. Each cluster k is centered at the means <span class="math">\(\mu_k\)</span>, with increased density for points near the mean.</p>
<p>Geometric features (<strong>shape</strong>, <strong>volume</strong>, <strong>orientation</strong>) of each cluster are determined by the covariance matrix <span class="math">\(\sum_k\)</span>.</p>
<p>Different possible parameterizations of <span class="math">\(\sum_k\)</span> are available in the R package <strong>mclust</strong> (see <em>?mclustModelNames</em>).</p>
<p>The available model options, in <strong>mclust</strong> package, are represented by identifiers including: EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV and VVV.</p>
<p>The first identifier refers to volume, the second to shape and the third to orientation. E stands for “equal”, V for “variable” and I for “coordinate axes”.</p>
<p>For example:</p>
<ul>
<li>EVI denotes a model in which the volumes of all clusters are equal (E), the shapes of the clusters may vary (V), and the orientation is the identity (I) or “coordinate axes.
</li>
<li>EEE means that the clusters have the same volume, shape and orientation in p-dimensional space.</li>
<li>VEI means that the clusters have variable volume, the same shape and orientation equal to coordinate axes.</li>
</ul>
<p><span class="success"> The <strong>mclust</strong> package uses maximum likelihood to fit all these models, with different covariance matrix parameterizations, for a range of k components. The “best model” is selected using the Bayesian Information Criterion or <strong>BIC</strong>. A large BIC score indicates strong evidence for the corresponding model. </span></p>
</div>
<div id="advantage-of-model-based-clustering" class="section level1">
<h1><span class="header-section-number">3</span> Advantage of model-based clustering</h1>
<p>The key advantage of model-based approach, compared to the standard clustering methods (k-means, hierarchical clustering, …), is the suggestion of the number of clusters and an appropriate model.</p>
</div>
<div id="example-of-data" class="section level1">
<h1><span class="header-section-number">4</span> Example of data</h1>
<p>We’ll use the bivariate <strong>faithful</strong> data set which contains the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park (Wyoming, USA).</p>
<pre class="r"><code># Load the data
data("faithful")
head(faithful)</code></pre>
<pre><code>##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55</code></pre>
<p>An illustration of the data can be drawn using <strong>ggplot2</strong> package as follow:</p>
<pre class="r"><code>library("ggplot2")
ggplot(faithful, aes(x=eruptions, y=waiting)) +
  geom_point() +  # Scatter plot
  geom_density2d() # Add 2d density estimation</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/model-based-clustering-scatter-plot-faithful-1.png" title="Model-Based Clustering - Unsupervised Machine Learning" alt="Model-Based Clustering - Unsupervised Machine Learning" width="518.4" /></p>
</div>
<div id="mclust-r-function-for-computing-model-based-clustering" class="section level1">
<h1><span class="header-section-number">5</span> Mclust(): R function for computing model-based clustering</h1>
<p>The function <strong>Mclust()</strong> [in <strong>mclust</strong> package] can be used to compute <strong>model-based clustering</strong>.</p>
<p>Install and load the package as follow:</p>
<pre class="r"><code># Install
install.packages("mclust")

# Load
library("mclust")</code></pre>
<p>The function <strong>Mclust()</strong> provides the optimal mixture model estimation according to BIC. A simplified format is:</p>
<pre class="r"><code>Mclust(data, G = NULL)</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>data</strong>: A numeric vector, matrix or data frame. Categorical variables are not allowed. If a matrix or data frame, rows correspond to observations and columns correspond to variables.</li>
<li><strong>G</strong>: An integer vector specifying the numbers of mixture components (clusters) for which the BIC is to be calculated. The default is G=1:9.</li>
</ul>
</div>
<p><br/></p>
<p>The function Mclust() returns an object of class ‘Mclust’ containing the following elements:</p>
<ul>
<li><strong>modelName</strong>: A character string denoting the model at which the optimal BIC occurs.</li>
<li><strong>G</strong>: The optimal number of mixture components (i.e: number of clusters)</li>
<li><strong>BIC</strong>: All BIV values</li>
<li><strong>bic</strong> Optimal BIC value</li>
<li><strong>loglik</strong>: The loglikelihood corresponding to the optimal BIC</li>
<li><strong>df</strong>: The number of estimated parameters</li>
<li><strong>Z</strong>: A matrix whose <span class="math">\([i,k]^{th}\)</span> entry is the probability that observation <span class="math">\(i\)</span> in the test data belongs to the <span class="math">\(k^{th}\)</span> class. Column names are cluster numbers, and rows are observations</li>
<li><strong>classification</strong>: The cluster number of each observation, i.e. map(z)</li>
<li><strong>uncertainty</strong>: The uncertainty associated with the classification</li>
</ul>
</div>
<div id="example-of-cluster-analysis-using-mclust" class="section level1">
<h1><span class="header-section-number">6</span> Example of cluster analysis using Mclust()</h1>
<pre class="r"><code>library(mclust)
# Model-based-clustering
mc <- Mclust(faithful)
# Print a summary
summary(mc)</code></pre>
<pre><code>## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 3 components:
## 
##  log.likelihood   n df       BIC       ICL
##       -1126.361 272 11 -2314.386 -2360.865
## 
## Clustering table:
##   1   2   3 
## 130  97  45</code></pre>
<pre class="r"><code># Values returned by Mclust()
names(mc)</code></pre>
<pre><code>##  [1] "call"           "data"           "modelName"      "n"             
##  [5] "d"              "G"              "BIC"            "bic"           
##  [9] "loglik"         "df"             "hypvol"         "parameters"    
## [13] "z"              "classification" "uncertainty"</code></pre>
<pre class="r"><code># Optimal selected model
mc$modelName</code></pre>
<pre><code>## [1] "EEE"</code></pre>
<pre class="r"><code># Optimal number of cluster
mc$G</code></pre>
<pre><code>## [1] 3</code></pre>
<pre class="r"><code># Probality for an observation to be in a given cluster
head(mc$z)</code></pre>
<pre><code>##           [,1]         [,2]         [,3]
## 1 2.181744e-02 1.130837e-08 9.781825e-01
## 2 2.475031e-21 1.000000e+00 3.320864e-13
## 3 2.521625e-03 2.051823e-05 9.974579e-01
## 4 6.553336e-14 9.999998e-01 1.664978e-07
## 5 9.838967e-01 7.642900e-20 1.610327e-02
## 6 2.104355e-07 9.975388e-01 2.461029e-03</code></pre>
<pre class="r"><code># Cluster assignement of each observation
head(mc$classification, 10)</code></pre>
<pre><code>##  1  2  3  4  5  6  7  8  9 10 
##  3  2  3  2  1  2  1  3  2  1</code></pre>
<pre class="r"><code># Uncertainty associated with the classification
head(mc$uncertainty)</code></pre>
<pre><code>##            1            2            3            4            5 
## 2.181745e-02 3.321787e-13 2.542143e-03 1.664978e-07 1.610327e-02 
##            6 
## 2.461239e-03</code></pre>
<p>Model-based clustering results can be drawn using the function <strong>plot.Mclust()</strong>:</p>
<pre class="r"><code>plot(x, what = c("BIC", "classification", "uncertainty", "density"),
     xlab = NULL, ylab = NULL, addEllipses = TRUE, main = TRUE, ...)</code></pre>
<pre class="r"><code># BIC values used for choosing the number of clusters
plot(mc, "BIC")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/model-based-clustering-visualization-1.png" title="Model-Based Clustering - Unsupervised Machine Learning" alt="Model-Based Clustering - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Classification: plot showing the clustering
plot(mc, "classification")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/model-based-clustering-visualization-2.png" title="Model-Based Clustering - Unsupervised Machine Learning" alt="Model-Based Clustering - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Classification uncertainty
plot(mc, "uncertainty")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/model-based-clustering-visualization-3.png" title="Model-Based Clustering - Unsupervised Machine Learning" alt="Model-Based Clustering - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Estimated density. Contour plot
plot(mc, "density")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/model-based-clustering-visualization-4.png" title="Model-Based Clustering - Unsupervised Machine Learning" alt="Model-Based Clustering - Unsupervised Machine Learning" width="518.4" /></p>
<p>Clusters generated by <strong>Mclust()</strong> can be drawn using the function <strong>fviz_cluster()</strong> [in <strong>factoextra</strong> package]. Read more about [factoextra](<a href="https://www.sthda.com/english/english/wiki/factoextra-r-package-quick-multivariate-data-analysis-pca-ca-mca-and-visualization-r-software-and-data-mining" class="uri">https://www.sthda.com/english/wiki/factoextra-r-package-quick-multivariate-data-analysis-pca-ca-mca-and-visualization-r-software-and-data-mining</a>.</p>
<pre class="r"><code>library(factoextra)
fviz_cluster(mc, frame.type = "norm", geom = "point")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/model-based-clustering-mclust-plot-ggplot2-factoextra-1.png" title="Model-Based Clustering - Unsupervised Machine Learning" alt="Model-Based Clustering - Unsupervised Machine Learning" width="518.4" /></p>
</div>
<div id="infos" class="section level1">
<h1><span class="header-section-number">7</span> Infos</h1>
<p><span class="warning">This analysis has been performed using <strong>R software</strong> (ver. 3.2.3)</span></p>
<ul>
<li>Chris Fraley, A. E. Raftery, T. B. Murphy and L. Scrucca (2012). mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report No. 597, Department of Statistics, University of Washington. <a href="https://www.stat.washington.edu/research/reports/2012/tr597.pdf">pdf</a></li>
<li>Chris Fraley and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97:611:631.</li>
</ul>
</div>

<script>jQuery(document).ready(function () {
    jQuery('h1').addClass('wiki_paragraph1');
    jQuery('h2').addClass('wiki_paragraph2');
    jQuery('h3').addClass('wiki_paragraph3');
    jQuery('h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>

<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->

<!-- END HTML -->]]></description>
			<pubDate>Sun, 20 Dec 2015 17:12:40 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Beautiful dendrogram visualizations in R: 5+ must known methods - Unsupervised Machine Learning]]></title>
			<link>https://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning</link>
			<guid>https://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning</guid>
			<description><![CDATA[<!-- START HTML -->

  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">

<div id="TOC">
<ul>
<li><a href="#plot.hclust-r-base-function"><span class="toc-section-number">1</span> plot.hclust(): R base function</a></li>
<li><a href="#plot.dendrogram-function"><span class="toc-section-number">2</span> plot.dendrogram() function</a></li>
<li><a href="#phylogenetic-trees"><span class="toc-section-number">3</span> Phylogenetic trees</a></li>
<li><a href="#ggdendro-package-ggplot2-and-dendrogram"><span class="toc-section-number">4</span> ggdendro package : ggplot2 and dendrogram</a><ul>
<li><a href="#installation-and-loading"><span class="toc-section-number">4.1</span> Installation and loading</a></li>
<li><a href="#visualize-dendrogram-using-ggdendrogram-function"><span class="toc-section-number">4.2</span> Visualize dendrogram using ggdendrogram() function</a></li>
<li><a href="#extract-dendrogram-plot-data"><span class="toc-section-number">4.3</span> Extract dendrogram plot data</a></li>
</ul></li>
<li><a href="#dendextend-package-extending-rs-dendrogram-functionality"><span class="toc-section-number">5</span> dendextend package: Extending R’s dendrogram functionality</a><ul>
<li><a href="#chaining"><span class="toc-section-number">5.1</span> Chaining</a></li>
<li><a href="#installation-and-loading-1"><span class="toc-section-number">5.2</span> Installation and loading</a></li>
<li><a href="#how-to-change-a-dendrogram"><span class="toc-section-number">5.3</span> How to change a dendrogram</a></li>
<li><a href="#create-a-simple-dendrogram"><span class="toc-section-number">5.4</span> Create a simple dendrogram</a></li>
<li><a href="#change-labels"><span class="toc-section-number">5.5</span> Change labels</a></li>
<li><a href="#change-the-points-of-a-dendrogram-nodesleaves"><span class="toc-section-number">5.6</span> Change the points of a dendrogram nodes/leaves</a></li>
<li><a href="#change-the-color-of-branches"><span class="toc-section-number">5.7</span> Change the color of branches</a></li>
<li><a href="#adding-colored-rectangles"><span class="toc-section-number">5.8</span> Adding colored rectangles</a></li>
<li><a href="#adding-colored-bars"><span class="toc-section-number">5.9</span> Adding colored bars</a></li>
<li><a href="#ggplot2-integration"><span class="toc-section-number">5.10</span> ggplot2 integration</a></li>
<li><a href="#pvclust-and-dendextend"><span class="toc-section-number">5.11</span> pvclust and dendextend</a></li>
</ul></li>
<li><a href="#infos"><span class="toc-section-number">6</span> Infos</a></li>
</ul>
</div>

<p><br/></p>
<p>A variety of functions exists in <strong>R</strong> for visualizing and customizing <strong>dendrogram</strong>. The aim of this article is to describe 5+ methods for <strong>drawing</strong> a beautiful dendrogram using <strong>R software</strong>.</p>
<p>We start by computing <a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning"><strong>hierarchical clustering</strong></a> using the data set USArrests:</p>
<pre class="r"><code># Load data
data(USArrests)

# Compute distances and hierarchical clustering
dd <- dist(scale(USArrests), method = "euclidean")
hc <- hclust(dd, method = "ward.D2")</code></pre>
<div id="plot.hclust-r-base-function" class="section level1">
<h1><span class="header-section-number">1</span> plot.hclust(): R base function</h1>
<p>As you already know, the standard R function <strong>plot.hclust()</strong> can be used to draw a dendrogram from the results of hierarchical clustering analyses (computed using <strong>hclust()</strong> function).</p>
<p>A simplified format is:</p>
<pre class="r"><code>plot(x, labels = NULL, hang = 0.1, 
     main = "Cluster dendrogram", sub = NULL,
     xlab = NULL, ylab = "Height", ...)</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>x</strong>: an object of the type produced by hclust()</li>
<li><strong>labels</strong>: A character vector of labels for the leaves of the tree. The default value is row names. if <strong>labels = FALSE</strong>, no labels are drawn.</li>
<li><strong>hang</strong>: The fraction of the plot height by which labels should hang below the rest of the plot. A negative value will cause the labels to hang down from 0.</li>
<li><strong>main, sub, xlab, ylab</strong>: character strings for title.</li>
</ul>
</div>
<p><br/></p>
<pre class="r"><code># Default plot
plot(hc)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-plot-hierarchical-clustoring-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Put the labels at the same height: hang = -1
plot(hc, hang = -1, cex = 0.6)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-plot-hierarchical-clustoring-2.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
</div>
<div id="plot.dendrogram-function" class="section level1">
<h1><span class="header-section-number">2</span> plot.dendrogram() function</h1>
<p>In order to visualize the result of a hierarchical clustering analysis using the function <strong>plot.dendrogram()</strong>, we must firstly convert it as a dendrogram.</p>
<p>The format of the function <strong>plot.dendrogram()</strong> is:</p>
<pre class="r"><code>plot(x, type = c("rectangle", "triangle"), horiz = FALSE)</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>x</strong>: an object of class dendrogram</li>
<li><strong>type</strong> of plot. Possible values are “rectangle” or “triangle”</li>
<li><strong>horiz</strong>: logical indicating if the dendrogram should be drawn horizontally or no</li>
</ul>
</div>
<p><br/></p>
<pre class="r"><code># Convert hclust into a dendrogram and plot
hcd <- as.dendrogram(hc)
# Default plot
plot(hcd, type = "rectangle", ylab = "Height")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-plot-dendrogram-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Triangle plot
plot(hcd, type = "triangle", ylab = "Height")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-plot-dendrogram-2.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Zoom in to the first dendrogram
plot(hcd, xlim = c(1, 20), ylim = c(1,8))</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-plot-dendrogram-3.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<p>The above dendrogram can be customized using the arguments:</p>
<ul>
<li><strong>nodePar</strong>: a list of plotting parameters to use for the nodes (see <strong>?points</strong>). Default value is NULL. The list may contain components named pch, cex, col, xpd, and/or bg each of which can have length two for specifying separate attributes for inner nodes and leaves.</li>
<li><strong>edgePar</strong>: a list of plotting parameters to use for the edge segments (see <strong>?segments</strong>). The list may contain components named col, lty and lwd (for the segments). As with nodePar, each can have length two for differentiating leaves and inner nodes.</li>
<li><strong>leaflab</strong>: a string specifying how leaves are labeled. The default “perpendicular” write text vertically; “textlike” writes text horizontally (in a rectangle), and “none” suppresses leaf labels.</li>
</ul>
<pre class="r"><code># Define nodePar
nodePar <- list(lab.cex = 0.6, pch = c(NA, 19), 
                cex = 0.7, col = "blue")
# Customized plot; remove labels
plot(hcd, ylab = "Height", nodePar = nodePar, leaflab = "none")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-prolt-dendrogram-customize-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Horizontal plot
plot(hcd,  xlab = "Height",
     nodePar = nodePar, horiz = TRUE)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-prolt-dendrogram-customize-2.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Change edge color
plot(hcd,  xlab = "Height", nodePar = nodePar, 
     edgePar = list(col = 2:3, lwd = 2:1))</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-prolt-dendrogram-customize-3.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
</div>
<div id="phylogenetic-trees" class="section level1">
<h1><span class="header-section-number">3</span> Phylogenetic trees</h1>
<p>The package <strong>ape</strong> (Analyses of Phylogenetics and Evolution) can be used to produce a more sophisticated dendrogram.</p>
<p>The function <strong>plot.phylo()</strong> can be used for plotting a dendrogram. A simplified format is:</p>
<pre class="r"><code>plot(x, type = "phylogram", show.tip.label = TRUE,
     edge.color = "black", edge.width = 1, edge.lty = 1,
     tip.color = "black")</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>x</strong>: an object of class “phylo”</li>
<li><strong>type</strong>: the type of phylogeny to be drawn. Possible values are: “phylogram” (the default), “cladogram”, “fan”, “unrooted” and “radial”</li>
<li><strong>show.tip.label</strong>: if true labels are shown</li>
<li><strong>edge.color, edge.width, edge.lty</strong>: line color, width and type to be used for edge</li>
<li><strong>tip.color</strong>: color used for labels</li>
</ul>
</div>
<p><br/></p>
<pre class="r"><code># install.packages("ape")
library("ape")
# Default plot
plot(as.phylo(hc), cex = 0.6, label.offset = 0.5)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-phylogenetic-trees-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Cladogram
plot(as.phylo(hc), type = "cladogram", cex = 0.6, 
     label.offset = 0.5)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-phylogenetic-trees-2.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Unrooted
plot(as.phylo(hc), type = "unrooted", cex = 0.6,
     no.margin = TRUE)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-phylogenetic-trees-3.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Fan
plot(as.phylo(hc), type = "fan")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-phylogenetic-trees-4.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Radial
plot(as.phylo(hc), type = "radial")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-phylogenetic-trees-5.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Cut the dendrogram into 4 clusters
colors = c("red", "blue", "green", "black")
clus4 = cutree(hc, 4)
plot(as.phylo(hc), type = "fan", tip.color = colors[clus4],
     label.offset = 1, cex = 0.7)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-phylogenetic-trees-6.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Change the appearance
# change edge and label (tip)
plot(as.phylo(hc), type = "cladogram", cex = 0.6,
     edge.color = "steelblue", edge.width = 2, edge.lty = 2,
     tip.color = "steelblue")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-phylogenetic-trees-7.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
</div>
<div id="ggdendro-package-ggplot2-and-dendrogram" class="section level1">
<h1><span class="header-section-number">4</span> ggdendro package : ggplot2 and dendrogram</h1>
<p>The R package <strong>ggdendro</strong> can be used to extract the plot data from dendrogram and for drawing a dendrogram using <strong>ggplot2</strong>.</p>
<div id="installation-and-loading" class="section level2">
<h2><span class="header-section-number">4.1</span> Installation and loading</h2>
<p><strong>ggdendro</strong> can be installed as follow:</p>
<pre class="r"><code>install.packages("ggdendro")</code></pre>
<p><span class="warning"> <strong>ggdendro</strong> requires the package <strong>ggplot2</strong>. Make sure that <strong>ggplot2</strong> is installed and loaded before using <strong>ggdendro</strong>.</span></p>
<p>Load <strong>ggdendro</strong> as follow:</p>
<pre class="r"><code>library("ggplot2")
library("ggdendro")</code></pre>
</div>
<div id="visualize-dendrogram-using-ggdendrogram-function" class="section level2">
<h2><span class="header-section-number">4.2</span> Visualize dendrogram using ggdendrogram() function</h2>
<p>The function <strong>ggdendrogram()</strong> creates dendrogram plot using <strong>ggplot2</strong>.</p>
<pre class="r"><code># Visualization using the default theme named theme_dendro()
ggdendrogram(hc)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-ggdendrogram-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Rotate the plot and remove default theme
ggdendrogram(hc, rotate = TRUE, theme_dendro = FALSE)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-ggdendrogram-2.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
</div>
<div id="extract-dendrogram-plot-data" class="section level2">
<h2><span class="header-section-number">4.3</span> Extract dendrogram plot data</h2>
<p>The function <strong>dendro_data()</strong> can be used for extracting the data. It returns a list of data frames which can be extracted using the functions below:</p>
<ul>
<li><strong>segment()</strong>: To extract the data for dendrogram line segments</li>
<li><strong>label()</strong>: To extract the labels</li>
</ul>
<pre class="r"><code># Build dendrogram object from hclust results
dend <- as.dendrogram(hc)

# Extract the data (for rectangular lines)
# Type can be "rectangle" or "triangle"
dend_data <- dendro_data(dend, type = "rectangle")
# What contains dend_data
names(dend_data)</code></pre>
<pre><code>## [1] "segments"    "labels"      "leaf_labels" "class"</code></pre>
<pre class="r"><code># Extract data for line segments
head(dend_data$segments)</code></pre>
<pre><code>##           x         y     xend      yend
## 1 19.771484 13.516242 8.867188 13.516242
## 2  8.867188 13.516242 8.867188  6.461866
## 3  8.867188  6.461866 4.125000  6.461866
## 4  4.125000  6.461866 4.125000  2.714554
## 5  4.125000  2.714554 2.500000  2.714554
## 6  2.500000  2.714554 2.500000  1.091092</code></pre>
<pre class="r"><code># Extract data for labels
head(dend_data$labels)</code></pre>
<pre><code>##   x y          label
## 1 1 0        Alabama
## 2 2 0      Louisiana
## 3 3 0        Georgia
## 4 4 0      Tennessee
## 5 5 0 North Carolina
## 6 6 0    Mississippi</code></pre>
<p><strong>dend_data</strong> can be used to draw a customized dendrogram using ggplot2:</p>
<pre class="r"><code># Plot line segments and add labels
p <- ggplot(dend_data$segments) + 
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend))+
  geom_text(data = dend_data$labels, aes(x, y, label = label),
            hjust = 1, angle = 90, size = 3)+
  ylim(-3, 15)
print(p)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-ggplot2-dendrogram-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
</div>
</div>
<div id="dendextend-package-extending-rs-dendrogram-functionality" class="section level1">
<h1><span class="header-section-number">5</span> dendextend package: Extending R’s dendrogram functionality</h1>
<p>The package <strong>dendextend</strong> contains many functions for changing the appearance of a dendrogram and for comparing dendrograms.</p>
<p>In this section we’ll use the <strong>chaining operator</strong> (<strong>%>%</strong>) to simplify our code.</p>
<div id="chaining" class="section level2">
<h2><span class="header-section-number">5.1</span> Chaining</h2>
<p>The <strong>chaining</strong> operator (<strong>%>%</strong>) turns <strong>x %>% f(y)</strong> into f(x, y) so you can use it to rewrite multiple operations such that they can be read from left-to-right, top-to-bottom. For instance, the results of the two R codes below are equivalent.</p>
<p><strong>Standard R code for creating a dendrogram</strong>:</p>
<pre class="r"><code>data <- scale(USArrests)
dist.res <- dist(data)
hc <- hclust(dist.res, method = "ward.D2")
dend <- as.dendrogram(hc)
plot(dend)</code></pre>
<p><strong>R code for creating a dendrogram using chaining operator</strong>:</p>
<pre class="r"><code>dend <- USArrests[1:5,] %>% # data
        scale %>% # Scale the data
        dist %>% # calculate a distance matrix, 
        hclust(method = "ward.D2") %>% # Hierarchical clustering 
        as.dendrogram # Turn the object into a dendrogram.
plot(dend)</code></pre>
</div>
<div id="installation-and-loading-1" class="section level2">
<h2><span class="header-section-number">5.2</span> Installation and loading</h2>
<p>Install the stable version as follow:</p>
<pre class="r"><code>install.packages(&amp;#39;dendextend&amp;#39;)</code></pre>
<p>Loading:</p>
<pre class="r"><code>library(dendextend)</code></pre>
</div>
<div id="how-to-change-a-dendrogram" class="section level2">
<h2><span class="header-section-number">5.3</span> How to change a dendrogram</h2>
<p>The function <strong>set()</strong> can be used to change the parameters with dendextend.</p>
<p>The format is:</p>
<pre class="r"><code>set(object, what, value)</code></pre>
<br/>
<div class="block">
<ol style="list-style-type: decimal">
<li><strong>object</strong>: a dendrogram object</li>
<li><strong>what</strong>: a character indicating what is the property of the tree that should be set/updated</li>
<li><strong>value</strong>: a vector with the value to set in the tree (the type of the value depends on the “what”).</li>
</ol>
</div>
<p><br/></p>
<p>Possible values for the argument <strong>what</strong> include:</p>
<table>
<thead>
<tr class="header">
<th align="left">Value for the argument <strong>what</strong></th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left"><strong>labels</strong></td>
<td align="left">set the labels</td>
</tr>
<tr class="even">
<td align="left"><strong>labels_colors</strong> and <strong>labels_cex</strong></td>
<td align="left">Set the color and the size of labels, respectively</td>
</tr>
<tr class="odd">
<td align="left"><strong>leaves_pch</strong>, <strong>leaves_cex</strong> and <strong>leaves_col</strong></td>
<td align="left">set the point type, size and color for leaves, respectively</td>
</tr>
<tr class="even">
<td align="left"><strong>nodes_pch</strong>, <strong>nodes_cex</strong> and <strong>nodes_col</strong></td>
<td align="left">set the point type, size and color for nodes, respectively</td>
</tr>
<tr class="odd">
<td align="left"><strong>hang_leaves</strong></td>
<td align="left">hang the leaves</td>
</tr>
<tr class="even">
<td align="left"><strong>branches_k_color</strong></td>
<td align="left">color the branches</td>
</tr>
<tr class="odd">
<td align="left"><strong>branches_col</strong>, <strong>branches_lwd </strong>, <strong>branches_lty</strong></td>
<td align="left">Set the color, the line width and the line type of branches, respectively</td>
</tr>
<tr class="even">
<td align="left"><strong>by_labels_branches_col</strong>, <strong>by_labels_branches_lwd</strong> and <strong>by_labels_branches_lty </strong></td>
<td align="left">Set the color, the line width and the line type of branches with specific labels, respectively</td>
</tr>
<tr class="odd">
<td align="left"><strong>clear_branches</strong> and <strong>clear_leaves</strong></td>
<td align="left">Clear branches and leaves, respectively</td>
</tr>
</tbody>
</table>
</div>
<div id="create-a-simple-dendrogram" class="section level2">
<h2><span class="header-section-number">5.4</span> Create a simple dendrogram</h2>
<pre class="r"><code># Create a dendrogram and plot it
dend <- USArrests[1:5,] %>%  scale %>% 
        dist %>% hclust %>% as.dendrogram

dend %>% plot</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-simple-dendrogram-dendextend-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="240" /></p>
<pre class="r"><code># Get the labels of the tree
labels(dend)</code></pre>
<pre><code>## [1] "Alaska"     "Arizona"    "California" "Alabama"    "Arkansas"</code></pre>
</div>
<div id="change-labels" class="section level2">
<h2><span class="header-section-number">5.5</span> Change labels</h2>
<p>This section describes how to change label names as well as the color and the size for labels.</p>
<pre class="r"><code># Change the labels, and then plot:
dend %>% set("labels", c("a", "b", "c", "d", "e")) %>% plot</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-change-labels-dendrogram-dendextend-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="240" /></p>
<pre class="r"><code># Change color and size for labels
dend %>% set("labels_col", c("green", "blue")) %>% # change color
  set("labels_cex", 2) %>% # Change size
  plot(main = "Change the color \nand size") # plot</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-change-labels-dendrogram-dendextend-2.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="240" /></p>
<pre class="r"><code># Color labels by specifying the number of cluster (k)
dend %>% set("labels_col", value = c("green", "blue"), k=2) %>% 
          plot(main = "Color labels \nper cluster")
abline(h = 2, lty = 2)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-change-labels-dendrogram-dendextend-3.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="240" /></p>
<p><span class="warning">In the R code above, the value of color vectors are too short. Hence, it’s recycled.</span></p>
</div>
<div id="change-the-points-of-a-dendrogram-nodesleaves" class="section level2">
<h2><span class="header-section-number">5.6</span> Change the points of a dendrogram nodes/leaves</h2>
<pre class="r"><code># Change the type, the color and the size of node points
# +++++++++++++++++++++++++++++
dend %>% set("nodes_pch", 19) %>%  # node point type
  set("nodes_cex", 2) %>%  # node point size
  set("nodes_col", "blue") %>% # node point color
  plot(main = "Node points")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-change-dendrogram-nodes-leaves-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="288" /></p>
<pre class="r"><code># Change the type, the color and the size of leave points
# +++++++++++++++++++++++++++++
dend %>% set("leaves_pch", 19) %>%  # node point type
  set("leaves_cex", 2) %>%  # node point size
  set("leaves_col", "blue") %>% # node point color
  plot(main = "Leaves points")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-change-dendrogram-nodes-leaves-2.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="288" /></p>
<pre class="r"><code># Specify different point types and colors for each leave
dend %>% set("leaves_pch", c(17, 18, 19)) %>%  # node point type
  set("leaves_cex", 2) %>%  # node point size
  set("leaves_col", c("blue", "red", "green")) %>% #node point color
  plot(main = "Leaves points")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-change-dendrogram-nodes-leaves-3.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="288" /></p>
</div>
<div id="change-the-color-of-branches" class="section level2">
<h2><span class="header-section-number">5.7</span> Change the color of branches</h2>
<p>The color for branches can be controlled using <a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning"><strong>k-means</strong> clustering</a>:</p>
<pre class="r"><code># Default colors
dend %>% set("branches_k_color", k = 2) %>% 
  plot(main = "Default colors")

# Customized colors
dend %>% set("branches_k_color", 
             value = c("red", "blue"), k = 2) %>% 
   plot(main = "Customized colors")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-color-branches-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="240" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-color-branches-2.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="240" /></p>
<p><span class="notice">It’s also possible to use the function <strong>color_branches()</strong>.</span></p>
</div>
<div id="adding-colored-rectangles" class="section level2">
<h2><span class="header-section-number">5.8</span> Adding colored rectangles</h2>
<p>Clusters can be highlighted by adding colored rectangles. This is done using the <strong>rect.dendrogram()</strong> function (modeled based on the <strong>rect.hclust()</strong> function). One advantage of rect.dendrogram over rect.hclust, is that it also works on horizontally plotted trees:</p>
<pre class="r"><code># Vertical plot
dend %>% set("branches_k_color", k = 3) %>% plot
dend %>% rect.dendrogram(k=3, border = 8, lty = 5, lwd = 2)

# Horizontal plot
dend %>% set("branches_k_color", k = 3) %>% plot(horiz = TRUE)
dend %>% rect.dendrogram(k = 3, horiz = TRUE, border = 8, lty = 5, lwd = 2)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-dendrogram-add-rectangle-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="259.2" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-dendrogram-add-rectangle-2.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="259.2" /></p>
</div>
<div id="adding-colored-bars" class="section level2">
<h2><span class="header-section-number">5.9</span> Adding colored bars</h2>
<p>This is useful for annotating the items in the clusters:</p>
<pre class="r"><code>grp <- c(1,1,1, 2,2)
k_3 <- cutree(dend,k = 3, order_clusters_as_data = FALSE) 
# The FALSE above makes sure we get the clusters in the order of the
# dendrogram, and not in that of the original data. It is like:
# cutree(dend, k = 3)[order.dendrogram(dend)]

the_bars <- cbind(grp, k_3)

dend %>% set("labels", "") %>% plot
colored_bars(colors = the_bars, dend = dend)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-annotating-items-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="288" /></p>
</div>
<div id="ggplot2-integration" class="section level2">
<h2><span class="header-section-number">5.10</span> ggplot2 integration</h2>
<p>The following 2 steps are used:</p>
<ol style="list-style-type: decimal">
<li>Transform a dendrogram into a <strong>ggdend object</strong> using <strong>as.ggdend()</strong> function</li>
<li>Make the plot using the function <strong>ggplot()</strong></li>
</ol>
<pre class="r"><code>dend <- iris[1:30,-5] %>% scale %>% dist %>% 
   hclust %>% as.dendrogram %>%
   set("branches_k_color", k=3) %>% set("branches_lwd", 1.2) %>%
   set("labels_colors") %>% set("labels_cex", c(.9,1.2)) %>% 
   set("leaves_pch", 19) %>% set("leaves_col", c("blue", "red"))
# plot the dend in usual "base" plotting engine:
plot(dend)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-ggplot2-dendrogram-2-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<p>Produce the same plot in <strong>ggplot2</strong> using the function:</p>
<pre class="r"><code>library(ggplot2)
# Rectangle dendrogram using ggplot2
ggd1 <- as.ggdend(dend)
ggplot(ggd1) </code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-dendrogram-ggplot2-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Change the theme to the default ggplot2 theme
ggplot(ggd1, horiz = TRUE, theme = NULL) </code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-dendrogram-ggplot2-2.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Theme minimal
ggplot(ggd1, theme = theme_minimal()) </code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-dendrogram-ggplot2-3.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># Create a radial plot and remove labels
ggplot(ggd1, labels = FALSE) + 
  scale_y_reverse(expand = c(0.2, 0)) +
  coord_polar(theta="x")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-dendrogram-ggplot2-4.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
</div>
<div id="pvclust-and-dendextend" class="section level2">
<h2><span class="header-section-number">5.11</span> pvclust and dendextend</h2>
<p>The package <strong>dendextend</strong> can be used to enhance many packages including <a href="https://www.sthda.com/english/english/wiki/clustering-validation-statistics-4-vital-things-everyone-should-know-unsupervised-machine-learning"><strong>pvclust</strong></a>. Recall that, pvclust is for calculating <strong>p-values</strong> for <strong>hierarchical clustering</strong>.</p>
<p><strong>pvclust</strong> can be used as follow:</p>
<pre class="r"><code>library(pvclust)
data(lung) # 916 genes for 73 subjects
set.seed(1234)
result <- pvclust(lung[1:100, 1:10], method.dist="cor", 
                  method.hclust="average", nboot=10)</code></pre>
<pre><code>## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.</code></pre>
<pre class="r"><code># Default plot of the result
plot(result)
pvrect(result)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-pvclust-1.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
<pre class="r"><code># pvclust and dendextend
result %>% as.dendrogram %>% 
  set("branches_k_color", k = 2, value = c("purple", "orange")) %>%
  plot
result %>% text
result %>% pvrect</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/dendrogram-visualization-pvclust-2.png" title="dendrogram visualization - Unsupervised Machine Learning" alt="dendrogram visualization - Unsupervised Machine Learning" width="518.4" /></p>
</div>
</div>
<div id="infos" class="section level1">
<h1><span class="header-section-number">6</span> Infos</h1>
<p><span class="warning">This analysis has been performed using <strong>R software</strong> (ver. 3.2.1)</span></p>
</div>

<script>jQuery(document).ready(function () {
    jQuery('h1').addClass('wiki_paragraph1');
    jQuery('h2').addClass('wiki_paragraph2');
    jQuery('h3').addClass('wiki_paragraph3');
    jQuery('h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->


<!-- END HTML -->]]></description>
			<pubDate>Sun, 06 Dec 2015 12:47:59 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Determining the optimal number of clusters: 3 must known methods - Unsupervised Machine Learning]]></title>
			<link>https://www.sthda.com/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning</link>
			<guid>https://www.sthda.com/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning</guid>
			<description><![CDATA[<!-- START HTML -->

            
  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">

<div id="TOC">
<ul>
<li><a href="#required-packages"><span class="toc-section-number">1</span> Required packages</a></li>
<li><a href="#data-preparation"><span class="toc-section-number">2</span> Data preparation</a></li>
<li><a href="#example-of-partitioning-method-results"><span class="toc-section-number">3</span> Example of partitioning method results</a></li>
<li><a href="#example-of-hierarchical-clustering-results"><span class="toc-section-number">4</span> Example of hierarchical clustering results</a></li>
<li><a href="#three-popular-methods-for-determining-the-optimal-number-of-clusters"><span class="toc-section-number">5</span> Three popular methods for determining the optimal number of clusters</a><ul>
<li><a href="#elbow-method"><span class="toc-section-number">5.1</span> Elbow method</a><ul>
<li><a href="#concept"><span class="toc-section-number">5.1.1</span> Concept</a></li>
<li><a href="#algorithm"><span class="toc-section-number">5.1.2</span> Algorithm</a></li>
<li><a href="#r-codes"><span class="toc-section-number">5.1.3</span> R codes</a></li>
</ul></li>
<li><a href="#average-silhouette-method"><span class="toc-section-number">5.2</span> Average silhouette method</a><ul>
<li><a href="#concept-1"><span class="toc-section-number">5.2.1</span> Concept</a></li>
<li><a href="#algorithm-1"><span class="toc-section-number">5.2.2</span> Algorithm</a></li>
<li><a href="#r-codes-1"><span class="toc-section-number">5.2.3</span> R codes</a></li>
</ul></li>
<li><a href="#conclusions-about-elbow-and-silhouette-methods"><span class="toc-section-number">5.3</span> Conclusions about elbow and silhouette methods</a></li>
<li><a href="#gap-statistic-method"><span class="toc-section-number">5.4</span> Gap statistic method</a><ul>
<li><a href="#concept-2"><span class="toc-section-number">5.4.1</span> Concept</a></li>
<li><a href="#algorithm-2"><span class="toc-section-number">5.4.2</span> Algorithm</a></li>
<li><a href="#r-codes-2"><span class="toc-section-number">5.4.3</span> R codes</a></li>
</ul></li>
</ul></li>
<li><a href="#nbclust-a-package-providing-30-indices-for-determining-the-best-number-of-clusters"><span class="toc-section-number">6</span> NbClust: A Package providing 30 indices for determining the best number of clusters</a><ul>
<li><a href="#overview-of-nbclust-package"><span class="toc-section-number">6.1</span> Overview of NbClust package</a></li>
<li><a href="#nbclust-r-function"><span class="toc-section-number">6.2</span> NbClust R function</a></li>
<li><a href="#examples-of-usage"><span class="toc-section-number">6.3</span> Examples of usage</a><ul>
<li><a href="#compute-only-an-index-of-interest"><span class="toc-section-number">6.3.1</span> Compute only an index of interest</a></li>
<li><a href="#compute-all-the-30-indices"><span class="toc-section-number">6.3.2</span> Compute all the 30 indices</a></li>
</ul></li>
</ul></li>
<li><a href="#infos"><span class="toc-section-number">7</span> Infos</a></li>
</ul>
</div>

<p><br/> The first step in <strong>clustering analysis</strong> is to assess whether the dataset is clusterable. This has been described in a chapter entitled: <a href="https://www.sthda.com/english/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning">Assessing Clustering Tendency</a>.</p>
<p><a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning"><strong>Partitioning methods</strong></a>, such as <strong>k-means clustering</strong> require also the users to specify the number of clusters to be generated.</p>
<p><span class="question">One fundamental question is: If the data is clusterable, then how to choose the right number of expected clusters (k)?</span></p>
<p>Unfortunately, there is no definitive answer to this question. The <strong>optimal clustering</strong> is somehow subjective and depend on the method used for measuring similarities and the parameters used for partitioning.</p>
<p>A simple and popular solution consists of inspecting the dendrogram produced using <strong>hierarchical clustering</strong> to see if it suggests a particular number of clusters. Unfortunately this approach is, again, subjective.</p>
<p>In this article, we’ll describe different methods for <strong>determining the optimal number of clusters</strong> for <strong>k-means</strong>, <strong>PAM</strong> and <strong>hierarchical</strong> clustering . These methods include <strong>direct methods</strong> and <strong>statistical testing methods</strong>.</p>
<br/>
<div class="block">
<ul>
<li><strong>Direct methods</strong> consists of optimizing a criterion, such as the <strong>within cluster sums of squares</strong> or the <strong>average silhouette</strong>. The corresponding methods are named <em>elbow</em> and <em>silhouette</em> methods, respectively.</li>
<li><strong>Testing methods</strong> consists of comparing evidence against null hypothesis. An example is the <strong>gap statistic</strong>.</li>
</ul>
</div>
<p><br/></p>
<p>In addition to <strong>elbow</strong>, <strong>silhouette</strong> and <strong>gap statistic</strong> methods, there are more than thirty other indices and methods that have been published for identifying the <strong>optimal number of clusters</strong>. We’ll provide <strong>R codes</strong> for computing all these 30 indices in order to decide the best number of clusters using the “majority rule”.</p>
<p>For each of these methods:</p>
<ul>
<li>We’ll describe the basic idea, the algorithm and the key mathematical concept</li>
<li>We’ll provide easy-o-use <strong>R codes</strong> with many examples for determining the optimal number of clusters and visualizing the output</li>
</ul>
<div id="required-packages" class="section level1">
<h1><span class="header-section-number">1</span> Required packages</h1>
<p>The following package will be used:</p>
<ul>
<li><strong>cluster</strong> for computing <strong>pam</strong> and for analyzing cluster silhouettes</li>
<li><strong>factoextra</strong> for visualizing clusters using <strong>ggplot2</strong> plotting system</li>
<li><strong>NbClust</strong> for finding the optimal number of clusters</li>
</ul>
<p>Install <strong>factoextra</strong> package as follow:</p>
<pre class="r"><code>if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")</code></pre>
<p>The remaining packages can be installed using the code below:</p>
<pre class="r"><code>pkgs <- c("cluster",  "NbClust")
install.packages(pkgs)</code></pre>
<p>Load packages:</p>
<pre class="r"><code>library(factoextra)
library(cluster)
library(NbClust)</code></pre>
</div>
<div id="data-preparation" class="section level1">
<h1><span class="header-section-number">2</span> Data preparation</h1>
<p>The data set <em>iris</em> is used. We start by excluding the species column and scaling the data using the function <strong>scale()</strong>:</p>
<pre class="r"><code># Load the data
data(iris)
head(iris)</code></pre>
<pre><code>##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa</code></pre>
<pre class="r"><code># Remove species column (5) and scale the data
iris.scaled <- scale(iris[, -5])</code></pre>
<p><span class="notice">This iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.</span></p>
</div>
<div id="example-of-partitioning-method-results" class="section level1">
<h1><span class="header-section-number">3</span> Example of partitioning method results</h1>
<p>The functions <strong>kmeans()</strong> [in <strong>stats</strong> package] and <strong>pam()</strong> [in <strong>cluster</strong> package] are described in this section. We’ll split the data into 3 clusters as follow:</p>
<pre class="r"><code># K-means clustering
set.seed(123)
km.res <- kmeans(iris.scaled, 3, nstart = 25)
# k-means group number of each observation
km.res$cluster</code></pre>
<pre><code>##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3
##  [71] 2 3 3 3 3 2 2 2 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 2 2 2 2 2 2 3 3 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3</code></pre>
<pre class="r"><code># Visualize k-means clusters
fviz_cluster(km.res, data = iris.scaled, geom = "point",
             stand = FALSE, frame.type = "norm")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-k-means-pam-clusterings-visualization-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<pre class="r"><code># PAM clustering
library("cluster")
pam.res <- pam(iris.scaled, 3)
pam.res$cluster</code></pre>
<pre><code>##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3
##  [71] 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 3 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3</code></pre>
<pre class="r"><code># Visualize pam clusters
fviz_cluster(pam.res, stand = FALSE, geom = "point",
             frame.type = "norm")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-k-means-pam-clusterings-visualization-2.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<p>Read more about partitioning methods: <a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning">Partitioning clustering</a></p>
</div>
<div id="example-of-hierarchical-clustering-results" class="section level1">
<h1><span class="header-section-number">4</span> Example of hierarchical clustering results</h1>
<p>The built-in R function <strong>hclust()</strong> is used:</p>
<pre class="r"><code># Compute pairewise distance matrices
dist.res <- dist(iris.scaled, method = "euclidean")
# Hierarchical clustering results
hc <- hclust(dist.res, method = "complete")
# Visualization of hclust
plot(hc, labels = FALSE, hang = -1)
# Add rectangle around 3 groups
rect.hclust(hc, k = 3, border = 2:4) </code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-hierarchical-clustering-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<pre class="r"><code># Cut into 3 groups
hc.cut <- cutree(hc, k = 3)
head(hc.cut, 20)</code></pre>
<pre><code>##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1</code></pre>
<p>Read more about hierarchical clustering: <a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning">Hierarchical clustering</a></p>
</div>
<div id="three-popular-methods-for-determining-the-optimal-number-of-clusters" class="section level1">
<h1><span class="header-section-number">5</span> Three popular methods for determining the optimal number of clusters</h1>
<p>In this section we describe the three most popular methods including: i) Elbow method, ii) silhouette method and iii) gap statistic.</p>
<div id="elbow-method" class="section level2">
<h2><span class="header-section-number">5.1</span> Elbow method</h2>
<div id="concept" class="section level3">
<h3><span class="header-section-number">5.1.1</span> Concept</h3>
<p>Recall that, the basic idea behind partitioning methods, such as <a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning"><strong>k-means clustering</strong></a>, is to define clusters such that the <strong>total intra-cluster variation</strong> (known as <strong>total within-cluster variation</strong> or <strong>total within-cluster sum of square</strong>) is minimized:</p>
<p><span class="math">\(minimize\left(\sum\limits_{k=1}^k W(C_k)\right)\)</span>,</p>
<p>Where <span class="math">\(C_k\)</span> is the <span class="math">\(k_{th}\)</span> cluster and <span class="math">\(W(C_k)\)</span> is the <strong>within-cluster variation</strong>.</p>
<p><span class="success">The <strong>total within-cluster sum of square (wss)</strong> measures the compactness of the clustering and we want it to be as small as possible.</span></p>
</div>
<div id="algorithm" class="section level3">
<h3><span class="header-section-number">5.1.2</span> Algorithm</h3>
<p>The optimal number of clusters can be defined as follow:</p>
<br/>
<div class="block">
<ol style="list-style-type: decimal">
<li>Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters</li>
<li>For each k, calculate the total within-cluster sum of square (wss)</li>
<li>Plot the curve of <strong>wss</strong> according to the number of clusters k.</li>
<li>The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.</li>
</ol>
</div>
<p><br/></p>
</div>
<div id="r-codes" class="section level3">
<h3><span class="header-section-number">5.1.3</span> R codes</h3>
<div id="elbow-method-for-k-means-clustering" class="section level4">
<h4><span class="header-section-number">5.1.3.1</span> Elbow method for k-means clustering</h4>
<pre class="r"><code>set.seed(123)
# Compute and plot wss for k = 2 to k = 15
k.max <- 15 # Maximal number of clusters
data <- iris.scaled
wss <- sapply(1:k.max, 
        function(k){kmeans(data, k, nstart=10 )$tot.withinss})

plot(1:k.max, wss,
       type="b", pch = 19, frame = FALSE, 
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares")
abline(v = 3, lty =2)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-optimal-number-of-cluster-elbow-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<p><span class="success">The elbow method suggests 3 cluster solutions.</span></p>
<p>The elbow method is implemented in <strong>factoextra</strong> package and can be easily computed using the function <strong>fviz_nbclust()</strong>, which format is:</p>
<pre class="r"><code>fviz_nbclust(x, FUNcluster, method = c("silhouette", "wss"))</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>x</strong>: numeric matrix or data frame</li>
<li><strong>FUNcluster</strong>: a partitioning function such as kmeans, pam, clara etc</li>
<li><strong>method</strong>: the method to be used for determining the optimal number of clusters.</li>
</ul>
</div>
<p><br/></p>
<p>The R code below computes the elbow method for kmeans():</p>
<pre class="r"><code>fviz_nbclust(iris.scaled, kmeans, method = "wss") +
    geom_vline(xintercept = 3, linetype = 2)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-k-means-optimal-clusters-wss-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<p><span class="success">Three clusters are suggested.</span></p>
</div>
<div id="elbow-method-for-pam-clustering" class="section level4">
<h4><span class="header-section-number">5.1.3.2</span> Elbow method for PAM clustering</h4>
<p>It’s possible to use the function <strong>fviz_nbclust()</strong> as follow:</p>
<pre class="r"><code>fviz_nbclust(iris.scaled, pam, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-pam-optimal-clusters-wss-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<p><span class="success">Three clusters are suggested.</span></p>
</div>
<div id="elbow-method-for-hierarchical-clustering" class="section level4">
<h4><span class="header-section-number">5.1.3.3</span> Elbow method for hierarchical clustering</h4>
<p>We’ll use a helper function <strong>hcut()</strong> [in <strong>factoextra</strong> package] which will compute hierarchical clustering (HC) algorithm and cut the dendrogram in k clusters:</p>
<pre class="r"><code>fviz_nbclust(iris.scaled, hcut, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-hierarchical-clustering-optimal-clusters-wss-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<p><span class="success">Three clusters are suggested.</span></p>
<p><span class="warning">Note that, the <strong>elbow</strong> method is sometimes ambiguous. An alternative is the average silhouette method (Kaufman and Rousseeuw [1990]) which can be also used with any clustering approach.</span></p>
</div>
</div>
</div>
<div id="average-silhouette-method" class="section level2">
<h2><span class="header-section-number">5.2</span> Average silhouette method</h2>
<div id="concept-1" class="section level3">
<h3><span class="header-section-number">5.2.1</span> Concept</h3>
<p>The <strong>average silhouette approach</strong> we’ll be described comprehensively in the chapter <strong>cluster validation statistics</strong>. Briefly, it measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.</p>
<p>Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k (Kaufman and Rousseeuw [1990]).</p>
</div>
<div id="algorithm-1" class="section level3">
<h3><span class="header-section-number">5.2.2</span> Algorithm</h3>
<p>The algorithm is similar to the elbow method and can be computed as follow:</p>
<br/>
<div class="block">
<ol style="list-style-type: decimal">
<li>Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters</li>
<li>For each k, calculate the average silhouette of observations (<strong>avg.sil</strong>)</li>
<li>Plot the curve of <strong>avg.sil</strong> according to the number of clusters k.</li>
<li>The location of the maximum is considered as the appropriate number of clusters.</li>
</ol>
</div>
<p><br/></p>
</div>
<div id="r-codes-1" class="section level3">
<h3><span class="header-section-number">5.2.3</span> R codes</h3>
<p>The function <strong>silhouette()</strong> [in <strong>cluster</strong> package] is used to compute the average silhouette width.</p>
<div id="average-silhouette-method-for-k-means-clustering" class="section level4">
<h4><span class="header-section-number">5.2.3.1</span> Average silhouette method for k-means clustering</h4>
<p>The R code below determine the optimal number of clusters K for k-means clustering:</p>
<pre class="r"><code>library(cluster)
k.max <- 15
data <- iris.scaled
sil <- rep(0, k.max)

# Compute the average silhouette width for 
# k = 2 to k = 15
for(i in 2:k.max){
  km.res <- kmeans(data, centers = i, nstart = 25)
  ss <- silhouette(km.res$cluster, dist(data))
  sil[i] <- mean(ss[, 3])
}

# Plot the  average silhouette width
plot(1:k.max, sil, type = "b", pch = 19, 
     frame = FALSE, xlab = "Number of clusters k")
abline(v = which.max(sil), lty = 2)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-k-means-optimal-number-of-clusters-average-silhouette-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<p>The function <strong>fviz_nbclust()</strong> [in <strong>factoextra</strong> package] can be also used. It just requires the <strong>cluster</strong> package to be installed:</p>
<pre class="r"><code>require(cluster)
fviz_nbclust(iris.scaled, kmeans, method = "silhouette")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-k-means-optimal-number-of-clusters-average-silhouette-ggplot-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<p><span class="success">Two clusters are suggested.</span></p>
</div>
<div id="average-silhouette-method-for-pam-clustering" class="section level4">
<h4><span class="header-section-number">5.2.3.2</span> Average silhouette method for PAM clustering</h4>
<pre class="r"><code>require(cluster)
fviz_nbclust(iris.scaled, pam, method = "silhouette")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-pam-optimal-number-of-clusters-average-silhouette-ggplot-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<p><span class="success">Two clusters are suggested.</span></p>
</div>
<div id="average-silhouette-method-for-hierarchical-clustering" class="section level4">
<h4><span class="header-section-number">5.2.3.3</span> Average silhouette method for hierarchical clustering</h4>
<pre class="r"><code>require(cluster)
fviz_nbclust(iris.scaled, hcut, method = "silhouette",
             hc_method = "complete")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-hierarchical-clustering-optimal-number-of-clusters-average-silhouette-ggplot-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<p><span class="success">Three clusters are suggested.</span></p>
</div>
</div>
</div>
<div id="conclusions-about-elbow-and-silhouette-methods" class="section level2">
<h2><span class="header-section-number">5.3</span> Conclusions about elbow and silhouette methods</h2>
<ul>
<li>Three cluster solutions are suggested using <strong>k-means</strong>, <strong>PAM</strong> and <strong>hierarchical</strong> clustering in combination with the <strong>elbow method</strong>.</li>
<li>The average silhouette method gives two cluster solutions using <strong>k-means</strong> and <strong>PAM</strong> algorithms. Combining hierarchical clustering and silhouette method returns 3 clusters</li>
</ul>
<p><span class="success">According to these observations, it’s possible to define k = 3 as the optimal number of clusters in the data.</span></p>
<p><span class="warning">The disadvantage of elbow and average silhouette methods is that, they measure a global clustering characteristic only. A more sophisticated method is to use the <strong>gap statistic</strong> which provides a statistical procedure to formalize the elbow/silhouette heuristic in order to estimate the optimal number of clusters.</span></p>
</div>
<div id="gap-statistic-method" class="section level2">
<h2><span class="header-section-number">5.4</span> Gap statistic method</h2>
<div id="concept-2" class="section level3">
<h3><span class="header-section-number">5.4.1</span> Concept</h3>
<p>The <strong>gap statistic</strong> has been published by <a href="http://web.stanford.edu/~hastie/Papers/gap.pdf">R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001)</a>. The approach can be applied to any clustering method (<a href="https://www.sthda.com/english/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning">K-means clustering</a>, <a href="https://www.sthda.com/english/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning">hierarchical clustering</a>, …).</p>
<p>The gap statistic compares the total within intracluster variation for different values of k with their expected values under null reference distribution of the data, i.e. a distribution with no obvious clustering.</p>
<p><span class="notice">Recall that, the total within intra-cluster variation for a given k clusters is the total within sum of square (<span class="math">\(w_k\)</span>).</span></p>
<p>The reference dataset is generated using Monte Carlo simulations of the sampling process. That is, for each variable (<span class="math">\(x_i\)</span>) in the data set we compute its range [<span class="math">\(min(x_i), max(x_j)\)</span>] and generate values for the n points uniformly from the interval min to max.</p>
<p><span class="notice">Note that, the function <strong>runif(n, min, max)</strong> can be used to generate random uniform distribution.</span></p>
<p>For the observed data and the the reference data, the total intracluster variation is computed using different values of k. The <strong>gap statistic</strong> for a given k is defined as follow:</p>
<p><span class="math">\[
Gap_n(k) = E_n^*\{log(W_k)\} - log(W_k)
\]</span></p>
<p>Where <span class="math">\(E_n^*\)</span> denotes the expectation under a sample of size <span class="math">\(n\)</span> from the reference distribution. <span class="math">\(E_n^*\)</span> is defined via bootstrapping (B) by generating B copies of the reference datasets and, by computing the average <span class="math">\(log(W_k^*)\)</span>.</p>
<p><span class="notice">Note that, the logarithm of the <span class="math">\(W_k\)</span> values is used, as they can be quite large.</span></p>
<p>The gap statistic measures the deviation of the observed <span class="math">\(W_k\)</span> value from its expected value under the null hypothesis.</p>
<p><span class="success">The estimate of the optimal clusters <span class="math">\(\hat{k}\)</span> will be value that maximize <span class="math">\(Gap_n(k)\)</span> (i.e, that yields the largest gap statistic). This means that the clustering structure is far away from the uniform distribution of points.</span></p>
<p><span class="notice">Note that, using <strong>B = 500</strong> gives quite precise results so that the gap plot is basically unchanged after an another run.</span></p>
<p>The standard deviation (<span class="math">\(sd_k\)</span>) of <span class="math">\(log(W_k^*)\)</span> is also computed in order to define the standard error (<span class="math">\(s_k\)</span>) of the simulation as follow:</p>
<p><span class="math">\[
s_k = sd_k \times \sqrt{1 + 1/B} 
\]</span></p>
<br/>
<div class="block">
<p>Finally, a more robust approach is to choose the optimal number of clusters K as the smallest k such that:</p>
<p><span class="math">\[Gap(k) \geq Gap(k+1) - s_{k+1}\]</span></p>
That is, we choose the smallest value of k such that the gap statistic is within one standard deviation of the gap at k+1.
</div>
<p><br/></p>
</div>
<div id="algorithm-2" class="section level3">
<h3><span class="header-section-number">5.4.2</span> Algorithm</h3>
<p>The algorithm involves the following steps (<a href="http://web.stanford.edu/~hastie/Papers/gap.pdf">Read the original paper of the gap statistic</a>):</p>
<br/>
<div class="block">
<ol style="list-style-type: decimal">
<li>Cluster the observed data, varying the number of clusters from k = 1, …, <span class="math">\(k_{max}\)</span>, and compute the corresponding <span class="math">\(W_k\)</span>.</li>
<li>Generate B reference data sets and cluster each of them with varying number of clusters k = 1, …, <span class="math">\(k_{max}\)</span>. Compute the estimated gap statistic <span class="math">\(Gap(k) = \frac{1}{B} \sum\limits_{b=1}^B log(W_{kb}^*) - log(W_k)\)</span>.</li>
<li>Let <span class="math">\(\bar{w} = (1/B) \sum_b log(W^*_{kb})\)</span>, compute the standard deviation <span class="math">\(sd(k) = \sqrt{(1/B) \sum_b (log(W^*_{kb}) - \bar{w})^2}\)</span> and define <span class="math">\(s_k = sd_k \times \sqrt{1 + 1/B}\)</span>.</li>
<li>Choose the number of clusters as the smallest k such that <span class="math">\(Gap(k) \geq Gap(k+1) - s_{k+1}\)</span>.</li>
</ol>
</div>
<p><br/></p>
</div>
<div id="r-codes-2" class="section level3">
<h3><span class="header-section-number">5.4.3</span> R codes</h3>
<div id="r-function-for-computing-the-gap-statistic" class="section level4">
<h4><span class="header-section-number">5.4.3.1</span> R function for computing the gap statistic</h4>
<p>The R function <strong>clusGap()</strong> [in <strong>cluster</strong> package ] can be used to estimate the number of clusters in the data by applying the <strong>gap statistic</strong>.</p>
<p>A simplified format is:</p>
<pre class="r"><code>clusGap(x, FUNcluster, K.max, B = 100, verbose = TRUE, ...)</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>x</strong>: numeric matrix or data frame</li>
<li><strong>FUNcluster</strong>: a function (e.g.: kmeans, pam, …) which accepts i) a data matrix like <em>x</em> as first argument; ii) the number of clusters desired (k > = 2) as a second argument; and returns a list containing a component named <strong>cluster</strong> which is a vector of length <span class="math">\(n = nrow(x)\)</span> of integers in 1:k determining the clustering or grouping of the n observations.</li>
<li><strong>K.max</strong>: the maximum number of clusters to consider, must be at least two.</li>
<li><strong>B</strong>: the number of Monte Carlo (“bootstrap”) samples.</li>
<li><strong>verbose</strong>: if TRUE, the computing progression is shown.</li>
<li><strong>…</strong>: Further arguments for FUNcluster(), see kmeans example below.</li>
</ul>
</div>
<p><br/></p>
<p><span class="success"> <strong>clusGap()</strong> function returns an object of class “clusGap” which main component is <strong>Tab</strong> with <strong>K.max</strong> rows and 4 columns, named “logW”, “E.logW”, “gap” and “SE.sim”. Recall that <span class="math">\(gap = E.logW - logW\)</span> and SE.sim is the standard error of gap.</span></p>
</div>
<div id="gap-statistic-for-k-means-clustering" class="section level4">
<h4><span class="header-section-number">5.4.3.2</span> Gap statistic for k-means clustering</h4>
<p>The R code below shows some example using the <strong>clustGap()</strong> function.</p>
<p><span class="notice">We’ll use B = 50 to keep the function speedy. Note that, it’s recommended to use B = 500 for your analysis.</span></p>
<p>The output of <strong>clusGap()</strong> function can be visualized using the function <strong>fviz_gap_stat()</strong> [in <strong>factoextra</strong>].</p>
<pre class="r"><code># Compute gap statistic
library(cluster)
set.seed(123)
gap_stat <- clusGap(iris.scaled, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)

# Print the result
print(gap_stat, method = "firstmax")</code></pre>
<pre><code>## Clustering Gap statistic ["clusGap"].
## B=50 simulated reference sets, k = 1..10
##  --> Number of clusters (method &amp;#39;firstmax&amp;#39;): 3
##           logW   E.logW       gap     SE.sim
##  [1,] 4.534565 4.754595 0.2200304 0.02504585
##  [2,] 4.021316 4.489687 0.4683711 0.02742112
##  [3,] 3.806577 4.295715 0.4891381 0.02384746
##  [4,] 3.699263 4.143675 0.4444115 0.02093871
##  [5,] 3.589284 4.052262 0.4629781 0.02036366
##  [6,] 3.519726 3.972254 0.4525278 0.02049566
##  [7,] 3.448288 3.905945 0.4576568 0.02106987
##  [8,] 3.398210 3.850807 0.4525967 0.01969193
##  [9,] 3.334279 3.802315 0.4680368 0.01905974
## [10,] 3.250246 3.759661 0.5094149 0.01928183</code></pre>
<pre class="r"><code># Base plot of gap statistic
plot(gap_stat, frame = FALSE, xlab = "Number of clusters k")
abline(v = 3, lty = 2)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-gap-statistic-k-means-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<pre class="r"><code># Use factoextra
fviz_gap_stat(gap_stat)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-gap-statistic-k-means-2.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<p><span class="success">In our example, the algorithm suggests k = 3</span></p>
<p>The optimal number of clusters, k, is computed using the “firstmax” method (see <strong>?cluster::maxSE</strong>). The criterion proposed by Tibshirani et al (2001) can be used as follow:</p>
<pre class="r"><code># Print
print(gap_stat, method = "Tibs2001SEmax")
# Plot
fviz_gap_stat(gap_stat, 
              maxSE = list(method = "Tibs2001SEmax"))
# Relaxed the gap test to be within two standard deviations
fviz_gap_stat(gap_stat, 
          maxSE = list(method = "Tibs2001SEmax", SE.factor = 2))</code></pre>
</div>
<div id="gap-statistic-for-pam-clustering" class="section level4">
<h4><span class="header-section-number">5.4.3.3</span> Gap statistic for PAM clustering</h4>
<p><span class="notice">We don’t need the argument “nstart” which is specific to kmeans() function.</span></p>
<pre class="r"><code># Compute gap statistic
set.seed(123)
gap_stat <- clusGap(iris.scaled, FUN = pam, K.max = 10, B = 50)
# Plot gap statistic
fviz_gap_stat(gap_stat)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-gap-statistic-pam-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<p><span class="success">Three cluster solutions are suggested.</span></p>
</div>
<div id="gap-statistic-for-hierarchical-clustering" class="section level4">
<h4><span class="header-section-number">5.4.3.4</span> Gap statistic for hierarchical clustering</h4>
<pre class="r"><code># Compute gap statistic
set.seed(123)
gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 50)
# Plot gap statistic
fviz_gap_stat(gap_stat)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-gap-statistic-hierarchical-clustering-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<p><span class="success">Three cluster solutions are suggested.</span></p>
</div>
</div>
</div>
</div>
<div id="nbclust-a-package-providing-30-indices-for-determining-the-best-number-of-clusters" class="section level1">
<h1><span class="header-section-number">6</span> NbClust: A Package providing 30 indices for determining the best number of clusters</h1>
<div id="overview-of-nbclust-package" class="section level2">
<h2><span class="header-section-number">6.1</span> Overview of NbClust package</h2>
<p>As mentioned in the introduction of this article, many indices have been proposed in the literature for determining the optimal number of clusters in a partitioning of a data set during the clustering process.</p>
<p><strong>NbClust</strong> package, published by <a href="http://www.jstatsoft.org/v61/i06/paper">Charrad et al., 2014</a>, provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.</p>
<p>An important advantage of NbClust is that the user can simultaneously computes multiple indices and determine the number of clusters in a single function call.</p>
<p>The indices provided in <strong>NbClust</strong> package includes the gap statistic, the silhouette method and 28 other indices described comprehensively in the original paper of <a href="http://www.jstatsoft.org/v61/i06/paper">Charrad et al., 2014</a>.</p>
</div>
<div id="nbclust-r-function" class="section level2">
<h2><span class="header-section-number">6.2</span> NbClust R function</h2>
<p>The simplified format of the function <strong>NbClust()</strong> is:</p>
<pre class="r"><code>NbClust(data = NULL, diss = NULL, distance = "euclidean",
        min.nc = 2, max.nc = 15, method = NULL, index = "all")</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>data</strong>: matrix</li>
<li><strong>diss</strong>: dissimilarity matrix to be used. By default, diss=NULL, but if it is replaced by a dissimilarity matrix, distance should be “NULL”</li>
<li><strong>distance</strong>: the distance measure to be used to compute the dissimilarity matrix. Possible values include “euclidean”, “manhattan” or “NULL”.</li>
<li><strong>min.nc, max.nc</strong>: minimal and maximal number of clusters, respectively</li>
<li><strong>method</strong>: The cluster analysis method to be used including “ward.D”, “ward.D2”, “single”, “complete”, “average” and more</li>
<li><strong>index</strong>: the index to be calculated including “silhouette”, “gap” and more.</li>
</ul>
</div>
<p><br/></p>
<p>The value of <strong>NbClust()</strong> function includes the following elements:</p>
<ul>
<li><strong>All.index</strong>: Values of indices for each partition of the dataset obtained with a number of clusters between min.nc and max.nc</li>
<li><strong>All.CriticalValues</strong>: Critical values of some indices for each partition obtained with a number of clusters between min.nc and max.nc</li>
<li><strong>Best.nc</strong>: Best number of clusters proposed by each index and the corresponding index value</li>
<li><strong>Best.partition</strong>: Partition that corresponds to the best number of clusters</li>
</ul>
</div>
<div id="examples-of-usage" class="section level2">
<h2><span class="header-section-number">6.3</span> Examples of usage</h2>
<p>Note that, user can request indices one by one, by setting the argument <em>index</em> to the name of the <strong>index</strong> of interest, for example <strong>index = “gap”</strong>.</p>
<p>In this case, <strong>NbClust</strong> function displays:</p>
<ul>
<li>the gap statistic values of the partitions obtained with number of clusters varying from <strong>min.nc</strong> to <strong>max.nc</strong> (<strong>$All.index</strong>)</li>
<li>the optimal number of clusters (<strong>$Best.nc</strong>)</li>
<li>and the partition corresponding to the best number of clusters (<strong>$Best.partition</strong>)</li>
</ul>
<div id="compute-only-an-index-of-interest" class="section level3">
<h3><span class="header-section-number">6.3.1</span> Compute only an index of interest</h3>
<p>The following example determine the number of clusters using <strong>gap</strong> statistics:</p>
<pre class="r"><code>library("NbClust")
set.seed(123)
res.nb <- NbClust(iris.scaled, distance = "euclidean",
                  min.nc = 2, max.nc = 10, 
                  method = "complete", index ="gap") 
res.nb # print the results</code></pre>
<pre><code>## $All.index
##       2       3       4       5       6       7       8       9      10 
## -0.2899 -0.2303 -0.6915 -0.8606 -1.0506 -1.3223 -1.3303 -1.4759 -1.5551 
## 
## $All.CriticalValues
##       2       3       4       5       6       7       8       9      10 
## -0.0539  0.4694  0.1787  0.2009  0.2848  0.0230  0.1631  0.0988  0.1708 
## 
## $Best.nc
## Number_clusters     Value_Index 
##          3.0000         -0.2303 
## 
## $Best.partition
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 3 3 2 3 2 3 2 3 2 2 3 2 3 3 3 3 2 2 2
##  [71] 3 3 3 3 3 3 3 3 3 2 2 2 2 3 3 3 3 2 3 2 2 3 2 2 2 3 3 3 2 2 3 3 3 3 3
## [106] 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3</code></pre>
<p>The elements returned by the function <strong>NbClust()</strong> are accessible using the R code below:</p>
<pre class="r"><code># All gap statistic values
res.nb$All.index

# Best number of clusters
res.nb$Best.nc

# Best partition
res.nb$Best.partition</code></pre>
</div>
<div id="compute-all-the-30-indices" class="section level3">
<h3><span class="header-section-number">6.3.2</span> Compute all the 30 indices</h3>
<p>The following example compute <strong>all</strong> the 30 indices, in a single function call, for determining the number of clusters and suggests to user the best clustering scheme. The description of the indices are available in NbClust documentation (see <strong>?NbClust</strong>).</p>
<p>To compute multiple indices simultaneously, the possible values for the argument <strong>index</strong> can be i) <strong>“alllong”</strong> or ii) <strong>“all”</strong>. The option <strong>“alllong”</strong> requires more time, as the run of some indices, such as <em>Gamma, Tau, Gap and Gplus</em>, is computationally very expensive. The user can avoid computing these four indices by setting the argument index to <strong>“all”</strong>. In this case, only 26 indices are calculated.</p>
<p>With the <strong>“alllong”</strong> option, the output of the <strong>NbClust</strong> function contains:</p>
<br/>
<div class="block">
<ul>
<li>all validation indices</li>
<li>critical values for Duda, Gap, PseudoT2 and Beale indices</li>
<li>the number of clusters corresponding to the optimal score for each indice</li>
<li>the best number of clusters proposed by NbClust according to the majority rule</li>
<li>the best partition</li>
</ul>
</div>
<p><br/></p>
<p>The R code below computes <strong>NbClust()</strong> with <strong>index = “all”</strong>:</p>
<pre class="r"><code>nb <- NbClust(iris.scaled, distance = "euclidean", min.nc = 2,
        max.nc = 10, method = "complete", index ="all")</code></pre>
<pre class="r"><code># Print the result
nb</code></pre>
<p>It’s possible to visualize the result using the function <strong>fviz_nbclust()</strong> [in <strong>factoextra</strong>], as follow:</p>
<pre class="r"><code>fviz_nbclust(nb) + theme_minimal()</code></pre>
<pre><code>## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 2 proposed  2 as the best number of clusters
## * 18 proposed  3 as the best number of clusters
## * 3 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * Accoridng to the majority rule, the best number of clusters is  3 .</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/clustering/determining-the-number-of-clusters-nbclust-ggplot2-1.png" title="Optimal number of clusters - R data visualization" alt="Optimal number of clusters - R data visualization" width="518.4" /></p>
<br/>
<div class="success">
<ul>
<li>….</li>
<li>2 proposed 2 as the best number of clusters</li>
<li>18 indices proposed 3 as the best number of clusters.</li>
<li>3 proposed 10 as the best number of clusters</li>
</ul>
<strong>According to the majority rule, the best number of clusters is 3</strong>
</div>
<p><br/></p>
</div>
</div>
</div>
<div id="infos" class="section level1">
<h1><span class="header-section-number">7</span> Infos</h1>
<p><span class="warning">This analysis has been performed using <strong>R software</strong> (ver. 3.2.1)</span></p>
<ul>
<li>Charrad M., Ghazzali N., Boiteau V., Niknafs A. (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, 61(6), 1-36.</li>
<li>Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.</li>
<li>Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411–423. <a href="http://web.stanford.edu/~hastie/Papers/gap.pdf">PDF</a></li>
</ul>
</div>

<script>jQuery(document).ready(function () {
    jQuery('h1').addClass('wiki_paragraph1');
    jQuery('h2').addClass('wiki_paragraph2');
    jQuery('h3').addClass('wiki_paragraph3');
    jQuery('h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->

<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>
  
<!--====================== stop here when you copy to sthda================-->


<!-- END HTML -->]]></description>
			<pubDate>Sun, 22 Nov 2015 04:37:09 +0100</pubDate>
			
		</item>
		
	</channel>
</rss>
