<?xml version="1.0" encoding="UTF-8" ?>
<!-- RSS generated by PHPBoost on Sat, 09 May 2026 15:33:15 +0200 -->

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title><![CDATA[Last articles - STHDA : Cluster Validation Essentials]]></title>
		<atom:link href="https://www.sthda.com/english/syndication/rss/articles/29" rel="self" type="application/rss+xml"/>
		<link>https://www.sthda.com</link>
		<description><![CDATA[Last articles - STHDA : Cluster Validation Essentials]]></description>
		<copyright>(C) 2005-2026 PHPBoost</copyright>
		<language>en</language>
		<generator>PHPBoost</generator>
		
		
		<item>
			<title><![CDATA[Computing P-value for Hierarchical Clustering]]></title>
			<link>https://www.sthda.com/english/articles/29-cluster-validation-essentials/99-computing-p-value-for-hierarchical-clustering/</link>
			<guid>https://www.sthda.com/english/articles/29-cluster-validation-essentials/99-computing-p-value-for-hierarchical-clustering/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p>Clusters can be found in a data set by chance due to clustering noise or sampling error. This article describes the R package <strong>pvclust</strong> <span class="citation">(Suzuki and Shimodaira 2015)</span> which uses bootstrap resampling techniques to <strong>compute p-value</strong> for each <strong>hierarchical clusters</strong>.</p>
<br/>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#algorithm">Algorithm</a></li>
<li><a href="#required-packages">Required packages</a></li>
<li><a href="#data-preparation">Data preparation</a></li>
<li><a href="#compute-p-value-for-hierarchical-clustering">Compute p-value for hierarchical clustering</a><ul>
<li><a href="#description-of-pvclust-function">Description of pvclust() function</a></li>
<li><a href="#usage-of-pvclust-function">Usage of pvclust() function</a></li>
</ul></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<p>Related Book:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="algorithm" class="section level2">
<h2>Algorithm</h2>
<ol style="list-style-type: decimal">
<li>Generated thousands of bootstrap samples by randomly sampling elements of the data</li>
<li>Compute hierarchical clustering on each bootstrap copy</li>
<li>For each cluster:
<ul>
<li>compute the <em>bootstrap probability</em> (<em>BP</em>) value which corresponds to the frequency that the cluster is identified in bootstrap copies.</li>
<li>Compute the <em>approximately unbiased</em> (AU) probability values (p-values) by multiscale bootstrap resampling</li>
</ul></li>
</ol>
<div class="success">
<p>
Clusters with AU >= 95% are considered to be strongly supported by data.
</p>
</div>
</div>
<div id="required-packages" class="section level2">
<h2>Required packages</h2>
<ol style="list-style-type: decimal">
<li>Install <strong>pvclust</strong>:</li>
</ol>
<pre class="r"><code>install.packages("pvclust")</code></pre>
<ol start="2" style="list-style-type: decimal">
<li>Load <strong>pvclust</strong>:</li>
</ol>
<pre class="r"><code>library(pvclust)</code></pre>
</div>
<div id="data-preparation" class="section level2">
<h2>Data preparation</h2>
<p>We’ll use <em>lung</em> data set [in <em>pvclust</em> package]. It contains the gene expression profile of 916 genes of 73 lung tissues including 67 tumors. Columns are samples and rows are genes.</p>
<pre class="r"><code>library(pvclust)
# Load the data
data("lung")
head(lung[, 1:4])</code></pre>
<pre><code>##               fetal_lung 232-97_SCC 232-97_node 68-96_Adeno
## IMAGE:196992       -0.40       4.28        3.68       -1.35
## IMAGE:587847       -2.22       5.21        4.75       -0.91
## IMAGE:1049185      -1.35      -0.84       -2.88        3.35
## IMAGE:135221        0.68       0.56       -0.45       -0.20
## IMAGE:298560          NA       4.14        3.58       -0.40
## IMAGE:119882       -3.23      -2.84       -2.72       -0.83</code></pre>
<pre class="r"><code># Dimension of the data
dim(lung)</code></pre>
<pre><code>## [1] 916  73</code></pre>
<p>We’ll use only a subset of the data set for the clustering analysis. The R function <em>sample</em>() can be used to extract a random subset of 30 samples:</p>
<pre class="r"><code>set.seed(123)
ss <- sample(1:73, 30) # extract 20 samples out of
df <- lung[, ss]</code></pre>
</div>
<div id="compute-p-value-for-hierarchical-clustering" class="section level2">
<h2>Compute p-value for hierarchical clustering</h2>
<div id="description-of-pvclust-function" class="section level3">
<h3>Description of pvclust() function</h3>
<p>The function <em>pvclust</em>() can be used as follow:</p>
<pre class="r"><code>pvclust(data, method.hclust = "average",
        method.dist = "correlation", nboot = 1000)</code></pre>
<p>Note that, the computation time can be strongly decreased using parallel computation version called <em>parPvclust</em>(). (Read ?parPvclust() for more information.)</p>
<pre class="r"><code>parPvclust(cl=NULL, data, method.hclust = "average",
           method.dist = "correlation", nboot = 1000,
           iseed = NULL)</code></pre>
<div class="block">
<ul>
<li>
<strong>data</strong>: numeric data matrix or data frame.
</li>
<li>
<strong>method.hclust</strong>: the agglomerative method used in hierarchical clustering. Possible values are one of “average”, “ward”, “single”, “complete”, “mcquitty”, “median” or “centroid”. The default is “average”. See method argument in <strong>?hclust</strong>.
</li>
<li>
<strong>method.dist</strong>: the distance measure to be used. Possible values are one of “correlation”, “uncentered”, “abscor” or those which are allowed for <strong>method</strong> argument in <strong>dist()</strong> function, such “euclidean” and “manhattan”.
</li>
<li>
<strong>nboot</strong>: the number of bootstrap replications. The default is 1000.
</li>
<li>
<strong>iseed</strong>: an integrer for random seeds. Use iseed argument to achieve reproducible results.
</li>
</ul>
</div>
<p>The function <em>pvclust</em>() returns an object of class <em>pvclust</em> containing many elements including <em>hclust</em> which contains hierarchical clustering result for the original data generated by the function <em>hclust</em>().</p>
</div>
<div id="usage-of-pvclust-function" class="section level3">
<h3>Usage of pvclust() function</h3>
<p><em>pvclust</em>() performs clustering on the columns of the data set, which correspond to samples in our case. If you want to perform the clustering on the variables (here, genes) you have to transpose the data set using the function <em>t</em>().</p>
<p>The R code below computes <em>pvclust</em>() using 10 as the number of bootstrap replications (for speed):</p>
<pre class="r"><code>library(pvclust)
set.seed(123)
res.pv <- pvclust(df, method.dist="cor", 
                  method.hclust="average", nboot = 10)</code></pre>
<pre class="r"><code># Default plot
plot(res.pv, hang = -1, cex = 0.5)
pvrect(res.pv)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/018-p-value-for-hierarchical-clustering-pvclust-p-value-hierarchical-clustering-1.png" width="518.4" /></p>
<div class="success">
<p>
Values on the dendrogram are <em>AU p-values</em> (Red, left), <em>BP values</em> (green, right), and <span class="math inline"><em>c</em><em>l</em><em>u</em><em>s</em><em>t</em><em>e</em><em>r</em><em>l</em><em>a</em><em>b</em><em>e</em><em>l</em><em>s</em></span> (grey, bottom). Clusters with AU > = 95% are indicated by the rectangles and are considered to be strongly supported by data.
</p>
</div>
<p>To extract the objects from the significant clusters, use the function <em>pvpick</em>():</p>
<pre class="r"><code>clusters <- pvpick(res.pv)
clusters</code></pre>
<p>Parallel computation can be applied as follow:</p>
<pre class="r"><code># Create a parallel socket cluster
library(parallel)
cl <- makeCluster(2, type = "PSOCK")
# parallel version of pvclust
res.pv <- parPvclust(cl, df, nboot=1000)
stopCluster(cl)</code></pre>
</div>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-suzuki2015">
<p>Suzuki, Ryota, and Hidetoshi Shimodaira. 2015. <em>Pvclust: Hierarchical Clustering with P-Values via Multiscale Bootstrap Resampling</em>. <a href="https://CRAN.R-project.org/package=pvclust" class="uri">https://CRAN.R-project.org/package=pvclust</a>.</p>
</div>
</div>
</div>
</div><!--end rdoc-->

<!-- END HTML -->]]></description>
			<pubDate>Thu, 07 Sep 2017 11:52:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Choosing the Best Clustering Algorithms]]></title>
			<link>https://www.sthda.com/english/articles/29-cluster-validation-essentials/98-choosing-the-best-clustering-algorithms/</link>
			<guid>https://www.sthda.com/english/articles/29-cluster-validation-essentials/98-choosing-the-best-clustering-algorithms/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p><strong>Choosing the best clustering method</strong> for a given data can be a hard task for the analyst. This article describes the R package <strong>clValid</strong> <span class="citation">(Brock et al. 2008)</span>, which can be used to compare simultaneously multiple clustering algorithms in a single function call for identifying the best clustering approach and the optimal number of clusters.</p>
<div class="block">
<p>
We’ll start by describing the different measures in the clValid package for comparing clustering algorithms. Next, we’ll present the function <em>clValid</em>(). Finally, we’ll provide R scripts for validating clustering results and comparing clustering algorithms.
</p>
</div>
<br/>
<p>Contents: </p>
<div id="TOC">
<ul>
<li><a href="#measures-for-comparing-clustering-algorithms">Measures for comparing clustering algorithms</a></li>
<li><a href="#compare-clustering-algorithms-in-r">Compare clustering algorithms in R</a></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<p>Related Book:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="measures-for-comparing-clustering-algorithms" class="section level2">
<h2>Measures for comparing clustering algorithms</h2>
<p>The clValid package compares clustering algorithms using two cluster validation measures:</p>
<ol style="list-style-type: decimal">
<li><p><em>Internal measures</em>, which uses intrinsic information in the data to assess the quality of the clustering. Internal measures include the connectivity, the silhouette coefficient and the Dunn index as described in Chapter @ref(cluster-validation-statistics) (Cluster Validation Statistics).</p></li>
<li><p><em>Stability measures</em>, a special version of internal measures, which evaluates the consistency of a clustering result by comparing it with the clusters obtained after each column is removed, one at a time.</p></li>
</ol>
<p>Cluster stability measures include:</p>
<ul>
<li>The average proportion of non-overlap (APN)</li>
<li>The average distance (AD)</li>
<li>The average distance between means (ADM)</li>
<li>The figure of merit (FOM)</li>
</ul>
<p>The APN, AD, and ADM are all based on the cross-classification table of the original clustering on the full data with the clustering based on the removal of one column.</p>
<ul>
<li><p>The APN measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.</p></li>
<li><p>The AD measures the average distance between observations placed in the same cluster under both cases (full data set and removal of one column).</p></li>
<li><p>The ADM measures the average distance between cluster centers for observations placed in the same cluster under both cases.</p></li>
<li><p>The FOM measures the average intra-cluster variance of the deleted column, where the clustering is based on the remaining (undeleted) columns.</p></li>
</ul>
<div class="warning">
<p>
The values of APN, ADM and FOM ranges from 0 to 1, with smaller value corresponding with highly consistent clustering results. AD has a value between 0 and infinity, and smaller values are also preferred.
</p>
</div>
<div class="notice">
<p>
Note that, the clValid package provides also biological validation measures, which evaluates the ability of a clustering algorithm to produce biologically meaningful clusters. An application is microarray or RNAseq data where observations corresponds to genes.
</p>
</div>
</div>
<div id="compare-clustering-algorithms-in-r" class="section level2">
<h2>Compare clustering algorithms in R</h2>
<p>We’ll use the function <strong>clValid</strong>() [in the <em>clValid</em> package], which simplified format is as follow:</p>
<pre class="r"><code>clValid(obj, nClust, clMethods = "hierarchical", 
        validation = "stability", maxitems = 600,
        metric = "euclidean", method = "average")</code></pre>
<div class="block">
<ul>
<li>
<strong>obj</strong>: A numeric matrix or data frame. Rows are the items to be clustered and columns are samples.
</li>
<li>
<strong>nClust</strong>: A numeric vector specifying the numbers of clusters to be evaluated. e.g., 2:10
</li>
<li>
<strong>clMethods</strong>: The clustering method to be used. Available options are “hierarchical”, “kmeans”, “diana”, “fanny”, “som”, “model”, “sota”, “pam”, “clara”, and “agnes”, with multiple choices allowed.
</li>
<li>
<strong>validation</strong>: The type of validation measures to be used. Allowed values are “internal”, “stability”, and “biological”, with multiple choices allowed.
</li>
<li>
<strong>maxitems</strong>: The maximum number of items (rows in matrix) which can be clustered.
</li>
<li>
<strong>metric</strong>: The metric used to determine the distance matrix. Possible choices are “euclidean”, “correlation”, and “manhattan”.
</li>
<li>
<strong>method</strong>: For hierarchical clustering (hclust and agnes), the agglomeration method to be used. Available choices are “ward”, “single”, “complete” and “average”.
</li>
</ul>
</div>
<p>For example, consider the iris data set, the <em>clValid</em>() function can be used as follow.</p>
<p>We start by cluster internal measures, which include the connectivity, silhouette width and Dunn index. It’s possible to compute simultaneously these internal measures for multiple clustering algorithms in combination with a range of cluster numbers.</p>
<pre class="r"><code>library(clValid)
# Iris data set:
# - Remove Species column and scale
df <- scale(iris[, -5])
# Compute clValid
clmethods <- c("hierarchical","kmeans","pam")
intern <- clValid(df, nClust = 2:6, 
              clMethods = clmethods, validation = "internal")
# Summary
summary(intern)</code></pre>
<pre><code>##  Length   Class    Mode 
##       1 clValid      S4</code></pre>
<div class="success">
<p>
It can be seen that hierarchical clustering with two clusters performs the best in each case (i.e., for connectivity, Dunn and Silhouette measures). Regardless of the clustering algorithm, the optimal number of clusters seems to be two using the three measures.
</p>
</div>
<p>The stability measures can be computed as follow:</p>
<pre class="r"><code># Stability measures
clmethods <- c("hierarchical","kmeans","pam")
stab <- clValid(df, nClust = 2:6, clMethods = clmethods, 
                validation = "stability")
# Display only optimal Scores
optimalScores(stab)</code></pre>
<pre><code>##       Score       Method Clusters
## APN 0.00327 hierarchical        2
## AD  1.00429          pam        6
## ADM 0.01609 hierarchical        2
## FOM 0.45575          pam        6</code></pre>
<div class="success">
<p>
For the APN and ADM measures, hierarchical clustering with two clusters again gives the best score. For the other measures, PAM with six clusters has the best score.
</p>
</div>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>Here, we described how to compare clustering algorithms using the <em>clValid</em> R package.</p>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-brock2008">
<p>Brock, Guy, Vasyl Pihur, Susmita Datta, and Somnath Datta. 2008. “ClValid: An R Package for Cluster Validation.” <em>Journal of Statistical Software</em> 25 (4): 1–22. <a href="https://www.jstatsoft.org/v025/i04" class="uri">https://www.jstatsoft.org/v025/i04</a>.</p>
</div>
</div>
</div>
</div><!--end rdoc-->

<!-- END HTML -->]]></description>
			<pubDate>Thu, 07 Sep 2017 11:33:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Cluster Validation Statistics: Must Know Methods]]></title>
			<link>https://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/</link>
			<guid>https://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p>The term <strong>cluster validation</strong> is used to design the procedure of evaluating the goodness of clustering algorithm results. This is important to avoid finding patterns in a random data, as well as, in the situation where you want to compare two clustering algorithms.</p>
<p>Generally, clustering validation statistics can be categorized into 3 classes <span class="citation">(Charrad et al. 2014,<span class="citation">Brock et al. (2008)</span>, <span class="citation">Theodoridis and Koutroumbas (2008)</span>)</span>:</p>
<ol style="list-style-type: decimal">
<li><p><strong>Internal cluster validation</strong>, which uses the internal information of the clustering process to evaluate the goodness of a clustering structure without reference to external information. It can be also used for estimating the number of clusters and the appropriate clustering algorithm without any external data.</p></li>
<li><p><strong>External cluster validation</strong>, which consists in comparing the results of a cluster analysis to an externally known result, such as externally provided class labels. It measures the extent to which cluster labels match externally supplied class labels. Since we know the “true” cluster number in advance, this approach is mainly used for selecting the right clustering algorithm for a specific data set.</p></li>
<li><p><strong>Relative cluster validation</strong>, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g.,: varying the number of clusters k). It’s generally used for determining the optimal number of clusters.</p></li>
</ol>
<div class="block">
<p>
In this chapter, we start by describing the different methods for clustering validation. Next, we’ll demonstrate how to compare the quality of clustering results obtained with different clustering algorithms. Finally, we’ll provide R scripts for validating clustering results.
</p>
</div>
<div class="notice">
<p>
In all the examples presented here, we’ll apply k-means, PAM and hierarchical clustering. Note that, the functions used in this article can be applied to evaluate the validity of any other clustering methods.
</p>
</div>
<br/>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#internal-measures-for-cluster-validation">Internal measures for cluster validation</a><ul>
<li><a href="#silhouette-coefficient">Silhouette coefficient</a></li>
<li><a href="#dunn-index">Dunn index</a></li>
</ul></li>
<li><a href="#external-measures-for-clustering-validation">External measures for clustering validation</a></li>
<li><a href="#computing-cluster-validation-statistics-in-r">Computing cluster validation statistics in R</a><ul>
<li><a href="#required-r-packages">Required R packages</a></li>
<li><a href="#data-preparation">Data preparation</a></li>
<li><a href="#clustering-analysis">Clustering analysis</a></li>
<li><a href="#cluster-validation">Cluster validation</a></li>
<li><a href="#external-clustering-validation">External clustering validation</a></li>
</ul></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<p>Related Book:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="internal-measures-for-cluster-validation" class="section level2">
<h2>Internal measures for cluster validation</h2>
<p>In this section, we describe the most widely used clustering validation indices. Recall that the goal of partitioning clustering algorithms (Part @ref(partitioning-clustering)) is to split the data set into clusters of objects, such that:</p>
<ul>
<li>the objects in the same cluster are similar as much as possible,</li>
<li>and the objects in different clusters are highly distinct</li>
</ul>
<div class="success">
<p>
That is, we want the average distance within cluster to be as small as possible; and the average distance between clusters to be as large as possible.
</p>
</div>
<p>Internal validation measures reflect often the <strong>compactness</strong>, the <strong>connectedness</strong> and the <strong>separation</strong> of the cluster partitions.</p>
<div class="block">
<ol style="list-style-type: decimal">
<li>
<p>
<strong>Compactness</strong> or cluster cohesion: Measures how close are the objects within the same cluster. A lower <strong>within-cluster variation</strong> is an indicator of a good compactness (i.e., a good clustering). The different indices for evaluating the compactness of clusters are base on distance measures such as the cluster-wise within average/median distances between observations.
</p>
</li>
<li>
<strong>Separation</strong>: Measures how well-separated a cluster is from other clusters. The indices used as separation measures include:
<ul>
<li>
distances between cluster centers
</li>
<li>
the pairwise minimum distances between objects in different clusters
</li>
</ul>
</li>
<li>
<p>
<strong>Connectivity</strong>: corresponds to what extent items are placed in the same cluster as their nearest neighbors in the data space. The connectivity has a value between 0 and infinity and should be minimized.
</p>
</li>
</ol>
</div>
<p>Generally most of the indices used for internal clustering validation combine compactness and separation measures as follow:</p>
<p><span class="math display">\[
Index = \frac{(\alpha \times Separation)}{(\beta \times Compactness)}
\]</span></p>
<p>Where <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span> are weights.</p>
<div class="success">
<p>
In this section, we’ll describe the two commonly used indices for assessing the goodness of clustering: the <strong>silhouette width</strong> and the <strong>Dunn index</strong>. These internal measure can be used also to determine the optimal number of clusters in the data.
</p>
</div>
<div id="silhouette-coefficient" class="section level3">
<h3>Silhouette coefficient</h3>
<p>The silhouette analysis measures how well an observation is clustered and it estimates the <strong>average distance between clusters</strong>. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.</p>
<p>For each observation <span class="math inline">\(i\)</span>, the silhouette width <span class="math inline">\(s_i\)</span> is calculated as follows:</p>
<ol style="list-style-type: decimal">
<li><p>For each observation <span class="math inline">\(i\)</span>, calculate the average dissimilarity <span class="math inline">\(a_i\)</span> between <span class="math inline">\(i\)</span> and all other points of the cluster to which i belongs.</p></li>
<li><p>For all other clusters <span class="math inline">\(C\)</span>, to which i does not belong, calculate the average dissimilarity <span class="math inline">\(d(i, C)\)</span> of <span class="math inline">\(i\)</span> to all observations of C. The smallest of these <span class="math inline">\(d(i,C)\)</span> is defined as <span class="math inline">\(b_i= \min_C d(i,C)\)</span>. The value of <span class="math inline">\(b_i\)</span> can be seen as the dissimilarity between <span class="math inline">\(i\)</span> and its “neighbor” cluster, i.e., the nearest one to which it does not belong.</p></li>
<li><p>Finally the silhouette width of the observation <span class="math inline">\(i\)</span> is defined by the formula: <span class="math inline">\(S_i = (b_i - a_i)/max(a_i, b_i)\)</span>.</p></li>
</ol>
<p>Silhouette width can be interpreted as follow:</p>
<div class="block">
<ul>
<li>
<p>
Observations with a large <span class="math inline"><em>S</em><sub><em>i</em></sub></span> (almost 1) are very well clustered.
</p>
</li>
<li>
<p>
A small <span class="math inline"><em>S</em><sub><em>i</em></sub></span> (around 0) means that the observation lies between two clusters.
</p>
</li>
<li>
<p>
Observations with a negative <span class="math inline"><em>S</em><sub><em>i</em></sub></span> are probably placed in the wrong cluster.
</p>
</li>
</ul>
</div>
</div>
<div id="dunn-index" class="section level3">
<h3>Dunn index</h3>
<p>The <strong>Dunn index</strong> is another internal clustering validation measure which can be computed as follow:</p>
<ol style="list-style-type: decimal">
<li>For each cluster, compute the distance between each of the objects in the cluster and the objects in the other clusters</li>
<li><p>Use the minimum of this pairwise distance as the inter-cluster separation (<em>min.separation</em>)</p></li>
<li>For each cluster, compute the distance between the objects in the same cluster.</li>
<li><p>Use the maximal intra-cluster distance (i.e maximum diameter) as the intra-cluster compactness</p></li>
<li><p>Calculate the <em>Dunn index</em> (D) as follow:</p></li>
</ol>
<p><span class="math display">\[
D = \frac{min.separation}{max.diameter}
\]</span></p>
<div class="success">
<p>
If the data set contains compact and well-separated clusters, the diameter of the clusters is expected to be small and the distance between the clusters is expected to be large. Thus, Dunn index should be maximized.
</p>
</div>
</div>
</div>
<div id="external-measures-for-clustering-validation" class="section level2">
<h2>External measures for clustering validation</h2>
<p>The aim is to compare the identified clusters (by k-means, pam or hierarchical clustering) to an external reference.</p>
<p>It’s possible to quantify the agreement between partitioning clusters and external reference using either the corrected <em>Rand index</em> and <em>Meila’s variation index VI</em>, which are implemented in the R function <em>cluster.stats</em>()[<em>fpc</em> package].</p>
<p>The corrected <em>Rand index</em> varies from -1 (no agreement) to 1 (perfect agreement).</p>
<div class="success">
<p>
External clustering validation, can be used to select suitable clustering algorithm for a given data set.
</p>
</div>
</div>
<div id="computing-cluster-validation-statistics-in-r" class="section level2">
<h2>Computing cluster validation statistics in R</h2>
<div id="required-r-packages" class="section level3">
<h3>Required R packages</h3>
<p>The following R packages are required in this chapter:</p>
<ul>
<li><em>factoextra</em> for data visualization</li>
<li><em>fpc</em> for computing clustering validation statistics</li>
<li><p><em>NbClust</em> for determining the optimal number of clusters in the data set.</p></li>
<li><p>Install the packages:</p></li>
</ul>
<pre class="r"><code>install.packages(c("factoextra", "fpc", "NbClust"))</code></pre>
<ul>
<li>Load the packages:</li>
</ul>
<pre class="r"><code>library(factoextra)
library(fpc)
library(NbClust)</code></pre>
</div>
<div id="data-preparation" class="section level3">
<h3>Data preparation</h3>
<p>We’ll use the built-in R data set iris:</p>
<pre class="r"><code># Excluding the column "Species" at position 5
df <- iris[, -5]
# Standardize
df <- scale(df)</code></pre>
</div>
<div id="clustering-analysis" class="section level3">
<h3>Clustering analysis</h3>
<p>We’ll use the function <em>eclust</em>() [enhanced clustering, in <em>factoextra</em>] which provides several advantages:</p>
<ul>
<li>It simplifies the workflow of clustering analysis</li>
<li>It can be used to compute hierarchical clustering and partitioning clustering in a single line function call</li>
<li>Compared to the standard partitioning functions (kmeans, pam, clara and fanny) which requires the user to specify the optimal number of clusters, the function <em>eclust</em>() computes automatically the gap statistic for estimating the right number of clusters.</li>
<li>It provides silhouette information for all partitioning methods and hierarchical clustering</li>
<li>It draws beautiful graphs using ggplot2</li>
</ul>
<p>The simplified format the <em>eclust</em>() function is as follow:</p>
<pre class="r"><code>eclust(x, FUNcluster = "kmeans", hc_metric = "euclidean", ...)</code></pre>
<div class="block">
<ul>
<li>
<strong>x</strong>: numeric vector, data matrix or data frame
</li>
<li>
<strong>FUNcluster</strong>: a clustering function including “kmeans”, “pam”, “clara”, “fanny”, “hclust”, “agnes” and “diana”. Abbreviation is allowed.
</li>
<li>
<strong>hc_metric</strong>: character string specifying the metric to be used for calculating dissimilarities between observations. Allowed values are those accepted by the function <em>dist</em>() [including “euclidean”, “manhattan”, “maximum”, “canberra”, “binary”, “minkowski”] and correlation based distance measures [“pearson”, “spearman” or “kendall”]. Used only when FUNcluster is a hierarchical clustering function such as one of “hclust”, “agnes” or “diana”.
</li>
<li>
<strong>…</strong>: other arguments to be passed to FUNcluster.
</li>
</ul>
</div>
<p>The function <strong>eclust()</strong> returns an object of class <strong>eclust</strong> containing the result of the standard function used (e.g., kmeans, pam, hclust, agnes, diana, etc.).</p>
<p>It includes also:</p>
<ul>
<li><strong>cluster</strong>: the cluster assignment of observations after cutting the tree</li>
<li><strong>nbclust</strong>: the number of clusters</li>
<li><strong>silinfo</strong>: the silhouette information of observations</li>
<li><strong>size</strong>: the size of clusters</li>
<li><strong>data</strong>: a matrix containing the original or the standardized data (if stand = TRUE)</li>
<li><strong>gap_stat</strong>: containing gap statistics</li>
</ul>
<p>To compute a partitioning clustering, such as k-means clustering with k = 3, type this:</p>
<pre class="r"><code># K-means clustering
km.res <- eclust(df, "kmeans", k = 3, nstart = 25, graph = FALSE)
# Visualize k-means clusters
fviz_cluster(km.res, geom = "point", ellipse.type = "norm",
             palette = "jco", ggtheme = theme_minimal())</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/016-cluster-validation-statistics-k-means-clustering-1.png" width="384" /></p>
<p>To compute a hierarchical clustering, use this:</p>
<pre class="r"><code># Hierarchical clustering
hc.res <- eclust(df, "hclust", k = 3, hc_metric = "euclidean", 
                 hc_method = "ward.D2", graph = FALSE)
# Visualize dendrograms
fviz_dend(hc.res, show_labels = FALSE,
         palette = "jco", as.ggplot = TRUE)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/016-cluster-validation-statistics-hierarchical-clustering-1.png" width="518.4" /></p>
</div>
<div id="cluster-validation" class="section level3">
<h3>Cluster validation</h3>
<div id="silhouette-plot" class="section level4">
<h4>Silhouette plot</h4>
<p>Recall that the silhouette coefficient (<span class="math inline">\(S_i\)</span>) measures how similar an object <span class="math inline">\(i\)</span> is to the the other objects in its own cluster versus those in the neighbor cluster. <span class="math inline">\(S_i\)</span> values range from 1 to - 1:</p>
<ul>
<li>A value of <span class="math inline">\(S_i\)</span> close to 1 indicates that the object is well clustered. In the other words, the object <span class="math inline">\(i\)</span> is similar to the other objects in its group.</li>
<li>A value of <span class="math inline">\(S_i\)</span> close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.</li>
</ul>
<p>It’s possible to draw silhouette coefficients of observations using the function <em>fviz_silhouette</em>() [<em>factoextra</em> package], which will also print a summary of the silhouette analysis output. To avoid this, you can use the option <em>print.summary = FALSE</em>.</p>
<pre class="r"><code>fviz_silhouette(km.res, palette = "jco", 
                ggtheme = theme_classic())</code></pre>
<pre><code>##   cluster size ave.sil.width
## 1       1   50          0.64
## 2       2   47          0.35
## 3       3   53          0.39</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/016-cluster-validation-statistics-silhouette-coefficient-1.png" width="518.4" /></p>
<p>Silhouette information can be extracted as follow:</p>
<pre class="r"><code># Silhouette information
silinfo <- km.res$silinfo
names(silinfo)
# Silhouette widths of each observation
head(silinfo$widths[, 1:3], 10)
# Average silhouette width of each cluster
silinfo$clus.avg.widths
# The total average (mean of all individual silhouette widths)
silinfo$avg.width
# The size of each clusters
km.res$size</code></pre>
<p>It can be seen that several samples, in cluster 2, have a negative silhouette coefficient. This means that they are not in the right cluster. We can find the name of these samples and determine the clusters they are closer (neighbor cluster), as follow:</p>
<pre class="r"><code># Silhouette width of observation
sil <- km.res$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]</code></pre>
<pre><code>##     cluster neighbor sil_width
## 112       2        3   -0.0106
## 128       2        3   -0.0249</code></pre>
</div>
<div id="computing-dunn-index-and-other-cluster-validation-statistics" class="section level4">
<h4>Computing Dunn index and other cluster validation statistics</h4>
<p>The function <em>cluster.stats</em>() [<em>fpc</em> package] and the function <em>NbClust()</em> [in <em>NbClust</em> package] can be used to compute <em>Dunn index</em> and many other indices.</p>
<p>The simplified format is:</p>
<pre class="r"><code>cluster.stats(d = NULL, clustering, al.clustering = NULL)</code></pre>
<div class="block">
<ul>
<li>
<strong>d</strong>: a distance object between cases as generated by the <strong>dist()</strong> function
</li>
<li>
<strong>clustering</strong>: vector containing the cluster number of each observation
</li>
<li>
<strong>alt.clustering</strong>: vector such as for clustering, indicating an alternative clustering
</li>
</ul>
</div>
<p>The function <em>cluster.stats</em>() returns a list containing many components useful for analyzing the intrinsic characteristics of a clustering:</p>
<ul>
<li><strong>cluster.number</strong>: number of clusters</li>
<li><strong>cluster.size</strong>: vector containing the number of points in each cluster</li>
<li><strong>average.distance</strong>, <strong>median.distance</strong>: vector containing the cluster-wise within average/median distances</li>
<li><strong>average.between</strong>: average distance between clusters. We want it to be as large as possible</li>
<li><strong>average.within</strong>: average distance within clusters. We want it to be as small as possible</li>
<li><strong>clus.avg.silwidths</strong>: vector of cluster average silhouette widths. Recall that, the <strong>silhouette width</strong> is also an estimate of the average distance between clusters. Its value is comprised between 1 and -1 with a value of 1 indicating a very good cluster.</li>
<li><strong>within.cluster.ss</strong>: a generalization of the within clusters sum of squares (k-means objective function), which is obtained if d is a Euclidean distance matrix.</li>
<li><strong>dunn, dunn2</strong>: Dunn index</li>
<li><strong>corrected.rand, vi</strong>: Two indexes to assess the similarity of two clustering: the corrected Rand index and Meila’s VI</li>
</ul>
<p>All the above elements can be used to evaluate the internal quality of clustering.</p>
<p>In the following sections, we’ll compute the clustering quality statistics for k-means. Look at the <strong>within.cluster.ss</strong> (within clusters sum of squares), the <strong>average.within</strong> (average distance within clusters) and <strong>clus.avg.silwidths</strong> (vector of cluster average silhouette widths).</p>
<pre class="r"><code>library(fpc)
# Statistics for k-means clustering
km_stats <- cluster.stats(dist(df),  km.res$cluster)
# Dun index
km_stats$dunn</code></pre>
<pre><code>## [1] 0.0265</code></pre>
<p>To display all statistics, type this:</p>
<pre class="r"><code>km_stats</code></pre>
<div class="notice">
<p>
Read the documentation of <em>cluster.stats</em>() for details about all the available indices.
</p>
</div>
</div>
</div>
<div id="external-clustering-validation" class="section level3">
<h3>External clustering validation</h3>
<p>Among the values returned by the function <strong>cluster.stats</strong>(), there are two indexes to assess the similarity of two clustering, namely the corrected Rand index and Meila’s VI.</p>
<p>We know that the iris data contains exactly 3 groups of species.</p>
<p><span class="question">Does the K-means clustering matches with the true structure of the data?</span></p>
<p>We can use the function <strong>cluster.stats</strong>() to answer to this question.</p>
<p>Let start by computing a cross-tabulation between k-means clusters and the reference Species labels:</p>
<pre class="r"><code>table(iris$Species, km.res$cluster)</code></pre>
<pre><code>##             
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 11 39
##   virginica   0 36 14</code></pre>
<p>It can be seen that:</p>
<ul>
<li>All setosa species (n = 50) has been classified in cluster 1</li>
<li>A large number of versicor species (n = 39 ) has been classified in cluster 3. Some of them ( n = 11) have been classified in cluster 2.</li>
<li>A large number of virginica species (n = 36 ) has been classified in cluster 2. Some of them (n = 14) have been classified in cluster 3.</li>
</ul>
<p>It’s possible to quantify the agreement between Species and k-means clusters using either the corrected Rand index and Meila’s VI provided as follow:</p>
<pre class="r"><code>library("fpc")
# Compute cluster stats
species <- as.numeric(iris$Species)
clust_stats <- cluster.stats(d = dist(df), 
                             species, km.res$cluster)
# Corrected Rand index
clust_stats$corrected.rand</code></pre>
<pre><code>## [1] 0.62</code></pre>
<pre class="r"><code># VI
clust_stats$vi</code></pre>
<pre><code>## [1] 0.748</code></pre>
<div class="success">
<p>
The corrected <strong>Rand index</strong> provides a measure for assessing the similarity between two partitions, adjusted for chance. Its range is -1 (no agreement) to 1 (perfect agreement). Agreement between the specie types and the cluster solution is 0.62 using <strong>Rand index</strong> and 0.748 using Meila’s VI.
</p>
</div>
<p>The same analysis can be computed for both PAM and hierarchical clustering:</p>
<pre class="r"><code># Agreement between species and pam clusters
pam.res <- eclust(df, "pam", k = 3, graph = FALSE)
table(iris$Species, pam.res$cluster)
cluster.stats(d = dist(iris.scaled), 
              species, pam.res$cluster)$vi
# Agreement between species and HC clusters
res.hc <- eclust(df, "hclust", k = 3, graph = FALSE)
table(iris$Species, res.hc$cluster)
cluster.stats(d = dist(iris.scaled), 
              species, res.hc$cluster)$vi</code></pre>
<div class="success">
<p>
External clustering validation, can be used to select suitable clustering algorithm for a given data set.
</p>
</div>
</div>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>We described how to validate clustering results using the silhouette method and the Dunn index. This task is facilitated using the combination of two R functions: <em>eclust</em>() and <em>fviz_silhouette</em> in the factoextra package We also demonstrated how to assess the agreement between a clustering result and an external reference.
In the next chapters, we’ll show how to i) choose the appropriate clustering algorithm for your data; and ii) computing p-values for hierarchical clustering.</p>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-brock2008">
<p>Brock, Guy, Vasyl Pihur, Susmita Datta, and Somnath Datta. 2008. “ClValid: An R Package for Cluster Validation.” <em>Journal of Statistical Software</em> 25 (4): 1–22. <a href="https://www.jstatsoft.org/v025/i04" class="uri">https://www.jstatsoft.org/v025/i04</a>.</p>
</div>
<div id="ref-charrad2014">
<p>Charrad, Malika, Nadia Ghazzali, Véronique Boiteau, and Azam Niknafs. 2014. “NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set.” <em>Journal of Statistical Software</em> 61: 1–36. <a href="http://www.jstatsoft.org/v61/i06/paper" class="uri">http://www.jstatsoft.org/v61/i06/paper</a>.</p>
</div>
<div id="ref-theodoridis2008">
<p>Theodoridis, Sergios, and Konstantinos Koutroumbas. 2008. <em>Pattern Recognition</em>. 2nd ed. Academic Press.</p>
</div>
</div>
</div>
</div><!--end rdoc-->
 
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

<!-- END HTML -->]]></description>
			<pubDate>Thu, 07 Sep 2017 11:07:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Determining The Optimal Number Of Clusters: 3 Must Know Methods]]></title>
			<link>https://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/</link>
			<guid>https://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p>Determining the <strong>optimal number of clusters</strong> in a data set is a fundamental issue in partitioning clustering, such as k-means clustering (Chapter @ref(kmeans-clustering)), which requires the user to specify the number of clusters k to be generated.</p>
<p>Unfortunately, there is no definitive answer to this question. The optimal number of clusters is somehow subjective and depends on the method used for measuring similarities and the parameters used for partitioning.
A simple and popular solution consists of inspecting the dendrogram produced using hierarchical clustering (Chapter @ref(agglomerative-clustering)) to see if it suggests a particular number of clusters. Unfortunately, this approach is also subjective.</p>
<div class="block">
<p>
In this chapter, we’ll describe different methods for determining the optimal number of clusters for k-means, k-medoids (PAM) and hierarchical clustering.
</p>
</div>
<p>These methods include direct methods and statistical testing methods:</p>
<ol style="list-style-type: decimal">
<li><p>Direct methods: consists of optimizing a criterion, such as the within cluster sums of squares or the average silhouette. The corresponding methods are named <em>elbow</em> and <em>silhouette</em> methods, respectively.</p></li>
<li><p>Statistical testing methods: consists of comparing evidence against null hypothesis. An example is the <em>gap statistic</em>.</p></li>
</ol>
<p>In addition to <em>elbow</em>, <em>silhouette</em> and <em>gap statistic</em> methods, there are more than thirty other indices and methods that have been published for identifying the optimal number of clusters. We’ll provide R codes for computing all these 30 indices in order to decide the best number of clusters using the “majority rule”.</p>
<p>For each of these methods:</p>
<ul>
<li>We’ll describe the basic idea and the algorithm</li>
<li>We’ll provide easy-o-use R codes with many examples for determining the optimal number of clusters and visualizing the output.</li>
</ul>
<br/>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#elbow-method">Elbow method</a></li>
<li><a href="#average-silhouette-method">Average silhouette method</a></li>
<li><a href="#gap-statistic-method">Gap statistic method</a></li>
<li><a href="#computing-the-number-of-clusters-using-r">Computing the number of clusters using R</a><ul>
<li><a href="#required-r-packages">Required R packages</a></li>
<li><a href="#data-preparation">Data preparation</a></li>
<li><a href="#fviz_nbclust-function-elbow-silhouhette-and-gap-statistic-methods">fviz_nbclust() function: Elbow, Silhouhette and Gap statistic methods</a></li>
<li><a href="#nbclust-function-30-indices-for-choosing-the-best-number-of-clusters">NbClust() function: 30 indices for choosing the best number of clusters</a></li>
</ul></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<p>Related Book:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="elbow-method" class="section level2">
<h2>Elbow method</h2>
<p>Recall that, the basic idea behind partitioning methods, such as k-means clustering (Chapter @ref(kmeans-clustering)), is to define clusters such that the total intra-cluster variation [or total within-cluster sum of square (WSS)] is minimized. The total WSS measures the compactness of the clustering and we want it to be as small as possible.</p>
<p>The Elbow method looks at the total WSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t improve much better the total WSS.</p>
<p>The optimal number of clusters can be defined as follow:</p>
<div class="block">
<ol style="list-style-type: decimal">
<li>
<p>
Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
</p>
</li>
<li>
<p>
For each k, calculate the total within-cluster sum of square (wss).
</p>
</li>
<li>
<p>
Plot the curve of wss according to the number of clusters k.
</p>
</li>
<li>
<p>
The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.
</p>
</li>
</ol>
</div>
<div class="success">
<p>
Note that, the elbow method is sometimes ambiguous. An alternative is the average silhouette method (Kaufman and Rousseeuw [1990]) which can be also used with any clustering approach.
</p>
</div>
</div>
<div id="average-silhouette-method" class="section level2">
<h2>Average silhouette method</h2>
<p>The average silhouette approach we’ll be described comprehensively in the chapter cluster validation statistics (Chapter @ref(cluster-validation-statistics)). Briefly, it measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.</p>
<p>Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k <span class="citation">(Kaufman and Rousseeuw 1990)</span>.</p>
<p>The algorithm is similar to the elbow method and can be computed as follow:</p>
<div class="block">
<ol style="list-style-type: decimal">
<li>
<p>
Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
</p>
</li>
<li>
<p>
For each k, calculate the average silhouette of observations (<span class="math inline"><em>a</em><em>v</em><em>g</em>.<em>s</em><em>i</em><em>l</em></span>).
</p>
</li>
<li>
<p>
Plot the curve of <span class="math inline"><em>a</em><em>v</em><em>g</em>.<em>s</em><em>i</em><em>l</em></span> according to the number of clusters k.
</p>
</li>
<li>
<p>
The location of the maximum is considered as the appropriate number of clusters.
</p>
</li>
</ol>
</div>
</div>
<div id="gap-statistic-method" class="section level2">
<h2>Gap statistic method</h2>
<p>The <em>gap statistic</em> has been published by <a href="http://web.stanford.edu/~hastie/Papers/gap.pdf">R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001)</a>. The approach can be applied to any clustering method.</p>
<p>The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e, that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.</p>
<p>The algorithm works as follow:</p>
<div class="block">
<ol style="list-style-type: decimal">
<li>
<p>
Cluster the observed data, varying the number of clusters from k = 1, …, <span class="math inline"><em>k</em><sub><em>m</em><em>a</em><em>x</em></sub></span>, and compute the corresponding total within intra-cluster variation <span class="math inline"><em>W</em><sub><em>k</em></sub></span>.
</p>
</li>
<li>
<p>
Generate B reference data sets with a random uniform distribution. Cluster each of these reference data sets with varying number of clusters k = 1, …, <span class="math inline"><em>k</em><sub><em>m</em><em>a</em><em>x</em></sub></span>, and compute the corresponding total within intra-cluster variation <span class="math inline"><em>W</em><sub><em>k</em><em>b</em></sub></span>.
</p>
</li>
<li>
<p>
Compute the estimated gap statistic as the deviation of the observed <span class="math inline"><em>W</em><sub><em>k</em></sub></span> value from its expected value <span class="math inline"><em>W</em><sub><em>k</em><em>b</em></sub></span> under the null hypothesis: <span class="math inline"><span class="math inline">\(Gap(k) = \frac{1}{B} \sum\limits_{b=1}^B log(W_{kb}^*) - log(W_k)\)</span></span>. Compute also the standard deviation of the statistics.
</p>
</li>
<li>
<p>
Choose the number of clusters as the smallest value of k such that the gap statistic is within one standard deviation of the gap at k+1: <span class="math inline"><em>G</em><em>a</em><em>p</em>(<em>k</em>)≥<em>G</em><em>a</em><em>p</em>(<em>k</em> + 1)−<em>s</em><sub><em>k</em> + 1</sub></span>.
</p>
</li>
</ol>
</div>
<div class="warning">
<p>
Note that, using B = 500 gives quite precise results so that the gap plot is basically unchanged after an another run.
</p>
</div>
</div>
<div id="computing-the-number-of-clusters-using-r" class="section level2">
<h2>Computing the number of clusters using R</h2>
<p>In this section, we’ll describe two functions for determining the optimal number of clusters:</p>
<ol style="list-style-type: decimal">
<li><p><em>fviz_nbclust</em>() function [in <em>factoextra</em> R package]: It can be used to compute the three different methods [elbow, silhouette and gap statistic] for any partitioning clustering methods [K-means, K-medoids (PAM), CLARA, HCUT]. Note that the <em>hcut</em>() function is available only in factoextra package.It computes hierarchical clustering and cut the tree in k pre-specified clusters.</p></li>
<li><p><em>NbClust</em>() function [ in <em>NbClust</em> R package] <span class="citation">(Charrad et al. 2014)</span>: It provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods. It can simultaneously computes all the indices and determine the number of clusters in a single function call.</p></li>
</ol>
<div id="required-r-packages" class="section level3">
<h3>Required R packages</h3>
<p>We’ll use the following R packages:</p>
<ul>
<li><em>factoextra</em> to determine the optimal number clusters for a given clustering methods and for data visualization.</li>
<li><em>NbClust</em> for computing about 30 methods at once, in order to find the optimal number of clusters.</li>
</ul>
<p>To install the packages, type this:</p>
<pre class="r"><code>pkgs <- c("factoextra",  "NbClust")
install.packages(pkgs)</code></pre>
<p>Load the packages as follow:</p>
<pre class="r"><code>library(factoextra)
library(NbClust)</code></pre>
</div>
<div id="data-preparation" class="section level3">
<h3>Data preparation</h3>
<p>We’ll use the USArrests data as a demo data set. We start by standardizing the data to make variables comparable.</p>
<pre class="r"><code># Standardize the data
df <- scale(USArrests)
head(df)</code></pre>
<pre><code>##            Murder Assault UrbanPop     Rape
## Alabama    1.2426   0.783   -0.521 -0.00342
## Alaska     0.5079   1.107   -1.212  2.48420
## Arizona    0.0716   1.479    0.999  1.04288
## Arkansas   0.2323   0.231   -1.074 -0.18492
## California 0.2783   1.263    1.759  2.06782
## Colorado   0.0257   0.399    0.861  1.86497</code></pre>
</div>
<div id="fviz_nbclust-function-elbow-silhouhette-and-gap-statistic-methods" class="section level3">
<h3>fviz_nbclust() function: Elbow, Silhouhette and Gap statistic methods</h3>
<p>The simplified format is as follow:</p>
<pre class="r"><code>fviz_nbclust(x, FUNcluster, method = c("silhouette", "wss", "gap_stat"))</code></pre>
<div class="block">
<ul>
<li>
<strong>x</strong>: numeric matrix or data frame
</li>
<li>
<strong>FUNcluster</strong>: a partitioning function. Allowed values include kmeans, pam, clara and hcut (for hierarchical clustering).
</li>
<li>
<strong>method</strong>: the method to be used for determining the optimal number of clusters.
</li>
</ul>
</div>
<p>The R code below determine the optimal number of clusters for k-means clustering:</p>
<pre class="r"><code># Elbow method
fviz_nbclust(df, kmeans, method = "wss") +
    geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")
# Silhouette method
fviz_nbclust(df, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")
# Gap statistic
# nboot = 50 to keep the function speedy. 
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
fviz_nbclust(df, kmeans, nstart = 25,  method = "gap_stat", nboot = 50)+
  labs(subtitle = "Gap statistic method")</code></pre>
<pre><code>## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 50)  [one "." per sample]:
## .................................................. 50</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/015-determining-the-optimal-number-of-clusters-k-means-optimal-clusters-wss-silhouette-1.png" width="288" /><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/015-determining-the-optimal-number-of-clusters-k-means-optimal-clusters-wss-silhouette-2.png" width="288" /><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/015-determining-the-optimal-number-of-clusters-k-means-optimal-clusters-wss-silhouette-3.png" width="288" /></p>
<div class="success">
<ul>
<li>
Elbow method: 4 clusters solution suggested
</li>
<li>
Silhouette method: 2 clusters solution suggested
</li>
<li>
Gap statistic method: 4 clusters solution suggested
</li>
</ul>
</div>
<p>According to these observations, it’s possible to define k = 4 as the optimal number of clusters in the data.</p>
<div class="warning">
<p>
The disadvantage of elbow and average silhouette methods is that, they measure a global clustering characteristic only. A more sophisticated method is to use the gap statistic which provides a statistical procedure to formalize the elbow/silhouette heuristic in order to estimate the optimal number of clusters.
</p>
</div>
</div>
<div id="nbclust-function-30-indices-for-choosing-the-best-number-of-clusters" class="section level3">
<h3>NbClust() function: 30 indices for choosing the best number of clusters</h3>
<p>The simplified format of the function <em>NbClust</em>() is:</p>
<pre class="r"><code>NbClust(data = NULL, diss = NULL, distance = "euclidean",
        min.nc = 2, max.nc = 15, method = NULL)</code></pre>
<div class="block">
<ul>
<li>
<strong>data</strong>: matrix
</li>
<li>
<strong>diss</strong>: dissimilarity matrix to be used. By default, diss=NULL, but if it is replaced by a dissimilarity matrix, distance should be “NULL”
</li>
<li>
<strong>distance</strong>: the distance measure to be used to compute the dissimilarity matrix. Possible values include “euclidean”, “manhattan” or “NULL”.
</li>
<li>
<strong>min.nc, max.nc</strong>: minimal and maximal number of clusters, respectively
</li>
<li>
<strong>method</strong>: The cluster analysis method to be used including “ward.D”, “ward.D2”, “single”, “complete”, “average”, “kmeans” and more.
</li>
</ul>
</div>
<ul>
<li>To compute <em>NbClust</em>() for kmeans, use method = “kmeans”.</li>
<li>To compute <em>NbClust</em>() for hierarchical clustering, method should be one of c(“ward.D”, “ward.D2”, “single”, “complete”, “average”).</li>
</ul>
<p>The R code below computes <em>NbClust</em>() for k-means:</p>
<pre class="r"><code>library("NbClust")
nb <- NbClust(df, distance = "euclidean", min.nc = 2,
        max.nc = 10, method = "kmeans")</code></pre>
<p>The result of NbClust using the function <em>fviz_nbclust</em>() [in <em>factoextra</em>], as follow:</p>
<pre class="r"><code>library("factoextra")
fviz_nbclust(nb)</code></pre>
<pre><code>## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 10 proposed  2 as the best number of clusters
## * 2 proposed  3 as the best number of clusters
## * 8 proposed  4 as the best number of clusters
## * 1 proposed  5 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 2 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/015-determining-the-optimal-number-of-clusters-nbclust-1.png" width="384" /></p>
<div class="success">
<ul>
<li>
….
</li>
<li>
2 proposed 0 as the best number of clusters
</li>
<li>
10 indices proposed 2 as the best number of clusters.
</li>
<li>
2 proposed 3 as the best number of clusters.
</li>
<li>
8 proposed 4 as the best number of clusters.
</li>
</ul>
<p>
According to the majority rule, the best number of clusters is 2.
</p>
</div>
</div>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>In this article, we described different methods for choosing the optimal number of clusters in a data set. These methods include the elbow, the silhouette and the gap statistic methods.</p>
<p>We demonstrated how to compute these methods using the R function <em>fviz_nbclust</em>() [in <em>factoextra</em> R package]. Additionally, we described the package <em>NbClust</em>(), which can be used to compute simultaneously many other indices and methods for determining the number of clusters.</p>
<p>After choosing the number of clusters k, the next step is to perform partitioning clustering as described at: k-means clustering (Chapter @ref(kmeans-clustering)).</p>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-charrad2014">
<p>Charrad, Malika, Nadia Ghazzali, Véronique Boiteau, and Azam Niknafs. 2014. “NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set.” <em>Journal of Statistical Software</em> 61: 1–36. <a href="http://www.jstatsoft.org/v61/i06/paper" class="uri">http://www.jstatsoft.org/v61/i06/paper</a>.</p>
</div>
<div id="ref-kaufman1990">
<p>Kaufman, Leonard, and Peter Rousseeuw. 1990. <em>Finding Groups in Data: An Introduction to Cluster Analysis</em>.</p>
</div>
</div>
</div>
</div><!--end rdoc-->
 
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

<!-- END HTML -->]]></description>
			<pubDate>Thu, 07 Sep 2017 09:14:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Assessing Clustering Tendency: Essentials]]></title>
			<link>https://www.sthda.com/english/articles/29-cluster-validation-essentials/95-assessing-clustering-tendency-essentials/</link>
			<guid>https://www.sthda.com/english/articles/29-cluster-validation-essentials/95-assessing-clustering-tendency-essentials/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">

<p>Before applying any clustering method on your data, it’s important to evaluate whether the data sets contains meaningful clusters (i.e.: non-random structures) or not. If yes, then how many clusters are there. This process is defined as the assessing of <strong>clustering tendency</strong> or the feasibility of the clustering analysis.</p>
<p>A big issue, in cluster analysis, is that clustering methods will return clusters even if the data does not contain any clusters. In other words, if you blindly apply a clustering method on a data set, it will divide the data into clusters because that is what it supposed to do.</p>
<div class="block">
<p>
In this chapter, we start by describing why we should evaluate the clustering tendency before applying any clustering method on a data. Next, we provide statistical and visual methods for assessing the clustering tendency.
</p>
</div>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#required-r-packages">Required R packages</a></li>
<li><a href="#data-preparation">Data preparation</a></li>
<li><a href="#visual-inspection-of-the-data">Visual inspection of the data</a></li>
<li><a href="#why-assessing-clustering-tendency">Why assessing clustering tendency?</a></li>
<li><a href="#methods-for-assessing-clustering-tendency">Methods for assessing clustering tendency</a><ul>
<li><a href="#statistical-methods">Statistical methods</a></li>
<li><a href="#visual-methods">Visual methods</a></li>
</ul></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#references">References</a></li>
</ul>
</div><br/>
<p>Related Book:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>


<div id="required-r-packages" class="section level2">
<h2>Required R packages</h2>
<ul>
<li><em>factoextra</em> for data visualization</li>
<li><em>clustertend</em> for statistical assessment clustering tendency</li>
</ul>
<p>To install the two packages, type this:</p>
<pre class="r"><code>install.packages(c("factoextra", "clustertend"))</code></pre>
</div>
<div id="data-preparation" class="section level2">
<h2>Data preparation</h2>
<p>We’ll use two data sets:</p>
<ul>
<li>the built-in R data set iris.</li>
<li>and a random data set generated from the iris data set.</li>
</ul>
<p>The iris data sets look like this:</p>
<pre class="r"><code>head(iris, 3)</code></pre>
<pre><code>##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa</code></pre>
<p>We start by excluding the column “Species” at position 5</p>
<pre class="r"><code># Iris data set
df <- iris[, -5]
# Random data generated from the iris data set
random_df <- apply(df, 2, 
                function(x){runif(length(x), min(x), (max(x)))})
random_df <- as.data.frame(random_df)
# Standardize the data sets
df <- iris.scaled <- scale(df)
random_df <- scale(random_df)</code></pre>
</div>
<div id="visual-inspection-of-the-data" class="section level2">
<h2>Visual inspection of the data</h2>
<p>We start by visualizing the data to assess whether they contains any meaningful clusters.</p>
<p>As the data contain more than two variables, we need to reduce the dimensionality in order to plot a scatter plot. This can be done using principal component analysis (PCA) algorithm (R function: <em>prcomp</em>()). After performing PCA, we use the function <em>fviz_pca_ind</em>() [<em>factoextra</em> R package] to visualize the output.</p>
<p>The iris and the random data sets can be illustrated as follow:</p>
<pre class="r"><code>library("factoextra")
# Plot faithful data set
fviz_pca_ind(prcomp(df), title = "PCA - Iris data", 
             habillage = iris$Species,  palette = "jco",
             geom = "point", ggtheme = theme_classic(),
             legend = "bottom")

# Plot the random df
fviz_pca_ind(prcomp(random_df), title = "PCA - Random data", 
             geom = "point", ggtheme = theme_classic())</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/014-assessing-clustering-tendency-principal-component-analysis-1.png" width="288" /><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/014-assessing-clustering-tendency-principal-component-analysis-2.png" width="288" /></p>
<div class="success">
<p>
It can be seen that the iris data set contains 3 real clusters. However the randomly generated data set doesn’t contain any meaningful clusters.
</p>
</div>
</div>
<div id="why-assessing-clustering-tendency" class="section level2">
<h2>Why assessing clustering tendency?</h2>
<p>In order to illustrate why it’s important to assess cluster tendency, we start by computing k-means clustering (Chapter @ref(kmeans-clustering)) and hierarchical clustering (Chapter @ref(agglomerative-clustering)) on the two data sets (the real and the random data). The function <em>fviz_cluster</em>() and <em>fviz_dend</em>() [in <em>factoextra</em> R package] will be used to visualize the results.</p>
<pre class="r"><code>library(factoextra)
set.seed(123)
# K-means on iris dataset
km.res1 <- kmeans(df, 3)
fviz_cluster(list(data = df, cluster = km.res1$cluster),
             ellipse.type = "norm", geom = "point", stand = FALSE,
             palette = "jco", ggtheme = theme_classic())</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/014-assessing-clustering-tendency-k-means-real-data-1.png" width="288" /></p>
<pre class="r"><code># K-means on the random dataset
km.res2 <- kmeans(random_df, 3)
fviz_cluster(list(data = random_df, cluster = km.res2$cluster),
             ellipse.type = "norm", geom = "point", stand = FALSE,
             palette = "jco", ggtheme = theme_classic())

# Hierarchical clustering on the random dataset
fviz_dend(hclust(dist(random_df)), k = 3, k_colors = "jco",  
          as.ggplot = TRUE, show_labels = FALSE)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/014-assessing-clustering-tendency-k-means-random-data-1.png" width="288" /><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/014-assessing-clustering-tendency-k-means-random-data-2.png" width="288" /></p>
<div class="warning">
<p>
It can be seen that the k-means algorithm and the hierarchical clustering impose a classification on the random uniformly distributed data set even if there are no meaningful clusters present in it. This is why, clustering tendency assessment methods should be used to evaluate the validity of clustering analysis. That is, whether a given data set contains meaningful clusters.
</p>
</div>
</div>
<div id="methods-for-assessing-clustering-tendency" class="section level2">
<h2>Methods for assessing clustering tendency</h2>
<p>In this section, we’ll describe two methods for evaluating the clustering tendency: i) a statistical (<em>Hopkins statistic</em>) and ii) a visual methods (<em>Visual Assessment of cluster Tendency</em> (VAT) algorithm).</p>
<div id="statistical-methods" class="section level3">
<h3>Statistical methods</h3>
<p>The <em>Hopkins statistic</em> <span class="citation">(Lawson and Jurs 1990)</span> is used to assess the clustering tendency of a data set by measuring the probability that a given data set is generated by a uniform data distribution. In other words, it tests the spatial randomness of the data.</p>
<p>For example, let D be a real data set. The Hopkins statistic can be calculated as follow:</p>
<ol style="list-style-type: decimal">
<li><p>Sample uniformly <span class="math inline">\(n\)</span> points (<span class="math inline">\(p_1\)</span>,…, <span class="math inline">\(p_n\)</span>) from D.</p></li>
<li><p>Compute the distance, <span class="math inline">\(x_i\)</span>, from each real point to each nearest neighbor: For each point <span class="math inline">\(p_i \in D\)</span>, find it’s nearest neighbor <span class="math inline">\(p_j\)</span>; then compute the distance between <span class="math inline">\(p_i\)</span> and <span class="math inline">\(p_j\)</span> and denote it as <span class="math inline">\(x_i = dist(p_i, p_j)\)</span></p></li>
<li><p>Generate a simulated data set (<span class="math inline">\(random_D\)</span>) drawn from a random uniform distribution with <span class="math inline">\(n\)</span> points (<span class="math inline">\(q_1\)</span>,…, <span class="math inline">\(q_n\)</span>) and the same variation as the original real data set D.</p></li>
<li><p>Compute the distance, <span class="math inline">\(y_i\)</span> from each artificial point to the nearest real data point: For each point <span class="math inline">\(q_i \in random_D\)</span>, find it’s nearest neighbor <span class="math inline">\(q_j\)</span> in D; then compute the distance between <span class="math inline">\(q_i\)</span> and <span class="math inline">\(q_j\)</span> and denote it <span class="math inline">\(y_i = dist(q_i, q_j)\)</span></p></li>
<li><p>Calculate the Hopkins statistic (H) as the mean nearest neighbor distance in the random data set divided by the sum of the mean nearest neighbor distances in the real and across the simulated data set.</p></li>
</ol>
<p>The formula is defined as follow:</p>
<p><span class="math display">\[H = \frac{\sum\limits_{i=1}^ny_i}{\sum\limits_{i=1}^nx_i + \sum\limits_{i=1}^ny_i}\]</span></p>
<p>How to interpret the Hopkins statistics? If <span class="math inline">\(D\)</span> were uniformly distributed, then <span class="math inline">\(\sum\limits_{i=1}^ny_i\)</span> and <span class="math inline">\(\sum\limits_{i=1}^nx_i\)</span> would be close to each other, and thus <span class="math inline">\(H\)</span> would be about 0.5. However, if clusters are present in D, then the distances for artificial points (<span class="math inline">\(\sum\limits_{i=1}^ny_i\)</span>) would be substantially larger than for the real ones (<span class="math inline">\(\sum\limits_{i=1}^nx_i\)</span>) in expectation, and thus the value of <span class="math inline">\(H\)</span> will increase <span class="citation">(Han, Kamber, and Pei 2012)</span>.</p>
<p>A value for <span class="math inline">\(H\)</span> higher than 0.75 indicates a clustering tendency at the 90% confidence level.</p>
<p>The null and the alternative hypotheses are defined as follow:</p>
<ul>
<li><strong>Null hypothesis</strong>: the data set D is uniformly distributed (i.e., no meaningful clusters)</li>
<li><strong>Alternative hypothesis</strong>: the data set D is not uniformly distributed (i.e., contains meaningful clusters)</li>
</ul>
<div class="success">
<p>
We can conduct the Hopkins Statistic test iteratively, using 0.5 as the threshold to reject the alternative hypothesis. That is, if H < 0.5, then it is unlikely that D has statistically significant clusters.
</p>
<p>
Put in other words, If the value of Hopkins statistic is close to 1, then we can reject the null hypothesis and conclude that the dataset D is significantly a clusterable data.
</p>
</div>
<p>Here, we present two R functions / packages to statistically evaluate clustering tendency by computing the Hopkins statistics:</p>
<ol style="list-style-type: decimal">
<li><code>get_clust_tendency()</code> function [in factoextra package]. It returns the Hopkins statistics as defined in the formula above. The result is a list containing two elements:
<ul>
<li>hopkins_stat</li>
<li>and plot</li>
</ul></li>
<li><code>hopkins()</code> function [in clustertend package]. It implements 1- the definition of H provided here.</li>
</ol>
<p>In the R code below, we’ll use the factoextra R package. Make sure that you have the latest version (or install: <code>devtools::install_github("kassambara/factoextra")</code>).</p>
<pre class="r"><code>library(factoextra)
# Compute Hopkins statistic for iris dataset
res <- get_clust_tendency(df, n = nrow(df)-1, graph = FALSE)
res$hopkins_stat</code></pre>
<pre><code>## [1] 0.818</code></pre>
<pre class="r"><code># Compute Hopkins statistic for a random dataset
res <- get_clust_tendency(random_df, n = nrow(random_df)-1,
                          graph = FALSE)
res$hopkins_stat</code></pre>
<pre><code>## [1] 0.466</code></pre>
<div class="success">
<p>
It can be seen that the iris data set is highly clusterable (the <strong>H</strong> value = 0.82 which is far above the threshold 0.5). However the random_df data set is not clusterable (<span class="math inline"><em>H</em> = 0.47</span>)
</p>
</div>
</div>
<div id="visual-methods" class="section level3">
<h3>Visual methods</h3>
<p>The algorithm of the visual assessment of cluster tendency (VAT) approach (Bezdek and Hathaway, 2002) is as follow:</p>
<p>The algorithm of VAT is as follow:</p>
<div class="block">
<ol style="list-style-type: decimal">
<li>
<p>
Compute the dissimilarity (DM) matrix between the objects in the data set using the Euclidean distance measure
</p>
</li>
<li>
<p>
Reorder the DM so that similar objects are close to one another. This process create an ordered dissimilarity matrix (ODM)
</p>
</li>
<li>
<p>
The ODM is displayed as an ordered dissimilarity image (ODI), which is the visual output of VAT
</p>
</li>
</ol>
</div>
<p>For the visual assessment of clustering tendency, we start by computing the dissimilarity matrix between observations using the function <em>dist</em>(). Next the function <em>fviz_dist</em>() [factoextra package] is used to display the dissimilarity matrix.</p>
<pre class="r"><code>fviz_dist(dist(df), show_labels = FALSE)+
  labs(title = "Iris data")

fviz_dist(dist(random_df), show_labels = FALSE)+
  labs(title = "Random data")</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/014-assessing-clustering-tendency-dissimilarity-matrix-1.png" width="288" /><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/014-assessing-clustering-tendency-dissimilarity-matrix-2.png" width="288" /></p>
<ul>
<li>Red: high similarity (ie: low dissimilarity) | Blue: low similarity</li>
</ul>
<p>The color level is proportional to the value of the dissimilarity between observations: pure red if <span class="math inline">\(dist(x_i, x_j) = 0\)</span> and pure blue if <span class="math inline">\(dist(x_i, x_j) = 1\)</span>. Objects belonging to the same cluster are displayed in consecutive order.</p>
<div class="success">
<p>
The dissimilarity matrix image confirms that there is a cluster structure in the iris data set but not in the random one.
</p>
</div>
<p>The VAT detects the clustering tendency in a visual form by counting the number of square shaped dark blocks along the diagonal in a VAT image.</p>
</div>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>In this article, we described how to assess clustering tendency using the Hopkins statistics and a visual method. After showing that a data is clusterable, the next step is to determine the number of optimal clusters in the data. This will be described in the next chapter.</p>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-tagkey2012">
<p>Han, Jiawei, Micheline Kamber, and Jian Pei. 2012. <em>Data Mining: Concepts and Techniques</em>. 3rd ed. Boston: Morgan Kaufmann. <a href="https://doi.org/10.1016/B978-0-12-381479-1.00016-2" class="uri">https://doi.org/10.1016/B978-0-12-381479-1.00016-2</a>.</p>
</div>
<div id="ref-lawson1990">
<p>Lawson, Richard G., and Peter C. Jurs. 1990. “New Index for Clustering Tendency and Its Application to Chemical Problems.” <em>Journal of Chemical Information and Computer Sciences</em> 30 (1): 36–41. <a href="http://pubs.acs.org/doi/abs/10.1021/ci00065a010" class="uri">http://pubs.acs.org/doi/abs/10.1021/ci00065a010</a>.</p>
</div>
</div>
</div>


</div><!--end rdoc-->

 
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

<!-- END HTML -->]]></description>
			<pubDate>Thu, 07 Sep 2017 08:45:00 +0200</pubDate>
			
		</item>
		
	</channel>
</rss>
