<?xml version="1.0" encoding="UTF-8" ?>
<!-- RSS generated by PHPBoost on Sat, 27 Jun 2026 10:40:58 +0200 -->

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title><![CDATA[Last articles - STHDA : Partitioning Clustering Essentials]]></title>
		<atom:link href="https://www.sthda.com/english/syndication/rss/articles/27" rel="self" type="application/rss+xml"/>
		<link>https://www.sthda.com</link>
		<description><![CDATA[Last articles - STHDA : Partitioning Clustering Essentials]]></description>
		<copyright>(C) 2005-2026 PHPBoost</copyright>
		<language>en</language>
		<generator>PHPBoost</generator>
		
		
		<item>
			<title><![CDATA[CLARA - Clustering Large Applications]]></title>
			<link>https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/89-clara-clustering-large-applications/</link>
			<guid>https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/89-clara-clustering-large-applications/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p><strong>CLARA</strong> (Clustering Large Applications, <span class="citation">(Kaufman and Rousseeuw 1990)</span>) is an extension to k-medoids methods (Chapter @ref(k-medoids)) to deal with data containing a large number of objects (more than several thousand observations) in order to reduce computing time and RAM storage problem. This is achieved using the sampling approach.</p>
<br/>
<p>Contents: </p>
<div id="TOC">
<ul>
<li><a href="#clara-concept">CLARA concept</a></li>
<li><a href="#clara-algorithm">CLARA Algorithm</a></li>
<li><a href="#computing-clara-in-r">Computing CLARA in R</a><ul>
<li><a href="#data-format-and-preparation">Data format and preparation</a></li>
<li><a href="#required-r-packages-and-functions">Required R packages and functions</a></li>
<li><a href="#estimating-the-optimal-number-of-clusters">Estimating the optimal number of clusters</a></li>
<li><a href="#computing-clara">Computing CLARA</a></li>
<li><a href="#visualizing-clara-clusters">Visualizing CLARA clusters</a></li>
</ul></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<p>Related Books:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="clara-concept" class="section level2">
<h2>CLARA concept</h2>
<p>Instead of finding medoids for the entire data set, CLARA considers a small sample of the data with fixed size (<em>sampsize</em>) and applies the PAM algorithm (Chapter @ref(k-medoids)) to generate an optimal set of medoids for the sample. The quality of resulting medoids is measured by the average dissimilarity between every object in the entire data set and the medoid of its cluster, defined as the cost function.</p>
<p>CLARA repeats the sampling and clustering processes a pre-specified number of times in order to minimize the sampling bias. The final clustering results correspond to the set of medoids with the minimal cost. The CLARA algorithm is summarized in the next section.</p>
</div>
<div id="clara-algorithm" class="section level2">
<h2>CLARA Algorithm</h2>
<p>The algorithm is as follow:</p>
<div class="block">
<ol style="list-style-type: decimal">
<li>
<p>
Split randomly the data sets in multiple subsets with fixed size (sampsize)
</p>
</li>
<li>
<p>
Compute PAM algorithm on each subset and choose the corresponding k representative objects (medoids). Assign each observation of the entire data set to the closest medoid.
</p>
</li>
<li>
<p>
Calculate the mean (or the sum) of the dissimilarities of the observations to their closest medoid. This is used as a measure of the goodness of the clustering.
</p>
</li>
<li>
<p>
Retain the sub-dataset for which the mean (or sum) is minimal. A further analysis is carried out on the final partition.
</p>
</li>
</ol>
</div>
<p>Note that, each sub-data set is forced to contain the medoids obtained from the best sub-data set until then. Randomly drawn observations are added to this set until sampsize has been reached.</p>
</div>
<div id="computing-clara-in-r" class="section level2">
<h2>Computing CLARA in R</h2>
<div id="data-format-and-preparation" class="section level3">
<h3>Data format and preparation</h3>
<p>To compute the CLARA algorithm in R, the data should be prepared as indicated in Chapter @ref(data-preparation-and-r-packages).</p>
<p>Here, we’ll generate use a random data set. To make the result reproducible, we start by using the function <em>set.seed</em>().</p>
<pre class="r"><code>set.seed(1234)
# Generate 500 objects, divided into 2 clusters.
df <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
           cbind(rnorm(300,50,8), rnorm(300,50,8)))
# Specify column and row names
colnames(df) <- c("x", "y")
rownames(df) <- paste0("S", 1:nrow(df))
# Previewing the data
head(df, nrow = 6)</code></pre>
<pre><code>##         x    y
## S1  -9.66 3.88
## S2   2.22 5.57
## S3   8.68 1.48
## S4 -18.77 5.61
## S5   3.43 2.49
## S6   4.05 6.08</code></pre>
</div>
<div id="required-r-packages-and-functions" class="section level3">
<h3>Required R packages and functions</h3>
<p>The function <em>clara</em>() [<em>cluster</em> package] can be used to compute <em>CLARA</em>. The simplified format is as follow:</p>
<pre class="r"><code>clara(x, k, metric = "euclidean", stand = FALSE, 
      samples = 5, pamLike = FALSE)</code></pre>
<div class="block">
<ul>
<li>
<strong>x</strong>: a numeric data matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. Missing values (NAs) are allowed.
</li>
<li>
<strong>k</strong>: the number of clusters.
</li>
<li>
<strong>metric</strong>: the distance metrics to be used. Available options are “euclidean” and “manhattan”. Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. Read more on distance measures (Chapter ). Note that, manhattan distance is less sensitive to outliers.
</li>
<li>
<strong>stand</strong>: logical value; if true, the variables (columns) in x are standardized before calculating the dissimilarities. Note that, it’s recommended to standardize variables before clustering.
</li>
<li>
<strong>samples</strong>: number of samples to be drawn from the data set. Default value is 5 but it’s recommended a much larger value.
</li>
<li>
<strong>pamLike</strong>: logical indicating if the same algorithm in the <strong>pam</strong>() function should be used. This should be always true.
</li>
</ul>
</div>
<p>To create a beautiful graph of the clusters generated with the <em>pam</em>() function, will use the <em>factoextra</em> package.</p>
<ol style="list-style-type: decimal">
<li>Installing required packages:</li>
</ol>
<pre class="r"><code>install.packages(c("cluster", "factoextra"))</code></pre>
<ol start="2" style="list-style-type: decimal">
<li>Loading the packages:</li>
</ol>
<pre class="r"><code>library(cluster)
library(factoextra)</code></pre>
</div>
<div id="estimating-the-optimal-number-of-clusters" class="section level3">
<h3>Estimating the optimal number of clusters</h3>
<p>To estimate the optimal number of clusters in your data, it’s possible to use the average silhouette method as described in PAM clustering chapter (Chapter @ref(k-medoids)). The R function <em>fviz_nbclust</em>() [<em>factoextra</em> package] provides a solution to facilitate this step.</p>
<pre class="r"><code>library(cluster)
library(factoextra)
fviz_nbclust(df, clara, method = "silhouette")+
  theme_classic()</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/008-clara-clustering-large-application-clara-optimal-clusters-wss-1.png" width="518.4" /></p>
<div class="success">
<p>
From the plot, the suggested number of clusters is 2. In the next section, we’ll classify the observations into 2 clusters.
</p>
</div>
</div>
<div id="computing-clara" class="section level3">
<h3>Computing CLARA</h3>
<p>The R code below computes PAM algorithm with k = 2:</p>
<pre class="r"><code># Compute CLARA
clara.res <- clara(df, 2, samples = 50, pamLike = TRUE)
# Print components of clara.res
print(clara.res)</code></pre>
<pre><code>## Call:     clara(x = df, k = 2, samples = 50, pamLike = TRUE) 
## Medoids:
##          x     y
## S121 -1.53  1.15
## S455 48.36 50.23
## Objective function:   9.88
## Clustering vector:    Named int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "names")= chr [1:500] "S1" "S2" "S3" "S4" "S5" "S6" "S7" ...
## Cluster sizes:            200 300 
## Best sample:
##  [1] S37  S49  S54  S63  S68  S71  S76  S80  S82  S101 S103 S108 S109 S118
## [15] S121 S128 S132 S138 S144 S162 S203 S210 S216 S231 S234 S249 S260 S261
## [29] S286 S299 S304 S305 S312 S315 S322 S350 S403 S450 S454 S455 S456 S465
## [43] S488 S497
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"</code></pre>
<p>The output of the function <em>clara</em>() includes the following components:</p>
<ul>
<li><strong>medoids</strong>: Objects that represent clusters</li>
<li><strong>clustering</strong>: a vector containing the cluster number of each object</li>
<li><strong>sample</strong>: labels or case numbers of the observations in the best sample, that is, the sample used by the clara algorithm for the final partition.</li>
</ul>
<p>If you want to add the point classifications to the original data, use this:</p>
<pre class="r"><code>dd <- cbind(df, cluster = clara.res$cluster)
head(dd, n = 4)</code></pre>
<pre><code>##         x    y cluster
## S1  -9.66 3.88       1
## S2   2.22 5.57       1
## S3   8.68 1.48       1
## S4 -18.77 5.61       1</code></pre>
<p>You can access to the results returned by <em>clara</em>() as follow:</p>
<pre class="r"><code># Medoids
clara.res$medoids</code></pre>
<pre><code>##          x     y
## S121 -1.53  1.15
## S455 48.36 50.23</code></pre>
<pre class="r"><code># Clustering
head(clara.res$clustering, 10)</code></pre>
<pre><code>##  S1  S2  S3  S4  S5  S6  S7  S8  S9 S10 
##   1   1   1   1   1   1   1   1   1   1</code></pre>
<p><span class="success">The <strong>medoids</strong> are S121, S455</span></p>
</div>
<div id="visualizing-clara-clusters" class="section level3">
<h3>Visualizing CLARA clusters</h3>
<p>To visualize the partitioning results, we’ll use the function <em>fviz_cluster</em>() [<em>factoextra</em> package]. It draws a scatter plot of data points colored by cluster numbers.</p>
<pre class="r"><code>fviz_cluster(clara.res, 
             palette = c("#00AFBB", "#FC4E07"), # color palette
             ellipse.type = "t", # Concentration ellipse
             geom = "point", pointsize = 1,
             ggtheme = theme_classic()
             )</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/008-clara-clustering-large-application-clara-k-medoids-clustering-large-data-sets-plot-1.png" width="432" /></p>
</div>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>The CLARA (Clustering Large Applications) algorithm is an extension to the PAM (Partitioning Around Medoids) clustering method for large data sets. It intended to reduce the computation time in the case of large data set.</p>
<p>As almost all partitioning algorithm, it requires the user to specify the appropriate number of clusters to be produced. This can be estimated using the function <em>fviz_nbclust</em> [in <em>factoextra</em> R package].</p>
<p>The R function <em>clara</em>() [<em>cluster</em> package] can be used to compute CLARA algorithm. The simplified format is clara(x, k, pamLike = TRUE), where “x” is the data and k is the number of clusters to be generated.</p>
<p>After, computing CLARA, the R function <em>fviz_cluster</em>() [<em>factoextra</em> package] can be used to visualize the results. The format is fviz_cluster(clara.res), where clara.res is the CLARA results.</p>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-kaufman1990">
<p>Kaufman, Leonard, and Peter Rousseeuw. 1990. <em>Finding Groups in Data: An Introduction to Cluster Analysis</em>.</p>
</div>
</div>
</div>
</div><!--end rdoc-->

<!-- END HTML -->]]></description>
			<pubDate>Mon, 04 Sep 2017 22:23:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[K-Medoids Essentials]]></title>
			<link>https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/88-k-medoids-essentials/</link>
			<guid>https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/88-k-medoids-essentials/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p>The <strong>k-medoids algorithm</strong> is a clustering approach related to k-means clustering (chapter @ref(kmeans-clustering)) for partitioning a data set into k groups or clusters. In k-medoids clustering, each cluster is represented by one of the data point in the cluster. These points are named cluster medoids.</p>
<p>The term medoid refers to an object within a cluster for which average dissimilarity between it and all the other the members of the cluster is minimal. It corresponds to the most centrally located point in the cluster. These objects (one per cluster) can be considered as a representative example of the members of that cluster which may be useful in some situations. Recall that, in k-means clustering, the center of a given cluster is calculated as the mean value of all the data points in the cluster.</p>
<p>K-medoid is a robust alternative to k-means clustering. This means that, the algorithm is less sensitive to noise and outliers, compared to k-means, because it uses medoids as cluster centers instead of means (used in k-means).</p>
<p>The k-medoids algorithm requires the user to specify k, the number of clusters to be generated (like in k-means clustering). A useful approach to determine the optimal number of clusters is the <strong>silhouette</strong> method, described in the next sections.</p>
<p>The most common k-medoids clustering methods is the <strong>PAM</strong> algorithm (<strong>Partitioning Around Medoids</strong>, <span class="citation">(Kaufman and Rousseeuw 1990)</span>).</p>
<div class="block">
<p>
In this article, We’ll describe the PAM algorithm and provide practical examples using <strong>R</strong> software. In the next chapter, we’ll also discuss a variant of PAM named <strong>CLARA</strong> (Clustering Large Applications) which is used for analyzing large data sets.
</p>
</div>
<br/>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#pam-concept">PAM concept</a></li>
<li><a href="#pam-algorithm">PAM algorithm</a></li>
<li><a href="#computing-pam-in-r">Computing PAM in R</a><ul>
<li><a href="#data">Data</a></li>
<li><a href="#required-r-packages-and-functions">Required R packages and functions</a></li>
<li><a href="#estimating-the-optimal-number-of-clusters">Estimating the optimal number of clusters</a></li>
<li><a href="#computing-pam-clustering">Computing PAM clustering</a></li>
<li><a href="#accessing-to-the-results-of-the-pam-function">Accessing to the results of the pam() function</a></li>
<li><a href="#visualizing-pam-clusters">Visualizing PAM clusters</a></li>
</ul></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<p>Related Books:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="pam-concept" class="section level2">
<h2>PAM concept</h2>
<p>The use of means implies that k-means clustering is highly sensitive to outliers. This can severely affects the assignment of observations to clusters. A more robust algorithm is provided by the <strong>PAM</strong> algorithm.</p>
</div>
<div id="pam-algorithm" class="section level2">
<h2>PAM algorithm</h2>
<p>The PAM algorithm is based on the search for k representative objects or medoids among the observations of the data set.</p>
<p>After finding a set of k medoids, clusters are constructed by assigning each observation to the nearest medoid.
Next, each selected medoid m and each non-medoid data point are swapped and the objective function is computed. The objective function corresponds to the sum of the dissimilarities of all objects to their nearest medoid.</p>
<p>The SWAP step attempts to improve the quality of the clustering by exchanging selected objects (medoids) and non-selected objects. If the objective function can be reduced by interchanging a selected object with an unselected object, then the swap is carried out. This is continued until the objective function can no longer be decreased. The goal is to find k representative objects which minimize the sum of the dissimilarities of the observations to their closest representative object.</p>
<p>In summary, PAM algorithm proceeds in two phases as follow:</p>
<div class="block">
<p>
<strong>Build phase</strong>:
</p>
<ol style="list-style-type: decimal">
<li>
Select k objects to become the medoids, or in case these objects were provided use them as the medoids;
</li>
<li>
Calculate the dissimilarity matrix if it was not provided;
</li>
<li>
Assign every object to its closest medoid;
</li>
</ol>
<p>
<strong>Swap phase</strong>:
 4. For each cluster search if any of the object of the cluster decreases the average dissimilarity coefficient; if it does, select the entity that decreases this coefficient the most as the medoid for this cluster; 5. If at least one medoid has changed go to (3), else end the algorithm.
</p>
</div>
<p>As mentioned above, the PAM algorithm works with a matrix of dissimilarity, and to compute this matrix the algorithm can use two metrics:</p>
<ol style="list-style-type: decimal">
<li>The euclidean distances, that are the root sum-of-squares of differences;</li>
<li>And, the Manhattan distance that are the sum of absolute distances.</li>
</ol>
<div class="warning">
<p>
Note that, in practice, you should get similar results most of the time, using either euclidean or Manhattan distance. If your data contains outliers, Manhattan distance should give more robust results, whereas euclidean would be influenced by unusual values.
</p>
</div>
<p>Read more on distance measures in Chapter @ref(clustering-distance-measures).</p>
</div>
<div id="computing-pam-in-r" class="section level2">
<h2>Computing PAM in R</h2>
<div id="data" class="section level3">
<h3>Data</h3>
<p>We’ll use the demo data sets “USArrests”, which we start by scaling (Chapter @ref(data-preparation-and-r-packages) using the R function <em>scale()</em> as follow:</p>
<pre class="r"><code>data("USArrests")      # Load the data set
df <- scale(USArrests) # Scale the data
head(df, n = 3)        # View the firt 3 rows of the data</code></pre>
<pre><code>##         Murder Assault UrbanPop     Rape
## Alabama 1.2426   0.783   -0.521 -0.00342
## Alaska  0.5079   1.107   -1.212  2.48420
## Arizona 0.0716   1.479    0.999  1.04288</code></pre>
</div>
<div id="required-r-packages-and-functions" class="section level3">
<h3>Required R packages and functions</h3>
<p>The function <em>pam</em>() [<em>cluster</em> package] and <em>pamk()</em> [<em>fpc</em> package] can be used to compute <strong>PAM</strong>.</p>
<p><span class="notice">The function <em>pamk</em>() does not require a user to decide the number of clusters K.</span></p>
<p>In the following examples, we’ll describe only the function <em>pam</em>(), which simplified format is:</p>
<pre class="r"><code>pam(x, k, metric = "euclidean", stand = FALSE)</code></pre>
<div class="block">
<ul>
<li>
<strong>x</strong>: possible values includes:
<ul>
<li>
Numeric data matrix or numeric data frame: each row corresponds to an observation, and each column corresponds to a variable.
</li>
<li>
Dissimilarity matrix: in this case x is typically the output of <strong>daisy()</strong> or <strong>dist()</strong>
</li>
</ul>
</li>
<li>
<strong>k</strong>: The number of clusters
</li>
<li>
<strong>metric</strong>: the distance metrics to be used. Available options are “euclidean” and “manhattan”.
</li>
<li>
<strong>stand</strong>: logical value; if true, the variables (columns) in x are standardized before calculating the dissimilarities. Ignored when x is a dissimilarity matrix.
</li>
</ul>
</div>
<p>To create a beautiful graph of the clusters generated with the <em>pam</em>() function, will use the <em>factoextra</em> package.</p>
<ol style="list-style-type: decimal">
<li>Installing required packages:</li>
</ol>
<pre class="r"><code>install.packages(c("cluster", "factoextra"))</code></pre>
<ol start="2" style="list-style-type: decimal">
<li>Loading the packages:</li>
</ol>
<pre class="r"><code>library(cluster)
library(factoextra)</code></pre>
</div>
<div id="estimating-the-optimal-number-of-clusters" class="section level3">
<h3>Estimating the optimal number of clusters</h3>
<p>To estimate the optimal number of clusters, we’ll use the average silhouette method. The idea is to compute PAM algorithm using different values of clusters k. Next, the average clusters silhouette is drawn according to the number of clusters. The average silhouette measures the quality of a clustering. A high average silhouette width indicates a good clustering. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k <span class="citation">(Kaufman and Rousseeuw 1990)</span>.</p>
<p>The R function <em>fviz_nbclust</em>() [<em>factoextra</em> package] provides a convenient solution to estimate the optimal number of clusters.</p>
<pre class="r"><code>library(cluster)
library(factoextra)
fviz_nbclust(df, pam, method = "silhouette")+
  theme_classic()</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/007-k-medoids-pam-optimal-clusters-wss-1.png" width="518.4" /></p>
<div class="success">
<p>
From the plot, the suggested number of clusters is 2. In the next section, we’ll classify the observations into 2 clusters.
</p>
</div>
</div>
<div id="computing-pam-clustering" class="section level3">
<h3>Computing PAM clustering</h3>
<p>The R code below computes PAM algorithm with k = 2:</p>
<pre class="r"><code>pam.res <- pam(df, 2)
print(pam.res)</code></pre>
<pre><code>## Medoids:
##            ID Murder Assault UrbanPop   Rape
## New Mexico 31  0.829   1.371    0.308  1.160
## Nebraska   27 -0.801  -0.825   -0.245 -0.505
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              2              2              1              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              2              1              2              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              2              1              2              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              2              1              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              1              2              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              2              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              1              2              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              2              2              2 
## Objective function:
## build  swap 
##  1.44  1.37 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"</code></pre>
<div class="success">
<p>
The printed output shows:
</p>
<ul>
<li>
the cluster medoids: a matrix, which rows are the medoids and columns are variables
</li>
<li>
the clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated
</li>
</ul>
</div>
<p>If you want to add the point classifications to the original data, use this:</p>
<pre class="r"><code>dd <- cbind(USArrests, cluster = pam.res$cluster)
head(dd, n = 3)</code></pre>
<pre><code>##         Murder Assault UrbanPop Rape cluster
## Alabama   13.2     236       58 21.2       1
## Alaska    10.0     263       48 44.5       1
## Arizona    8.1     294       80 31.0       1</code></pre>
</div>
<div id="accessing-to-the-results-of-the-pam-function" class="section level3">
<h3>Accessing to the results of the pam() function</h3>
<p>The function <em>pam</em>() returns an object of class <em>pam</em> which components include:</p>
<ul>
<li><strong>medoids</strong>: Objects that represent clusters</li>
<li><strong>clustering</strong>: a vector containing the cluster number of each object</li>
</ul>
<p>These components can be accessed as follow:</p>
<pre class="r"><code># Cluster medoids: New Mexico, Nebraska
pam.res$medoids</code></pre>
<pre><code>##            Murder Assault UrbanPop   Rape
## New Mexico  0.829   1.371    0.308  1.160
## Nebraska   -0.801  -0.825   -0.245 -0.505</code></pre>
<pre class="r"><code># Cluster numbers
head(pam.res$clustering)</code></pre>
<pre><code>##    Alabama     Alaska    Arizona   Arkansas California   Colorado 
##          1          1          1          2          1          1</code></pre>
</div>
<div id="visualizing-pam-clusters" class="section level3">
<h3>Visualizing PAM clusters</h3>
<p>To visualize the partitioning results, we’ll use the function <em>fviz_cluster</em>() [<em>factoextra</em> package]. It draws a scatter plot of data points colored by cluster numbers. If the data contains more than 2 variables, the <a href="factominer-and-factoextra-principal-component-analysis-visualization-r-software-and-data-mining"><em>Principal Component Analysis (PCA)</em></a> algorithm is used to reduce the dimensionality of the data. In this case, the first two principal dimensions are used to plot the data.</p>
<pre class="r"><code>fviz_cluster(pam.res, 
             palette = c("#00AFBB", "#FC4E07"), # color palette
             ellipse.type = "t", # Concentration ellipse
             repel = TRUE, # Avoid label overplotting (slow)
             ggtheme = theme_classic()
             )</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/007-k-medoids-pam-k-medoids-clustering-plot-1.png" width="480" /></p>
</div>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>The K-medoids algorithm, PAM, is a robust alternative to k-means for partitioning a data set into clusters of observation.</p>
<p>In k-medoids method, each cluster is represented by a selected object within the cluster. The selected objects are named medoids and corresponds to the most centrally located points within the cluster.</p>
<p>The PAM algorithm requires the user to know the data and to indicate the appropriate number of clusters to be produced. This can be estimated using the function <em>fviz_nbclust</em> [in <em>factoextra</em> R package].</p>
<p>The R function <em>pam</em>() [<em>cluster</em> package] can be used to compute PAM algorithm. The simplified format is pam(x, k), where “x” is the data and k is the number of clusters to be generated.</p>
<p>After, performing PAM clustering, the R function <em>fviz_cluster</em>() [<strong>factoextra</strong> package] can be used to visualize the results. The format is fviz_cluster(pam.res), where pam.res is the PAM results.</p>
<div class="warning">
<p>
Note that, for large data sets, () may need too much memory or too much computation time. In this case, the function () is preferable. This should not be a problem for modern computers.
</p>
</div>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-kaufman1990">
<p>Kaufman, Leonard, and Peter Rousseeuw. 1990. <em>Finding Groups in Data: An Introduction to Cluster Analysis</em>.</p>
</div>
</div>
</div>
</div><!--end rdoc-->
 
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

<!-- END HTML -->]]></description>
			<pubDate>Mon, 04 Sep 2017 22:09:00 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[K-Means Clustering Essentials]]></title>
			<link>https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/87-k-means-clustering-essentials/</link>
			<guid>https://www.sthda.com/english/articles/27-partitioning-clustering-essentials/87-k-means-clustering-essentials/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">
<p><strong>K-means clustering</strong> <span class="citation">(MacQueen 1967)</span> is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. <em>k clusters</em>), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high <em>intra-class similarity</em>), whereas objects from different clusters are as dissimilar as possible (i.e., low <em>inter-class similarity</em>). In k-means clustering, each cluster is represented by its center (i.e, <em>centroid</em>) which corresponds to the mean of points assigned to the cluster.</p>
<div class="block">
<p>
In this article, we’ll describe the <strong>k-means algorithm</strong> and provide practical examples using <strong>R</strong> software.
</p>
</div>
<br/>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#k-means-basic-ideas">K-means basic ideas</a></li>
<li><a href="#k-means-algorithm">K-means algorithm</a></li>
<li><a href="#computing-k-means-clustering-in-r">Computing k-means clustering in R</a><ul>
<li><a href="#data">Data</a></li>
<li><a href="#required-r-packages-and-functions">Required R packages and functions</a></li>
<li><a href="#estimating-the-optimal-number-of-clusters">Estimating the optimal number of clusters</a></li>
<li><a href="#computing-k-means-clustering">Computing k-means clustering</a></li>
<li><a href="#accessing-to-the-results-of-kmeans-function">Accessing to the results of kmeans() function</a></li>
<li><a href="#visualizing-k-means-clusters">Visualizing k-means clusters</a></li>
</ul></li>
<li><a href="#k-means-clustering-advantages-and-disadvantages">K-means clustering advantages and disadvantages</a></li>
<li><a href="#alternative-to-k-means-clustering">Alternative to k-means clustering</a></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<p>Related Books:</p>
<div class = "small-block content-privileged-friends cluster-book">
    <center>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/17-practical-guide-to-cluster-analysis-in-r/">
          <img src = "https://www.sthda.com/english/sthda-upload/images/cluster-analysis/clustering-book-cover.png" /><br/>
      Practical Guide to Cluster Analysis in R
      </a>
      </center>
</div>
<div class="spacer"></div>
<div id="k-means-basic-ideas" class="section level2">
<h2>K-means basic ideas</h2>
<p>The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized.</p>
<p>There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm <span class="citation">(Hartigan and Wong 1979)</span>, which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:</p>
<p><span class="math display">\[
W(C_k) = \sum\limits_{x_i \in C_k} (x_i - \mu_k)^2
\]</span></p>
<ul>
<li><span class="math inline">\(x_i\)</span> design a data point belonging to the cluster <span class="math inline">\(C_k\)</span></li>
<li><span class="math inline">\(\mu_k\)</span> is the mean value of the points assigned to the cluster <span class="math inline">\(C_k\)</span></li>
</ul>
<p>Each observation (<span class="math inline">\(x_i\)</span>) is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers <span class="math inline">\(\mu_k\)</span> is a minimum.</p>
<p>We define the total within-cluster variation as follow:</p>
<p><span class="math display">\[
tot.withinss = \sum\limits_{k=1}^k W(C_k) = \sum\limits_{k=1}^k \sum\limits_{x_i \in C_k} (x_i - \mu_k)^2
\]</span></p>
<p><span class="success">The <em>total within-cluster sum of square</em> measures the compactness (i.e <em>goodness</em>) of the clustering and we want it to be as small as possible.</span></p>
</div>
<div id="k-means-algorithm" class="section level2">
<h2>K-means algorithm</h2>
<p>The first step when using k-means clustering is to indicate the number of clusters (k) that will be generated in the final solution.</p>
<p>The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The selected objects are also known as cluster means or centroids.</p>
<p>Next, each of the remaining objects is assigned to it’s closest centroid, where closest is defined using the Euclidean distance (Chapter @ref(clustering-distance-measures) between the object and the cluster mean. This step is called “cluster assignment step”. Note that, to use correlation distance, the data are input as z-scores.</p>
<p>After the assignment step, the algorithm computes the new mean value of each cluster. The term cluster “centroid update” is used to design this step. Now that the centers have been recalculated, every observation is checked again to see if it might be closer to a different cluster. All the objects are reassigned again using the updated cluster means.</p>
<p>The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e until <em>convergence</em> is achieved). That is, the clusters formed in the current iteration are the same as those obtained in the previous iteration.</p>
<p>K-means algorithm can be summarized as follow:</p>
<div class="block">
<ol style="list-style-type: decimal">
<li>
Specify the number of clusters (K) to be created (by the analyst)
</li>
<li>
Select randomly k objects from the dataset as the initial cluster centers or means
</li>
<li>
Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid
</li>
<li>
For each of the k clusters update the <em>cluster centroid</em> by calculating the new mean values of all the data points in the cluster. The centoid of a <span class="math inline"><em>K</em><sub><em>t</em><em>h</em></sub></span> cluster is a vector of length <span class="math inline"><em>p</em></span> containing the means of all variables for the observations in the <span class="math inline"><em>k</em><sub><em>t</em><em>h</em></sub></span> cluster; <em>p</em> is the number of variables.
</li>
<li>
Iteratively minimize the total within sum of square. That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached. By default, the <strong>R</strong> software uses 10 as the default value for the maximum number of iterations.
</li>
</ol>
</div>
</div>
<div id="computing-k-means-clustering-in-r" class="section level2">
<h2>Computing k-means clustering in R</h2>
<div id="data" class="section level3">
<h3>Data</h3>
<p>We’ll use the demo data sets “USArrests”. The data should be prepared as described in chapter @ref(data-preparation-and-r-packages). The data must contains only continuous variables, as the k-means algorithm uses variable means. As we don’t want the k-means algorithm to depend to an arbitrary variable unit, we start by scaling the data using the R function <em>scale()</em> as follow:</p>
<pre class="r"><code>data("USArrests")      # Loading the data set
df <- scale(USArrests) # Scaling the data
# View the firt 3 rows of the data
head(df, n = 3)</code></pre>
<pre><code>##         Murder Assault UrbanPop     Rape
## Alabama 1.2426   0.783   -0.521 -0.00342
## Alaska  0.5079   1.107   -1.212  2.48420
## Arizona 0.0716   1.479    0.999  1.04288</code></pre>
</div>
<div id="required-r-packages-and-functions" class="section level3">
<h3>Required R packages and functions</h3>
<p>The standard R function for k-means clustering is <em>kmeans</em>() [<em>stats</em> package], which simplified format is as follow:</p>
<pre class="r"><code>kmeans(x, centers, iter.max = 10, nstart = 1)</code></pre>
<div class="block">
<ul>
<li>
<strong>x</strong>: numeric matrix, numeric data frame or a numeric vector
</li>
<li>
<strong>centers</strong>: Possible values are the number of clusters (k) or a set of initial (distinct) cluster centers. If a number, a random set of (distinct) rows in x is chosen as the initial centers.
</li>
<li>
<strong>iter.max</strong>: The maximum number of iterations allowed. Default value is 10.
</li>
<li>
<strong>nstart</strong>: The number of random starting partitions when centers is a number. Trying nstart > 1 is often recommended.
</li>
</ul>
</div>
<p>To create a beautiful graph of the clusters generated with the <em>kmeans</em>() function, will use the <em>factoextra</em> package.</p>
<ul>
<li>Installing <em>factoextra</em> package as:</li>
</ul>
<pre class="r"><code>install.packages("factoextra")</code></pre>
<ul>
<li>Loading <em>factoextra</em>:</li>
</ul>
<pre class="r"><code>library(factoextra)</code></pre>
</div>
<div id="estimating-the-optimal-number-of-clusters" class="section level3">
<h3>Estimating the optimal number of clusters</h3>
<p>The k-means clustering requires the users to specify the number of clusters to be generated.</p>
<p><span class="question">One fundamental question is: How to choose the right number of expected clusters (k)?</span></p>
<p>Different methods will be presented in the chapter “cluster evaluation and validation statistics”.</p>
<p>Here, we provide a simple solution. The idea is to compute k-means clustering using different values of clusters k. Next, the wss (within sum of square) is drawn according to the number of clusters. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.</p>
<p>The R function <em>fviz_nbclust</em>() [in <em>factoextra</em> package] provides a convenient solution to estimate the optimal number of clusters.</p>
<pre class="r"><code>library(factoextra)
fviz_nbclust(df, kmeans, method = "wss") +
    geom_vline(xintercept = 4, linetype = 2)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/006b-kmeans-clustering-k-means-optimal-clusters-wss-1.png" width="518.4" /></p>
<div class="success">
<p>
The plot above represents the variance within the clusters. It decreases as k increases, but it can be seen a bend (or “elbow”) at k = 4. This bend indicates that additional clusters beyond the fourth have little value.. In the next section, we’ll classify the observations into 4 clusters.
</p>
</div>
</div>
<div id="computing-k-means-clustering" class="section level3">
<h3>Computing k-means clustering</h3>
<p>As k-means clustering algorithm starts with k randomly selected centroids, it’s always recommended to use the <em>set.seed()</em> function in order to set a seed for <em>R’s random number generator</em>. The aim is to make reproducible the results, so that the reader of this article will obtain exactly the same results as those shown below.</p>
<p>The R code below performs <em>k-means clustering</em> with k = 4:</p>
<pre class="r"><code># Compute k-means with k = 4
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)</code></pre>
<div class="warning">
<p>
As the final result of k-means clustering result is sensitive to the random starting assignments, we specify <em>nstart = 25</em>. This means that R will try 25 different random starting assignments and then select the best results corresponding to the one with the lowest within cluster variation. The default value of <em>nstart</em> in R is one. But, it’s strongly recommended to compute <em>k-means clustering</em> with a large value of <em>nstart</em> such as 25 or 50, in order to have a more stable result.
</p>
</div>
<pre class="r"><code># Print the results
print(km.res)</code></pre>
<pre><code>## K-means clustering with 4 clusters of sizes 13, 16, 13, 8
## 
## Cluster means:
##   Murder Assault UrbanPop    Rape
## 1 -0.962  -1.107   -0.930 -0.9668
## 2 -0.489  -0.383    0.576 -0.2617
## 3  0.695   1.039    0.723  1.2769
## 4  1.412   0.874   -0.815  0.0193
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              4              3              3              4              3 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              3              2              2              3              4 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              1              3              2              1 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              1              4              1              3 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              3              1              4              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              3              1              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              3              3              4              1              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              4 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              1              4              3              2              1 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              1              1              2 
## 
## Within cluster sum of squares by cluster:
## [1] 11.95 16.21 19.92  8.32
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"</code></pre>
<div class="success">
<p>
The printed output displays:
</p>
<ul>
<li>
the cluster means or centers: a matrix, which rows are cluster number (1 to 4) and columns are variables
</li>
<li>
the clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated
</li>
</ul>
</div>
<p>It’s possible to compute the mean of each variables by clusters using the original data:</p>
<pre class="r"><code>aggregate(USArrests, by=list(cluster=km.res$cluster), mean)</code></pre>
<pre><code>##   cluster Murder Assault UrbanPop Rape
## 1       1   3.60    78.5     52.1 12.2
## 2       2   5.66   138.9     73.9 18.8
## 3       3  10.82   257.4     76.0 33.2
## 4       4  13.94   243.6     53.8 21.4</code></pre>
<p>If you want to add the point classifications to the original data, use this:</p>
<pre class="r"><code>dd <- cbind(USArrests, cluster = km.res$cluster)
head(dd)</code></pre>
<pre><code>##            Murder Assault UrbanPop Rape cluster
## Alabama      13.2     236       58 21.2       4
## Alaska       10.0     263       48 44.5       3
## Arizona       8.1     294       80 31.0       3
## Arkansas      8.8     190       50 19.5       4
## California    9.0     276       91 40.6       3
## Colorado      7.9     204       78 38.7       3</code></pre>
</div>
<div id="accessing-to-the-results-of-kmeans-function" class="section level3">
<h3>Accessing to the results of kmeans() function</h3>
<p><strong>kmeans()</strong> function returns a list of components, including:</p>
<ul>
<li><strong>cluster</strong>: A vector of integers (from 1:k) indicating the cluster to which each point is allocated</li>
<li><strong>centers</strong>: A matrix of cluster centers (cluster means)</li>
<li><strong>totss</strong>: The total sum of squares (TSS), i.e <span class="math inline">\(\sum{(x_i - \bar{x})^2}\)</span>. TSS measures the total variance in the data.</li>
<li><strong>withinss</strong>: Vector of within-cluster sum of squares, one component per cluster</li>
<li><strong>tot.withinss</strong>: Total within-cluster sum of squares, i.e. <span class="math inline">\(sum(withinss)\)</span></li>
<li><strong>betweenss</strong>: The between-cluster sum of squares, i.e. <span class="math inline">\(totss - tot.withinss\)</span></li>
<li><strong>size</strong>: The number of observations in each cluster</li>
</ul>
<p>These components can be accessed as follow:</p>
<pre class="r"><code># Cluster number for each of the observations
km.res$cluster</code></pre>
<pre class="r"><code>head(km.res$cluster, 4)</code></pre>
<pre><code>##  Alabama   Alaska  Arizona Arkansas 
##        4        3        3        4</code></pre>
<p>…..</p>
<pre class="r"><code># Cluster size
km.res$size</code></pre>
<pre><code>## [1] 13 16 13  8</code></pre>
<pre class="r"><code># Cluster means
km.res$centers</code></pre>
<pre><code>##   Murder Assault UrbanPop    Rape
## 1 -0.962  -1.107   -0.930 -0.9668
## 2 -0.489  -0.383    0.576 -0.2617
## 3  0.695   1.039    0.723  1.2769
## 4  1.412   0.874   -0.815  0.0193</code></pre>
</div>
<div id="visualizing-k-means-clusters" class="section level3">
<h3>Visualizing k-means clusters</h3>
<p>It is a good idea to plot the cluster results. These can be used to assess the choice of the number of clusters as well as comparing two different cluster analyses.</p>
<p>Now, we want to visualize the data in a scatter plot with coloring each data point according to its cluster assignment.</p>
<p>The problem is that the data contains more than 2 variables and the question is what variables to choose for the xy scatter plot.</p>
<p>A solution is to reduce the number of dimensions by applying a dimensionality reduction algorithm, such as <a href="https://www.sthda.com/english/wiki/factominer-and-factoextra-principal-component-analysis-visualization-r-software-and-data-mining"><strong>Principal Component Analysis (PCA)</strong></a>, that operates on the four variables and outputs two new variables (that represent the original variables) that you can use to do the plot.</p>
<div class="success">
<p>
In other words, if we have a multi-dimensional data set, a solution is to perform Principal Component Analysis (PCA) and to plot data points according to the first two principal components coordinates. }
</p>
<p>
The function <em>fviz_cluster</em>() [<em>factoextra</em> package] can be used to easily visualize k-means clusters. It takes k-means results and the original data as arguments. In the resulting plot, observations are represented by points, using principal components if the number of variables is greater than 2. It’s also possible to draw concentration ellipse around each cluster.
</p>
</div>
<pre class="r"><code>fviz_cluster(km.res, data = df,
             palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"), 
             ellipse.type = "euclid", # Concentration ellipse
             star.plot = TRUE, # Add segments from centroids to items
             repel = TRUE, # Avoid label overplotting (slow)
             ggtheme = theme_minimal()
             )</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/cluster-analysis/006b-kmeans-clustering-k-means-plot-ggplot2-factoextra-1.png" width="672" /></p>
</div>
</div>
<div id="k-means-clustering-advantages-and-disadvantages" class="section level2">
<h2>K-means clustering advantages and disadvantages</h2>
<p>K-means clustering is very simple and fast algorithm. It can efficiently deal with very large data sets. However there are some weaknesses, including:</p>
<div class="warning">
<ol style="list-style-type: decimal">
<li>
<p>
It assumes prior knowledge of the data and requires the analyst to choose the appropriate number of cluster (k) in advance
</p>
</li>
<li>
The final results obtained is sensitive to the initial random selection of cluster centers. Why is it a problem? Because, for every different run of the algorithm on the same dataset, you may choose different set of initial centers. This may lead to different clustering results on different runs of the algorithm.
</li>
<li>
<p>
It’s sensitive to outliers.
</p>
</li>
<li>
<p>
If you rearrange your data, it’s very possible that you’ll get a different solution every time you change the ordering of your data.
</p>
</li>
</ol>
</div>
<p>Possible solutions to these weaknesses, include:</p>
<div class="success">
<ol style="list-style-type: decimal">
<li>
<p>
Solution to issue 1: Compute k-means for a range of k values, for example by varying k between 2 and 10. Then, choose the best k by comparing the clustering results obtained for the different k values.
</p>
</li>
<li>
<p>
Solution to issue 2: Compute K-means algorithm several times with different initial cluster centers. The run with the lowest total within-cluster sum of square is selected as the final clustering solution.
</p>
</li>
<li>
<p>
To avoid distortions caused by excessive outliers, it’s possible to use PAM algorithm, which is less sensitive to outliers.
</p>
</li>
</ol>
</div>
</div>
<div id="alternative-to-k-means-clustering" class="section level2">
<h2>Alternative to k-means clustering</h2>
<p>A robust alternative to k-means is PAM, which is based on medoids. As discussed in the next chapter, the PAM clustering can be computed using the function <em>pam</em>() [<em>cluster</em> package]. The function <em>pamk</em>( ) [fpc package] is a wrapper for PAM that also prints the suggested number of clusters based on optimum average silhouette width.</p>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>K-means clustering can be used to classify observations into k groups, based on their similarity. Each group is represented by the mean value of points in the group, known as the cluster centroid.</p>
<p>K-means algorithm requires users to specify the number of cluster to generate. The R function <em>kmeans</em>() [<em>stats</em> package] can be used to compute k-means algorithm. The simplified format is kmeans(x, centers), where “x” is the data and centers is the number of clusters to be produced.</p>
<p>After, computing k-means clustering, the R function <em>fviz_cluster</em>() [<em>factoextra</em> package] can be used to visualize the results. The format is fviz_cluster(km.res, data), where km.res is k-means results and data corresponds to the original data sets.</p>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-hartigan1979">
<p>Hartigan, JA, and MA Wong. 1979. “Algorithm AS 136: A K-means clustering algorithm.” <em>Applied Statistics</em>. Royal Statistical Society, 100–108.</p>
</div>
<div id="ref-macqueen1967">
<p>MacQueen, J. 1967. “Some Methods for Classification and Analysis of Multivariate Observations.” In <em>Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics</em>, 281–97. Berkeley, Calif.: University of California Press. <a href="http://projecteuclid.org:443/euclid.bsmsp/1200512992" class="uri">http://projecteuclid.org:443/euclid.bsmsp/1200512992</a>.</p>
</div>
</div>
</div>
</div><!--end rdoc-->
 
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

<!-- END HTML -->]]></description>
			<pubDate>Mon, 04 Sep 2017 21:17:00 +0200</pubDate>
			
		</item>
		
	</channel>
</rss>
