After showing how to compute hierarchical clustering (Chapter @ref(agglomerative-clustering)), we describe, here, how to compare two dendrograms using the dendextend R package.
The dendextend package provides several functions for comparing dendrograms. Here, we’ll focus on two functions:
- tanglegram() for visual comparison of two dendrograms
- and cor.dendlist() for computing a correlation matrix between dendrograms.
We’ll use the R base USArrests data sets and we start by standardizing the variables using the function scale() as follow:
df <- scale(USArrests)
To make readable the plots, generated in the next sections, we’ll work with a small random subset of the data set. Therefore, we’ll use the function sample() to randomly select 10 observations among the 50 observations contained in the data set:
# Subset containing 10 rows set.seed(123) ss <- sample(1:50, 10) df <- df[ss,]
We start by creating a list of two dendrograms by computing hierarchical clustering (HC) using two different linkage methods (“average” and “ward.D2”). Next, we transform the results as dendrograms and create a list to hold the two dendrograms.
library(dendextend) # Compute distance matrix res.dist <- dist(df, method = "euclidean") # Compute 2 hierarchical clusterings hc1 <- hclust(res.dist, method = "average") hc2 <- hclust(res.dist, method = "ward.D2") # Create two dendrograms dend1 <- as.dendrogram (hc1) dend2 <- as.dendrogram (hc2) # Create a list to hold dendrograms dend_list <- dendlist(dend1, dend2)
Visual comparison of two dendrograms
To visually compare two dendrograms, we’ll use the tanglegram() function [dendextend package], which plots the two dendrograms, side by side, with their labels connected by lines.
The quality of the alignment of the two trees can be measured using the function entanglement(). Entanglement is a measure between 1 (full entanglement) and 0 (no entanglement). A lower entanglement coefficient corresponds to a good alignment.
- Draw a tanglegram:
- Customized the tanglegram using many other options as follow:
tanglegram(dend1, dend2, highlight_distinct_edges = FALSE, # Turn-off dashed lines common_subtrees_color_lines = FALSE, # Turn-off line colors common_subtrees_color_branches = TRUE, # Color common branches main = paste("entanglement =", round(entanglement(dend_list), 2)) )
Note that “unique” nodes, with a combination of labels/items not present in the other tree, are highlighted with dashed lines.
Correlation matrix between a list of dendrograms
The function cor.dendlist() is used to compute “Baker” or “Cophenetic” correlation matrix between a list of trees. The value can range between -1 to 1. With near 0 values meaning that the two trees are not statistically similar.
# Cophenetic correlation matrix cor.dendlist(dend_list, method = "cophenetic")
## [,1] [,2] ## [1,] 1.000 0.965 ## [2,] 0.965 1.000
# Baker correlation matrix cor.dendlist(dend_list, method = "baker")
## [,1] [,2] ## [1,] 1.000 0.962 ## [2,] 0.962 1.000
The correlation between two trees can be also computed as follow:
# Cophenetic correlation coefficient cor_cophenetic(dend1, dend2)
##  0.965
# Baker correlation coefficient cor_bakers_gamma(dend1, dend2)
##  0.962
It’s also possible to compare simultaneously multiple dendrograms. A chaining operator %>% is used to run multiple function at the same time. It’s useful for simplifying the code:
# Create multiple dendrograms by chaining dend1 <- df %>% dist %>% hclust("complete") %>% as.dendrogram dend2 <- df %>% dist %>% hclust("single") %>% as.dendrogram dend3 <- df %>% dist %>% hclust("average") %>% as.dendrogram dend4 <- df %>% dist %>% hclust("centroid") %>% as.dendrogram # Compute correlation matrix dend_list <- dendlist("Complete" = dend1, "Single" = dend2, "Average" = dend3, "Centroid" = dend4) cors <- cor.dendlist(dend_list) # Print correlation matrix round(cors, 2)
## Complete Single Average Centroid ## Complete 1.00 0.76 0.99 0.75 ## Single 0.76 1.00 0.80 0.84 ## Average 0.99 0.80 1.00 0.74 ## Centroid 0.75 0.84 0.74 1.00
# Visualize the correlation matrix using corrplot package library(corrplot) corrplot(cors, "pie", "lower")