Advanced Clustering

Advanced Clustering

Advanced Clustering

This course presents advanced clustering techniques, including: hierarchical k-means clustering, Fuzzy clustering, Model-based clustering and density-based clustering.

Related Book

Practical Guide to Cluster Analysis in R



Lessons

  1. Fuzzy clustering is also known as soft method. Standard clustering (K-means, PAM) approaches produce partitions, in which each observation belongs to only one cluster. This is known as hard clustering. In Fuzzy clustering, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster. In this article, we’ll describe how to compute fuzzy clustering using the R software.
  2. In model-based clustering, the data are viewed as coming from a distribution that is mixture of two ore more clusters. It finds best fit of models to data and estimates the number of clusters. In this chapter, we illustrate model-based clustering using the R package mclust.
  3. The density-based clustering (DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers. In this chapter, we’ll describe the DBSCAN algorithm and demonstrate how to compute DBSCAN using the fpc R package.

Comments ( 10 )

  • Connie

    Thank you so much for the very clear and excellent teaching on cluster analysis! I am wondering if I want to cluster observations based on three ordered categorical variables and one continuous variable in panel data, which method should I use? I would appreciate if you would like to answer my question.

    • Kassambara

      For a mixed data, you can, first, compute a distance matrix between variables using the daisy() R function [in cluster package].

      Next, you can apply hierarchical clustering on the computed distance matrix.

      For example:

      library(cluster)
      library(factoextra)
      
      # Load data
      data(flower)
      head(flower, 3)
      
      # Compute the gower distance matrix and visualize
      gower.dist <- daisy(flower, metric = "gower")
      fviz_dist(gower.dist)
      
      # Perform aglomerative hierarchical clustering
      hc.clust <- agnes(gower.dist)
      fviz_dend(hc.clust)
      
      • Connie

        Kassambara, thank you for quick reply. Your explanation is always clear and straightforward. As I have 20,000 observations, my first thought is to use CLARA. I will adopt hierarchical clustering as you suggested. As a beginner to cluster analysis, may I ask why hierarchical clustering is better than CLARA in my case? That is the question I need to answer when I write the method section of the paper.

        When I read Fraley’s paper (2002), I like the idea of ‘soft’ clustering, which however has some limitations such as large dataset. I don’t want to make things complicated at first place. But in future, after running basic method, is it possible to apply ‘soft’ clustering in my case? Which soft clustering method you recommend? Thank you!

        • Kassambara

          Hi Connie,

          My previous comment shows just an example of how to perform clustering on mixed data. Note that, Clara algorithm doesn’t take a distance matrix as input, so you can’t apply it on Gower distance.

          For soft clustering, I would suggest the fuzzy clustering method using the fanny() R function [in cluster R package]. It supports distance matrix as an input.

          You might be interested by the Hierarchical Clustering on Principal Components (HCPC), which can be also used for performing clustering on mixed data.

          • Noven

            Hi, Kassambara. Please make a post/tutorial about K-Prototype Clustering for mixed attribute and how to get the cluster accuracy.

  • Poorwa_kunwar

    I am working on a very large dataset (over 90 lac observations) and also the dataset has both categorical and continuous variables. I tried using gowerand PAM but it simply fails to work because the dataset is too large. I’m thinking of using k-prototypes algorithm in the clustMixType package. Do you have any suggestions? Thanks.

  • Yulin

    Excellent course! Many thanks of sharing the knowledge!

  • Hema Latha Krishna Nair

    Hi,
    It would be helpful if anyone can explain on how may I use K-Means clustering in a situation where I have more than 2 dimension/ argument for evaluation. I would like to cluster them into 5 clusters (K-5) but I am afraid basic Kmeans only takes up 2 dimensions for distance measure. Any best practice?

Give a comment

Want to post an issue with R? If yes, please make sure you have read this: How to Include Reproducible R Script Examples in Datanovia Comments

Teachers