Discovering knowledge from big multivariate data, recorded every days, requires specialized machine learning techniques.
This book presents an easy to use practical guide in R to compute the most popular machine learning methods for exploring data sets, as well as, for building predictive models.
The main parts of the book include:
Unsupervised learning methods, to explore and discover knowledge from a large multivariate data set using clustering and principal component methods. You will learn hierarchical clustering, k-means, principal component analysis and correspondence analysis methods.
Regression analysis, to predict a quantitative outcome value using linear regression and non-linear regression strategies.
Classification techniques, to predict a qualitative outcome value using logistic regression, discriminant analysis, naive bayes classifier and support vector machines.
Advanced machine learning methods, to build robust regression and classification models using k-nearest neighbors methods, decision tree models, ensemble methods (bagging, random forest and boosting).
Model selection methods, to select automatically the best combination of predictor variables for building an optimal predictive model. These include, best subsets selection methods, stepwise regression and penalized regression (ridge, lasso and elastic net regression models). We also present principal component-based regression methods, which are useful when the data contain multiple correlated predictor variables.
Model validation and evaluation techniques for measuring the performance of a predictive model.
Model diagnostics for detecting and fixing a potential problems in a predictive model.
The book presents the basic principles of these tasks and provide many examples in R. This book offers solid guidance in data mining for students and researchers.
Key features:
At the end of each chapter, we present R lab sections in which we systematically work through applications of the various methods discussed in that chapter.
Where to find the book?:
Although there are several good books on principal component methods (PCMs) and related topics, we felt that many of them are either too theoretical or too advanced.
This book provides a solid practical guidance to summarize, visualize and interpret the most important information in a large multivariate data sets, using principal component methods in R.
Where to find the book:
The following figure illustrates the type of analysis to be performed depending on the type of variables contained in the data set.
There are a number of R packages implementing principal component methods. These packages include: FactoMineR, ade4, stats, ca, MASS and ExPosition.
However, the result is presented differently depending on the used package.
To help in the interpretation and in the visualization of multivariate analysis - such as cluster analysis and principal component methods - we developed an easy-to-use R package named factoextra (official online documentation: http://www.sthda.com/english/rpkgs/factoextra).
No matter which package you decide to use for computing principal component methods, the factoextra R package can help to extract easily, in a human readable data format, the analysis results from the different packages mentioned above. factoextra provides also convenient solutions to create ggplot2-based beautiful graphs.
Methods, which outputs can be visualized using the factoextra package are shown in the figure below:
In this book, we’ll use mainly:
The other packages - ade4, ExPosition, etc - will be also presented briefly.
This book contains 4 parts.
Part I provides a quick introduction to R and presents the key features of FactoMineR and factoextra.
Part II describes classical principal component methods to analyze data sets containing, predominantly, either continuous or categorical variables. These methods include:
In Part III, you’ll learn advanced methods for analyzing a data set containing a mix of variables (continuous and categorical) structured or not into groups:
Part IV covers hierarchical clustering on principal components (HCPC), which is useful for performing clustering with a data set containing only categorical variables or with a mixed data of categorical and continuous variables
This book presents the basic principles of the different methods and provide many examples in R. This book offers solid guidance in data mining for students and researchers.
Key features:
At the end of each chapter, we present R lab sections in which we systematically work through applications of the various methods discussed in that chapter. Additionally, we provide links to other resources and to our hand-curated list of videos on principal component methods for further learning.
Some examples of plots generated in this book are shown hereafter. You’ll learn how to create, customize and interpret these plots.
Download the preview of the book at: Principal Component Methods in R (Book preview)
Author: Paul Teetor
Author: Winston Chang
Author: Aurélien Géron
Large amounts of data are collected every day from satellite images, bio-medical, security, marketing, web search, geo-spatial or other automatic equipment. Mining knowledge from these big data far exceeds human’s abilities.
Clustering is one of the important data mining methods for discovering knowledge in multidimensional data. The goal of clustering is to identify pattern or groups of similar objects within a data set of interest.
In the litterature, it is referred as “pattern recognition” or “unsupervised machine learning” - “unsupervised” because we are not guided by a priori ideas of which variables or samples belong in which clusters. “Learning” because the machine algorithm “learns” how to cluster.
Cluster analysis is popular in many fields, including:
In cancer research for classifying patients into subgroups according their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.
In marketing for market segmentation by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.
In City-planning for identifying groups of houses according to their type, value and location.
Where to find the book:
Although there are several good books on unsupervised machine learning/clustering and related topics, we felt that many of them are either too high-level, theoretical or too advanced. Our goal was to write a practical guide to cluster analysis, elegant visualization and interpretation.
The main parts of the book include:
The book presents the basic principles of these tasks and provide many examples in R. This book offers solid guidance in data mining for students and researchers.
Key features:
This book contains 5 parts. Part I (Chapter 1 - 3) provides a quick introduction to R (chapter 1) and presents required R packages and data format (Chapter 2) for clustering analysis and visualization.
The classification of objects, into clusters, requires some methods for measuring the distance or the (dis)similarity between the objects. Chapter 3 covers the common distance measures used for assessing similarity between observations.
Part II starts with partitioning clustering methods, which include:
Partitioning clustering approaches subdivide the data sets into a set of k groups, where k is the number of groups pre-specified by the analyst.
In Part III, we consider agglomerative hierarchical clustering method, which is an alternative approach to partitionning clustering for identifying groups in a data set. It does not require to pre-specify the number of clusters to be generated. The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram (see the figure below).
In this part, we describe how to compute, visualize, interpret and compare dendrograms:
In this section, you will learn how to generate and interpret the following plots.
Part IV describes clustering validation and evaluation strategies, which consists of measuring the goodness of clustering results. Before applying any clustering algorithm to a data set, the first thing to do is to assess the clustering tendency. That is, whether applying clustering is suitable for the data. If yes, then how many clusters are there. Next, you can perform hierarchical clustering or partitioning clustering (with a pre-specified number of clusters). Finally, you can use a number of measures, described in this chapter, to evaluate the goodness of the clustering results.
The different chapters included in part IV are organized as follow:
Assessing clustering tendency (Chapter 11)
Determining the optimal number of clusters (Chapter 12)
Cluster validation statistics (Chapter 13)
Choosing the best clustering algorithms (Chapter 14)
Computing p-value for hierarchical clustering (Chapter 15)
In this section, you’ll learn how to create and interpret the plots hereafter.
## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
## .................................................. 50
## .................................................. 100
Part V presents advanced clustering methods, including:
The hierarchical k-means clustering is an hybrid approach for improving k-means results.
In Fuzzy clustering, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster.
In model-based clustering, the data are viewed as coming from a distribution that is mixture of two ore more clusters. It finds best fit of models to data and estimates the number of clusters.
The density-based clustering (DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers.
Download the preview of the book at: Practical Guide to Cluster Analysis in R (Book preview)
Author: Robert Kabacoff