Discovering knowledge from big multivariate data, recorded every days, requires specialized machine learning techniques.
This book presents an easy to use practical guide in R to compute the most popular machine learning methods for exploring data sets, as well as, for building predictive models.
The main parts of the book include:
Unsupervised learning methods, to explore and discover knowledge from a large multivariate data set using clustering and principal component methods. You will learn hierarchical clustering, k-means, principal component analysis and correspondence analysis methods.
Regression analysis, to predict a quantitative outcome value using linear regression and non-linear regression strategies.
Classification techniques, to predict a qualitative outcome value using logistic regression, discriminant analysis, naive bayes classifier and support vector machines.
Advanced machine learning methods, to build robust regression and classification models using k-nearest neighbors methods, decision tree models, ensemble methods (bagging, random forest and boosting).
Model selection methods, to select automatically the best combination of predictor variables for building an optimal predictive model. These include, best subsets selection methods, stepwise regression and penalized regression (ridge, lasso and elastic net regression models). We also present principal component-based regression methods, which are useful when the data contain multiple correlated predictor variables.
Model validation and evaluation techniques for measuring the performance of a predictive model.
Model diagnostics for detecting and fixing a potential problems in a predictive model.
The book presents the basic principles of these tasks and provide many examples in R. This book offers solid guidance in data mining for students and researchers.
Key features:
At the end of each chapter, we present R lab sections in which we systematically work through applications of the various methods discussed in that chapter.
Where to find the book?:
Although there are several good books on principal component methods (PCMs) and related topics, we felt that many of them are either too theoretical or too advanced.
This book provides a solid practical guidance to summarize, visualize and interpret the most important information in a large multivariate data sets, using principal component methods in R.
Where to find the book:
The following figure illustrates the type of analysis to be performed depending on the type of variables contained in the data set.
There are a number of R packages implementing principal component methods. These packages include: FactoMineR, ade4, stats, ca, MASS and ExPosition.
However, the result is presented differently depending on the used package.
To help in the interpretation and in the visualization of multivariate analysis - such as cluster analysis and principal component methods - we developed an easy-to-use R package named factoextra (official online documentation: http://www.sthda.com/english/rpkgs/factoextra).
No matter which package you decide to use for computing principal component methods, the factoextra R package can help to extract easily, in a human readable data format, the analysis results from the different packages mentioned above. factoextra provides also convenient solutions to create ggplot2-based beautiful graphs.
Methods, which outputs can be visualized using the factoextra package are shown in the figure below:
In this book, we’ll use mainly:
The other packages - ade4, ExPosition, etc - will be also presented briefly.
This book contains 4 parts.
Part I provides a quick introduction to R and presents the key features of FactoMineR and factoextra.
Part II describes classical principal component methods to analyze data sets containing, predominantly, either continuous or categorical variables. These methods include:
In Part III, you’ll learn advanced methods for analyzing a data set containing a mix of variables (continuous and categorical) structured or not into groups:
Part IV covers hierarchical clustering on principal components (HCPC), which is useful for performing clustering with a data set containing only categorical variables or with a mixed data of categorical and continuous variables
This book presents the basic principles of the different methods and provide many examples in R. This book offers solid guidance in data mining for students and researchers.
Key features:
At the end of each chapter, we present R lab sections in which we systematically work through applications of the various methods discussed in that chapter. Additionally, we provide links to other resources and to our hand-curated list of videos on principal component methods for further learning.
Some examples of plots generated in this book are shown hereafter. You’ll learn how to create, customize and interpret these plots.
Download the preview of the book at: Principal Component Methods in R (Book preview)