Regression analysis consists of a set of machine learning methods that allow us to predict a continuous outcome variable (y) based on the value of one or multiple predictor variables (x).
Briefly, the goal of regression model is to build a mathematical equation that defines y as a function of the x variables. Next, this equation can be used to predict the outcome (y) on the basis of new values of the predictor variables (x).
Linear regression is the most simple and popular technique for predicting a continuous variable. It assumes a linear relationship between the outcome and the predictor variables.
The linear regression equation can be written as y = b0 + b*x + e
, where:
Technically, the linear regression coefficients are detetermined so that the error in predicting the outcome value is minimized. This method of computing the beta coefficients is called the Ordinary Least Squares method.
When you have multiple predictor variables, say x1 and x2, the regression equation can be written as y = b0 + b1*x1 + b2*x2 +e
. In some situations, there might be an interaction effect between some predictors, that is for example, increasing the value of a predictor variable x1 may increase the effectiveness of the predictor x2 in explaining the variation in the outcome variable.
Note also that, linear regression models can incorporate both continuous and categorical predictor variables.
When you build the linear regression model, you need to diagnostic whether linear model is suitable for your data.
In some cases, the relationship between the outcome and the predictor variables is not linear. In these situations, you need to build a non-linear regression, such as polynomial and spline regression.
When you have multiple predictors in the regression model, you might want to select the best combination of predictor variables to build an optimal predictive model. This process called model selection, consists of comparing multiple models containing different sets of predictors in order to select the best performing model that minimize the prediction error. Linear model selection approaches include best subsets regression and stepwise regression
In some situations, such as in genomic fields, you might have a large multivariate data set containing some correlated predictors. In this case, the information, in the original data set, can be summarized into few new variables (called principal components) that are a linear combination of the original variables. This few principal components can be used to build a linear model, which might be more performant for your data. This approach is know as principal component-based methods, which include: principal component regression and partial least squares regression.
An alternative method to simplify a large multivariate model is to use penalized regression, which penalizes the model for having too many variables. The most well known penalized regression include ridge regression and the lasso regression.
You can apply all these different regression models on your data, compare the models and finally select the best approach that explains well your data. To do so, you need some statistical metrics to compare the performance of the different models in explaining your data and in predicting the outcome of new test data.
The best model is defined as the model that has the lowest prediction error. The most popular metrics for comparing regression models, include:
RMSE = mean((observeds - predicteds)^2) %>% sqrt()
. The lower the RMSE, the better the model.Note that, the above mentioned metrics should be computed on a new test data that has not been used to train (i.e. build) the model. If you have a large data set, with many records, you can randomly split the data into training set (80% for building the predictive model) and test set or validation set (20% for evaluating the model performance).
One of the most robust and popular approach for estimating a model performance is k-fold cross-validation. It can be applied even on a small data set. k-fold cross-validation works as follow:
Taken together, the best model is the model that has the lowest cross-validation error, RMSE.
In this Part, you will learn different methods for regression analysis and we’ll provide practical example in R.
The content is organized as follow:
The ggpubr R package facilitates the creation of beautiful ggplot2-based graphs for researcher with non-advanced programming backgrounds.
The current material presents a collection of articles for simply creating and customizing publication-ready plots using ggpubr. To see some examples of plots created with ggpubr click the following link: ggpubr examples.
ggpubr Key features:
Official online documentation: http://www.sthda.com/english/rpkgs/ggpubr.
In this first volume of symplyR, we are excited to share our Practical Guides to Partioning Clustering.
The course materials contain 3 chapters organized as follow:
Contents:
K-Medoids Essentials: PAM clustering
Contents:
CLARA - Clustering Large Applications
Contents:
Example of plots:
Although there are several good books on principal component methods (PCMs) and related topics, we felt that many of them are either too theoretical or too advanced.
This book provides a solid practical guidance to summarize, visualize and interpret the most important information in a large multivariate data sets, using principal component methods in R.
Where to find the book:
The following figure illustrates the type of analysis to be performed depending on the type of variables contained in the data set.
There are a number of R packages implementing principal component methods. These packages include: FactoMineR, ade4, stats, ca, MASS and ExPosition.
However, the result is presented differently depending on the used package.
To help in the interpretation and in the visualization of multivariate analysis - such as cluster analysis and principal component methods - we developed an easy-to-use R package named factoextra (official online documentation: http://www.sthda.com/english/rpkgs/factoextra).
No matter which package you decide to use for computing principal component methods, the factoextra R package can help to extract easily, in a human readable data format, the analysis results from the different packages mentioned above. factoextra provides also convenient solutions to create ggplot2-based beautiful graphs.
Methods, which outputs can be visualized using the factoextra package are shown in the figure below:
In this book, we’ll use mainly:
The other packages - ade4, ExPosition, etc - will be also presented briefly.
This book contains 4 parts.
Part I provides a quick introduction to R and presents the key features of FactoMineR and factoextra.
Part II describes classical principal component methods to analyze data sets containing, predominantly, either continuous or categorical variables. These methods include:
In Part III, you’ll learn advanced methods for analyzing a data set containing a mix of variables (continuous and categorical) structured or not into groups:
Part IV covers hierarchical clustering on principal components (HCPC), which is useful for performing clustering with a data set containing only categorical variables or with a mixed data of categorical and continuous variables
This book presents the basic principles of the different methods and provide many examples in R. This book offers solid guidance in data mining for students and researchers.
Key features:
At the end of each chapter, we present R lab sections in which we systematically work through applications of the various methods discussed in that chapter. Additionally, we provide links to other resources and to our hand-curated list of videos on principal component methods for further learning.
Some examples of plots generated in this book are shown hereafter. You’ll learn how to create, customize and interpret these plots.
Download the preview of the book at: Principal Component Methods in R (Book preview)
simplyR is a web space where we’ll be posting practical and easy guides for solving real important problems using R programming language.
As we aren’t fans of unnecessary complications, we’ll keep the content of our tutorials / R codes as simple as possible.
Many tutorials are coming soon.
Topics we love include:
Samples of our recent publications, on R & Data Science, are:
If you want to contribute, read this: http://www.sthda.com/english/pages/contribute-to-sthda
]]>