Principal Component and Partial Least Squares Regression Essentials

kassambara | 11/03/2018 | 51675 | Comments (7) | Model Selection Essentials in R

This chapter presents regression methods based on dimension reduction techniques, which can be very useful when you have a large data set with multiple correlated predictor variables.

Generally, all dimension reduction methods work by first summarizing the original predictors into few new variables called principal components (PCs), which are then used as predictors to fit the linear regression model. These methods avoid multicollinearity between predictors, which a big issue in regression setting (see Chapter @ref(multicollinearity)).

When using the dimension reduction methods, it’s generally recommended to standardize each predictor to make them comparable. Standardization consists of dividing the predictor by its standard deviation.

Here, we described two well known regression methods based on dimension reduction: Principal Component Regression (PCR) and Partial Least Squares (PLS) regression. We also provide practical examples in R.

Contents:

Principal component regression
Partial least squares regression
Loading required R packages
Preparing the data
Computation
- Computing principal component regression
- Computing partial least squares
Discussion

The Book:

Machine Learning Essentials: Practical Guide in R

Principal component regression

The principal component regression (PCR) first applies Principal Component Analysis on the data set to summarize the original predictor variables into few new variables also known as principal components (PCs), which are a linear combination of the original data.

These PCs are then used to build the linear regression model. The number of principal components, to incorporate in the model, is chosen by cross-validation (cv). Note that, PCR is suitable when the data set contains highly correlated predictors.

Partial least squares regression

A possible drawback of PCR is that we have no guarantee that the selected principal components are associated with the outcome. Here, the selection of the principal components to incorporate in the model is not supervised by the outcome variable.

An alternative to PCR is the Partial Least Squares (PLS) regression, which identifies new principal components that not only summarizes the original predictors, but also that are related to the outcome. These components are then used to fit the regression model. So, compared to PCR, PLS uses a dimension reduction strategy that is supervised by the outcome.

Like PCR, PLS is convenient for data with highly-correlated predictors. The number of PCs used in PLS is generally chosen by cross-validation. Predictors and the outcome variables should be generally standardized, to make the variables comparable.

Loading required R packages

tidyverse for easy data manipulation and visualization
caret for easy machine learning workflow
pls, for computing PCR and PLS

library(tidyverse)
library(caret)
library(pls)

Preparing the data

We’ll use the Boston data set [in MASS package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (mdev), in Boston Suburbs, based on multiple predictor variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

Computation

The R function train() [caret package] provides an easy workflow to compute PCR and PLS by invoking the pls package. It has an option named method, which can take the value pcr or pls.

An additional argument is scale = TRUE for standardizing the variables to make them comparable.

caret uses cross-validation to automatically identify the optimal number of principal components (ncomp) to be incorporated in the model.

Here, we’ll test 10 different values of the tuning parameter ncomp. This is specified using the option tuneLength. The optimal number of principal components is selected so that the cross-validation error (RMSE) is minimized.

Computing principal component regression

# Build the model on training set
set.seed(123)
model <- train(
  medv~., data = train.data, method = "pcr",
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
  )
# Plot model RMSE vs different values of components
plot(model)
# Print the best tuning parameter ncomp that
# minimize the cross-validation error, RMSE
model$bestTune

##   ncomp
## 5     5

# Summarize the final model
summary(model$finalModel)

## Data:    X dimension: 407 13 
##  Y dimension: 407 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps
## X           47.48    58.40    68.00    74.75    80.94
## .outcome    38.10    51.02    64.43    65.24    71.17

# Make predictions
predictions <- model %>% predict(test.data)
# Model performance metrics
data.frame(
  RMSE = caret::RMSE(predictions, test.data$medv),
  Rsquare = caret::R2(predictions, test.data$medv)
)

##   RMSE Rsquare
## 1 5.18   0.645

The plot shows the prediction error (RMSE, Chapter @ref(regression-model-accuracy-metrics)) made by the model according to the number of principal components incorporated in the model.

Our analysis shows that, choosing five principal components (ncomp = 5) gives the smallest prediction error RMSE.

The summary() function also provides the percentage of variance explained in the predictors (x) and in the outcome (medv) using different numbers of components.

For example, 80.94% of the variation (or information) contained in the predictors are captured by 5 principal components (ncomp = 5). Additionally, setting ncomp = 5, captures 71% of the information in the outcome variable (medv), which is good.

Taken together, cross-validation identifies ncomp = 5 as the optimal number of PCs that minimize the prediction error (RMSE) and explains enough variation in the predictors and in the outcome.

Computing partial least squares

The R code is just like that of the PCR method.

# Build the model on training set
set.seed(123)
model <- train(
  medv~., data = train.data, method = "pls",
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
  )
# Plot model RMSE vs different values of components
plot(model)
# Print the best tuning parameter ncomp that
# minimize the cross-validation error, RMSE
model$bestTune

##   ncomp
## 9     9

# Summarize the final model
summary(model$finalModel)

## Data:    X dimension: 407 13 
##  Y dimension: 407 1
## Fit method: oscorespls
## Number of components considered: 9
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           46.19    57.32    64.15    69.76    75.63    78.66    82.85
## .outcome    50.90    71.84    73.71    74.71    75.18    75.35    75.42
##           8 comps  9 comps
## X           85.92    90.36
## .outcome    75.48    75.49

# Make predictions
predictions <- model %>% predict(test.data)
# Model performance metrics
data.frame(
  RMSE = caret::RMSE(predictions, test.data$medv),
  Rsquare = caret::R2(predictions, test.data$medv)
)

##   RMSE Rsquare
## 1 4.99   0.671

The optimal number of principal components included in the PLS model is 9. This captures 90% of the variation in the predictors and 75% of the variation in the outcome variable (medv).

In our example, the cross-validation error RMSE obtained with the PLS model is lower than the RMSE obtained using the PCR method. So, the PLS model is the best model, for explaining our data, compared to the PCR model.

Discussion

This chapter describes principal component based regression methods, including principal component regression (PCR) and partial least squares regression (PLS). These methods are very useful for multivariate data containing correlated predictors.

The presence of correlation in the data allows to summarize the data into few non-redundant components that can be used in the regression model.

Compared to ridge regression and lasso (Chapter @ref(penalized-regression)), the final PCR and PLS models are more difficult to interpret, because they do not perform any kind of variable selection or even directly produce regression coefficient estimates.

3 Notes

Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Donnez nous 5 étoiles (juste au dessus de ce block)! Vous devez être membre pour voter. Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!

Recommended for You!

Machine Learning Essentials: Practical Guide in R

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

R Graphics Essentials for Great Data Visualization

Network Analysis and Visualization in R

More books on R and data science

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Coursera - Online Courses and Specialization

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Comments

You are not authorized to post a comment

Comment

Visitor

#743 04/05/2019 at 19h13

I am wondering if it is possible to find which independent variables are significantly associated with a responsible variable in PLS regression

Comment

kassambara

Administrator

#417 03/30/2018 at 20h47

Thank you very much for your feedback!

Comment

sfd

Member

#415 03/30/2018 at 00h21

Ok - understood!

I've already added my 5 (well-deserved) Stars,
to several STHDA articles I read recently.
Finally!

Thanks/Merci again,
for your
great and clear step X step Articles...

Comment

kassambara

Administrator

#414 03/29/2018 at 23h31

We want each reader to rate , an article, only once. The only possible solution to ensure this is to be registered on STHDA as a member.

Comment

Visitor

#413 03/29/2018 at 22h21

Yes, posting a comment
is much easier now
(and exactly as effective against bots!).

You'll get many more comments
from the STHDA community now!. :-)

Pls, let us know
when the 5-Star ***** rating system
is fixed.

I have to 5 *****
many of your Articles, Kassambara!.
Seriously!.
SFer

Comment

kassambara

Administrator

#412 03/29/2018 at 22h01

Thank you for your feedback. It really helps.

Issue b) fixed now.

Issue a) readers needs to be a member to vote. This has been now specified in the call for voting message.

Thank you again!!!

Comment

SFer

Visitor

#408 03/29/2018 at 20h58

Another super-article by STHDA!

BTW:
Me (and surely others),
are having difficulties:

a) Grading an Article 1 to 5 stars.

for ex:
many times I wanted to leave FIVE stars
for one of your truly excellent Articles.

As soon as I click on the 5th star,
(all 5 stars illuminate Ok)
but then a message popup, says:
- "You don't have the required level!".

b) Leaving a Comment in Article.

With the present anti-bot filter
before it allows te reader to post a Comment,
ie" "select all images with a car" etc.
it's a real pain and frustrating.
The images are sometimes ambiguous
and hard to "see".

In Summary:
I think these 2 factors
are not just for your great R Articles.

You'd get many more
Stars and Comments in each Article,
if you made it easier
for your readers (like me!)...

All in good will, Kassambara.
Your Blog is truly great.
Hope this helps!.