The **best subsets regression** is a model selection approach that consists of testing all possible combination of the predictor variables, and then selecting the best model according to some statistical criteria.

In this chapter, we’ll describe how to compute best subsets regression using R.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow`leaps`

, for computing best subsets regression

```
library(tidyverse)
library(caret)
library(leaps)
```

We’ll use the built-in R `swiss`

data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.

```
# Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)
```

The R function `regsubsets()`

[`leaps`

package] can be used to identify different best models of different sizes.
You need to specify the option `nvmax`

, which represents the maximum number of predictors to incorporate in the model. For example, if `nvmax = 5`

, the function will return up to the best 5-variables model, that is, it returns the best 1-variable model, the best 2-variables model, …, the best 5-variables models.

In our example, we have only 5 predictor variables in the data. So, we’ll use `nvmax = 5`

.

```
models <- regsubsets(Fertility~., data = swiss, nvmax = 5)
summary(models)
```

```
## Subset selection object
## Call: st_build()
## 5 Variables (and intercept)
## Forced in Forced out
## Agriculture FALSE FALSE
## Examination FALSE FALSE
## Education FALSE FALSE
## Catholic FALSE FALSE
## Infant.Mortality FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## Agriculture Examination Education Catholic Infant.Mortality
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " " " "*" "*" " "
## 3 ( 1 ) " " " " "*" "*" "*"
## 4 ( 1 ) "*" " " "*" "*" "*"
## 5 ( 1 ) "*" "*" "*" "*" "*"
```

The function `summary()`

reports the best set of variables for each model size. From the output above, an asterisk specifies that a given variable is included in the corresponding model.

For example, it can be seen that the best 2-variables model contains only Education and Catholic variables (`Fertility ~ Education + Catholic`

). The best three-variable model is (`Fertility ~ Education + Catholic + Infant.mortality`

), and so forth.

A natural question is: which of these best models should we finally choose for our predictive analytics?

To answer to this questions, you need some statistical metrics or strategies to compare the overall performance of the models and to choose the best one. You need to estimate the prediction error of each model and to select the one with the lower prediction error.

The `summary()`

function returns some metrics - Adjusted R2, Cp and BIC (see Chapter @ref(regression-model-accuracy-metrics)) - allowing us to identify the best overall model, where best is defined as the model that maximize the adjusted R2 and minimize the prediction error (RSS, cp and BIC).

The adjusted R2 represents the proportion of variation, in the outcome, that are explained by the variation in predictors values. the higher the adjusted R2, the better the model.

The best model, according to each of these metrics, can be extracted as follow:

```
res.sum <- summary(models)
data.frame(
Adj.R2 = which.max(res.sum$adjr2),
CP = which.min(res.sum$cp),
BIC = which.min(res.sum$bic)
)
```

```
## Adj.R2 CP BIC
## 1 5 4 4
```

There is no single correct solution to model selection, each of these criteria will lead to slightly different models. Remember that, “All models are wrong, some models are useful”.

Here, adjusted R2 tells us that the best model is the one with all the 5 predictor variables. However, using the BIC and Cp criteria, we should go for the model with 4 variables.

So, we have different “best” models depending on which metrics we consider. We need additional strategies.

Note also that the adjusted R2, BIC and Cp are calculated on the training data that have been used to fit the model. This means that, the model selection, using these metrics, is possibly subject to overfitting and may not perform as well when applied to new data.

A more rigorous approach is to select a models based on the prediction error computed on a new test data using k-fold cross-validation techniques (Chapter @ref(cross-validation)).

The **k-fold Cross-validation** consists of first dividing the data into k subsets, also known as k-fold, where k is generally set to 5 or 10. Each subset (10%) serves successively as test data set and the remaining subset (90%) as training data. The average cross-validation error is computed as the model prediction error.

The k-fold cross-validation can be easily computed using the function `train()`

[`caret`

package] (Chapter @ref(cross-validation)).

Here, we’ll follow the procedure below:

- Extract the different model formulas from the
`models`

object - Train a linear model on the formula using k-fold cross-validation (with k= 5) and compute the prediction error of each model

We start by defining two helper functions:

`get_model_formula()`

, allowing to access easily the formula of the models returned by the function`regsubsets()`

. Copy and paste the following code in your R console:

```
# id: model id
# object: regsubsets object
# data: data used to fit regsubsets
# outcome: outcome variable
get_model_formula <- function(id, object, outcome){
# get models data
models <- summary(object)$which[id,-1]
# Get outcome variable
#form <- as.formula(object$call[[2]])
#outcome <- all.vars(form)[1]
# Get model predictors
predictors <- names(which(models == TRUE))
predictors <- paste(predictors, collapse = "+")
# Build model formula
as.formula(paste0(outcome, "~", predictors))
}
```

For example to have the best 3-variable model formula, type this:

`get_model_formula(3, models, "Fertility")`

```
## Fertility ~ Education + Catholic + Infant.Mortality
##
```

`get_cv_error()`

, to get the cross-validation (CV) error for a given model:

```
get_cv_error <- function(model.formula, data){
set.seed(1)
train.control <- trainControl(method = "cv", number = 5)
cv <- train(model.formula, data = data, method = "lm",
trControl = train.control)
cv$results$RMSE
}
```

Finally, use the above defined helper functions to compute the prediction error of the different best models returned by the `regsubsets()`

function:

```
# Compute cross-validation error
model.ids <- 1:5
cv.errors <- map(model.ids, get_model_formula, models, "Fertility") %>%
map(get_cv_error, data = swiss) %>%
unlist()
cv.errors
```

`## [1] 9.42 8.45 7.93 7.68 7.92`

```
# Select the model that minimize the CV error
which.min(cv.errors)
```

`## [1] 4`

It can be seen that the model with 4 variables is the best model. It has the lower prediction error. The regression coefficients of this model can be extracted as follow:

`coef(models, 4)`

```
## (Intercept) Agriculture Education Catholic
## 62.101 -0.155 -0.980 0.125
## Infant.Mortality
## 1.078
```

This chapter describes the best subsets regression approach for choosing the best linear regression model that explains our data.

Note that, this method is computationally expensive and becomes unfeasible for a large data set with many variables. A better alternative is provided by the **stepwise regression** method. See Chapter @ref(stepwise-regression).

The **stepwise regression** (or stepwise selection) consists of iteratively adding and removing predictors, in the predictive model, in order to find the subset of variables in the data set resulting in the best performing model, that is a model that lowers prediction error.

There are three strategies of stepwise regression (James et al. 2014,P. Bruce and Bruce (2017)):

**Forward selection**, which starts with no predictors in the model, iteratively adds the most contributive predictors, and stops when the improvement is no longer statistically significant.**Backward selection**(or**backward elimination**), which starts with all predictors in the model (full model), iteratively removes the least contributive predictors, and stops when you have a model where all predictors are statistically significant.**Stepwise selection**(or sequential replacement), which is a combination of forward and backward selections. You start with no predictors, then sequentially add the most contributive predictors (like forward selection). After adding each new variable, remove any variables that no longer provide an improvement in the model fit (like backward selection).

Note that,

- forward selection and stepwise selection can be applied in the high-dimensional configuration, where the number of samples n is inferior to the number of predictors p, such as in genomic fields.
- Backward selection requires that the number of samples n is larger than the number of variables p, so that the full model can be fit.

In this chapter, you’ll learn how to compute the stepwise regression methods in R.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow`leaps`

, for computing stepwise regression

```
library(tidyverse)
library(caret)
library(leaps)
```

There are many functions and R packages for computing stepwise regression. These include:

`stepAIC()`

[MASS package], which choose the best model by AIC. It has an option named`direction`

, which can take the following values: i) “both” (for stepwise regression, both forward and backward selection); “backward” (for backward selection) and “forward” (for forward selection). It return the best final model.

```
library(MASS)
# Fit the full model
full.model <- lm(Fertility ~., data = swiss)
# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both",
trace = FALSE)
summary(step.model)
```

`regsubsets()`

[leaps package], which has the tuning parameter`nvmax`

specifying the maximal number of predictors to incorporate in the model (See Chapter @ref(best-subsets-regression)). It returns multiple models with different size up to nvmax. You need to compare the performance of the different models for choosing the best one.`regsubsets()`

has the option`method`

, which can take the values “backward”, “forward” and “seqrep” (seqrep = sequential replacement, combination of forward and backward selections).

```
models <- regsubsets(Fertility~., data = swiss, nvmax = 5,
method = "seqrep")
summary(models)
```

Note that, the `train()`

function [caret package] provides an easy workflow to perform stepwise selections using the `leaps`

and the MASS packages. It has an option named `method`

, which can take the following values:

`"leapBackward"`

, to fit linear regression with**backward selection**`"leapForward"`

, to fit linear regression with**forward selection**`"leapSeq"`

, to fit linear regression with**stepwise selection**.

You also need to specify the tuning parameter `nvmax`

, which corresponds to the maximum number of predictors to be incorporated in the model.

For example, you can vary `nvmax`

from 1 to 5. In this case, the function starts by searching different best models of different size, up to the best 5-variables model. That is, it searches the best 1-variable model, the best 2-variables model, …, the best 5-variables models.

The following example performs backward selection (`method = "leapBackward"`

), using the `swiss`

data set, to identify the best model for predicting Fertility on the basis of socio-economic indicators.

As the data set contains only 5 predictors, we’ll vary `nvmax`

from 1 to 5 resulting to the identification of the 5 best models with different sizes: the best 1-variable model, the best 2-variables model, …, the best 5-variables model.

We’ll use 10-fold cross-validation to estimate the average prediction error (RMSE) of each of the 5 models (see Chapter @ref(cross-validation)). The RMSE statistical metric is used to compare the 5 models and to automatically choose the best one, where best is defined as the model that minimize the RMSE.

```
# Set seed for reproducibility
set.seed(123)
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 10)
# Train the model
step.model <- train(Fertility ~., data = swiss,
method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:5),
trControl = train.control
)
step.model$results
```

```
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 9.30 0.408 7.91 1.53 0.390 1.65
## 2 2 9.08 0.515 7.75 1.66 0.247 1.40
## 3 3 8.07 0.659 6.55 1.84 0.216 1.57
## 4 4 7.27 0.732 5.93 2.14 0.236 1.67
## 5 5 7.38 0.751 6.03 2.23 0.239 1.64
```

The output above shows different metrics and their standard deviation for comparing the accuracy of the 5 best models. Columns are:

`nvmax`

: the number of variable in the model. For example nvmax = 2, specify the best 2-variables model`RMSE`

and`MAE`

are two different metrics measuring the prediction error of each model. The lower the RMSE and MAE, the better the model.`Rsquared`

indicates the correlation between the observed outcome values and the values predicted by the model. The higher the R squared, the better the model.

In our example, it can be seen that the model with 4 variables (nvmax = 4) is the one that has the lowest RMSE. You can display the best tuning values (nvmax), automatically selected by the `train()`

function, as follow:

`step.model$bestTune`

```
## nvmax
## 4 4
```

This indicates that the best model is the one with nvmax = 4 variables. The function `summary()`

reports the best set of variables for each model size, up to the best 4-variables model.

`summary(step.model$finalModel)`

```
## Subset selection object
## 5 Variables (and intercept)
## Forced in Forced out
## Agriculture FALSE FALSE
## Examination FALSE FALSE
## Education FALSE FALSE
## Catholic FALSE FALSE
## Infant.Mortality FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
## Agriculture Examination Education Catholic Infant.Mortality
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " " " "*" "*" " "
## 3 ( 1 ) " " " " "*" "*" "*"
## 4 ( 1 ) "*" " " "*" "*" "*"
```

An asterisk specifies that a given variable is included in the corresponding model. For example, it can be seen that the best 4-variables model contains Agriculture, Education, Catholic, Infant.Mortality (`Fertility ~ Agriculture + Education + Catholic + Infant.Mortality`

).

The regression coefficients of the final model (id = 4) can be accessed as follow:

`coef(step.model$finalModel, 4)`

Or, by computing the linear model using only the selected predictors:

```
lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality,
data = swiss)
```

```
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Education Catholic
## 62.101 -0.155 -0.980 0.125
## Infant.Mortality
## 1.078
```

This chapter describes stepwise regression methods in order to choose an optimal simple model, without compromising the model accuracy.

We have demonstrated how to use the `leaps`

R package for computing stepwise regression. Another alternative is the function `stepAIC()`

available in the MASS package. It has an option called `direction`

, which can have the following values: “both”, “forward”, “backward”.

```
library(MASS)
res.lm <- lm(Fertility ~., data = swiss)
step <- stepAIC(res.lm, direction = "both", trace = FALSE)
step
```

Additionally, the caret package has method to compute stepwise regression using the MASS package (`method = "lmStepAIC"`

):

```
# Train the model
step.model <- train(Fertility ~., data = swiss,
method = "lmStepAIC",
trControl = train.control,
trace = FALSE
)
# Model accuracy
step.model$results
# Final model coefficients
step.model$finalModel
# Summary of the model
summary(step.model$finalModel)
```

Stepwise regression is very useful for high-dimensional data containing multiple predictor variables. Other alternatives are the penalized regression (ridge and lasso regression) (Chapter @ref(penalized-regression)) and the principal components-based regression methods (PCR and PLS) (Chapter @ref(pcr-and-pls-regression)).

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. *An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

The standard linear model (or the ordinary least squares method) performs poorly in a situation, where you have a large multivariate data set containing a number of variables superior to the number of samples.

A better alternative is the **penalized regression** allowing to create a linear regression model that is penalized, for having too many variables in the model, by adding a constraint in the equation (James et al. 2014,P. Bruce and Bruce (2017)). This is also known as **shrinkage** or **regularization** methods.

The consequence of imposing this penalty, is to reduce (i.e. shrink) the coefficient values towards zero. This allows the less contributive variables to have a coefficient close to zero or equal zero.

Note that, the shrinkage requires the selection of a tuning parameter (lambda) that determines the amount of shrinkage.

In this chapter we’ll describe the most commonly used penalized regression methods, including **ridge regression**, **lasso regression** and **elastic net regression**. We’ll also provide practical examples in R.

Contents:

Ridge regression shrinks the regression coefficients, so that variables, with minor contribution to the outcome, have their coefficients close to zero.

The shrinkage of the coefficients is achieved by penalizing the regression model with a penalty term called **L2-norm**, which is the sum of the squared coefficients.

The amount of the penalty can be fine-tuned using a constant called lambda (\(\lambda\)). Selecting a good value for \(\lambda\) is critical.

When \(\lambda = 0\), the penalty term has no effect, and ridge regression will produce the classical least square coefficients. However, as \(\lambda\) increases to infinite, the impact of the shrinkage penalty grows, and the ridge regression coefficients will get close zero.

Note that, in contrast to the ordinary least square regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the ridge regression (James et al. 2014), so that all the predictors are on the same scale.

The standardization of a predictor `x`

, can be achieved using the formula `x' = x / sd(x)`

, where sd(x) is the standard deviation of x. The consequence of this is that, all standardized predictors will have a standard deviation of one allowing the final fit to not depend on the scale on which the predictors are measured.

One important advantage of the ridge regression, is that it still performs well, compared to the ordinary least square method (Chapter @ref(linear-regression)), in a situation where you have a large multivariate data with the number of predictors (p) larger than the number of observations (n).

One disadvantage of the ridge regression is that, it will include all the predictors in the final model, unlike the stepwise regression methods (Chapter @ref(stepwise-regression)), which will generally select models that involve a reduced set of variables.

Ridge regression shrinks the coefficients towards zero, but it will not set any of them exactly to zero. The lasso regression is an alternative that overcomes this drawback.

Lasso stands for Least Absolute Shrinkage and Selection Operator. It shrinks the regression coefficients toward zero by penalizing the regression model with a penalty term called **L1-norm**, which is the sum of the absolute coefficients.

In the case of lasso regression, the penalty has the effect of forcing some of the coefficient estimates, with a minor contribution to the model, to be exactly equal to zero. This means that, lasso can be also seen as an alternative to the subset selection methods for performing variable selection in order to reduce the complexity of the model.

As in ridge regression, selecting a good value of \(\lambda\) for the lasso is critical.

One obvious advantage of lasso regression over ridge regression, is that it produces simpler and more interpretable models that incorporate only a reduced set of the predictors. However, neither ridge regression nor the lasso will universally dominate the other.

Generally, lasso might perform better in a situation where some of the predictors have large coefficients, and the remaining predictors have very small coefficients.

Ridge regression will perform better when the outcome is a function of many predictors, all with coefficients of roughly equal size (James et al. 2014).

Cross-validation methods can be used for identifying which of these two techniques is better on a particular data set.

Elastic Net produces a regression model that is penalized with both the **L1-norm** and **L2-norm**. The consequence of this is to effectively shrink coefficients (like in ridge regression) and to set some coefficients to zero (as in LASSO).

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow`glmnet`

, for computing penalized regression

```
library(tidyverse)
library(caret)
library(glmnet)
```

We’ll use the `Boston`

data set [in `MASS`

package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (`mdev`

), in Boston Suburbs, based on multiple predictor variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
```

You need to create two objects:

`y`

for storing the outcome variable`x`

for holding the predictor variables. This should be created using the function`model.matrix()`

allowing to automatically transform any qualitative variables (if any) into dummy variables (Chapter @ref(regression-with-categorical-variables)), which is important because glmnet() can only take numerical, quantitative inputs. After creating the model matrix, we remove the intercept component at index = 1.

```
# Predictor variables
x <- model.matrix(medv~., train.data)[,-1]
# Outcome variable
y <- train.data$medv
```

We’ll use the R function `glmnet()`

[glmnet package] for computing penalized linear regression models.

The simplified format is as follow:

`glmnet(x, y, alpha = 1, lambda = NULL)`

`x`

: matrix of predictor variables`y`

: the response or outcome variable, which is a binary variable.`alpha`

: the elasticnet mixing parameter. Allowed values include:- “1”: for lasso regression
- “0”: for ridge regression
- a value between 0 and 1 (say 0.3) for elastic net regression.

`lamba`

: a numeric value defining the amount of shrinkage. Should be specify by analyst.

In penalized regression, you need to specify a constant `lambda`

to adjust the amount of the coefficient shrinkage. The best `lambda`

for your data, can be defined as the `lambda`

that minimize the cross-validation prediction error rate. This can be determined automatically using the function `cv.glmnet()`

.

In the following sections, we start by computing ridge, lasso and elastic net regression models. Next, we’ll compare the different models in order to choose the best one for our data.

The best model is defined as the model that has the lowest prediction error, RMSE (Chapter @ref(regression-model-accuracy-metrics)).

```
# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x, y, alpha = 0)
# Display the best lambda value
cv$lambda.min
```

`## [1] 0.758`

```
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 0, lambda = cv$lambda.min)
# Display regression coefficients
coef(model)
```

```
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 28.69633
## crim -0.07285
## zn 0.03417
## indus -0.05745
## chas 2.49123
## nox -11.09232
## rm 3.98132
## age -0.00314
## dis -1.19296
## rad 0.14068
## tax -0.00610
## ptratio -0.86400
## black 0.00937
## lstat -0.47914
```

```
# Make predictions on the test data
x.test <- model.matrix(medv ~., test.data)[,-1]
predictions <- model %>% predict(x.test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)
```

```
## RMSE Rsquare
## 1 4.98 0.671
```

Note that by default, the function glmnet() standardizes variables so that their scales are comparable. However, the coefficients are always returned on the original scale.

The only difference between the R code used for ridge regression is that, for lasso regression you need to specify the argument `alpha = 1`

instead of `alpha = 0`

(for ridge regression).

```
# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x, y, alpha = 1)
# Display the best lambda value
cv$lambda.min
```

`## [1] 0.00852`

```
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)
# Dsiplay regression coefficients
coef(model)
```

```
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 36.90539
## crim -0.09222
## zn 0.04842
## indus -0.00841
## chas 2.28624
## nox -16.79651
## rm 3.81186
## age .
## dis -1.59603
## rad 0.28546
## tax -0.01240
## ptratio -0.95041
## black 0.00965
## lstat -0.52880
```

```
# Make predictions on the test data
x.test <- model.matrix(medv ~., test.data)[,-1]
predictions <- model %>% predict(x.test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)
```

```
## RMSE Rsquare
## 1 4.99 0.671
```

The elastic net regression can be easily computed using the `caret`

workflow, which invokes the `glmnet`

package.

We use `caret`

to automatically select the best tuning parameters `alpha`

and `lambda`

. The `caret`

packages tests a range of possible `alpha`

and `lambda`

values, then selects the best values for lambda and alpha, resulting to a final model that is an elastic net model.

Here, we’ll test the combination of 10 different values for `alpha`

and `lambda`

. This is specified using the option `tuneLength`

.

The best `alpha`

and `lambda`

values are those values that minimize the cross-validation error (Chapter @ref(cross-validation)).

```
# Build the model using the training set
set.seed(123)
model <- train(
medv ~., data = train.data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Best tuning parameter
model$bestTune
```

```
## alpha lambda
## 6 0.1 0.21
```

```
# Coefficient of the final model. You need
# to specify the best lambda
coef(model$finalModel, model$bestTune$lambda)
```

```
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 33.04083
## crim -0.07898
## zn 0.04136
## indus -0.03093
## chas 2.34443
## nox -14.30442
## rm 3.90863
## age .
## dis -1.41783
## rad 0.20564
## tax -0.00879
## ptratio -0.91214
## black 0.00946
## lstat -0.51770
```

```
# Make predictions on the test data
x.test <- model.matrix(medv ~., test.data)[,-1]
predictions <- model %>% predict(x.test)
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)
```

```
## RMSE Rsquare
## 1 4.98 0.672
```

The different models performance metrics are comparable. Using lasso or elastic net regression set the coefficient of the predictor variable `age`

to zero, leading to a simpler model compared to the ridge regression, which include all predictor variables.

All things equal, we should go for the simpler model. In our example, we can choose the lasso or the elastic net regression models.

Note that, we can easily compute and compare ridge, lasso and elastic net regression using the `caret`

workflow.

`caret`

will automatically choose the best tuning parameter values, compute the final model and evaluate the model performance using cross-validation techniques.

**Setup a grid range of lambda values**:

`lambda <- 10^seq(-3, 3, length = 100)`

**Compute ridge regression**:

```
# Build the model
set.seed(123)
ridge <- train(
medv ~., data = train.data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0, lambda = lambda)
)
# Model coefficients
coef(ridge$finalModel, ridge$bestTune$lambda)
# Make predictions
predictions <- ridge %>% predict(test.data)
# Model prediction performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)
```

**Compute lasso regression**:

```
# Build the model
set.seed(123)
lasso <- train(
medv ~., data = train.data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 1, lambda = lambda)
)
# Model coefficients
coef(lasso$finalModel, lasso$bestTune$lambda)
# Make predictions
predictions <- lasso %>% predict(test.data)
# Model prediction performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)
```

**Elastic net regression**:

```
# Build the model
set.seed(123)
elastic <- train(
medv ~., data = train.data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Model coefficients
coef(elastic$finalModel, elastic$bestTune$lambda)
# Make predictions
predictions <- elastic %>% predict(test.data)
# Model prediction performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)
```

**Comparing models performance**:

The performance of the different models - ridge, lasso and elastic net - can be easily compared using `caret`

. The best model is defined as the one that minimizes the prediction error.

```
models <- list(ridge = ridge, lasso = lasso, elastic = elastic)
resamples(models) %>% summary( metric = "RMSE")
```

```
##
## Call:
## summary.resamples(object = ., metric = "RMSE")
##
## Models: ridge, lasso, elastic
## Number of resamples: 10
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ridge 3.10 3.96 4.38 4.73 5.52 7.43 0
## lasso 3.16 4.03 4.39 4.73 5.51 7.27 0
## elastic 3.13 4.00 4.37 4.72 5.52 7.32 0
```

It can be seen that the elastic net model has the lowest median RMSE.

In this chapter we described the most commonly used penalized regression methods, including ridge regression, lasso regression and elastic net regression. These methods are very useful in a situation, where you have a large multivariate data sets.

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

*An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

This chapter presents regression methods based on dimension reduction techniques, which can be very useful when you have a large data set with multiple correlated predictor variables.

Generally, all dimension reduction methods work by first summarizing the original predictors into few new variables called principal components (PCs), which are then used as predictors to fit the linear regression model. These methods avoid multicollinearity between predictors, which a big issue in regression setting (see Chapter @ref(multicollinearity)).

When using the dimension reduction methods, it’s generally recommended to standardize each predictor to make them comparable. Standardization consists of dividing the predictor by its standard deviation.

Here, we described two well known regression methods based on dimension reduction: **Principal Component Regression** (**PCR**) and **Partial Least Squares** (**PLS**) regression. We also provide practical examples in R.

Contents:

The principal component regression (PCR) first applies Principal Component Analysis on the data set to summarize the original predictor variables into few new variables also known as principal components (PCs), which are a linear combination of the original data.

These PCs are then used to build the linear regression model. The number of principal components, to incorporate in the model, is chosen by cross-validation (cv). Note that, PCR is suitable when the data set contains highly correlated predictors.

A possible drawback of PCR is that we have no guarantee that the selected principal components are associated with the outcome. Here, the selection of the principal components to incorporate in the model is not supervised by the outcome variable.

An alternative to PCR is the **Partial Least Squares** (PLS) regression, which identifies new principal components that not only summarizes the original predictors, but also that are related to the outcome. These components are then used to fit the regression model. So, compared to PCR, PLS uses a dimension reduction strategy that is supervised by the outcome.

Like PCR, PLS is convenient for data with highly-correlated predictors. The number of PCs used in PLS is generally chosen by cross-validation. Predictors and the outcome variables should be generally standardized, to make the variables comparable.

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow`pls`

, for computing PCR and PLS

```
library(tidyverse)
library(caret)
library(pls)
```

We’ll use the `Boston`

data set [in `MASS`

package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (`mdev`

), in Boston Suburbs, based on multiple predictor variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
```

The R function `train()`

[`caret`

package] provides an easy workflow to compute PCR and PLS by invoking the `pls`

package. It has an option named `method`

, which can take the value `pcr`

or `pls`

.

An additional argument is `scale = TRUE`

for standardizing the variables to make them comparable.

`caret`

uses cross-validation to automatically identify the optimal number of principal components (`ncomp`

) to be incorporated in the model.

Here, we’ll test 10 different values of the tuning parameter `ncomp`

. This is specified using the option `tuneLength`

. The optimal number of principal components is selected so that the cross-validation error (RMSE) is minimized.

```
# Build the model on training set
set.seed(123)
model <- train(
medv~., data = train.data, method = "pcr",
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Plot model RMSE vs different values of components
plot(model)
# Print the best tuning parameter ncomp that
# minimize the cross-validation error, RMSE
model$bestTune
```

```
## ncomp
## 5 5
```

```
# Summarize the final model
summary(model$finalModel)
```

```
## Data: X dimension: 407 13
## Y dimension: 407 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 47.48 58.40 68.00 74.75 80.94
## .outcome 38.10 51.02 64.43 65.24 71.17
```

```
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance metrics
data.frame(
RMSE = caret::RMSE(predictions, test.data$medv),
Rsquare = caret::R2(predictions, test.data$medv)
)
```

```
## RMSE Rsquare
## 1 5.18 0.645
```

The plot shows the prediction error (RMSE, Chapter @ref(regression-model-accuracy-metrics)) made by the model according to the number of principal components incorporated in the model.

Our analysis shows that, choosing five principal components (ncomp = 5) gives the smallest prediction error RMSE.

The `summary()`

function also provides the percentage of variance explained in the predictors (x) and in the outcome (`medv`

) using different numbers of components.

For example, 80.94% of the variation (or information) contained in the predictors are captured by 5 principal components (`ncomp = 5`

). Additionally, setting ncomp = 5, captures 71% of the information in the outcome variable (`medv`

), which is good.

Taken together, cross-validation identifies ncomp = 5 as the optimal number of PCs that minimize the prediction error (RMSE) and explains enough variation in the predictors and in the outcome.

The R code is just like that of the PCR method.

```
# Build the model on training set
set.seed(123)
model <- train(
medv~., data = train.data, method = "pls",
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Plot model RMSE vs different values of components
plot(model)
# Print the best tuning parameter ncomp that
# minimize the cross-validation error, RMSE
model$bestTune
```

```
## ncomp
## 9 9
```

```
# Summarize the final model
summary(model$finalModel)
```

```
## Data: X dimension: 407 13
## Y dimension: 407 1
## Fit method: oscorespls
## Number of components considered: 9
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 46.19 57.32 64.15 69.76 75.63 78.66 82.85
## .outcome 50.90 71.84 73.71 74.71 75.18 75.35 75.42
## 8 comps 9 comps
## X 85.92 90.36
## .outcome 75.48 75.49
```

```
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance metrics
data.frame(
RMSE = caret::RMSE(predictions, test.data$medv),
Rsquare = caret::R2(predictions, test.data$medv)
)
```

```
## RMSE Rsquare
## 1 4.99 0.671
```

The optimal number of principal components included in the PLS model is 9. This captures 90% of the variation in the predictors and 75% of the variation in the outcome variable (`medv`

).

In our example, the cross-validation error RMSE obtained with the PLS model is lower than the RMSE obtained using the PCR method. So, the PLS model is the best model, for explaining our data, compared to the PCR model.

This chapter describes principal component based regression methods, including principal component regression (PCR) and partial least squares regression (PLS). These methods are very useful for multivariate data containing correlated predictors.

The presence of correlation in the data allows to summarize the data into few non-redundant components that can be used in the regression model.

Compared to ridge regression and lasso (Chapter @ref(penalized-regression)), the final PCR and PLS models are more difficult to interpret, because they do not perform any kind of variable selection or even directly produce regression coefficient estimates.