Penalized Logistic Regression Essentials in R: Ridge, Lasso and Elastic Net

kassambara | 11/03/2018 | 183598 | Comments (6) | Classification Methods Essentials

When you have multiple variables in your logistic regression model, it might be useful to find a reduced set of variables resulting to an optimal performing model (see Chapter @ref(penalized-regression)).

Penalized logistic regression imposes a penalty to the logistic model for having too many variables. This results in shrinking the coefficients of the less contributive variables toward zero. This is also known as regularization.

The most commonly used penalized regression include:

ridge regression: variables with minor contribution have their coefficients close to zero. However, all the variables are incorporated in the model. This is useful when all variables need to be incorporated in the model according to domain knowledge.
lasso regression: the coefficients of some less contributive variables are forced to be exactly zero. Only the most significant variables are kept in the final model.
elastic net regression: the combination of ridge and lasso regression. It shrinks some coefficients toward zero (like ridge regression) and set some coefficients to exactly zero (like lasso regression)

This chapter describes how to compute penalized logistic regression, such as lasso regression, for automatically selecting an optimal model containing the most contributive predictor variables.

Contents:

Loading required R packages
Preparing the data
Computing penalized logistic regression
Discussion

The Book:

Machine Learning Essentials: Practical Guide in R

Loading required R packages

tidyverse for easy data manipulation and visualization
caret for easy machine learning workflow
glmnet, for computing penalized regression

library(tidyverse)
library(caret)
library(glmnet)

Preparing the data

Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproductibility.

# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing penalized logistic regression

Additionnal data preparation

The R function model.matrix() helps to create the matrix of predictors and also automatically converts categorical predictors to appropriate dummy variables, which is required for the glmnet() function.

# Dumy code categorical predictor variables
x <- model.matrix(diabetes~., train.data)[,-1]
# Convert the outcome (class) to a numerical variable
y <- ifelse(train.data$diabetes == "pos", 1, 0)

R functions

We’ll use the R function glmnet() [glmnet package] for computing penalized logistic regression.

The simplified format is as follow:

glmnet(x, y, family = "binomial", alpha = 1, lambda = NULL)

x: matrix of predictor variables
y: the response or outcome variable, which is a binary variable.
family: the response type. Use “binomial” for a binary outcome variable
alpha: the elasticnet mixing parameter. Allowed values include:
- “1”: for lasso regression
- “0”: for ridge regression
- a value between 0 and 1 (say 0.3) for elastic net regression.
lamba: a numeric value defining the amount of shrinkage. Should be specify by analyst.

In penalized regression, you need to specify a constant lambda to adjust the amount of the coefficient shrinkage. The best lambda for your data, can be defined as the lambda that minimize the cross-validation prediction error rate. This can be determined automatically using the function cv.glmnet().

In the following R code, we’ll show how to compute lasso regression by specifying the option alpha = 1. You can also try the ridge regression, using alpha = 0, to see which is better for your data.

Quick start R code

Fit the lasso penalized regression model:

library(glmnet)
# Find the best lambda using cross-validation
set.seed(123) 
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, family = "binomial",
                lambda = cv.lasso$lambda.min)
# Display regression coefficients
coef(model)
# Make predictions on the test data
x.test <- model.matrix(diabetes ~., test.data)[,-1]
probabilities <- model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)

Compute lasso regression

Find the optimal value of lambda that minimizes the cross-validation error:

library(glmnet)
set.seed(123)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
plot(cv.lasso)

The plot displays the cross-validation error according to the log of lambda. The left dashed vertical line indicates that the log of the optimal value of lambda is approximately -5, which is the one that minimizes the prediction error. This lambda value will give the most accurate model. The exact value of lambda can be viewed as follow:

cv.lasso$lambda.min

## [1] 0.00871

Generally, the purpose of regularization is to balance accuracy and simplicity. This means, a model with the smallest number of predictors that also gives a good accuracy. To this end, the function cv.glmnet() finds also the value of lambda that gives the simplest model but also lies within one standard error of the optimal value of lambda. This value is called lambda.1se.

cv.lasso$lambda.1se

## [1] 0.0674

Using lambda.min as the best lambda, gives the following regression coefficients:

coef(cv.lasso, cv.lasso$lambda.min)

## 9 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) -8.615615
## pregnant     0.035076
## glucose      0.036916
## pressure     .       
## triceps      0.016484
## insulin     -0.000392
## mass         0.030485
## pedigree     0.785506
## age          0.036265

From the output above, only the viable triceps has a coefficient exactly equal to zero.

Using lambda.1se as the best lambda, gives the following regression coefficients:

coef(cv.lasso, cv.lasso$lambda.1se)

## 9 x 1 sparse Matrix of class "dgCMatrix"
##                    1
## (Intercept) -4.65750
## pregnant     .      
## glucose      0.02628
## pressure     .      
## triceps      0.00191
## insulin      .      
## mass         .      
## pedigree     .      
## age          0.01734

Using lambda.1se, only 5 variables have non-zero coefficients. The coefficients of all other variables have been set to zero by the lasso algorithm, reducing the complexity of the model.

Setting lambda = lambda.1se produces a simpler model compared to lambda.min, but the model might be a little bit less accurate than the one obtained with lambda.min.

In the next sections, we’ll compute the final model using lambda.min and then assess the model accuracy against the test data. We’ll also discuss the results obtained by fitting the model using lambda = lambda.1se.

Compute the final lasso model:

Compute the final model using lambda.min:

# Final model with lambda.min
lasso.model <- glmnet(x, y, alpha = 1, family = "binomial",
                      lambda = cv.lasso$lambda.min)
# Make prediction on test data
x.test <- model.matrix(diabetes ~., test.data)[,-1]
probabilities <- lasso.model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)

## [1] 0.769

Compute the final model using lambda.1se:

# Final model with lambda.1se
lasso.model <- glmnet(x, y, alpha = 1, family = "binomial",
                      lambda = cv.lasso$lambda.1se)
# Make prediction on test data
x.test <- model.matrix(diabetes ~., test.data)[,-1]
probabilities <- lasso.model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy rate
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)

## [1] 0.705

In the next sections, we’ll compare the accuracy obtained with lasso regression against the one obtained using the full logistic regression model (including all predictors).

Compute the full logistic model

# Fit the model
full.model <- glm(diabetes ~., data = train.data, family = binomial)
# Make predictions
probabilities <- full.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)

## [1] 0.808

Discussion

This chapter described how to compute penalized logistic regression model in R. Here, we focused on lasso model, but you can also fit the ridge regression by using alpha = 0 in the glmnet() function. For elastic net regression, you need to choose a value of alpha somewhere between 0 and 1. This can be done automatically using the caret package. See Chapter @ref(penalized-regression).

Our analysis demonstrated that the lasso regression, using lambda.min as the best lambda, results to simpler model without compromising much the model performance on the test data when compared to the full logistic model.

The model accuracy that we have obtained with lambda.1se is a bit less than what we got with the more complex model using all predictor variables (n = 8) or using lambda.min in the lasso regression. Even with lambda.1se, the obtained accuracy remains good enough in addition to the resulting model simplicity.

This means that the simpler model obtained with lasso regression does at least as good a job fitting the information in the data as the more complicated one. According to the bias-variance trade-off, all things equal, simpler model should be always preferred because it is less likely to overfit the training data.

For variable selection, an alternative to the penalized logistic regression techniques is the stepwise logistic regression described in the Chapter @ref(stepwise-logistic-regression).

1 Note

Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Donnez nous 5 étoiles (juste au dessus de ce block)! Vous devez être membre pour voter. Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!

Recommended for You!

Machine Learning Essentials: Practical Guide in R

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

R Graphics Essentials for Great Data Visualization

Network Analysis and Visualization in R

More books on R and data science

Recommended for you

This section contains the best data science and self-development resources to help you on your path.

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Comments

You are not authorized to post a comment

Comment

mani3

Member

#897 09/09/2020 at 15h37

probabilities <- full.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")

I believe predict() would give the logit of the probabilities. So you need to apply the expit transformation on those before you predict the classes? Please correct me if I'm wrong.

Comment

Visitor

#771 04/28/2019 at 15h04

Very clear .. many thanks

Comment

Francis

Visitor

#683 02/03/2019 at 12h03

Hello, this is great. Do you have a full view of this particular example ( PimaIndiansDiabetes2) demonstrating all the 3 regularisation procedures (ridge regression, lasso regression, elastic net regression). Where can I find it?

thanks

Comment

Don

Visitor

#638 10/31/2018 at 21h22

An extremely helpful tutorial! Thank you!

However, I have one question: shouldn't we standartise prediction/design matrix x before applying LASSO? If some prediction variables have a very different range of values the penalty will not be equally proportional for each of the predictors.

Comment

kassambara

Administrator

#406 03/29/2018 at 20h46

Thank you for your feedback!

Comment

Visitor

#405 03/29/2018 at 14h38

Very well explained

STAY UPDATED

Articles - Classification Methods Essentials