Stepwise Logistic Regression Essentials in R

kassambara | 11/03/2018 | 117807 | Comments (6) | Classification Methods Essentials

Stepwise logistic regression consists of automatically selecting a reduced number of predictor variables for building the best performing logistic regression model. Read more at Chapter @ref(stepwise-regression).

This chapter describes how to compute the stepwise logistic regression in R.

Contents:

Loading required R packages
Preparing the data
Computing stepwise logistique regression
Discussion

The Book:

Machine Learning Essentials: Practical Guide in R

Loading required R packages

tidyverse for easy data manipulation and visualization
caret for easy machine learning workflow

library(tidyverse)
library(caret)

Preparing the data

Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproductibility.

# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing stepwise logistique regression

The stepwise logistic regression can be easily computed using the R function stepAIC() available in the MASS package. It performs model selection by AIC. It has an option called direction, which can have the following values: “both”, “forward”, “backward” (see Chapter @ref(stepwise-regression)).

Quick start R code

library(MASS)
# Fit the model
model <- glm(diabetes ~., data = train.data, family = binomial) %>%
  stepAIC(trace = FALSE)
# Summarize the final selected model
summary(model)
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
mean(predicted.classes==test.data$diabetes)

Full logistic regression model

Full model incorporating all predictors:

full.model <- glm(diabetes ~., data = train.data, family = binomial)
coef(full.model)

## (Intercept)    pregnant     glucose    pressure     triceps     insulin 
##    -9.50372     0.04571     0.04230    -0.00700     0.01858    -0.00159 
##        mass    pedigree         age 
##     0.04502     0.96845     0.04256

Perform stepwise variable selection

Select the most contributive variables:

library(MASS)
step.model <- full.model %>% stepAIC(trace = FALSE)
coef(step.model)

## (Intercept)     glucose        mass    pedigree         age 
##     -9.5612      0.0379      0.0523      0.9697      0.0529

The function chose a final model in which one variable has been removed from the original full model. Dropped predictor is: triceps.

Compare the full and the stepwise models

Here, we’ll compare the performance of the full and the stepwise logistic models. The best model is defined as the model that has the lowest classification error rate in predicting the class of new test data:

Prediction accuracy of the full logistic regression model:

# Make predictions
probabilities <- full.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Prediction accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)

## [1] 0.808

Prediction accuracy of the stepwise logistic regression model:

# Make predictions
probabilities <- predict(step.model, test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Prediction accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)

## [1] 0.795

Discussion

This chapter describes how to perform stepwise logistic regression in R. In our example, the stepwise regression have selected a reduced number of predictor variables resulting to a final model, which performance was similar to the one of the full model.

So, the stepwise selection reduced the complexity of the model without compromising its accuracy. Note that, all things equal, we should always choose the simpler model, here the final model returned by the stepwise regression.

Another alternative to the stepwise method, for model selection, is the penalized regression approach (Chapter @ref(penalized-logistic-regression)), which penalizes the model for having two many variables.

3 Notes

Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Donnez nous 5 étoiles (juste au dessus de ce block)! Vous devez être membre pour voter. Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!

Recommended for You!

Machine Learning Essentials: Practical Guide in R

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

R Graphics Essentials for Great Data Visualization

Network Analysis and Visualization in R

More books on R and data science

Recommended for you

This section contains the best data science and self-development resources to help you on your path.

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Comments

You are not authorized to post a comment

Comment

Visitor

#640 11/04/2018 at 10h58

you need to install the magrittr package to use the %>% forward pipe operator

Comment

Visitor

#625 10/10/2018 at 16h38

Iam getting the below error:

Error in model_full %>% stepAIC(trace = FALSE) :
could not find function "%>%"

Comment

kassambara

Administrator

#582 08/13/2018 at 21h19

We don't have tutorial for sequential logistic regression

Comment

Visitor D Barazini

Visitor

#575 07/31/2018 at 15h54

thanks for this clear desctription. Appreciated

A question: do you have tutorial to run R for a sequential logistic regression (i.e., blockwise ...with blocks of variables to add simultaneously or even stepwise across blocks)?

Comment

kassambara

Administrator

#473 05/19/2018 at 12h08

Your feedback is highly appreciated!

Comment

sfd

Member

#447 04/26/2018 at 04h43

Such a wonderful series,
with clearly written
step X step examples.

Thank you, Kassambara!
SFd

STAY UPDATED

Articles - Classification Methods Essentials