## Stepwise Logistic Regression Essentials in R

**Stepwise logistic regression** consists of automatically selecting a reduced number of predictor variables for building the best performing logistic regression model. Read more at Chapter @ref(stepwise-regression).

This chapter describes how to compute the stepwise logistic regression in **R**.

Contents:

## Loading required R packages

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
```

## Preparing the data

Data set: `PimaIndiansDiabetes2`

[in `mlbench`

package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproductibility.

```
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
```

## Computing stepwise logistique regression

The stepwise logistic regression can be easily computed using the R function `stepAIC()`

available in the MASS package. It performs model selection by AIC. It has an option called `direction`

, which can have the following values: “both”, “forward”, “backward” (see Chapter @ref(stepwise-regression)).

### Quick start R code

```
library(MASS)
# Fit the model
model <- glm(diabetes ~., data = train.data, family = binomial) %>%
stepAIC(trace = FALSE)
# Summarize the final selected model
summary(model)
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
mean(predicted.classes==test.data$diabetes)
```

### Full logistic regression model

Full model incorporating all predictors:

```
full.model <- glm(diabetes ~., data = train.data, family = binomial)
coef(full.model)
```

```
## (Intercept) pregnant glucose pressure triceps insulin
## -9.50372 0.04571 0.04230 -0.00700 0.01858 -0.00159
## mass pedigree age
## 0.04502 0.96845 0.04256
```

### Perform stepwise variable selection

Select the most contributive variables:

```
library(MASS)
step.model <- full.model %>% stepAIC(trace = FALSE)
coef(step.model)
```

```
## (Intercept) glucose mass pedigree age
## -9.5612 0.0379 0.0523 0.9697 0.0529
```

The function chose a final model in which one variable has been removed from the original full model. Dropped predictor is: `triceps`

.

### Compare the full and the stepwise models

Here, we’ll compare the performance of the full and the stepwise logistic models. The best model is defined as the model that has the lowest classification error rate in predicting the class of new test data:

Prediction accuracy of the full logistic regression model:

```
# Make predictions
probabilities <- full.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Prediction accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
```

`## [1] 0.808`

Prediction accuracy of the stepwise logistic regression model:

```
# Make predictions
probabilities <- predict(step.model, test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Prediction accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
```

`## [1] 0.795`

## Discussion

This chapter describes how to perform stepwise logistic regression in R. In our example, the stepwise regression have selected a reduced number of predictor variables resulting to a final model, which performance was similar to the one of the full model.

So, the stepwise selection reduced the complexity of the model without compromising its accuracy. Note that, all things equal, we should always choose the simpler model, here the final model returned by the stepwise regression.

Another alternative to the stepwise method, for model selection, is the penalized regression approach (Chapter @ref(penalized-logistic-regression)), which penalizes the model for having two many variables.