# Articles - Regression Model Diagnostics

## Multicollinearity Essentials and VIF in R

In multiple regression (Chapter @ref(linear-regression)), two or more predictor variables might be correlated with each other. This situation is referred as collinearity.

There is an extreme situation, called multicollinearity, where collinearity exists between three or more variables even if no pair of variables has a particularly high correlation. This means that there is redundancy between predictor variables.

In the presence of multicollinearity, the solution of the regression model becomes unstable.

For a given predictor (p), multicollinearity can assessed by computing a score called the variance inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

The smallest possible value of VIF is one (absence of multicollinearity). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity (James et al. 2014).

When faced to multicollinearity, the concerned variables should be removed, since the presence of multicollinearity implies that the information that this variable provides about the response is redundant in the presence of the other variables (James et al. 2014,P. Bruce and Bruce (2017)).

This chapter describes how to detect multicollinearity in a regression model using R.

Contents:

• `tidyverse` for easy data manipulation and visualization
• `caret` for easy machine learning workflow
``````library(tidyverse)
library(caret)``````

## Preparing the data

We’ll use the `Boston` data set [in `MASS` package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (`mdev`), in Boston Suburbs, based on multiple predictor variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

``````# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston\$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]``````

## Building a regression model

The following regression model include all predictor variables:

``````# Build the model
model1 <- lm(medv ~., data = train.data)
# Make predictions
predictions <- model1 %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data\$medv),
R2 = R2(predictions, test.data\$medv)
)``````
``````##   RMSE   R2
## 1 4.99 0.67``````

## Detecting multicollinearity

The R function `vif()` [car package] can be used to detect multicollinearity in a regression model:

``car::vif(model1)``
``````##    crim      zn   indus    chas     nox      rm     age     dis     rad
##    1.87    2.36    3.90    1.06    4.47    2.01    3.02    3.96    7.80
##     tax ptratio   black   lstat
##    9.16    1.91    1.31    2.97``````

In our example, the VIF score for the predictor variable `tax` is very high (VIF = 9.16). This might be problematic.

## Dealing with multicollinearity

In this section, we’ll update our model by removing the the predictor variables with high VIF value:

``````# Build a model excluding the tax variable
model2 <- lm(medv ~. -tax, data = train.data)
# Make predictions
predictions <- model2 %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data\$medv),
R2 = R2(predictions, test.data\$medv)
)``````
``````##   RMSE    R2
## 1 5.01 0.671``````

It can be seen that removing the `tax` variable does not affect very much the model performance metrics.

## Discussion

This chapter describes how to detect and deal with multicollinearity in regression models. Multicollinearity problems consist of including, in the model, different variables that have a similar predictive relationship with the outcome. This can be assessed for each predictor by computing the VIF value.

Any variable with a high VIF value (above 5 or 10) should be removed from the model. This leads to a simpler model without compromising the model accuracy, which is good.

Note that, in a large data set presenting multiple correlated predictor variables, you can perform principal component regression and partial least square regression strategies. See Chapter @ref(pcr-and-pls-regression).

## References

Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.