**Linear regression** (or **linear model**) is used to predict a quantitative outcome variable (y) on the basis of one or multiple predictor variables (x) (James et al. 2014,P. Bruce and Bruce (2017)).

The goal is to build a mathematical formula that defines y as a function of the x variable. Once, we built a statistically significant model, it’s possible to use it for predicting future outcome on the basis of new x values.

When you build a regression model, you need to assess the performance of the predictive model. In other words, you need to evaluate how well the model is in predicting the outcome of a new test data that have not been used to build the model.

Two important metrics are commonly used to assess the performance of the predictive regression model:

**Root Mean Squared Error**, which measures the model prediction error. It corresponds to the average difference between the observed known values of the outcome and the predicted value by the model. RMSE is computed as`RMSE = mean((observeds - predicteds)^2) %>% sqrt()`

. The lower the RMSE, the better the model.**R-square**, representing the squared correlation between the observed known outcome values and the predicted values by the model. The higher the R2, the better the model.

A simple workflow to build to build a predictive regression model is as follow:

- Randomly split your data into training set (80%) and test set (20%)
- Build the regression model using the training set
- Make predictions using the test set and compute the model accuracy metrics

In this chapter, you will learn:

- the basics and the formula of linear regression,
- how to compute simple and multiple regression models in R,
- how to make predictions of the outcome of new data,
- how to assess the performance of the model

Contents:

The mathematical formula of the linear regression can be written as follow:

`y = b0 + b1*x + e`

We read this as “y is modeled as beta1 (`b1`

) times `x`

, plus a constant beta0 (`b0`

), plus an error term `e`

.”

When you have multiple predictor variables, the equation can be written as `y = b0 + b1*x1 + b2*x2 + ... + bn*xn`

, where:

- b0 is the intercept,
- b1, b2, …, bn are the regression weights or coefficients associated with the predictors x1, x2, …, xn.
`e`

is the*error term*(also known as the*residual errors*), the part of y that can be explained by the regression model

Note that, b0, b1, b2, … and bn are known as the regression beta coefficients or parameters.

The figure below illustrates a simple linear regression model, where:

- the best-fit regression line is in blue
- the intercept (b0) and the slope (b1) are shown in green
- the error terms (e) are represented by vertical red lines

From the scatter plot above, it can be seen that not all the data points fall exactly on the fitted regression line. Some of the points are above the blue curve and some are below it; overall, the residual errors (e) have approximately mean zero.

The sum of the squares of the residual errors are called the **Residual Sum of Squares** or **RSS**.

The average variation of points around the fitted regression line is called the **Residual Standard Error** (**RSE**). This is one the metrics used to evaluate the overall quality of the fitted regression model. The lower the RSE, the better it is.

Since the mean error term is zero, the outcome variable y can be approximately estimated as follow:

`y ~ b0 + b1*x`

Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as minimal as possible. This method of determining the beta coefficients is technically called **least squares** regression or **ordinary least squares** (OLS) regression.

Once, the beta coefficients are calculated, a t-test is performed to check whether or not these coefficients are significantly different from zero. A non-zero beta coefficients means that there is a significant relationship between the predictors (x) and the outcome variable (y).

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
theme_set(theme_bw())
```

We’ll use the `marketing`

data set, introduced in the Chapter @ref(regression-analysis), for predicting sales units on the basis of the amount of money spent in the three advertising medias (youtube, facebook and newspaper)

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data
data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)
```

```
## youtube facebook newspaper sales
## 58 163.4 23.0 19.9 15.8
## 157 112.7 52.2 60.6 18.4
## 81 91.7 32.0 26.8 14.2
```

```
# Split the data into training and test set
set.seed(123)
training.samples <- marketing$sales %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- marketing[training.samples, ]
test.data <- marketing[-training.samples, ]
```

The R function `lm()`

is used to compute linear regression model.

```
# Build the model
model <- lm(sales ~., data = train.data)
# Summarize the model
summary(model)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
RMSE(predictions, test.data$sales)
# (b) R-square
R2(predictions, test.data$sales)
```

The **simple linear regression** is used to predict a continuous outcome variable (y) based on one single predictor variable (x).

In the following example, we’ll build a simple linear model to predict sales units based on the advertising budget spent on youtube. The regression equation can be written as `sales = b0 + b1*youtube`

.

The R function `lm()`

can be used to determine the beta coefficients of the linear model, as follow:

```
model <- lm(sales ~ youtube, data = train.data)
summary(model)$coef
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.3839 0.62442 13.4 5.22e-28
## youtube 0.0468 0.00301 15.6 7.84e-34
```

The output above shows the estimate of the regression beta coefficients (column `Estimate`

) and their significance levels (column `Pr(>|t|)`

. The intercept (`b0`

) is 8.38 and the coefficient of youtube variable is 0.046.

The estimated regression equation can be written as follow: `sales = 8.38 + 0.046*youtube`

. Using this formula, for each new youtube advertising budget, you can predict the number of sale units.

For example:

- For a youtube advertising budget equal zero, we can expect a sale of 8.38 units.
- For a youtube advertising budget equal 1000, we can expect a sale of 8.38 + 0.046*1000 = 55 units.

Predictions can be easily made using the R function `predict()`

. In the following example, we predict sales units for two youtube advertising budget: 0 and 1000.

```
newdata <- data.frame(youtube = c(0, 1000))
model %>% predict(newdata)
```

```
## 1 2
## 8.38 55.19
```

**Multiple linear regression** is an extension of simple linear regression for predicting an outcome variable (y) on the basis of multiple distinct predictor variables (x).

For example, with three predictor variables (x), the prediction of y is expressed by the following equation: `y = b0 + b1*x1 + b2*x2 + b3*x3`

The regression beta coefficients measure the association between each predictor variable and the outcome. “b_j” can be interpreted as the average effect on y of a one unit increase in “x_j”, holding all other predictors fixed.

In this section, we’ll build a multiple regression model to predict sales based on the budget invested in three advertising medias: youtube, facebook and newspaper. The formula is as follow: `sales = b0 + b1*youtube + b2*facebook + b3*newspaper`

You can compute the multiple regression model coefficients in R as follow:

```
model <- lm(sales ~ youtube + facebook + newspaper,
data = train.data)
summary(model)$coef
```

Note that, if you have many predictor variables in your data, you can simply include all the available variables in the model using `~.`

:

```
model <- lm(sales ~., data = train.data)
summary(model)$coef
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.698 1.41e-12
## youtube 0.04557 0.00159 28.630 2.03e-64
## facebook 0.18694 0.00989 18.905 2.07e-42
## newspaper 0.00179 0.00677 0.264 7.92e-01
```

From the output above, the coefficients table shows the beta coefficient estimates and their significance levels. Columns are:

`Estimate`

: the intercept (b0) and the beta coefficient estimates associated to each predictor variable`Std.Error`

: the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are about the estimate.`t value`

: the t-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3)`Pr(>|t|)`

: The p-value corresponding to the t-statistic. The smaller the p-value, the more significant the estimate is.

As previously described, you can easily make predictions using the R function `predict()`

:

```
# New advertising budgets
newdata <- data.frame(
youtube = 2000, facebook = 1000,
newspaper = 1000
)
# Predict sales values
model %>% predict(newdata)
```

```
## 1
## 283
```

Before using a model for predictions, you need to assess the statistical significance of the model. This can be easily checked by displaying the statistical summary of the model.

Display the statistical summary of the model as follow:

`summary(model)`

```
##
## Call:
## lm(formula = sales ~ ., data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.412 -1.110 0.348 1.422 3.499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.70 1.4e-12 ***
## youtube 0.04557 0.00159 28.63 < 2e-16 ***
## facebook 0.18694 0.00989 18.90 < 2e-16 ***
## newspaper 0.00179 0.00677 0.26 0.79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.12 on 158 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.888
## F-statistic: 427 on 3 and 158 DF, p-value: <2e-16
```

The summary outputs shows 6 components, including:

**Call**. Shows the function call used to compute the regression model.**Residuals**. Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.**Coefficients**. Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.**Residual standard error**(RSE),**R-squared**(R2) and the**F-statistic**are metrics that are used to check how well the model fits to our data.

The first step in interpreting the multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.

In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.

To see which predictor variables are significant, you can examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statistic p-values.

`summary(model)$coef`

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.698 1.41e-12
## youtube 0.04557 0.00159 28.630 2.03e-64
## facebook 0.18694 0.00989 18.905 2.07e-42
## newspaper 0.00179 0.00677 0.264 7.92e-01
```

For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.

It can be seen that, changing in youtube and facebook advertising budget are significantly associated to changes in sales while changes in newspaper budget is not significantly associated with sales.

For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.

For example, for a fixed amount of youtube and newspaper advertising budget, spending an additional 1 000 dollars on facebook advertising leads to an increase in sales by approximately 0.1885*1000 = 189 sale units, on average.

The youtube coefficient suggests that for every 1 000 dollars increase in youtube advertising budget, holding all other predictors constant, we can expect an increase of 0.045*1000 = 45 sales units, on average.

We found that newspaper is not significant in the multiple regression model. This means that, for a fixed amount of youtube and newspaper advertising budget, changes in the newspaper advertising budget will not significantly affect sales units.

As the newspaper variable is not significant, it is possible to remove it from the model:

```
model <- lm(sales ~ youtube + facebook, data = train.data)
summary(model)
```

```
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.481 -1.104 0.349 1.423 3.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.43446 0.40877 8.4 2.3e-14 ***
## youtube 0.04558 0.00159 28.7 < 2e-16 ***
## facebook 0.18788 0.00920 20.4 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.11 on 159 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.889
## F-statistic: 644 on 2 and 159 DF, p-value: <2e-16
```

Finally, our model equation can be written as follow: `sales = 3.43+ 0.045`

.
*youtube + 0.187*facebook

Once you identified that, at least, one predictor variable is significantly associated to the outcome, you should continue the diagnostic by checking how well the model fits the data. This process is also referred to as the *goodness-of-fit*

The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model summary:

- Residual Standard Error (RSE),
- R-squared (R2) and adjusted R2,
- F-statistic, which has been already described in the previous section

```
## rse r.squared f.statistic p.value
## 1 2.11 0.89 644 5.64e-77
```

**Residual standard error**(RSE).

The RSE (or model *sigma*), corresponding to the prediction error, represents roughly the average difference between the observed outcome values and the predicted values by the model. The lower the RSE the best the model fits to our data.

Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible.

In our example, using only youtube and facebook predictor variables, the RSE = 2.11, meaning that the observed sales values deviate from the predicted values by approximately 2.11 units in average.

This corresponds to an error rate of 2.11/mean(train.data$sales) = 2.11/16.77 = 13%, which is low.

**R-squared and Adjusted R-squared**:

The R-squared (R2) ranges from 0 to 1 and represents the proportion of variation in the outcome variable that can be explained by the model predictor variables.

For a simple linear regression, R2 is the square of the Pearson correlation coefficient between the outcome and the predictor variables. In multiple linear regression, the R2 represents the correlation coefficient between the observed outcome values and the predicted values.

The R2 measures, how well the model fits the data. The higher the R2, the better the model. However, a problem with the R2, is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the outcome (James et al. 2014). A solution is to adjust the R2 by taking into account the number of predictor variables.

The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x variables included in the predictive model.

So, you should mainly consider the adjusted R-squared, which is a penalized R2 for a higher number of predictors.

- An (adjusted) R2 that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model.
- A number near 0 indicates that the regression model did not explain much of the variability in the outcome.

In our example, the adjusted R2 is 0.88, which is good.

**F-Statistic**:

Recall that, the F-statistic gives the overall significance of the model. It assess whether at least one predictor variable has a non-zero coefficient.

In a simple linear regression, this test is not really interesting since it just duplicates the information given by the t-test, available in the coefficient table.

The F-statistic becomes more important once we start using multiple predictors as in multiple linear regression.

A large F-statistic will corresponds to a statistically significant p-value (p < 0.05). In our example, the F-statistic equal 644 producing a p-value of 1.46e-42, which is highly significant.

We’ll make predictions using the test data in order to evaluate the performance of our regression model.

The procedure is as follow:

- Predict the sales values based on new advertising budgets in the test data
- Assess the model performance by computing:
- The prediction error RMSE (Root Mean Squared Error), representing the average difference between the observed known outcome values in the test data and the predicted outcome values by the model. The lower the RMSE, the better the model.
- The R-square (R2), representing the correlation between the observed outcome values and the predicted outcome values. The higher the R2, the better the model.

```
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
# (a) Compute the prediction error, RMSE
RMSE(predictions, test.data$sales)
```

`## [1] 1.58`

```
# (b) Compute R-square
R2(predictions, test.data$sales)
```

`## [1] 0.938`

From the output above, the R2 is 0.93, meaning that the observed and the predicted outcome values are highly correlated, which is very good.

The prediction error RMSE is 1.58, representing an error rate of 1.58/mean(test.data$sales) = 1.58/17 = 9.2%, which is good.

This chapter describes the basics of linear regression and provides practical examples in R for computing simple and multiple linear regression models. We also described how to assess the performance of the model for predictions.

Note that, linear regression assumes a linear relationship between the outcome and the predictor variables. This can be easily checked by creating a scatter plot of the outcome variable vs the predictor variable.

For example, the following R code displays sales units versus youtube advertising budget. We’ll also add a smoothed line:

```
ggplot(marketing, aes(x = youtube, y = sales)) +
geom_point() +
stat_smooth()
```

The graph above shows a linearly increasing relationship between the `sales`

and the `youtube`

variables, which is a good thing.

In addition to the linearity assumptions, the linear regression method makes many other assumptions about your data (see Chapter @ref(regression-assumptions-and-diagnostics)). You should make sure that these assumptions hold true for your data.

Potential problems, include: a) the presence of influential observations in the data (Chapter @ref(regression-assumptions-and-diagnostics)), non-linearity between the outcome and some predictor variables (@ref(polynomial-and-spline-regression)) and the presence of strong correlation between predictor variables (Chapter @ref(multicollinearity)).

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. *An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

This chapter describes how to compute multiple linear regression with **interaction effects**.

Previously, we have described how to build a multiple linear regression model (Chapter @ref(linear-regression)) for predicting a continuous outcome variable (y) based on multiple predictor variables (x).

For example, to predict sales, based on advertising budgets spent on youtube and facebook, the model equation is `sales = b0 + b1*youtube + b2*facebook`

, where, b0 is the intercept; b1 and b2 are the regression coefficients associated respectively with the predictor variables youtube and facebook.

The above equation, also known as *additive model*, investigates only the main effects of predictors. It assumes that the relationship between a given predictor variable and the outcome is independent of the other predictor variables (James et al. 2014,P. Bruce and Bruce (2017)).

Considering our example, the additive model assumes that, the effect on sales of youtube advertising is independent of the effect of facebook advertising.

This assumption might not be true. For example, spending money on facebook advertising may increase the effectiveness of youtube advertising on sales. In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect (James et al. 2014).

In this chapter, you’ll learn:

- the equation of multiple linear regression with interaction
- R codes for computing the regression coefficients associated with the main effects and the interaction effects
- how to interpret the interaction effect

Contents:

The multiple linear regression equation, with interaction effects between two predictors (x1 and x2), can be written as follow:

`y = b0 + b1*x1 + b2*x2 + b3*(x1*x2)`

Considering our example, it becomes:

`sales = b0 + b1*youtube + b2*facebook + b3*(youtube*facebook)`

This can be also written as:

`sales = b0 + (b1 + b3*facebook)*youtube + b2*facebook`

or as:

`sales = b0 + b1*youtube + (b2 +b3*youtube)*facebook`

`b3`

can be interpreted as the increase in the effectiveness of youtube advertising for a one unit increase in facebook advertising (or vice-versa).

In the following sections, you will learn how to compute the regression coefficients in R.

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
```

We’ll use the `marketing`

data set, introduced in the Chapter @ref(regression-analysis), for predicting sales units on the basis of the amount of money spent in the three advertising medias (youtube, facebook and newspaper)

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model).

```
# Load the data
data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)
```

```
## youtube facebook newspaper sales
## 58 163.4 23.0 19.9 15.8
## 157 112.7 52.2 60.6 18.4
## 81 91.7 32.0 26.8 14.2
```

```
# Split the data into training and test set
set.seed(123)
training.samples <- marketing$sales %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- marketing[training.samples, ]
test.data <- marketing[-training.samples, ]
```

The standard linear regression model can be computed as follow:

```
# Build the model
model1 <- lm(sales ~ youtube + facebook, data = train.data)
# Summarize the model
summary(model1)
```

```
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.481 -1.104 0.349 1.423 3.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.43446 0.40877 8.4 2.3e-14 ***
## youtube 0.04558 0.00159 28.7 < 2e-16 ***
## facebook 0.18788 0.00920 20.4 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.11 on 159 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.889
## F-statistic: 644 on 2 and 159 DF, p-value: <2e-16
```

```
# Make predictions
predictions <- model1 %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
RMSE(predictions, test.data$sales)
```

`## [1] 1.58`

```
# (b) R-square
R2(predictions, test.data$sales)
```

`## [1] 0.938`

In R, you include interactions between variables using the `*`

operator:

```
# Build the model
# Use this:
model2 <- lm(sales ~ youtube + facebook + youtube:facebook,
data = marketing)
# Or simply, use this:
model2 <- lm(sales ~ youtube*facebook, data = train.data)
# Summarize the model
summary(model2)
```

```
##
## Call:
## lm(formula = sales ~ youtube * facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.438 -0.482 0.231 0.748 1.860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.90e+00 3.28e-01 24.06 <2e-16 ***
## youtube 1.95e-02 1.64e-03 11.90 <2e-16 ***
## facebook 2.96e-02 9.83e-03 3.01 0.003 **
## youtube:facebook 9.12e-04 4.84e-05 18.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.18 on 158 degrees of freedom
## Multiple R-squared: 0.966, Adjusted R-squared: 0.966
## F-statistic: 1.51e+03 on 3 and 158 DF, p-value: <2e-16
```

```
# Make predictions
predictions <- model2 %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
RMSE(predictions, test.data$sales)
```

`## [1] 0.963`

```
# (b) R-square
R2(predictions, test.data$sales)
```

`## [1] 0.982`

It can be seen that all the coefficients, including the interaction term coefficient, are statistically significant, suggesting that there is an interaction relationship between the two predictor variables (youtube and facebook advertising).

Our model equation looks like this:

`sales = 7.89 + 0.019*youtube + 0.029*facebook + 0.0009*youtube*facebook`

We can interpret this as an increase in youtube advertising of 1000 dollars is associated with increased sales of `(b1 + b3*facebook)*1000 = 19 + 0.9*facebook units`

. And an increase in facebook advertising of 1000 dollars will be associated with an increase in sales of `(b2 + b3*youtube)*1000 = 28 + 0.9*youtube units`

.

Note that, sometimes, it is the case that the interaction term is significant but not the main effects. The hierarchical principle states that, if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant (James et al. 2014).

The prediction error RMSE of the interaction model is 0.963, which is lower than the prediction error of the additive model (1.58).

Additionally, the R-square (R2) value of the interaction model is 98% compared to only 93% for the additive model.

These results suggest that the model with the interaction term is better than the model that contains only main effects. So, for this specific data, we should go for the model with the interaction model.

This chapter describes how to compute multiple linear regression with interaction effects. Interaction terms should be included in the model if they are significantly.

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

*An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

This chapter describes how to compute **regression with categorical variables**.

**Categorical variables** (also known as *factor* or *qualitative variables*) are variables that classify observations into groups. They have a limited number of different values, called levels. For example the gender of individuals are a categorical variable that can take two levels: Male or Female.

Regression analysis requires numerical variables. So, when a researcher wishes to include a categorical variable in a regression model, supplementary steps are required to make the results interpretable.

In these steps, the categorical variables are recoded into a set of separate binary variables. This recoding is called “dummy coding” and leads to the creation of a table called *contrast matrix*. This is done automatically by statistical software, such as R.

Here, you’ll learn how to build and interpret a linear regression model with categorical predictor variables. We’ll also provide practical examples in R.

Contents:

`tidyverse`

for easy data manipulation and visualization

`library(tidyverse)`

We’ll use the `Salaries`

data set [`car`

package], which contains 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.

The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members.

```
# Load the data
data("Salaries", package = "car")
# Inspect the data
sample_n(Salaries, 3)
```

```
## rank discipline yrs.since.phd yrs.service sex salary
## 115 Prof A 12 0 Female 105000
## 313 Prof A 29 19 Male 94350
## 162 Prof B 26 19 Male 176500
```

Recall that, the regression equation, for predicting an outcome variable (y) on the basis of a predictor variable (x), can be simply written as `y = b0 + b1*x`

. `b0`

and `b1 are the regression beta coefficients, representing the intercept and the slope, respectively.

Suppose that, we wish to investigate differences in salaries between males and females.

Based on the gender variable, we can create a new dummy variable that takes the value:

`1`

if a person is male`0`

if a person is female

and use this variable as a predictor in the regression equation, leading to the following the model:

`b0 + b1`

if person is male`bo`

if person is female

The coefficients can be interpreted as follow:

`b0`

is the average salary among females,`b0 + b1`

is the average salary among males,- and
`b1`

is the average difference in salary between males and females.

For simple demonstration purpose, the following example models the salary difference between males and females by computing a simple linear regression model on the `Salaries`

data set [`car`

package]. R creates dummy variables automatically:

```
# Compute the model
model <- lm(salary ~ sex, data = Salaries)
summary(model)$coef
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101002 4809 21.00 2.68e-66
## sexMale 14088 5065 2.78 5.67e-03
```

From the output above, the average salary for female is estimated to be 101002, whereas males are estimated a total of 101002 + 14088 = 115090. The p-value for the dummy variable `sexMale`

is very significant, suggesting that there is a statistical evidence of a difference in average salary between the genders.

The `contrasts()`

function returns the coding that R have used to create the dummy variables:

`contrasts(Salaries$sex)`

```
## Male
## Female 0
## Male 1
```

R has created a sexMale dummy variable that takes on a value of 1 if the sex is Male, and 0 otherwise. The decision to code males as 1 and females as 0 (baseline) is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients.

You can use the function `relevel()`

to set the baseline category to males as follow:

```
Salaries <- Salaries %>%
mutate(sex = relevel(sex, ref = "Male"))
```

The output of the regression fit becomes:

```
model <- lm(salary ~ sex, data = Salaries)
summary(model)$coef
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115090 1587 72.50 2.46e-230
## sexFemale -14088 5065 -2.78 5.67e-03
```

The fact that the coefficient for `sexFemale`

in the regression output is negative indicates that being a Female is associated with decrease in salary (relative to Males).

Now the estimates for `bo`

and `b1`

are 115090 and -14088, respectively, leading once again to a prediction of average salary of 115090 for males and a prediction of 115090 - 14088 = 101002 for females.

Alternatively, instead of a 0/1 coding scheme, we could create a dummy variable -1 (male) / 1 (female) . This results in the model:

`b0 - b1`

if person is male`b0 + b1`

if person is female

So, if the categorical variable is coded as -1 and 1, then if the regression coefficient is positive, it is subtracted from the group coded as -1 and added to the group coded as 1. If the regression coefficient is negative, then addition and subtraction is reversed.

Generally, a categorical variable with n levels will be transformed into n-1 variables each with two levels. These n-1 new variables contain the same information than the single variable. This recoding creates a table called **contrast matrix**.

For example `rank`

in the `Salaries`

data has three levels: “AsstProf”, “AssocProf” and “Prof”. This variable could be dummy coded into two variables, one called AssocProf and one Prof:

- If rank = AssocProf, then the column AssocProf would be coded with a 1 and Prof with a 0.
- If rank = Prof, then the column AssocProf would be coded with a 0 and Prof would be coded with a 1.
- If rank = AsstProf, then both columns “AssocProf” and “Prof” would be coded with a 0.

This dummy coding is automatically performed by R. For demonstration purpose, you can use the function `model.matrix()`

to create a contrast matrix for a factor variable:

```
res <- model.matrix(~rank, data = Salaries)
head(res[, -1])
```

```
## rankAssocProf rankProf
## 1 0 1
## 2 0 1
## 3 0 0
## 4 0 1
## 5 0 1
## 6 1 0
```

When building linear model, there are different ways to encode categorical variables, known as contrast coding systems. The default option in R is to use the first level of the factor as a reference and interpret the remaining levels relative to this level.

Note that, ANOVA (analyse of variance) is just a special case of linear model where the predictors are categorical variables. And, because R understands the fact that ANOVA and regression are both examples of linear models, it lets you extract the classic ANOVA table from your regression model using the R base `anova()`

function or the `Anova()`

function [in `car`

package]. We generally recommend the `Anova()`

function because it automatically takes care of unbalanced designs.

The results of predicting salary from using a multiple regression procedure are presented below.

```
library(car)
model2 <- lm(salary ~ yrs.service + rank + discipline + sex,
data = Salaries)
Anova(model2)
```

```
## Anova Table (Type II tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## yrs.service 3.24e+08 1 0.63 0.43
## rank 1.03e+11 2 100.26 < 2e-16 ***
## discipline 1.74e+10 1 33.86 1.2e-08 ***
## sex 7.77e+08 1 1.51 0.22
## Residuals 2.01e+11 391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Taking other variables (yrs.service, rank and discipline) into account, it can be seen that the categorical variable sex is no longer significantly associated with the variation in salary between individuals. Significant variables are rank and discipline.

If you want to interpret the contrasts of the categorical variable, type this:

`summary(model2)`

```
##
## Call:
## lm(formula = salary ~ yrs.service + rank + discipline + sex,
## data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64202 -14255 -1533 10571 99163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73122.9 3245.3 22.53 < 2e-16 ***
## yrs.service -88.8 111.6 -0.80 0.42696
## rankAssocProf 14560.4 4098.3 3.55 0.00043 ***
## rankProf 49159.6 3834.5 12.82 < 2e-16 ***
## disciplineB 13473.4 2315.5 5.82 1.2e-08 ***
## sexFemale -4771.2 3878.0 -1.23 0.21931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22700 on 391 degrees of freedom
## Multiple R-squared: 0.448, Adjusted R-squared: 0.441
## F-statistic: 63.4 on 5 and 391 DF, p-value: <2e-16
```

For example, it can be seen that being from discipline B (applied departments) is significantly associated with an average increase of 13473.38 in salary compared to discipline A (theoretical departments).

In this chapter we described how categorical variables are included in linear regression model. As regression requires numerical inputs, categorical variables need to be recoded into a set of binary variables.

We provide practical examples for the situations where you have categorical variables containing two or more levels.

Note that, for categorical variables with a large number of levels it might be useful to group together some of the levels.

Some categorical variables have levels that are ordered. They can be converted to numerical values and used as is. For example, if the professor grades (“AsstProf”, “AssocProf” and “Prof”) have a special meaning, you can convert them into numerical values, ordered from low to high, corresponding to higher-grade professors.

In some cases, the true relationship between the outcome and a predictor variable might not be linear.

There are different solutions extending the linear regression model (Chapter @ref(linear-regression)) for capturing these nonlinear effects, including:

**Polynomial regression**. This is the simple approach to model non-linear relationships. It add polynomial terms or quadratic terms (square, cubes, etc) to a regression.**Spline regression**. Fits a smooth curve with a series of polynomial segments. The values delimiting the spline segments are called**Knots**.**Generalized additive models**(GAM). Fits spline models with automated selection of knots.

In this chapter, you’ll learn how to compute non-linear regression models and how to compare the different models in order to choose the one that fits the best your data.

The RMSE and the R2 metrics, will be used to compare the different models (see Chapter @ref(linear regression)).

Recall that, the RMSE represents the model prediction error, that is the average difference the observed outcome values and the predicted outcome values. The R2 represents the squared correlation between the observed and predicted outcome values. The best model is the model with the lowest RMSE and the highest R2.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
theme_set(theme_classic())
```

We’ll use the `Boston`

data set [in `MASS`

package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (`mdev`

), in Boston Suburbs, based on the predictor variable `lstat`

(percentage of lower status of the population).

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
```

First, visualize the scatter plot of the `medv`

vs `lstat`

variables as follow:

```
ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth()
```

The above scatter plot suggests a non-linear relationship between the two variables

In the following sections, we start by computing linear and non-linear regression models. Next, we’ll compare the different models in order to choose the best one for our data.

The standard linear regression model equation can be written as `medv = b0 + b1*lstat`

.

Compute linear regression model:

```
# Build the model
model <- lm(medv ~ lstat, data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
```

```
## RMSE R2
## 1 6.07 0.535
```

Visualize the data:

```
ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ x)
```

The polynomial regression adds polynomial or quadratic terms to the regression equation as follow:

\[medv = b0 + b1*lstat + b2*lstat^2\]

In R, to create a predictor x^2 you should use the function `I()`

, as follow: `I(x^2)`

. This raise x to the power 2.

The polynomial regression can be computed in R as follow:

`lm(medv ~ lstat + I(lstat^2), data = train.data)`

An alternative simple solution is to use this:

`lm(medv ~ poly(lstat, 2, raw = TRUE), data = train.data)`

```
##
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = train.data)
##
## Coefficients:
## (Intercept) poly(lstat, 2, raw = TRUE)1
## 43.351 -2.340
## poly(lstat, 2, raw = TRUE)2
## 0.043
```

The output contains two coefficients associated with lstat : one for the linear term (lstat^1) and one for the quadratic term (lstat^2).

The following example computes a sixfth-order polynomial fit:

```
lm(medv ~ poly(lstat, 6, raw = TRUE), data = train.data) %>%
summary()
```

```
##
## Call:
## lm(formula = medv ~ poly(lstat, 6, raw = TRUE), data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.23 -3.24 -0.74 2.02 26.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.14e+01 6.00e+00 11.90 < 2e-16 ***
## poly(lstat, 6, raw = TRUE)1 -1.45e+01 3.22e+00 -4.48 9.6e-06 ***
## poly(lstat, 6, raw = TRUE)2 1.87e+00 6.26e-01 2.98 0.003 **
## poly(lstat, 6, raw = TRUE)3 -1.32e-01 5.73e-02 -2.30 0.022 *
## poly(lstat, 6, raw = TRUE)4 4.98e-03 2.66e-03 1.87 0.062 .
## poly(lstat, 6, raw = TRUE)5 -9.56e-05 6.03e-05 -1.58 0.114
## poly(lstat, 6, raw = TRUE)6 7.29e-07 5.30e-07 1.38 0.170
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.28 on 400 degrees of freedom
## Multiple R-squared: 0.684, Adjusted R-squared: 0.679
## F-statistic: 144 on 6 and 400 DF, p-value: <2e-16
```

From the output above, it can be seen that polynomial terms beyond the fith order are not significant. So, just create a fith polynomial regression model as follow:

```
# Build the model
model <- lm(medv ~ poly(lstat, 5, raw = TRUE), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
```

```
## RMSE R2
## 1 4.96 0.689
```

Visualize the fith polynomial regression line as follow:

```
ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ poly(x, 5, raw = TRUE))
```

When you have a non-linear relationship, you can also try a logarithm transformation of the predictor variables:

```
# Build the model
model <- lm(medv ~ log(lstat), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
```

```
## RMSE R2
## 1 5.24 0.657
```

Visualize the data:

```
ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ log(x))
```

Polynomial regression only captures a certain amount of curvature in a nonlinear relationship. An alternative, and often superior, approach to modeling nonlinear relationships is to use splines (P. Bruce and Bruce 2017).

Splines provide a way to smoothly interpolate between fixed points, called knots. Polynomial regression is computed between knots. In other words, splines are series of polynomial segments strung together, joining at knots (P. Bruce and Bruce 2017).

The R package `splines`

includes the function `bs`

for creating a b-spline term in a regression model.

You need to specify two parameters: the degree of the polynomial and the location of the knots. In our example, we’ll place the knots at the lower quartile, the median quartile, and the upper quartile:

`knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75))`

We’ll create a model using a cubic spline (degree = 3):

```
library(splines)
# Build the model
knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75))
model <- lm (medv ~ bs(lstat, knots = knots), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
```

```
## RMSE R2
## 1 4.97 0.688
```

Note that, the coefficients for a spline term are not interpretable.

Visualize the cubic spline as follow:

```
ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ splines::bs(x, df = 3))
```

Once you have detected a non-linear relationship in your data, the polynomial terms may not be flexible enough to capture the relationship, and spline terms require specifying the knots.

Generalized additive models, or GAM, are a technique to automatically fit a spline regression. This can be done using the `mgcv`

R package:

```
library(mgcv)
# Build the model
model <- gam(medv ~ s(lstat), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
```

```
## RMSE R2
## 1 5.02 0.684
```

The term `s(lstat)`

tells the `gam()`

function to find the “best” knots for a spline term.

Visualize the data:

```
ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = gam, formula = y ~ s(x))
```

From analyzing the RMSE and the R2 metrics of the different models, it can be seen that the polynomial regression, the spline regression and the generalized additive models outperform the linear regression model and the log transformation approaches.

This chapter describes how to compute non-linear regression models using R.

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

Linear regression (Chapter @ref(linear-regression)) makes several assumptions about the data at hand. This chapter describes **regression assumptions** and provides built-in plots for **regression diagnostics** in R programming language.

After performing a regression analysis, you should always check if the model works well for the data at hand.

A first step of this regression diagnostic is to inspect the significance of the regression beta coefficients, as well as, the R2 that tells us how well the linear regression model fits to the data. This has been described in the Chapters @ref(linear-regression) and @ref(cross-validation).

In this current chapter, you will learn additional steps to evaluate how well the model fits the data.

For example, the linear regression model makes the assumption that the relationship between the predictors (x) and the outcome variable is linear. This might not be true. The relationship could be polynomial or logarithmic.

Additionally, the data might contain some influential observations, such as outliers (or extreme values), that can affect the result of the regression.

Therefore, you should closely diagnostic the regression model that you built in order to detect potential problems and to check whether the assumptions made by the linear regression model are met or not.

To do so, we generally examine the distribution of **residuals errors**, that can tell you more about your data.

In this chapter,

- we start by explaining
**residuals errors**and**fitted values**. - next, we present linear
**regresion assumptions**, as well as, potential problems you can face when performing regression analysis. - finally, we describe some built-in
**diagnostic plots**in R for testing the assumptions underlying linear regression model.

Contents:

- Loading Required R packages
- Example of data
- Building a regression model
- Fitted values and residuals
- Regression assumptions
- Regression diagnostics {reg-diag}
- Linearity of the data
- Homogeneity of variance
- Normality of residuals
- Outliers and high levarage points
- Influential values
- Discussion
- References

`tidyverse`

for easy data manipulation and visualization`broom`

: creates a tidy data frame from statistical test results

```
library(tidyverse)
library(broom)
theme_set(theme_classic())
```

We’ll use the data set `marketing`

[datarium package], introduced in Chapter @ref(regression-analysis).

```
# Load the data
data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)
```

```
## youtube facebook newspaper sales
## 58 163.4 23.0 19.9 15.8
## 157 112.7 52.2 60.6 18.4
## 81 91.7 32.0 26.8 14.2
```

We build a model to predict sales on the basis of advertising budget spent in youtube medias.

```
model <- lm(sales ~ youtube, data = marketing)
model
```

```
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Coefficients:
## (Intercept) youtube
## 8.4391 0.0475
```

Our regression equation is: `y = 8.43 + 0.07*x`

, that is `sales = 8.43 + 0.047*youtube`

.

Before, describing regression assumptions and regression diagnostics, we start by explaining two key concepts in regression analysis: Fitted values and residuals errors. These are important for understanding the diagnostic plots presented hereafter.

The **fitted** (or **predicted**) values are the y-values that you would expect for the given x-values according to the built regression model (or visually, the best-fitting straight regression line).

In our example, for a given youtube advertising budget, the fitted (predicted) sales value would be, `sales = 8.44 + 0.0048*youtube`

.

From the scatter plot below, it can be seen that not all the data points fall exactly on the estimated regression line. This means that, for a given youtube advertising budget, the observed (or measured) sale values can be different from the predicted sale values. The difference is called the **residual errors**, represented by a vertical red lines.

In R, you can easily augment your data to add fitted values and residuals by using the function `augment()`

[broom package]. Let’s call the output `model.diag.metrics`

because it contains several metrics useful for regression diagnostics. We’ll describe theme later.

```
model.diag.metrics <- augment(model)
head(model.diag.metrics)
```

```
## sales youtube .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## 1 26.52 276.1 21.56 0.385 4.955 0.00970 3.90 7.94e-03 1.2733
## 2 12.48 53.4 10.98 0.431 1.502 0.01217 3.92 9.20e-04 0.3866
## 3 11.16 20.6 9.42 0.502 1.740 0.01649 3.92 1.69e-03 0.4486
## 4 22.20 181.8 17.08 0.277 5.119 0.00501 3.90 4.34e-03 1.3123
## 5 15.48 217.0 18.75 0.297 -3.273 0.00578 3.91 2.05e-03 -0.8393
## 6 8.64 10.4 8.94 0.525 -0.295 0.01805 3.92 5.34e-05 -0.0762
```

Among the table columns, there are:

`youtube`

: the invested youtube advertising budget`sales`

: the observed sale values`.fitted`

: the fitted sale values`.resid`

: the residual errors- …

The following R code plots the residuals error (in red color) between observed values and the fitted regression line. Each vertical red segments represents the residual error between an observed sale value and the corresponding predicted (i.e. fitted) value.

```
ggplot(model.diag.metrics, aes(youtube, sales)) +
geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = youtube, yend = .fitted), color = "red", size = 0.3)
```

In order to check regression assumptions, we’ll examine the distribution of residuals.

Linear regression makes several assumptions about the data, such as :

**Linearity of the data**. The relationship between the predictor (x) and the outcome (y) is assumed to be linear.**Normality of residuals**. The residual errors are assumed to be normally distributed.**Homogeneity of residuals variance**. The residuals are assumed to have a constant variance (**homoscedasticity**)**Independence of residuals error terms**.

You should check whether or not these assumptions hold true. Potential problems include:

**Non-linearity**of the outcome - predictor relationships**Heteroscedasticity**: Non-constant variance of error terms.**Presence of influential values**in the data that can be:- Outliers: extreme values in the outcome (y) variable
- High-leverage points: extreme values in the predictors (x) variable

All these assumptions and potential problems can be checked by producing some diagnostic plots visualizing the residual errors.

Regression diagnostics plots can be created using the R base function `plot()`

or the `autoplot()`

function [ggfortify package], which creates a ggplot2-based graphics.

- Create the diagnostic plots with the R base function:

```
par(mfrow = c(2, 2))
plot(model)
```

- Create the diagnostic plots using ggfortify:

```
library(ggfortify)
autoplot(model)
```

The diagnostic plots show residuals in four different ways:

**Residuals vs Fitted**. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.**Normal Q-Q**. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line.**Scale-Location**(or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.**Residuals vs Leverage**. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. This plot will be described further in the next sections.

The four plots show the top 3 most extreme data points labeled with with the row numbers of the data in the data set. They might be potentially problematic. You might want to take a close look at them individually to check if there is anything special for the subject or if it could be simply data entry errors. We’ll discuss about this in the following sections.

The metrics used to create the above plots are available in the `model.diag.metrics`

data, described in the previous section.

```
# Add observations indices and
# drop some columns (.se.fit, .sigma) for simplification
model.diag.metrics <- model.diag.metrics %>%
mutate(index = 1:nrow(model.diag.metrics)) %>%
select(index, everything(), -.se.fit, -.sigma)
# Inspect the data
head(model.diag.metrics, 4)
```

```
## index sales youtube .fitted .resid .hat .cooksd .std.resid
## 1 1 26.5 276.1 21.56 4.96 0.00970 0.00794 1.273
## 2 2 12.5 53.4 10.98 1.50 0.01217 0.00092 0.387
## 3 3 11.2 20.6 9.42 1.74 0.01649 0.00169 0.449
## 4 4 22.2 181.8 17.08 5.12 0.00501 0.00434 1.312
```

We’ll use mainly the following columns:

`.fitted`

: fitted values`.resid`

: residual errors`.hat`

: hat values, used to detect high-leverage points (or extreme values in the predictors x variables)`.std.resid`

: standardized residuals, which is the residuals divided by their standard errors. Used to detect outliers (or extreme values in the outcome y variable)`.cooksd`

: Cook’s distance, used to detect influential values, which can be an outlier or a high leverage point

In the following section, we’ll describe, in details, how to use these graphs and metrics to check the regression assumptions and to diagnostic potential problems in the model.

The linearity assumption can be checked by inspecting the **Residuals vs Fitted** plot (1st plot):

`plot(model, 1)`

Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at zero. The presence of a pattern may indicate a problem with some aspect of the linear model.

In our example, there is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.

Note that, if the residual plot indicates a non-linear relationship in the data, then a simple approach is to use non-linear transformations of the predictors, such as log(x), sqrt(x) and x^2, in the regression model.

This assumption can be checked by examining the *scale-location plot*, also known as the *spread-location plot*.

`plot(model, 3)`

This plot shows if residuals are spread equally along the ranges of predictors. It’s good if you see a horizontal line with equally spread points. In our example, this is not the case.

It can be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors (or *heteroscedasticity*).

A possible solution to reduce the heteroscedasticity problem is to use a log or square root transformation of the outcome variable (y).

```
model2 <- lm(log(sales) ~ youtube, data = marketing)
plot(model2, 3)
```

The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.

In our example, all the points fall approximately along this reference line, so we can assume normality.

`plot(model, 2)`

**Outliers**:

An outlier is a point that has an extreme outcome variable value. The presence of outliers may affect the interpretation of the model, because it increases the RSE.

Outliers can be identified by examining the *standardized residual* (or *studentized residual*), which is the residual divided by its estimated standard error. Standardized residuals can be interpreted as the number of standard errors away from the regression line.

Observations whose standardized residuals are greater than 3 in absolute value are possible outliers (James et al. 2014).

**High leverage points**:

A data point has high leverage, if it has extreme predictor x values. This can be detected by examining the leverage statistic or the *hat-value*. A value of this statistic above `2(p + 1)/n`

indicates an observation with high leverage (P. Bruce and Bruce 2017); where, `p`

is the number of predictors and `n`

is the number of observations.

Outliers and high leverage points can be identified by inspecting the *Residuals vs Leverage* plot:

`plot(model, 5)`

The plot above highlights the top 3 most extreme points (#26, #36 and #179), with a standardized residuals below -2. However, there is no outliers that exceed 3 standard deviations, what is good.

Additionally, there is no high leverage point in the data. That is, all data points, have a leverage statistic below 2(p + 1)/n = 4/200 = 0.02.

An influential value is a value, which inclusion or exclusion can alter the results of the regression analysis. Such a value is associated with a large residual.

Not all outliers (or extreme data points) are influential in linear regression analysis.

Statisticians have developed a metric called *Cook’s distance* to determine the influence of a value. This metric defines influence as a combination of leverage and residual size.

A rule of thumb is that an observation has high influence if Cook’s distance exceeds `4/(n - p - 1)`

(P. Bruce and Bruce 2017), where `n`

is the number of observations and `p`

the number of predictor variables.

The *Residuals vs Leverage* plot can help us to find influential observations if any. On this plot, outlying values are generally located at the upper right corner or at the lower right corner. Those spots are the places where data points can be influential against a regression line.

The following plots illustrate the Cook’s distance and the leverage of our model:

```
# Cook's distance
plot(model, 4)
# Residuals vs Leverage
plot(model, 5)
```

By default, the top 3 most extreme values are labelled on the Cook’s distance plot. If you want to label the top 5 extreme values, specify the option `id.n`

as follow:

`plot(model, 4, id.n = 5)`

If you want to look at these top 3 observations with the highest Cook’s distance in case you want to assess them further, type this R code:

```
model.diag.metrics %>%
top_n(3, wt = .cooksd)
```

```
## index sales youtube .fitted .resid .hat .cooksd .std.resid
## 1 26 14.4 315 23.4 -9.04 0.0142 0.0389 -2.33
## 2 36 15.4 349 25.0 -9.66 0.0191 0.0605 -2.49
## 3 179 14.2 332 24.2 -10.06 0.0165 0.0563 -2.59
```

When data points have high Cook’s distance scores and are to the upper or lower right of the leverage plot, they have leverage meaning they are influential to the regression results. The regression results will be altered if we exclude those cases.

In our example, the data don’t present any influential points. Cook’s distance lines (a red dashed line) are not shown on the Residuals vs Leverage plot because all points are well inside of the Cook’s distance lines.

Let’s show now another example, where the data contain two extremes values with potential influence on the regression results:

```
df2 <- data.frame(
x = c(marketing$youtube, 500, 600),
y = c(marketing$sales, 80, 100)
)
model2 <- lm(y ~ x, df2)
```

Create the *Residuals vs Leverage* plot of the two models:

```
# Cook's distance
plot(model2, 4)
# Residuals vs Leverage
plot(model2, 5)
```

On the Residuals vs Leverage plot, look for a data point outside of a dashed line, Cook’s distance. When the points are outside of the Cook’s distance, this means that they have high Cook’s distance scores. In this case, the values are influential to the regression results. The regression results will be altered if we exclude those cases.

In the above example 2, two data points are far beyond the Cook’s distance lines. The other residuals appear clustered on the left. The plot identified the influential observation as #201 and #202. If you exclude these points from the analysis, the slope coefficient changes from 0.06 to 0.04 and R2 from 0.5 to 0.6. Pretty big impact!

This chapter describes linear regression assumptions and shows how to diagnostic potential problems in the model.

The diagnostic is essentially performed by visualizing the residuals. Having patterns in residuals is not a stop signal. Your current regression model might not be the best way to understand your data.

Potential problems might be:

A non-linear relationships between the outcome and the predictor variables. When facing to this problem, one solution is to include a quadratic term, such as polynomial terms or log transformation. See Chapter @ref(polynomial-and-spline-regression).

Existence of important variables that you left out from your model. Other variables you didn’t include (e.g., age or gender) may play an important role in your model and data. See Chapter @ref(confounding-variables).

Presence of outliers. If you believe that an outlier has occurred due to an error in data collection and entry, then one solution is to simply remove the concerned observation.

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

*An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

In multiple regression (Chapter @ref(linear-regression)), two or more predictor variables might be correlated with each other. This situation is referred as *collinearity*.

There is an extreme situation, called **multicollinearity**, where collinearity exists between three or more variables even if no pair of variables has a particularly high correlation. This means that there is redundancy between predictor variables.

In the presence of multicollinearity, the solution of the regression model becomes unstable.

For a given predictor (p), multicollinearity can assessed by computing a score called the **variance inflation factor** (or **VIF**), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

The smallest possible value of VIF is one (absence of multicollinearity). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity (James et al. 2014).

When faced to multicollinearity, the concerned variables should be removed, since the presence of multicollinearity implies that the information that this variable provides about the response is redundant in the presence of the other variables (James et al. 2014,P. Bruce and Bruce (2017)).

This chapter describes how to detect multicollinearity in a regression model using R.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
```

We’ll use the `Boston`

data set [in `MASS`

package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (`mdev`

), in Boston Suburbs, based on multiple predictor variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
```

The following regression model include all predictor variables:

```
# Build the model
model1 <- lm(medv ~., data = train.data)
# Make predictions
predictions <- model1 %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
```

```
## RMSE R2
## 1 4.99 0.67
```

The R function `vif()`

[car package] can be used to detect multicollinearity in a regression model:

`car::vif(model1)`

```
## crim zn indus chas nox rm age dis rad
## 1.87 2.36 3.90 1.06 4.47 2.01 3.02 3.96 7.80
## tax ptratio black lstat
## 9.16 1.91 1.31 2.97
```

In our example, the VIF score for the predictor variable `tax`

is very high (VIF = 9.16). This might be problematic.

In this section, we’ll update our model by removing the the predictor variables with high VIF value:

```
# Build a model excluding the tax variable
model2 <- lm(medv ~. -tax, data = train.data)
# Make predictions
predictions <- model2 %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
```

```
## RMSE R2
## 1 5.01 0.671
```

It can be seen that removing the `tax`

variable does not affect very much the model performance metrics.

This chapter describes how to detect and deal with multicollinearity in regression models. Multicollinearity problems consist of including, in the model, different variables that have a similar predictive relationship with the outcome. This can be assessed for each predictor by computing the VIF value.

Any variable with a high VIF value (above 5 or 10) should be removed from the model. This leads to a simpler model without compromising the model accuracy, which is good.

Note that, in a large data set presenting multiple correlated predictor variables, you can perform principal component regression and partial least square regression strategies. See Chapter @ref(pcr-and-pls-regression).

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

*An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

A **Confounding variable** is an important variable that should be included in the predictive model but you omit it.Naive interpretation of such models can lead to invalid conclusions.

For example, consider that we want to model life expentency in different countries based on the GDP per capita, using the `gapminder`

data set:

```
library(gapminder)
lm(lifeExp ~ gdpPercap, data = gapminder)
```

In this example, it is clear that the continent is an important variable: countries in Europe are estimated to have a higher life expectancy compared to countries in Africa. Therefore, continent is a confounding variable that should be included in the model:

`lm(lifeExp ~ gdpPercap + continent, data = gapminder)`

In this chapter we’ll describe different statistical regression **metrics** for measuring the performance of a regression model (Chapter @ref(linear-regression)).

Next, we’ll provide practical examples in R for comparing the performance of two models in order to select the best one for our data.

Contents:

In regression model, the most commonly known evaluation metrics include:

**R-squared**(R2), which is the proportion of variation in the outcome that is explained by the predictor variables. In multiple regression models, R2 corresponds to the squared correlation between the observed outcome values and the predicted values by the model. The Higher the R-squared, the better the model.**Root Mean Squared Error**(RMSE), which measures the average error performed by the model in predicting the outcome for an observation. Mathematically, the RMSE is the square root of the*mean squared error (MSE)*, which is the average squared difference between the observed actual outome values and the values predicted by the model. So,`MSE = mean((observeds - predicteds)^2)`

and`RMSE = sqrt(MSE`

). The lower the RMSE, the better the model.**Residual Standard Error**(RSE), also known as the*model sigma*, is a variant of the RMSE adjusted for the number of predictors in the model. The lower the RSE, the better the model. In practice, the difference between RMSE and RSE is very small, particularly for large multivariate data.**Mean Absolute Error**(MAE), like the RMSE, the MAE measures the prediction error. Mathematically, it is the average absolute difference between observed and predicted outcomes,`MAE = mean(abs(observeds - predicteds))`

. MAE is less sensitive to outliers compared to RMSE.

The problem with the above metrics, is that they are sensible to the inclusion of additional variables in the model, even if those variables dont have significant contribution in explaining the outcome. Put in other words, including additional variables in the model will always increase the R2 and reduce the RMSE. So, we need a more robust metric to guide the model choice.

Concerning R2, there is an adjusted version, called **Adjusted R-squared**, which adjusts the R2 for having too many variables in the model.

Additionally, there are four other important metrics - **AIC**, **AICc**, **BIC** and **Mallows Cp** - that are commonly used for model evaluation and selection. These are an unbiased estimate of the model prediction error MSE. The lower these metrics, he better the model.

**AIC**stands for (*Akaike’s Information Criteria*), a metric developped by the Japanese Statistician, Hirotugu Akaike, 1970. The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.**AICc**is a version of AIC corrected for small sample sizes.**BIC**(or*Bayesian information criteria*) is a variant of AIC with a stronger penalty for including additional variables to the model.**Mallows Cp**: A variant of AIC developed by Colin Mallows.

Generally, the most commonly used metrics, for measuring regression model quality and for comparing models, are: Adjusted R2, AIC, BIC and Cp.

In the following sections, we’ll show you how to compute these above mentionned metrics.

`tidyverse`

for data manipulation and visualization`modelr`

provides helper functions for computing regression model performance metrics`broom`

creates easily a tidy data frame containing the model statistical metrics

```
library(tidyverse)
library(modelr)
library(broom)
```

We’ll use the built-in R `swiss`

data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.

```
# Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)
```

We start by creating two models:

- Model 1, including all predictors
- Model 2, including all predictors except the variable Examination

```
model1 <- lm(Fertility ~., data = swiss)
model2 <- lm(Fertility ~. -Examination, data = swiss)
```

There are many R functions and packages for assessing model quality, including:

`summary()`

[stats package], returns the R-squared, adjusted R-squared and the RSE`AIC()`

and`BIC()`

[stats package], computes the AIC and the BIC, respectively

```
summary(model1)
AIC(model1)
BIC(model1)
```

`rsquare()`

,`rmse()`

and`mae()`

[modelr package], computes, respectively, the R2, RMSE and the MAE.

```
library(modelr)
data.frame(
R2 = rsquare(model1, data = swiss),
RMSE = rmse(model1, data = swiss),
MAE = mae(model1, data = swiss)
)
```

`R2()`

,`RMSE()`

and`MAE()`

[caret package], computes, respectively, the R2, RMSE and the MAE.

```
library(caret)
predictions <- model1 %>% predict(swiss)
data.frame(
R2 = R2(predictions, swiss$Fertility),
RMSE = RMSE(predictions, swiss$Fertility),
MAE = MAE(predictions, swiss$Fertility)
)
```

`glance()`

[broom package], computes the R2, adjusted R2, sigma (RSE), AIC, BIC.

```
library(broom)
glance(model1)
```

- Manual computation of R2, RMSE and MAE:

```
# Make predictions and compute the
# R2, RMSE and MAE
swiss %>%
add_predictions(model1) %>%
summarise(
R2 = cor(Fertility, pred)^2,
MSE = mean((Fertility - pred)^2),
RMSE = sqrt(MSE),
MAE = mean(abs(Fertility - pred))
)
```

Here, we’ll use the function `glance()`

to simply compare the overall quality of our two models:

```
# Metrics for model 1
glance(model1) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
```

```
## adj.r.squared sigma AIC BIC p.value
## 1 0.671 7.17 326 339 5.59e-10
```

```
# Metrics for model 2
glance(model2) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
```

```
## adj.r.squared sigma AIC BIC p.value
## 1 0.671 7.17 325 336 1.72e-10
```

From the output above, it can be seen that:

The two models have exactly the samed

**adjusted R2**(0.67), meaning that they are equivalent in explaining the outcome, here fertility score. Additionally, they have the same amount of**residual standard error**(RSE or sigma = 7.17). However, the model 2 is more simple than model 1 because it incorporates less variables. All things equal, the simple model is always better in statistics.The AIC and the BIC of the model 2 are lower than those of the model1. In model comparison strategies, the model with the lowest AIC and BIC score is preferred.

Finally, the F-statistic p.value of the model 2 is lower than the one of the model 1. This means that the model 2 is statistically more significant compared to model 1, which is consistent to the above conclusion.

Note that, the RMSE and the RSE are measured in the same scale as the outcome variable. Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible:

`sigma(model1)/mean(swiss$Fertility)`

`## [1] 0.102`

In our example the average prediction error rate is 10%.

This chapter describes several metrics for assessing the overall performance of a regression model.

The most important metrics are the Adjusted R-square, RMSE, AIC and the BIC. These metrics are also used as the basis of model comparison and optimal model selection.

Note that, these regression metrics are all internal measures, that is they have been computed on the same data that was used to build the regression model. They tell you how well the model fits to the data in hand, called training data set.

In general, we do not really care how well the method works on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data.

However, the test data is not always available making the test error very difficult to estimate. In this situation, methods such as **cross-validation** (Chapter @ref(cross-validation)) and **bootstrap** (Chapter @ref(bootstrap-resampling)) are applied for estimating the test error (or the prediction error rate) using training data.

**Cross-validation** refers to a set of methods for measuring the performance of a given predictive model on new test data sets.

The basic idea, behind cross-validation techniques, consists of dividing the data into two sets:

- The training set, used to train (i.e. build) the model;
- and the testing set (or validation set), used to test (i.e. validate) the model by estimating the prediction error.

Cross-validation is also known as a *resampling method* because it involves fitting the same statistical method multiple times using different subsets of the data.

In this chapter, you’ll learn:

the most commonly used statistical metrics (Chapter @ref(regression-model-accuracy-metrics)) for measuring the performance of a regression model in predicting the outcome of new test data.

- The different cross-validation methods for assessing model performance. We cover the following approaches:
- Validation set approach (or data split)
- Leave One Out Cross Validation
- k-fold Cross Validation
- Repeated k-fold Cross Validation

Each of these methods has their advantages and drawbacks. Use the method that best suits your problem. Generally, the (repeated) k-fold cross validation is recommended.

- Practical examples of R codes for computing cross-validation methods.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easily computing cross-validation methods

```
library(tidyverse)
library(caret)
```

We’ll use the built-in R `swiss`

data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.

```
# Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)
```

After building a model, we are interested in determining the accuracy of this model on predicting the outcome for new unseen observations not used to build the model. Put in other words, we want to estimate the prediction error.

To do so, the basic strategy is to:

- Build the model on a training data set
- Apply the model on a new test data set to make predictions
- Compute the prediction errors

In Chapter @ref(regression-model-accuracy-metrics), we described several statistical metrics for quantifying the overall quality of regression models. These include:

**R-squared**(R2), representing the squared correlation between the observed outcome values and the predicted values by the model. The higher the adjusted R2, the better the model.**Root Mean Squared Error**(RMSE), which measures the average prediction error made by the model in predicting the outcome for an observation. That is, the average difference between the observed known outcome values and the values predicted by the model. The lower the RMSE, the better the model.**Mean Absolute Error**(MAE), an alternative to the RMSE that is less sensitive to outliers. It corresponds to the average absolute difference between observed and predicted outcomes. The lower the MAE, the better the model

In classification setting, the prediction error rate is estimated as the proportion of misclassified observations.

R2, RMSE and MAE are used to measure the regression model performance during **cross-validation**.

In the following section, we’ll explain the basics of cross-validation, and we’ll provide practical example using mainly the `caret`

R package.

Briefly, cross-validation algorithms can be summarized as follow:

- Reserve a small sample of the data set
- Build (or train) the model using the remaining part of the data set
- Test the effectiveness of the model on the the reserved sample of the data set. If the model works well on the test data set, then it’s good.

The following sections describe the different cross-validation techniques.

The validation set approach consists of randomly splitting the data into two sets: one set is used to train the model and the remaining other set sis used to test the model.

The process works as follow:

- Build (train) the model on the training data set
- Apply the model to the test data set to predict the outcome of new unseen observations
- Quantify the prediction error as the mean squared difference between the observed and the predicted outcome values.

The example below splits the `swiss`

data set so that 80% is used for training a linear regression model and 20% is used to evaluate the model performance.

```
# Split the data into training and test set
set.seed(123)
training.samples <- swiss$Fertility %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- swiss[training.samples, ]
test.data <- swiss[-training.samples, ]
# Build the model
model <- lm(Fertility ~., data = train.data)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$Fertility),
RMSE = RMSE(predictions, test.data$Fertility),
MAE = MAE(predictions, test.data$Fertility))
```

```
## R2 RMSE MAE
## 1 0.39 9.11 7.48
```

When comparing two models, the one that produces the lowest test sample RMSE is the preferred model.

the RMSE and the MAE are measured in the same scale as the outcome variable. Dividing the RMSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible:

`RMSE(predictions, test.data$Fertility)/mean(test.data$Fertility)`

`## [1] 0.128`

Note that, the validation set method is only useful when you have a large data set that can be partitioned. A disadvantage is that we build a model on a fraction of the data set only, possibly leaving out some interesting information about data, leading to higher bias. Therefore, the test error rate can be highly variable, depending on which observations are included in the training set and which observations are included in the validation set.

This method works as follow:

- Leave out one data point and build the model on the rest of the data set
- Test the model against the data point that is left out at step 1 and record the test error associated with the prediction
- Repeat the process for all data points
- Compute the overall prediction error by taking the average of all these test error estimates recorded at step 2.

Practical example in R using the `caret`

package:

```
# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
```

```
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.74 0.613 6.12
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
```

The advantage of the LOOCV method is that we make use all data points reducing potential bias.

However, the process is repeated as many times as there are data points, resulting to a higher execution time when n is extremely large.

Additionally, we test the model performance against one data point at each iteration. This might result to higher variation in the prediction error, if some data points are outliers. So, we need a good ratio of testing data points, a solution provided by the **k-fold cross-validation method**.

The k-fold cross-validation method evaluates the model performance on different subset of the training data and then calculate the average prediction error rate. The algorithm is as follow:

- Randomly split the data set into k-subsets (or k-fold) (for example 5 subsets)
- Reserve one subset and train the model on all other subsets
- Test the model on the reserved subset and record the prediction error
- Repeat this process until each of the k subsets has served as the test set.
- Compute the average of the k recorded errors. This is called the cross-validation error serving as the performance metric for the model.

K-fold cross-validation (CV) is a robust method for estimating the accuracy of a model.

The most obvious advantage of k-fold CV compared to LOOCV is computational. A less obvious but potentially more important advantage of k-fold CV is that it often gives more accurate estimates of the test error rate than does LOOCV (James et al. 2014).

Typical question, is how to choose right value of k?

Lower value of K is more biased and hence undesirable. On the other hand, higher value of K is less biased, but can suffer from large variability. It is not hard to see that a smaller value of k (say k = 2) always takes us towards validation set approach, whereas a higher value of k (say k = number of data points) leads us to LOOCV approach.

In practice, one typically performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

The following example uses 10-fold cross validation to estimate the prediction error. Make sure to set seed for reproducibility.

```
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
```

```
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 43, 42, 42, 41, 43, 41, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.38 0.751 6.03
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
```

The process of splitting the data into k-folds can be repeated a number of times, this is called repeated k-fold cross validation.

The final model error is taken as the mean error from the number of repeats.

The following example uses 10-fold cross validation with 3 repeats:

```
# Define training control
set.seed(123)
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
```

In this chapter, we described 4 different methods for assessing the performance of a model on unseen test data.

These methods include: validation set approach, leave-one-out cross-validation, k-fold cross-validation and repeated k-fold cross-validation.

We generally recommend the (repeated) k-fold cross-validation to estimate the prediction error rate. It can be used in regression and classification settings.

Another alternative to cross-validation is the bootstrap resampling methods (Chapter @ref(bootstrap-resampling)), which consists of repeatedly and randomly selecting a sample of n observations from the original data set, and to evaluate the model performance on each copy.

*An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

Similarly to cross-validation techniques (Chapter @ref(cross-validation)), the **bootstrap resampling** method can be used to measure the accuracy of a predictive model. Additionally, it can be used to measure the uncertainty associated with any statistical estimator.

Bootstrap resampling consists of repeatedly selecting a sample of n observations from the original data set, and to evaluate the model on each copy. An average standard error is then calculated and the results provide an indication of the overall variance of the model performance.

This chapter describes the basics of bootstrapping and provides practical examples in R for computing a model prediction error. Additionally, we’ll show you how to compute an estimator uncertainty using bootstrap techniques.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easily computing cross-validation methods

```
library(tidyverse)
library(caret)
```

We’ll use the built-in R `swiss`

data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.

```
# Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)
```

The bootstrap method is used to quantify the uncertainty associated with a given statistical estimator or with a predictive model.

It consists of randomly selecting a sample of n observations from the original data set. This subset, called bootstrap data set is then used to evaluate the model.

This procedure is repeated a large number of times and the standard error of the bootstrap estimate is then calculated. The results provide an indication of the variance of the models performance.

Note that, the sampling is performed with replacement, which means that the same observation can occur more than once in the bootstrap data set.

The following example uses a bootstrap with 100 resamples to test a linear regression model:

```
# Define training control
train.control <- trainControl(method = "boot", number = 100)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
```

```
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 47, 47, 47, 47, 47, 47, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 8.4 0.597 6.76
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
```

The output shows the average model performance across the 100 resamples.

RMSE (Root Mean Squared Error) and MAE(Mean Absolute Error), represent two different measures of the model prediction error. The lower the RMSE and the MAE, the better the model. The R-squared represents the proportion of variation in the outcome explained by the predictor variables included in the model. The higher the R-squared, the better the model. Read more on these metrics at Chapter @ref(regression-model-accuracy-metrics).

The bootstrap approach can be used to quantify the uncertainty (or standard error) associated with any given statistical estimator.

For example, you might want to estimate the accuracy of the linear regression beta coefficients using bootstrap method.

The different steps are as follow:

- Create a simple function,
`model_coef()`

, that takes the`swiss`

data set as well as the indices for the observations, and returns the regression coefficients. - Apply the function
`boot_fun()`

to the full data set of 47 observations in order to compute the coefficients

We start by creating a function that returns the regression model coefficients:

```
model_coef <- function(data, index){
coef(lm(Fertility ~., data = data, subset = index))
}
model_coef(swiss, 1:47)
```

```
## (Intercept) Agriculture Examination Education
## 66.915 -0.172 -0.258 -0.871
## Catholic Infant.Mortality
## 0.104 1.077
```

Next, we use the `boot()`

function [boot package] to compute the standard errors of 500 bootstrap estimates for the coefficients:

```
library(boot)
boot(swiss, model_coef, 500)
```

```
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = swiss, statistic = model_coef, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 66.915 -2.04e-01 10.9174
## t2* -0.172 -5.62e-03 0.0639
## t3* -0.258 -2.27e-02 0.2524
## t4* -0.871 3.89e-05 0.2203
## t5* 0.104 -7.77e-04 0.0319
## t6* 1.077 4.45e-02 0.4478
```

In the output above,

`original`

column corresponds to the regression coefficients. The associated standard errors are given in the column`std.error`

.- t1 corresponds to the intercept, t2 corresponds to
`Agriculture`

and so on…

For example, it can be seen that, the standard error (SE) of the regression coefficient associated with `Agriculture`

is 0.06.

Note that, the standard errors measure the variability/accuracy of the beta coefficients. It can be used to compute the confidence intervals of the coefficients.

For example, the 95% confidence interval for a given coefficient b is defined as `b +/- 2*SE(b)`

, where:

- the lower limits of b =
`b - 2*SE(b) = -0.172 - (2*0.0680) = -0.308`

(for Agriculture variable) - the upper limits of b =
`b + 2*SE(b) = -0.172 + (2*0.0680) = -0.036`

(for Agriculture variable)

That is, there is approximately a 95% chance that the interval [-0.308, -0.036] will contain the true value of the coefficient.

Using the standard `lm()`

function gives a slightly different standard errors, because the linear model make some assumptions about the data:

`summary(lm(Fertility ~., data = swiss))$coef`

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.915 10.7060 6.25 1.91e-07
## Agriculture -0.172 0.0703 -2.45 1.87e-02
## Examination -0.258 0.2539 -1.02 3.15e-01
## Education -0.871 0.1830 -4.76 2.43e-05
## Catholic 0.104 0.0353 2.95 5.19e-03
## Infant.Mortality 1.077 0.3817 2.82 7.34e-03
```

The bootstrap approach does not rely on any of these assumptions made by the linear model, and so it is likely giving a more accurate estimate of the coefficients standard errors than is the summary() function.

This chapter describes bootstrap resampling method for evaluating a predictive model accuracy, as well as, for measuring the uncertainty associated with a given statistical estimator.

An alternative approach to bootstrapping, for evaluating a predictive model performance, is cross-validation techniques (Chapter @ref(cross-validation)).