In this chapter we’ll describe different statistical regression **metrics** for measuring the performance of a regression model (Chapter @ref(linear-regression)).

Next, we’ll provide practical examples in R for comparing the performance of two models in order to select the best one for our data.

Contents:

In regression model, the most commonly known evaluation metrics include:

**R-squared**(R2), which is the proportion of variation in the outcome that is explained by the predictor variables. In multiple regression models, R2 corresponds to the squared correlation between the observed outcome values and the predicted values by the model. The Higher the R-squared, the better the model.**Root Mean Squared Error**(RMSE), which measures the average error performed by the model in predicting the outcome for an observation. Mathematically, the RMSE is the square root of the*mean squared error (MSE)*, which is the average squared difference between the observed actual outome values and the values predicted by the model. So,`MSE = mean((observeds - predicteds)^2)`

and`RMSE = sqrt(MSE`

). The lower the RMSE, the better the model.**Residual Standard Error**(RSE), also known as the*model sigma*, is a variant of the RMSE adjusted for the number of predictors in the model. The lower the RSE, the better the model. In practice, the difference between RMSE and RSE is very small, particularly for large multivariate data.**Mean Absolute Error**(MAE), like the RMSE, the MAE measures the prediction error. Mathematically, it is the average absolute difference between observed and predicted outcomes,`MAE = mean(abs(observeds - predicteds))`

. MAE is less sensitive to outliers compared to RMSE.

The problem with the above metrics, is that they are sensible to the inclusion of additional variables in the model, even if those variables dont have significant contribution in explaining the outcome. Put in other words, including additional variables in the model will always increase the R2 and reduce the RMSE. So, we need a more robust metric to guide the model choice.

Concerning R2, there is an adjusted version, called **Adjusted R-squared**, which adjusts the R2 for having too many variables in the model.

Additionally, there are four other important metrics - **AIC**, **AICc**, **BIC** and **Mallows Cp** - that are commonly used for model evaluation and selection. These are an unbiased estimate of the model prediction error MSE. The lower these metrics, he better the model.

**AIC**stands for (*Akaike’s Information Criteria*), a metric developped by the Japanese Statistician, Hirotugu Akaike, 1970. The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.**AICc**is a version of AIC corrected for small sample sizes.**BIC**(or*Bayesian information criteria*) is a variant of AIC with a stronger penalty for including additional variables to the model.**Mallows Cp**: A variant of AIC developed by Colin Mallows.

Generally, the most commonly used metrics, for measuring regression model quality and for comparing models, are: Adjusted R2, AIC, BIC and Cp.

In the following sections, we’ll show you how to compute these above mentionned metrics.

`tidyverse`

for data manipulation and visualization`modelr`

provides helper functions for computing regression model performance metrics`broom`

creates easily a tidy data frame containing the model statistical metrics

```
library(tidyverse)
library(modelr)
library(broom)
```

We’ll use the built-in R `swiss`

data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.

```
# Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)
```

We start by creating two models:

- Model 1, including all predictors
- Model 2, including all predictors except the variable Examination

```
model1 <- lm(Fertility ~., data = swiss)
model2 <- lm(Fertility ~. -Examination, data = swiss)
```

There are many R functions and packages for assessing model quality, including:

`summary()`

[stats package], returns the R-squared, adjusted R-squared and the RSE`AIC()`

and`BIC()`

[stats package], computes the AIC and the BIC, respectively

```
summary(model1)
AIC(model1)
BIC(model1)
```

`rsquare()`

,`rmse()`

and`mae()`

[modelr package], computes, respectively, the R2, RMSE and the MAE.

```
library(modelr)
data.frame(
R2 = rsquare(model1, data = swiss),
RMSE = rmse(model1, data = swiss),
MAE = mae(model1, data = swiss)
)
```

`R2()`

,`RMSE()`

and`MAE()`

[caret package], computes, respectively, the R2, RMSE and the MAE.

```
library(caret)
predictions <- model1 %>% predict(swiss)
data.frame(
R2 = R2(predictions, swiss$Fertility),
RMSE = RMSE(predictions, swiss$Fertility),
MAE = MAE(predictions, swiss$Fertility)
)
```

`glance()`

[broom package], computes the R2, adjusted R2, sigma (RSE), AIC, BIC.

```
library(broom)
glance(model1)
```

- Manual computation of R2, RMSE and MAE:

```
# Make predictions and compute the
# R2, RMSE and MAE
swiss %>%
add_predictions(model1) %>%
summarise(
R2 = cor(Fertility, pred)^2,
MSE = mean((Fertility - pred)^2),
RMSE = sqrt(MSE),
MAE = mean(abs(Fertility - pred))
)
```

Here, we’ll use the function `glance()`

to simply compare the overall quality of our two models:

```
# Metrics for model 1
glance(model1) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
```

```
## adj.r.squared sigma AIC BIC p.value
## 1 0.671 7.17 326 339 5.59e-10
```

```
# Metrics for model 2
glance(model2) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
```

```
## adj.r.squared sigma AIC BIC p.value
## 1 0.671 7.17 325 336 1.72e-10
```

From the output above, it can be seen that:

The two models have exactly the samed

**adjusted R2**(0.67), meaning that they are equivalent in explaining the outcome, here fertility score. Additionally, they have the same amount of**residual standard error**(RSE or sigma = 7.17). However, the model 2 is more simple than model 1 because it incorporates less variables. All things equal, the simple model is always better in statistics.The AIC and the BIC of the model 2 are lower than those of the model1. In model comparison strategies, the model with the lowest AIC and BIC score is preferred.

Finally, the F-statistic p.value of the model 2 is lower than the one of the model 1. This means that the model 2 is statistically more significant compared to model 1, which is consistent to the above conclusion.

Note that, the RMSE and the RSE are measured in the same scale as the outcome variable. Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible:

`sigma(model1)/mean(swiss$Fertility)`

`## [1] 0.102`

In our example the average prediction error rate is 10%.

This chapter describes several metrics for assessing the overall performance of a regression model.

The most important metrics are the Adjusted R-square, RMSE, AIC and the BIC. These metrics are also used as the basis of model comparison and optimal model selection.

Note that, these regression metrics are all internal measures, that is they have been computed on the same data that was used to build the regression model. They tell you how well the model fits to the data in hand, called training data set.

In general, we do not really care how well the method works on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data.

However, the test data is not always available making the test error very difficult to estimate. In this situation, methods such as **cross-validation** (Chapter @ref(cross-validation)) and **bootstrap** (Chapter @ref(bootstrap-resampling)) are applied for estimating the test error (or the prediction error rate) using training data.

**Cross-validation** refers to a set of methods for measuring the performance of a given predictive model on new test data sets.

The basic idea, behind cross-validation techniques, consists of dividing the data into two sets:

- The training set, used to train (i.e. build) the model;
- and the testing set (or validation set), used to test (i.e. validate) the model by estimating the prediction error.

Cross-validation is also known as a *resampling method* because it involves fitting the same statistical method multiple times using different subsets of the data.

In this chapter, you’ll learn:

the most commonly used statistical metrics (Chapter @ref(regression-model-accuracy-metrics)) for measuring the performance of a regression model in predicting the outcome of new test data.

- The different cross-validation methods for assessing model performance. We cover the following approaches:
- Validation set approach (or data split)
- Leave One Out Cross Validation
- k-fold Cross Validation
- Repeated k-fold Cross Validation

Each of these methods has their advantages and drawbacks. Use the method that best suits your problem. Generally, the (repeated) k-fold cross validation is recommended.

- Practical examples of R codes for computing cross-validation methods.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easily computing cross-validation methods

```
library(tidyverse)
library(caret)
```

We’ll use the built-in R `swiss`

data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.

```
# Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)
```

After building a model, we are interested in determining the accuracy of this model on predicting the outcome for new unseen observations not used to build the model. Put in other words, we want to estimate the prediction error.

To do so, the basic strategy is to:

- Build the model on a training data set
- Apply the model on a new test data set to make predictions
- Compute the prediction errors

In Chapter @ref(regression-model-accuracy-metrics), we described several statistical metrics for quantifying the overall quality of regression models. These include:

**R-squared**(R2), representing the squared correlation between the observed outcome values and the predicted values by the model. The higher the adjusted R2, the better the model.**Root Mean Squared Error**(RMSE), which measures the average prediction error made by the model in predicting the outcome for an observation. That is, the average difference between the observed known outcome values and the values predicted by the model. The lower the RMSE, the better the model.**Mean Absolute Error**(MAE), an alternative to the RMSE that is less sensitive to outliers. It corresponds to the average absolute difference between observed and predicted outcomes. The lower the MAE, the better the model

In classification setting, the prediction error rate is estimated as the proportion of misclassified observations.

R2, RMSE and MAE are used to measure the regression model performance during **cross-validation**.

In the following section, we’ll explain the basics of cross-validation, and we’ll provide practical example using mainly the `caret`

R package.

Briefly, cross-validation algorithms can be summarized as follow:

- Reserve a small sample of the data set
- Build (or train) the model using the remaining part of the data set
- Test the effectiveness of the model on the the reserved sample of the data set. If the model works well on the test data set, then it’s good.

The following sections describe the different cross-validation techniques.

The validation set approach consists of randomly splitting the data into two sets: one set is used to train the model and the remaining other set sis used to test the model.

The process works as follow:

- Build (train) the model on the training data set
- Apply the model to the test data set to predict the outcome of new unseen observations
- Quantify the prediction error as the mean squared difference between the observed and the predicted outcome values.

The example below splits the `swiss`

data set so that 80% is used for training a linear regression model and 20% is used to evaluate the model performance.

```
# Split the data into training and test set
set.seed(123)
training.samples <- swiss$Fertility %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- swiss[training.samples, ]
test.data <- swiss[-training.samples, ]
# Build the model
model <- lm(Fertility ~., data = train.data)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$Fertility),
RMSE = RMSE(predictions, test.data$Fertility),
MAE = MAE(predictions, test.data$Fertility))
```

```
## R2 RMSE MAE
## 1 0.39 9.11 7.48
```

When comparing two models, the one that produces the lowest test sample RMSE is the preferred model.

the RMSE and the MAE are measured in the same scale as the outcome variable. Dividing the RMSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible:

`RMSE(predictions, test.data$Fertility)/mean(test.data$Fertility)`

`## [1] 0.128`

Note that, the validation set method is only useful when you have a large data set that can be partitioned. A disadvantage is that we build a model on a fraction of the data set only, possibly leaving out some interesting information about data, leading to higher bias. Therefore, the test error rate can be highly variable, depending on which observations are included in the training set and which observations are included in the validation set.

This method works as follow:

- Leave out one data point and build the model on the rest of the data set
- Test the model against the data point that is left out at step 1 and record the test error associated with the prediction
- Repeat the process for all data points
- Compute the overall prediction error by taking the average of all these test error estimates recorded at step 2.

Practical example in R using the `caret`

package:

```
# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
```

```
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.74 0.613 6.12
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
```

The advantage of the LOOCV method is that we make use all data points reducing potential bias.

However, the process is repeated as many times as there are data points, resulting to a higher execution time when n is extremely large.

Additionally, we test the model performance against one data point at each iteration. This might result to higher variation in the prediction error, if some data points are outliers. So, we need a good ratio of testing data points, a solution provided by the **k-fold cross-validation method**.

The k-fold cross-validation method evaluates the model performance on different subset of the training data and then calculate the average prediction error rate. The algorithm is as follow:

- Randomly split the data set into k-subsets (or k-fold) (for example 5 subsets)
- Reserve one subset and train the model on all other subsets
- Test the model on the reserved subset and record the prediction error
- Repeat this process until each of the k subsets has served as the test set.
- Compute the average of the k recorded errors. This is called the cross-validation error serving as the performance metric for the model.

K-fold cross-validation (CV) is a robust method for estimating the accuracy of a model.

The most obvious advantage of k-fold CV compared to LOOCV is computational. A less obvious but potentially more important advantage of k-fold CV is that it often gives more accurate estimates of the test error rate than does LOOCV (James et al. 2014).

Typical question, is how to choose right value of k?

Lower value of K is more biased and hence undesirable. On the other hand, higher value of K is less biased, but can suffer from large variability. It is not hard to see that a smaller value of k (say k = 2) always takes us towards validation set approach, whereas a higher value of k (say k = number of data points) leads us to LOOCV approach.

In practice, one typically performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

The following example uses 10-fold cross validation to estimate the prediction error. Make sure to set seed for reproducibility.

```
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
```

```
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 43, 42, 42, 41, 43, 41, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.38 0.751 6.03
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
```

The process of splitting the data into k-folds can be repeated a number of times, this is called repeated k-fold cross validation.

The final model error is taken as the mean error from the number of repeats.

The following example uses 10-fold cross validation with 3 repeats:

```
# Define training control
set.seed(123)
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
```

In this chapter, we described 4 different methods for assessing the performance of a model on unseen test data.

These methods include: validation set approach, leave-one-out cross-validation, k-fold cross-validation and repeated k-fold cross-validation.

We generally recommend the (repeated) k-fold cross-validation to estimate the prediction error rate. It can be used in regression and classification settings.

Another alternative to cross-validation is the bootstrap resampling methods (Chapter @ref(bootstrap-resampling)), which consists of repeatedly and randomly selecting a sample of n observations from the original data set, and to evaluate the model performance on each copy.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. *An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

Similarly to cross-validation techniques (Chapter @ref(cross-validation)), the **bootstrap resampling** method can be used to measure the accuracy of a predictive model. Additionally, it can be used to measure the uncertainty associated with any statistical estimator.

Bootstrap resampling consists of repeatedly selecting a sample of n observations from the original data set, and to evaluate the model on each copy. An average standard error is then calculated and the results provide an indication of the overall variance of the model performance.

This chapter describes the basics of bootstrapping and provides practical examples in R for computing a model prediction error. Additionally, we’ll show you how to compute an estimator uncertainty using bootstrap techniques.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easily computing cross-validation methods

```
library(tidyverse)
library(caret)
```

We’ll use the built-in R `swiss`

data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.

```
# Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)
```

The bootstrap method is used to quantify the uncertainty associated with a given statistical estimator or with a predictive model.

It consists of randomly selecting a sample of n observations from the original data set. This subset, called bootstrap data set is then used to evaluate the model.

This procedure is repeated a large number of times and the standard error of the bootstrap estimate is then calculated. The results provide an indication of the variance of the models performance.

Note that, the sampling is performed with replacement, which means that the same observation can occur more than once in the bootstrap data set.

The following example uses a bootstrap with 100 resamples to test a linear regression model:

```
# Define training control
train.control <- trainControl(method = "boot", number = 100)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
```

```
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 47, 47, 47, 47, 47, 47, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 8.4 0.597 6.76
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
```

The output shows the average model performance across the 100 resamples.

RMSE (Root Mean Squared Error) and MAE(Mean Absolute Error), represent two different measures of the model prediction error. The lower the RMSE and the MAE, the better the model. The R-squared represents the proportion of variation in the outcome explained by the predictor variables included in the model. The higher the R-squared, the better the model. Read more on these metrics at Chapter @ref(regression-model-accuracy-metrics).

The bootstrap approach can be used to quantify the uncertainty (or standard error) associated with any given statistical estimator.

For example, you might want to estimate the accuracy of the linear regression beta coefficients using bootstrap method.

The different steps are as follow:

- Create a simple function,
`model_coef()`

, that takes the`swiss`

data set as well as the indices for the observations, and returns the regression coefficients. - Apply the function
`boot_fun()`

to the full data set of 47 observations in order to compute the coefficients

We start by creating a function that returns the regression model coefficients:

```
model_coef <- function(data, index){
coef(lm(Fertility ~., data = data, subset = index))
}
model_coef(swiss, 1:47)
```

```
## (Intercept) Agriculture Examination Education
## 66.915 -0.172 -0.258 -0.871
## Catholic Infant.Mortality
## 0.104 1.077
```

Next, we use the `boot()`

function [boot package] to compute the standard errors of 500 bootstrap estimates for the coefficients:

```
library(boot)
boot(swiss, model_coef, 500)
```

```
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = swiss, statistic = model_coef, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 66.915 -2.04e-01 10.9174
## t2* -0.172 -5.62e-03 0.0639
## t3* -0.258 -2.27e-02 0.2524
## t4* -0.871 3.89e-05 0.2203
## t5* 0.104 -7.77e-04 0.0319
## t6* 1.077 4.45e-02 0.4478
```

In the output above,

`original`

column corresponds to the regression coefficients. The associated standard errors are given in the column`std.error`

.- t1 corresponds to the intercept, t2 corresponds to
`Agriculture`

and so on…

For example, it can be seen that, the standard error (SE) of the regression coefficient associated with `Agriculture`

is 0.06.

Note that, the standard errors measure the variability/accuracy of the beta coefficients. It can be used to compute the confidence intervals of the coefficients.

For example, the 95% confidence interval for a given coefficient b is defined as `b +/- 2*SE(b)`

, where:

- the lower limits of b =
`b - 2*SE(b) = -0.172 - (2*0.0680) = -0.308`

(for Agriculture variable) - the upper limits of b =
`b + 2*SE(b) = -0.172 + (2*0.0680) = -0.036`

(for Agriculture variable)

That is, there is approximately a 95% chance that the interval [-0.308, -0.036] will contain the true value of the coefficient.

Using the standard `lm()`

function gives a slightly different standard errors, because the linear model make some assumptions about the data:

`summary(lm(Fertility ~., data = swiss))$coef`

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.915 10.7060 6.25 1.91e-07
## Agriculture -0.172 0.0703 -2.45 1.87e-02
## Examination -0.258 0.2539 -1.02 3.15e-01
## Education -0.871 0.1830 -4.76 2.43e-05
## Catholic 0.104 0.0353 2.95 5.19e-03
## Infant.Mortality 1.077 0.3817 2.82 7.34e-03
```

The bootstrap approach does not rely on any of these assumptions made by the linear model, and so it is likely giving a more accurate estimate of the coefficients standard errors than is the summary() function.

This chapter describes bootstrap resampling method for evaluating a predictive model accuracy, as well as, for measuring the uncertainty associated with a given statistical estimator.

An alternative approach to bootstrapping, for evaluating a predictive model performance, is cross-validation techniques (Chapter @ref(cross-validation)).