Regression Model Accuracy Metrics: R-square, AIC, BIC, Cp and more

kassambara | 11/03/2018 | 195357 | Comment (1) | Regression Model Validation

In this chapter we’ll describe different statistical regression metrics for measuring the performance of a regression model (Chapter @ref(linear-regression)).

Next, we’ll provide practical examples in R for comparing the performance of two models in order to select the best one for our data.

Contents:

Model performance metrics
Loading required R packages
Example of data
Building regression models
Assessing model quality
Comparing regression models performance
Discussion

The Book:

Machine Learning Essentials: Practical Guide in R

Model performance metrics

In regression model, the most commonly known evaluation metrics include:

R-squared (R2), which is the proportion of variation in the outcome that is explained by the predictor variables. In multiple regression models, R2 corresponds to the squared correlation between the observed outcome values and the predicted values by the model. The Higher the R-squared, the better the model.
Root Mean Squared Error (RMSE), which measures the average error performed by the model in predicting the outcome for an observation. Mathematically, the RMSE is the square root of the mean squared error (MSE), which is the average squared difference between the observed actual outome values and the values predicted by the model. So, MSE = mean((observeds - predicteds)^2) and RMSE = sqrt(MSE). The lower the RMSE, the better the model.
Residual Standard Error (RSE), also known as the model sigma, is a variant of the RMSE adjusted for the number of predictors in the model. The lower the RSE, the better the model. In practice, the difference between RMSE and RSE is very small, particularly for large multivariate data.
Mean Absolute Error (MAE), like the RMSE, the MAE measures the prediction error. Mathematically, it is the average absolute difference between observed and predicted outcomes, MAE = mean(abs(observeds - predicteds)). MAE is less sensitive to outliers compared to RMSE.

The problem with the above metrics, is that they are sensible to the inclusion of additional variables in the model, even if those variables dont have significant contribution in explaining the outcome. Put in other words, including additional variables in the model will always increase the R2 and reduce the RMSE. So, we need a more robust metric to guide the model choice.

Concerning R2, there is an adjusted version, called Adjusted R-squared, which adjusts the R2 for having too many variables in the model.

Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp - that are commonly used for model evaluation and selection. These are an unbiased estimate of the model prediction error MSE. The lower these metrics, he better the model.

AIC stands for (Akaike’s Information Criteria), a metric developped by the Japanese Statistician, Hirotugu Akaike, 1970. The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.
AICc is a version of AIC corrected for small sample sizes.
BIC (or Bayesian information criteria) is a variant of AIC with a stronger penalty for including additional variables to the model.
Mallows Cp: A variant of AIC developed by Colin Mallows.

Generally, the most commonly used metrics, for measuring regression model quality and for comparing models, are: Adjusted R2, AIC, BIC and Cp.

In the following sections, we’ll show you how to compute these above mentionned metrics.

Loading required R packages

tidyverse for data manipulation and visualization
modelr provides helper functions for computing regression model performance metrics
broom creates easily a tidy data frame containing the model statistical metrics

library(tidyverse)
library(modelr)
library(broom)

Example of data

We’ll use the built-in R swiss data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.

# Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)

Building regression models

We start by creating two models:

Model 1, including all predictors
Model 2, including all predictors except the variable Examination

model1 <- lm(Fertility ~., data = swiss)
model2 <- lm(Fertility ~. -Examination, data = swiss)

Assessing model quality

There are many R functions and packages for assessing model quality, including:

summary() [stats package], returns the R-squared, adjusted R-squared and the RSE
AIC() and BIC() [stats package], computes the AIC and the BIC, respectively

summary(model1)
AIC(model1)
BIC(model1)

rsquare(), rmse() and mae() [modelr package], computes, respectively, the R2, RMSE and the MAE.

library(modelr)
data.frame(
  R2 = rsquare(model1, data = swiss),
  RMSE = rmse(model1, data = swiss),
  MAE = mae(model1, data = swiss)
)

R2(), RMSE() and MAE() [caret package], computes, respectively, the R2, RMSE and the MAE.

library(caret)
predictions <- model1 %>% predict(swiss)
data.frame(
  R2 = R2(predictions, swiss$Fertility),
  RMSE = RMSE(predictions, swiss$Fertility),
  MAE = MAE(predictions, swiss$Fertility)
)

glance() [broom package], computes the R2, adjusted R2, sigma (RSE), AIC, BIC.

library(broom)
glance(model1)

Manual computation of R2, RMSE and MAE:

# Make predictions and compute the
# R2, RMSE and MAE
swiss %>%
  add_predictions(model1) %>%
  summarise(
    R2 = cor(Fertility, pred)^2,
    MSE = mean((Fertility - pred)^2),
    RMSE = sqrt(MSE),
    MAE = mean(abs(Fertility - pred))
  )

Comparing regression models performance

Here, we’ll use the function glance() to simply compare the overall quality of our two models:

# Metrics for model 1
glance(model1) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)

##   adj.r.squared sigma AIC BIC  p.value
## 1         0.671  7.17 326 339 5.59e-10

# Metrics for model 2
glance(model2) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)

##   adj.r.squared sigma AIC BIC  p.value
## 1         0.671  7.17 325 336 1.72e-10

From the output above, it can be seen that:

The two models have exactly the samed adjusted R2 (0.67), meaning that they are equivalent in explaining the outcome, here fertility score. Additionally, they have the same amount of residual standard error (RSE or sigma = 7.17). However, the model 2 is more simple than model 1 because it incorporates less variables. All things equal, the simple model is always better in statistics.
The AIC and the BIC of the model 2 are lower than those of the model1. In model comparison strategies, the model with the lowest AIC and BIC score is preferred.
Finally, the F-statistic p.value of the model 2 is lower than the one of the model 1. This means that the model 2 is statistically more significant compared to model 1, which is consistent to the above conclusion.

Note that, the RMSE and the RSE are measured in the same scale as the outcome variable. Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible:

sigma(model1)/mean(swiss$Fertility)

## [1] 0.102

In our example the average prediction error rate is 10%.

Discussion

This chapter describes several metrics for assessing the overall performance of a regression model.

The most important metrics are the Adjusted R-square, RMSE, AIC and the BIC. These metrics are also used as the basis of model comparison and optimal model selection.

Note that, these regression metrics are all internal measures, that is they have been computed on the same data that was used to build the regression model. They tell you how well the model fits to the data in hand, called training data set.

In general, we do not really care how well the method works on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data.

However, the test data is not always available making the test error very difficult to estimate. In this situation, methods such as cross-validation (Chapter @ref(cross-validation)) and bootstrap (Chapter @ref(bootstrap-resampling)) are applied for estimating the test error (or the prediction error rate) using training data.

2 Notes

Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Donnez nous 5 étoiles (juste au dessus de ce block)! Vous devez être membre pour voter. Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!

Recommended for You!

Machine Learning Essentials: Practical Guide in R

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

R Graphics Essentials for Great Data Visualization

Network Analysis and Visualization in R

More books on R and data science

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Coursera - Online Courses and Specialization

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Comments

You are not authorized to post a comment

Comment

Visitor

#589 08/22/2018 at 08h03

This is fantastic ! Thank you

When we use caret package for Logistic regression, how can I get various tests done?
1. Hosmer–Lemeshow test
2. Omnibus test - Likelihood test