Predict in R: Model Predictions and Confidence Intervals

kassambara | 10/03/2018 | 464965 | Comments (6) | Regression Analysis

The main goal of linear regression is to predict an outcome value on the basis of one or multiple predictor variables.

In this chapter, we’ll describe how to predict outcome for new observations data using R.. You will also learn how to display the confidence intervals and the prediction intervals.

Contents:

Build a linear regression
Prediction for new data set
Confidence interval
Prediction interval
Prediction interval or confidence interval?
Discussion
References

The Book:

Machine Learning Essentials: Practical Guide in R

Build a linear regression

We start by building a simple linear regression model that predicts the stopping distances of cars on the basis of the speed.

# Load the data
data("cars", package = "datasets")
# Build the model
model <- lm(dist ~ speed, data = cars)
model

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##      -17.58         3.93

The linear model equation can be written as follow: dist = -17.579 + 3.932*speed.

Note that, the units of the variable speed and dist are respectively, mph and ft.

Prediction for new data set

Using the above model, we can predict the stopping distance for a new speed value.

Start by creating a new data frame containing, for example, three new speed values:

new.speeds <- data.frame(
  speed = c(12, 19, 24)
)

You can predict the corresponding stopping distances using the R function predict() as follow:

predict(model, newdata = new.speeds)

##    1    2    3 
## 29.6 57.1 76.8

Confidence interval

The confidence interval reflects the uncertainty around the mean predictions. To display the 95% confidence intervals around the mean the predictions, specify the option interval = "confidence":

predict(model, newdata = new.speeds, interval = "confidence")

##    fit  lwr  upr
## 1 29.6 24.4 34.8
## 2 57.1 51.8 62.4
## 3 76.8 68.4 85.2

The output contains the following columns:

fit: the predicted sale values for the three new advertising budget
lwr and upr: the lower and the upper confidence limits for the expected values, respectively. By default the function produces the 95% confidence limits.

For example, the 95% confidence interval associated with a speed of 19 is (51.83, 62.44). This means that, according to our model, a car with a speed of 19 mph has, on average, a stopping distance ranging between 51.83 and 62.44 ft.

Prediction interval

The prediction interval gives uncertainty around a single value. In the same way, as the confidence intervals, the prediction intervals can be computed as follow:

predict(model, newdata = new.speeds, interval = "prediction")

##    fit   lwr   upr
## 1 29.6 -1.75  61.0
## 2 57.1 25.76  88.5
## 3 76.8 44.75 108.8

The 95% prediction intervals associated with a speed of 19 is (25.76, 88.51). This means that, according to our model, 95% of the cars with a speed of 19 mph have a stopping distance between 25.76 and 88.51.

Note that, prediction interval relies strongly on the assumption that the residual errors are normally distributed with a constant variance. So, you should only use such intervals if you believe that the assumption is approximately met for the data at hand.

Prediction interval or confidence interval?

A prediction interval reflects the uncertainty around a single value, while a confidence interval reflects the uncertainty around the mean prediction values. Thus, a prediction interval will be generally much wider than a confidence interval for the same value.

Which one should we use? The answer to this question depends on the context and the purpose of the analysis. Generally, we are interested in specific individual predictions, so a prediction interval would be more appropriate. Using a confidence interval when you should be using a prediction interval will greatly underestimate the uncertainty in a given predicted value (P. Bruce and Bruce 2017).

The R code below creates a scatter plot with:

The regression line in blue
The confidence band in gray
The prediction band in red

# 0. Build linear model 
data("cars", package = "datasets")
model <- lm(dist ~ speed, data = cars)
# 1. Add predictions 
pred.int <- predict(model, interval = "prediction")
mydata <- cbind(cars, pred.int)
# 2. Regression line + confidence intervals
library("ggplot2")
p <- ggplot(mydata, aes(speed, dist)) +
  geom_point() +
  stat_smooth(method = lm)
# 3. Add prediction intervals
p + geom_line(aes(y = lwr), color = "red", linetype = "dashed")+
    geom_line(aes(y = upr), color = "red", linetype = "dashed")

Discussion

In this chapter, we have described how to use the R function predict() for predicting outcome for new data.

References

Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.

Last update : 24/07/2018

1 Note

Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Donnez nous 5 étoiles (juste au dessus de ce block)! Vous devez être membre pour voter. Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!

Recommended for You!

Machine Learning Essentials: Practical Guide in R

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

R Graphics Essentials for Great Data Visualization

Network Analysis and Visualization in R

More books on R and data science

Recommended for you

This section contains the best data science and self-development resources to help you on your path.

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Comments

You are not authorized to post a comment

Comment

Suraj

Member

#853 04/20/2020 at 21h44

thank you so much for a clear explanation in short, However I am looking how to do uncertainty analysis by monte Carlo method for ML predicted results in R and drow the smooth line by 95%CI in the same graph mentioned above.

Comment

david@choicemaster.org

Visitor

#746 04/11/2019 at 09h00

The example is about car stopping distances but the text states "fit: the predicted sale values for the three new advertising budget"

Comment

kassambara

Administrator

#565 07/24/2018 at 22h02

Fixed now, thank you @genghiskhan!

Comment

genghiskhan

Member

#559 07/18/2018 at 01h17

Thanks for your tutorial.
I think this equation should have the plus sign rather than minus.

dist = -17.579 - 3.932*speed

It should be dist = -17.579 + 3.932*speed

Comment

kassambara

Administrator

#492 05/22/2018 at 22h47

Thank you! Updated know

Comment

Raul

Visitor

#491 05/22/2018 at 22h18

nice article. one detail, when it says "a stopping distance ranging between 51.83 and 62.44 mph", it should say "a stopping distance ranging between 51.83 and 62.44 ft"

STAY UPDATED

Articles - Regression Analysis