The main goal of linear regression is to predict an outcome value on the basis of one or multiple predictor variables.
In this chapter, we’ll describe how to predict outcome for new observations data using R.. You will also learn how to display the confidence intervals and the prediction intervals.
Build a linear regression
We start by building a simple linear regression model that predicts the stopping distances of cars on the basis of the speed.
# Load the data data("cars", package = "datasets") # Build the model model <- lm(dist ~ speed, data = cars) model
## ## Call: ## lm(formula = dist ~ speed, data = cars) ## ## Coefficients: ## (Intercept) speed ## -17.58 3.93
The linear model equation can be written as follow:
dist = -17.579 + 3.932*speed.
Note that, the units of the variable
dist are respectively,
Prediction for new data set
Using the above model, we can predict the stopping distance for a new speed value.
Start by creating a new data frame containing, for example, three new speed values:
new.speeds <- data.frame( speed = c(12, 19, 24) )
You can predict the corresponding stopping distances using the R function
predict() as follow:
predict(model, newdata = new.speeds)
## 1 2 3 ## 29.6 57.1 76.8
The confidence interval reflects the uncertainty around the mean predictions. To display the 95% confidence intervals around the mean the predictions, specify the option
interval = "confidence":
predict(model, newdata = new.speeds, interval = "confidence")
## fit lwr upr ## 1 29.6 24.4 34.8 ## 2 57.1 51.8 62.4 ## 3 76.8 68.4 85.2
The output contains the following columns:
fit: the predicted sale values for the three new advertising budget
upr: the lower and the upper confidence limits for the expected values, respectively. By default the function produces the 95% confidence limits.
For example, the 95% confidence interval associated with a speed of 19 is (51.83, 62.44). This means that, according to our model, a car with a speed of 19 mph has, on average, a stopping distance ranging between 51.83 and 62.44 ft.
The prediction interval gives uncertainty around a single value. In the same way, as the confidence intervals, the prediction intervals can be computed as follow:
predict(model, newdata = new.speeds, interval = "prediction")
## fit lwr upr ## 1 29.6 -1.75 61.0 ## 2 57.1 25.76 88.5 ## 3 76.8 44.75 108.8
The 95% prediction intervals associated with a speed of 19 is (25.76, 88.51). This means that, according to our model, 95% of the cars with a speed of 19 mph have a stopping distance between 25.76 and 88.51.
Note that, prediction interval relies strongly on the assumption that the residual errors are normally distributed with a constant variance. So, you should only use such intervals if you believe that the assumption is approximately met for the data at hand.
Prediction interval or confidence interval?
A prediction interval reflects the uncertainty around a single value, while a confidence interval reflects the uncertainty around the mean prediction values. Thus, a prediction interval will be generally much wider than a confidence interval for the same value.
Which one should we use? The answer to this question depends on the context and the purpose of the analysis. Generally, we are interested in specific individual predictions, so a prediction interval would be more appropriate. Using a confidence interval when you should be using a prediction interval will greatly underestimate the uncertainty in a given predicted value (P. Bruce and Bruce 2017).
The R code below creates a scatter plot with:
- The regression line in blue
- The confidence band in gray
- The prediction band in red
# 0. Build linear model data("cars", package = "datasets") model <- lm(dist ~ speed, data = cars) # 1. Add predictions pred.int <- predict(model, interval = "prediction") mydata <- cbind(cars, pred.int) # 2. Regression line + confidence intervals library("ggplot2") p <- ggplot(mydata, aes(speed, dist)) + geom_point() + stat_smooth(method = lm) # 3. Add prediction intervals p + geom_line(aes(y = lwr), color = "red", linetype = "dashed")+ geom_line(aes(y = upr), color = "red", linetype = "dashed")
In this chapter, we have described how to use the R function
predict() for predicting outcome for new data.
Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.