# Articles - Regression Analysis

## Nonlinear Regression Essentials in R: Polynomial and Spline Regression Models

|   111278  |  Comments (9)  |  Regression Analysis

In some cases, the true relationship between the outcome and a predictor variable might not be linear.

There are different solutions extending the linear regression model (Chapter @ref(linear-regression)) for capturing these nonlinear effects, including:

• Polynomial regression. This is the simple approach to model non-linear relationships. It add polynomial terms or quadratic terms (square, cubes, etc) to a regression.

• Spline regression. Fits a smooth curve with a series of polynomial segments. The values delimiting the spline segments are called Knots.

• Generalized additive models (GAM). Fits spline models with automated selection of knots.

In this chapter, you’ll learn how to compute non-linear regression models and how to compare the different models in order to choose the one that fits the best your data.

The RMSE and the R2 metrics, will be used to compare the different models (see Chapter @ref(linear regression)).

Recall that, the RMSE represents the model prediction error, that is the average difference the observed outcome values and the predicted outcome values. The R2 represents the squared correlation between the observed and predicted outcome values. The best model is the model with the lowest RMSE and the highest R2.

Contents:

• tidyverse for easy data manipulation and visualization
• caret for easy machine learning workflow
library(tidyverse)
library(caret)
theme_set(theme_classic())

## Preparing the data

We’ll use the Boston data set [in MASS package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (mdev), in Boston Suburbs, based on the predictor variable lstat (percentage of lower status of the population).

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
R2 = R2(predictions, test.data$medv) ) ## RMSE R2 ## 1 5.24 0.657 Visualize the data: ggplot(train.data, aes(lstat, medv) ) + geom_point() + stat_smooth(method = lm, formula = y ~ log(x)) ## Spline regression Polynomial regression only captures a certain amount of curvature in a nonlinear relationship. An alternative, and often superior, approach to modeling nonlinear relationships is to use splines (P. Bruce and Bruce 2017). Splines provide a way to smoothly interpolate between fixed points, called knots. Polynomial regression is computed between knots. In other words, splines are series of polynomial segments strung together, joining at knots (P. Bruce and Bruce 2017). The R package splines includes the function bs for creating a b-spline term in a regression model. You need to specify two parameters: the degree of the polynomial and the location of the knots. In our example, we’ll place the knots at the lower quartile, the median quartile, and the upper quartile: knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75))

We’ll create a model using a cubic spline (degree = 3):

library(splines)
# Build the model
knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75)) model <- lm (medv ~ bs(lstat, knots = knots), data = train.data) # Make predictions predictions <- model %>% predict(test.data) # Model performance data.frame( RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv) ) ## RMSE R2 ## 1 4.97 0.688 Note that, the coefficients for a spline term are not interpretable. Visualize the cubic spline as follow: ggplot(train.data, aes(lstat, medv) ) + geom_point() + stat_smooth(method = lm, formula = y ~ splines::bs(x, df = 3)) ## Generalized additive models Once you have detected a non-linear relationship in your data, the polynomial terms may not be flexible enough to capture the relationship, and spline terms require specifying the knots. Generalized additive models, or GAM, are a technique to automatically fit a spline regression. This can be done using the mgcv R package: library(mgcv) # Build the model model <- gam(medv ~ s(lstat), data = train.data) # Make predictions predictions <- model %>% predict(test.data) # Model performance data.frame( RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data\$medv)
)
##   RMSE    R2
## 1 5.02 0.684

The term s(lstat) tells the gam() function to find the “best” knots for a spline term.

Visualize the data:

ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = gam, formula = y ~ s(x))

## Comparing the models

From analyzing the RMSE and the R2 metrics of the different models, it can be seen that the polynomial regression, the spline regression and the generalized additive models outperform the linear regression model and the log transformation approaches.

## Discussion

This chapter describes how to compute non-linear regression models using R.

## References

Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.