**Linear regression** (or **linear model**) is used to predict a quantitative outcome variable (y) on the basis of one or multiple predictor variables (x) (James et al. 2014,P. Bruce and Bruce (2017)).

The goal is to build a mathematical formula that defines y as a function of the x variable. Once, we built a statistically significant model, it’s possible to use it for predicting future outcome on the basis of new x values.

When you build a regression model, you need to assess the performance of the predictive model. In other words, you need to evaluate how well the model is in predicting the outcome of a new test data that have not been used to build the model.

Two important metrics are commonly used to assess the performance of the predictive regression model:

**Root Mean Squared Error**, which measures the model prediction error. It corresponds to the average difference between the observed known values of the outcome and the predicted value by the model. RMSE is computed as`RMSE = mean((observeds - predicteds)^2) %>% sqrt()`

. The lower the RMSE, the better the model.**R-square**, representing the squared correlation between the observed known outcome values and the predicted values by the model. The higher the R2, the better the model.

A simple workflow to build to build a predictive regression model is as follow:

- Randomly split your data into training set (80%) and test set (20%)
- Build the regression model using the training set
- Make predictions using the test set and compute the model accuracy metrics

In this chapter, you will learn:

- the basics and the formula of linear regression,
- how to compute simple and multiple regression models in R,
- how to make predictions of the outcome of new data,
- how to assess the performance of the model

Contents:

The mathematical formula of the linear regression can be written as follow:

`y = b0 + b1*x + e`

We read this as “y is modeled as beta1 (`b1`

) times `x`

, plus a constant beta0 (`b0`

), plus an error term `e`

.”

When you have multiple predictor variables, the equation can be written as `y = b0 + b1*x1 + b2*x2 + ... + bn*xn`

, where:

- b0 is the intercept,
- b1, b2, …, bn are the regression weights or coefficients associated with the predictors x1, x2, …, xn.
`e`

is the*error term*(also known as the*residual errors*), the part of y that can be explained by the regression model

Note that, b0, b1, b2, … and bn are known as the regression beta coefficients or parameters.

The figure below illustrates a simple linear regression model, where:

- the best-fit regression line is in blue
- the intercept (b0) and the slope (b1) are shown in green
- the error terms (e) are represented by vertical red lines

From the scatter plot above, it can be seen that not all the data points fall exactly on the fitted regression line. Some of the points are above the blue curve and some are below it; overall, the residual errors (e) have approximately mean zero.

The sum of the squares of the residual errors are called the **Residual Sum of Squares** or **RSS**.

The average variation of points around the fitted regression line is called the **Residual Standard Error** (**RSE**). This is one the metrics used to evaluate the overall quality of the fitted regression model. The lower the RSE, the better it is.

Since the mean error term is zero, the outcome variable y can be approximately estimated as follow:

`y ~ b0 + b1*x`

Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as minimal as possible. This method of determining the beta coefficients is technically called **least squares** regression or **ordinary least squares** (OLS) regression.

Once, the beta coefficients are calculated, a t-test is performed to check whether or not these coefficients are significantly different from zero. A non-zero beta coefficients means that there is a significant relationship between the predictors (x) and the outcome variable (y).

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
theme_set(theme_bw())
```

We’ll use the `marketing`

data set, introduced in the Chapter @ref(regression-analysis), for predicting sales units on the basis of the amount of money spent in the three advertising medias (youtube, facebook and newspaper)

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data
data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)
```

```
## youtube facebook newspaper sales
## 58 163.4 23.0 19.9 15.8
## 157 112.7 52.2 60.6 18.4
## 81 91.7 32.0 26.8 14.2
```

```
# Split the data into training and test set
set.seed(123)
training.samples <- marketing$sales %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- marketing[training.samples, ]
test.data <- marketing[-training.samples, ]
```

The R function `lm()`

is used to compute linear regression model.

```
# Build the model
model <- lm(sales ~., data = train.data)
# Summarize the model
summary(model)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
RMSE(predictions, test.data$sales)
# (b) R-square
R2(predictions, test.data$sales)
```

The **simple linear regression** is used to predict a continuous outcome variable (y) based on one single predictor variable (x).

In the following example, we’ll build a simple linear model to predict sales units based on the advertising budget spent on youtube. The regression equation can be written as `sales = b0 + b1*youtube`

.

The R function `lm()`

can be used to determine the beta coefficients of the linear model, as follow:

```
model <- lm(sales ~ youtube, data = train.data)
summary(model)$coef
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.3839 0.62442 13.4 5.22e-28
## youtube 0.0468 0.00301 15.6 7.84e-34
```

The output above shows the estimate of the regression beta coefficients (column `Estimate`

) and their significance levels (column `Pr(>|t|)`

. The intercept (`b0`

) is 8.38 and the coefficient of youtube variable is 0.046.

The estimated regression equation can be written as follow: `sales = 8.38 + 0.046*youtube`

. Using this formula, for each new youtube advertising budget, you can predict the number of sale units.

For example:

- For a youtube advertising budget equal zero, we can expect a sale of 8.38 units.
- For a youtube advertising budget equal 1000, we can expect a sale of 8.38 + 0.046*1000 = 55 units.

Predictions can be easily made using the R function `predict()`

. In the following example, we predict sales units for two youtube advertising budget: 0 and 1000.

```
newdata <- data.frame(youtube = c(0, 1000))
model %>% predict(newdata)
```

```
## 1 2
## 8.38 55.19
```

**Multiple linear regression** is an extension of simple linear regression for predicting an outcome variable (y) on the basis of multiple distinct predictor variables (x).

For example, with three predictor variables (x), the prediction of y is expressed by the following equation: `y = b0 + b1*x1 + b2*x2 + b3*x3`

The regression beta coefficients measure the association between each predictor variable and the outcome. “b_j” can be interpreted as the average effect on y of a one unit increase in “x_j”, holding all other predictors fixed.

In this section, we’ll build a multiple regression model to predict sales based on the budget invested in three advertising medias: youtube, facebook and newspaper. The formula is as follow: `sales = b0 + b1*youtube + b2*facebook + b3*newspaper`

You can compute the multiple regression model coefficients in R as follow:

```
model <- lm(sales ~ youtube + facebook + newspaper,
data = train.data)
summary(model)$coef
```

Note that, if you have many predictor variables in your data, you can simply include all the available variables in the model using `~.`

:

```
model <- lm(sales ~., data = train.data)
summary(model)$coef
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.698 1.41e-12
## youtube 0.04557 0.00159 28.630 2.03e-64
## facebook 0.18694 0.00989 18.905 2.07e-42
## newspaper 0.00179 0.00677 0.264 7.92e-01
```

From the output above, the coefficients table shows the beta coefficient estimates and their significance levels. Columns are:

`Estimate`

: the intercept (b0) and the beta coefficient estimates associated to each predictor variable`Std.Error`

: the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are about the estimate.`t value`

: the t-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3)`Pr(>|t|)`

: The p-value corresponding to the t-statistic. The smaller the p-value, the more significant the estimate is.

As previously described, you can easily make predictions using the R function `predict()`

:

```
# New advertising budgets
newdata <- data.frame(
youtube = 2000, facebook = 1000,
newspaper = 1000
)
# Predict sales values
model %>% predict(newdata)
```

```
## 1
## 283
```

Before using a model for predictions, you need to assess the statistical significance of the model. This can be easily checked by displaying the statistical summary of the model.

Display the statistical summary of the model as follow:

`summary(model)`

```
##
## Call:
## lm(formula = sales ~ ., data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.412 -1.110 0.348 1.422 3.499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.70 1.4e-12 ***
## youtube 0.04557 0.00159 28.63 < 2e-16 ***
## facebook 0.18694 0.00989 18.90 < 2e-16 ***
## newspaper 0.00179 0.00677 0.26 0.79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.12 on 158 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.888
## F-statistic: 427 on 3 and 158 DF, p-value: <2e-16
```

The summary outputs shows 6 components, including:

**Call**. Shows the function call used to compute the regression model.**Residuals**. Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.**Coefficients**. Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.**Residual standard error**(RSE),**R-squared**(R2) and the**F-statistic**are metrics that are used to check how well the model fits to our data.

The first step in interpreting the multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.

In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.

To see which predictor variables are significant, you can examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statistic p-values.

`summary(model)$coef`

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.698 1.41e-12
## youtube 0.04557 0.00159 28.630 2.03e-64
## facebook 0.18694 0.00989 18.905 2.07e-42
## newspaper 0.00179 0.00677 0.264 7.92e-01
```

For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.

It can be seen that, changing in youtube and facebook advertising budget are significantly associated to changes in sales while changes in newspaper budget is not significantly associated with sales.

For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.

For example, for a fixed amount of youtube and newspaper advertising budget, spending an additional 1 000 dollars on facebook advertising leads to an increase in sales by approximately 0.1885*1000 = 189 sale units, on average.

The youtube coefficient suggests that for every 1 000 dollars increase in youtube advertising budget, holding all other predictors constant, we can expect an increase of 0.045*1000 = 45 sales units, on average.

We found that newspaper is not significant in the multiple regression model. This means that, for a fixed amount of youtube and newspaper advertising budget, changes in the newspaper advertising budget will not significantly affect sales units.

As the newspaper variable is not significant, it is possible to remove it from the model:

```
model <- lm(sales ~ youtube + facebook, data = train.data)
summary(model)
```

```
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.481 -1.104 0.349 1.423 3.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.43446 0.40877 8.4 2.3e-14 ***
## youtube 0.04558 0.00159 28.7 < 2e-16 ***
## facebook 0.18788 0.00920 20.4 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.11 on 159 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.889
## F-statistic: 644 on 2 and 159 DF, p-value: <2e-16
```

Finally, our model equation can be written as follow: `sales = 3.43+ 0.045`

.
*youtube + 0.187*facebook

Once you identified that, at least, one predictor variable is significantly associated to the outcome, you should continue the diagnostic by checking how well the model fits the data. This process is also referred to as the *goodness-of-fit*

The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model summary:

- Residual Standard Error (RSE),
- R-squared (R2) and adjusted R2,
- F-statistic, which has been already described in the previous section

```
## rse r.squared f.statistic p.value
## 1 2.11 0.89 644 5.64e-77
```

**Residual standard error**(RSE).

The RSE (or model *sigma*), corresponding to the prediction error, represents roughly the average difference between the observed outcome values and the predicted values by the model. The lower the RSE the best the model fits to our data.

Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible.

In our example, using only youtube and facebook predictor variables, the RSE = 2.11, meaning that the observed sales values deviate from the predicted values by approximately 2.11 units in average.

This corresponds to an error rate of 2.11/mean(train.data$sales) = 2.11/16.77 = 13%, which is low.

**R-squared and Adjusted R-squared**:

The R-squared (R2) ranges from 0 to 1 and represents the proportion of variation in the outcome variable that can be explained by the model predictor variables.

For a simple linear regression, R2 is the square of the Pearson correlation coefficient between the outcome and the predictor variables. In multiple linear regression, the R2 represents the correlation coefficient between the observed outcome values and the predicted values.

The R2 measures, how well the model fits the data. The higher the R2, the better the model. However, a problem with the R2, is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the outcome (James et al. 2014). A solution is to adjust the R2 by taking into account the number of predictor variables.

The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x variables included in the predictive model.

So, you should mainly consider the adjusted R-squared, which is a penalized R2 for a higher number of predictors.

- An (adjusted) R2 that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model.
- A number near 0 indicates that the regression model did not explain much of the variability in the outcome.

In our example, the adjusted R2 is 0.88, which is good.

**F-Statistic**:

Recall that, the F-statistic gives the overall significance of the model. It assess whether at least one predictor variable has a non-zero coefficient.

In a simple linear regression, this test is not really interesting since it just duplicates the information given by the t-test, available in the coefficient table.

The F-statistic becomes more important once we start using multiple predictors as in multiple linear regression.

A large F-statistic will corresponds to a statistically significant p-value (p < 0.05). In our example, the F-statistic equal 644 producing a p-value of 1.46e-42, which is highly significant.

We’ll make predictions using the test data in order to evaluate the performance of our regression model.

The procedure is as follow:

- Predict the sales values based on new advertising budgets in the test data
- Assess the model performance by computing:
- The prediction error RMSE (Root Mean Squared Error), representing the average difference between the observed known outcome values in the test data and the predicted outcome values by the model. The lower the RMSE, the better the model.
- The R-square (R2), representing the correlation between the observed outcome values and the predicted outcome values. The higher the R2, the better the model.

```
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
# (a) Compute the prediction error, RMSE
RMSE(predictions, test.data$sales)
```

`## [1] 1.58`

```
# (b) Compute R-square
R2(predictions, test.data$sales)
```

`## [1] 0.938`

From the output above, the R2 is 0.93, meaning that the observed and the predicted outcome values are highly correlated, which is very good.

The prediction error RMSE is 1.58, representing an error rate of 1.58/mean(test.data$sales) = 1.58/17 = 9.2%, which is good.

This chapter describes the basics of linear regression and provides practical examples in R for computing simple and multiple linear regression models. We also described how to assess the performance of the model for predictions.

Note that, linear regression assumes a linear relationship between the outcome and the predictor variables. This can be easily checked by creating a scatter plot of the outcome variable vs the predictor variable.

For example, the following R code displays sales units versus youtube advertising budget. We’ll also add a smoothed line:

```
ggplot(marketing, aes(x = youtube, y = sales)) +
geom_point() +
stat_smooth()
```

The graph above shows a linearly increasing relationship between the `sales`

and the `youtube`

variables, which is a good thing.

In addition to the linearity assumptions, the linear regression method makes many other assumptions about your data (see Chapter @ref(regression-assumptions-and-diagnostics)). You should make sure that these assumptions hold true for your data.

Potential problems, include: a) the presence of influential observations in the data (Chapter @ref(regression-assumptions-and-diagnostics)), non-linearity between the outcome and some predictor variables (@ref(polynomial-and-spline-regression)) and the presence of strong correlation between predictor variables (Chapter @ref(multicollinearity)).

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. *An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

This chapter describes how to compute multiple linear regression with **interaction effects**.

Previously, we have described how to build a multiple linear regression model (Chapter @ref(linear-regression)) for predicting a continuous outcome variable (y) based on multiple predictor variables (x).

For example, to predict sales, based on advertising budgets spent on youtube and facebook, the model equation is `sales = b0 + b1*youtube + b2*facebook`

, where, b0 is the intercept; b1 and b2 are the regression coefficients associated respectively with the predictor variables youtube and facebook.

The above equation, also known as *additive model*, investigates only the main effects of predictors. It assumes that the relationship between a given predictor variable and the outcome is independent of the other predictor variables (James et al. 2014,P. Bruce and Bruce (2017)).

Considering our example, the additive model assumes that, the effect on sales of youtube advertising is independent of the effect of facebook advertising.

This assumption might not be true. For example, spending money on facebook advertising may increase the effectiveness of youtube advertising on sales. In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect (James et al. 2014).

In this chapter, you’ll learn:

- the equation of multiple linear regression with interaction
- R codes for computing the regression coefficients associated with the main effects and the interaction effects
- how to interpret the interaction effect

Contents:

The multiple linear regression equation, with interaction effects between two predictors (x1 and x2), can be written as follow:

`y = b0 + b1*x1 + b2*x2 + b3*(x1*x2)`

Considering our example, it becomes:

`sales = b0 + b1*youtube + b2*facebook + b3*(youtube*facebook)`

This can be also written as:

`sales = b0 + (b1 + b3*facebook)*youtube + b2*facebook`

or as:

`sales = b0 + b1*youtube + (b2 +b3*youtube)*facebook`

`b3`

can be interpreted as the increase in the effectiveness of youtube advertising for a one unit increase in facebook advertising (or vice-versa).

In the following sections, you will learn how to compute the regression coefficients in R.

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
```

We’ll use the `marketing`

data set, introduced in the Chapter @ref(regression-analysis), for predicting sales units on the basis of the amount of money spent in the three advertising medias (youtube, facebook and newspaper)

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model).

```
# Load the data
data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)
```

```
## youtube facebook newspaper sales
## 58 163.4 23.0 19.9 15.8
## 157 112.7 52.2 60.6 18.4
## 81 91.7 32.0 26.8 14.2
```

```
# Split the data into training and test set
set.seed(123)
training.samples <- marketing$sales %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- marketing[training.samples, ]
test.data <- marketing[-training.samples, ]
```

The standard linear regression model can be computed as follow:

```
# Build the model
model1 <- lm(sales ~ youtube + facebook, data = train.data)
# Summarize the model
summary(model1)
```

```
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.481 -1.104 0.349 1.423 3.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.43446 0.40877 8.4 2.3e-14 ***
## youtube 0.04558 0.00159 28.7 < 2e-16 ***
## facebook 0.18788 0.00920 20.4 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.11 on 159 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.889
## F-statistic: 644 on 2 and 159 DF, p-value: <2e-16
```

```
# Make predictions
predictions <- model1 %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
RMSE(predictions, test.data$sales)
```

`## [1] 1.58`

```
# (b) R-square
R2(predictions, test.data$sales)
```

`## [1] 0.938`

In R, you include interactions between variables using the `*`

operator:

```
# Build the model
# Use this:
model2 <- lm(sales ~ youtube + facebook + youtube:facebook,
data = marketing)
# Or simply, use this:
model2 <- lm(sales ~ youtube*facebook, data = train.data)
# Summarize the model
summary(model2)
```

```
##
## Call:
## lm(formula = sales ~ youtube * facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.438 -0.482 0.231 0.748 1.860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.90e+00 3.28e-01 24.06 <2e-16 ***
## youtube 1.95e-02 1.64e-03 11.90 <2e-16 ***
## facebook 2.96e-02 9.83e-03 3.01 0.003 **
## youtube:facebook 9.12e-04 4.84e-05 18.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.18 on 158 degrees of freedom
## Multiple R-squared: 0.966, Adjusted R-squared: 0.966
## F-statistic: 1.51e+03 on 3 and 158 DF, p-value: <2e-16
```

```
# Make predictions
predictions <- model2 %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
RMSE(predictions, test.data$sales)
```

`## [1] 0.963`

```
# (b) R-square
R2(predictions, test.data$sales)
```

`## [1] 0.982`

It can be seen that all the coefficients, including the interaction term coefficient, are statistically significant, suggesting that there is an interaction relationship between the two predictor variables (youtube and facebook advertising).

Our model equation looks like this:

`sales = 7.89 + 0.019*youtube + 0.029*facebook + 0.0009*youtube*facebook`

We can interpret this as an increase in youtube advertising of 1000 dollars is associated with increased sales of `(b1 + b3*facebook)*1000 = 19 + 0.9*facebook units`

. And an increase in facebook advertising of 1000 dollars will be associated with an increase in sales of `(b2 + b3*youtube)*1000 = 28 + 0.9*youtube units`

.

Note that, sometimes, it is the case that the interaction term is significant but not the main effects. The hierarchical principle states that, if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant (James et al. 2014).

The prediction error RMSE of the interaction model is 0.963, which is lower than the prediction error of the additive model (1.58).

Additionally, the R-square (R2) value of the interaction model is 98% compared to only 93% for the additive model.

These results suggest that the model with the interaction term is better than the model that contains only main effects. So, for this specific data, we should go for the model with the interaction model.

This chapter describes how to compute multiple linear regression with interaction effects. Interaction terms should be included in the model if they are significantly.

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

*An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

This chapter describes how to compute **regression with categorical variables**.

**Categorical variables** (also known as *factor* or *qualitative variables*) are variables that classify observations into groups. They have a limited number of different values, called levels. For example the gender of individuals are a categorical variable that can take two levels: Male or Female.

Regression analysis requires numerical variables. So, when a researcher wishes to include a categorical variable in a regression model, supplementary steps are required to make the results interpretable.

In these steps, the categorical variables are recoded into a set of separate binary variables. This recoding is called “dummy coding” and leads to the creation of a table called *contrast matrix*. This is done automatically by statistical software, such as R.

Here, you’ll learn how to build and interpret a linear regression model with categorical predictor variables. We’ll also provide practical examples in R.

Contents:

`tidyverse`

for easy data manipulation and visualization

`library(tidyverse)`

We’ll use the `Salaries`

data set [`car`

package], which contains 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.

The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members.

```
# Load the data
data("Salaries", package = "car")
# Inspect the data
sample_n(Salaries, 3)
```

```
## rank discipline yrs.since.phd yrs.service sex salary
## 115 Prof A 12 0 Female 105000
## 313 Prof A 29 19 Male 94350
## 162 Prof B 26 19 Male 176500
```

Recall that, the regression equation, for predicting an outcome variable (y) on the basis of a predictor variable (x), can be simply written as `y = b0 + b1*x`

. `b0`

and `b1 are the regression beta coefficients, representing the intercept and the slope, respectively.

Suppose that, we wish to investigate differences in salaries between males and females.

Based on the gender variable, we can create a new dummy variable that takes the value:

`1`

if a person is male`0`

if a person is female

and use this variable as a predictor in the regression equation, leading to the following the model:

`b0 + b1`

if person is male`bo`

if person is female

The coefficients can be interpreted as follow:

`b0`

is the average salary among females,`b0 + b1`

is the average salary among males,- and
`b1`

is the average difference in salary between males and females.

For simple demonstration purpose, the following example models the salary difference between males and females by computing a simple linear regression model on the `Salaries`

data set [`car`

package]. R creates dummy variables automatically:

```
# Compute the model
model <- lm(salary ~ sex, data = Salaries)
summary(model)$coef
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101002 4809 21.00 2.68e-66
## sexMale 14088 5065 2.78 5.67e-03
```

From the output above, the average salary for female is estimated to be 101002, whereas males are estimated a total of 101002 + 14088 = 115090. The p-value for the dummy variable `sexMale`

is very significant, suggesting that there is a statistical evidence of a difference in average salary between the genders.

The `contrasts()`

function returns the coding that R have used to create the dummy variables:

`contrasts(Salaries$sex)`

```
## Male
## Female 0
## Male 1
```

R has created a sexMale dummy variable that takes on a value of 1 if the sex is Male, and 0 otherwise. The decision to code males as 1 and females as 0 (baseline) is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients.

You can use the function `relevel()`

to set the baseline category to males as follow:

```
Salaries <- Salaries %>%
mutate(sex = relevel(sex, ref = "Male"))
```

The output of the regression fit becomes:

```
model <- lm(salary ~ sex, data = Salaries)
summary(model)$coef
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115090 1587 72.50 2.46e-230
## sexFemale -14088 5065 -2.78 5.67e-03
```

The fact that the coefficient for `sexFemale`

in the regression output is negative indicates that being a Female is associated with decrease in salary (relative to Males).

Now the estimates for `bo`

and `b1`

are 115090 and -14088, respectively, leading once again to a prediction of average salary of 115090 for males and a prediction of 115090 - 14088 = 101002 for females.

Alternatively, instead of a 0/1 coding scheme, we could create a dummy variable -1 (male) / 1 (female) . This results in the model:

`b0 - b1`

if person is male`b0 + b1`

if person is female

So, if the categorical variable is coded as -1 and 1, then if the regression coefficient is positive, it is subtracted from the group coded as -1 and added to the group coded as 1. If the regression coefficient is negative, then addition and subtraction is reversed.

Generally, a categorical variable with n levels will be transformed into n-1 variables each with two levels. These n-1 new variables contain the same information than the single variable. This recoding creates a table called **contrast matrix**.

For example `rank`

in the `Salaries`

data has three levels: “AsstProf”, “AssocProf” and “Prof”. This variable could be dummy coded into two variables, one called AssocProf and one Prof:

- If rank = AssocProf, then the column AssocProf would be coded with a 1 and Prof with a 0.
- If rank = Prof, then the column AssocProf would be coded with a 0 and Prof would be coded with a 1.
- If rank = AsstProf, then both columns “AssocProf” and “Prof” would be coded with a 0.

This dummy coding is automatically performed by R. For demonstration purpose, you can use the function `model.matrix()`

to create a contrast matrix for a factor variable:

```
res <- model.matrix(~rank, data = Salaries)
head(res[, -1])
```

```
## rankAssocProf rankProf
## 1 0 1
## 2 0 1
## 3 0 0
## 4 0 1
## 5 0 1
## 6 1 0
```

When building linear model, there are different ways to encode categorical variables, known as contrast coding systems. The default option in R is to use the first level of the factor as a reference and interpret the remaining levels relative to this level.

Note that, ANOVA (analyse of variance) is just a special case of linear model where the predictors are categorical variables. And, because R understands the fact that ANOVA and regression are both examples of linear models, it lets you extract the classic ANOVA table from your regression model using the R base `anova()`

function or the `Anova()`

function [in `car`

package]. We generally recommend the `Anova()`

function because it automatically takes care of unbalanced designs.

The results of predicting salary from using a multiple regression procedure are presented below.

```
library(car)
model2 <- lm(salary ~ yrs.service + rank + discipline + sex,
data = Salaries)
Anova(model2)
```

```
## Anova Table (Type II tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## yrs.service 3.24e+08 1 0.63 0.43
## rank 1.03e+11 2 100.26 < 2e-16 ***
## discipline 1.74e+10 1 33.86 1.2e-08 ***
## sex 7.77e+08 1 1.51 0.22
## Residuals 2.01e+11 391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Taking other variables (yrs.service, rank and discipline) into account, it can be seen that the categorical variable sex is no longer significantly associated with the variation in salary between individuals. Significant variables are rank and discipline.

If you want to interpret the contrasts of the categorical variable, type this:

`summary(model2)`

```
##
## Call:
## lm(formula = salary ~ yrs.service + rank + discipline + sex,
## data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64202 -14255 -1533 10571 99163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73122.9 3245.3 22.53 < 2e-16 ***
## yrs.service -88.8 111.6 -0.80 0.42696
## rankAssocProf 14560.4 4098.3 3.55 0.00043 ***
## rankProf 49159.6 3834.5 12.82 < 2e-16 ***
## disciplineB 13473.4 2315.5 5.82 1.2e-08 ***
## sexFemale -4771.2 3878.0 -1.23 0.21931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22700 on 391 degrees of freedom
## Multiple R-squared: 0.448, Adjusted R-squared: 0.441
## F-statistic: 63.4 on 5 and 391 DF, p-value: <2e-16
```

For example, it can be seen that being from discipline B (applied departments) is significantly associated with an average increase of 13473.38 in salary compared to discipline A (theoretical departments).

In this chapter we described how categorical variables are included in linear regression model. As regression requires numerical inputs, categorical variables need to be recoded into a set of binary variables.

We provide practical examples for the situations where you have categorical variables containing two or more levels.

Note that, for categorical variables with a large number of levels it might be useful to group together some of the levels.

Some categorical variables have levels that are ordered. They can be converted to numerical values and used as is. For example, if the professor grades (“AsstProf”, “AssocProf” and “Prof”) have a special meaning, you can convert them into numerical values, ordered from low to high, corresponding to higher-grade professors.

In some cases, the true relationship between the outcome and a predictor variable might not be linear.

There are different solutions extending the linear regression model (Chapter @ref(linear-regression)) for capturing these nonlinear effects, including:

**Polynomial regression**. This is the simple approach to model non-linear relationships. It add polynomial terms or quadratic terms (square, cubes, etc) to a regression.**Spline regression**. Fits a smooth curve with a series of polynomial segments. The values delimiting the spline segments are called**Knots**.**Generalized additive models**(GAM). Fits spline models with automated selection of knots.

In this chapter, you’ll learn how to compute non-linear regression models and how to compare the different models in order to choose the one that fits the best your data.

The RMSE and the R2 metrics, will be used to compare the different models (see Chapter @ref(linear regression)).

Recall that, the RMSE represents the model prediction error, that is the average difference the observed outcome values and the predicted outcome values. The R2 represents the squared correlation between the observed and predicted outcome values. The best model is the model with the lowest RMSE and the highest R2.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
theme_set(theme_classic())
```

We’ll use the `Boston`

data set [in `MASS`

package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (`mdev`

), in Boston Suburbs, based on the predictor variable `lstat`

(percentage of lower status of the population).

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
```

First, visualize the scatter plot of the `medv`

vs `lstat`

variables as follow:

```
ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth()
```

The above scatter plot suggests a non-linear relationship between the two variables

In the following sections, we start by computing linear and non-linear regression models. Next, we’ll compare the different models in order to choose the best one for our data.

The standard linear regression model equation can be written as `medv = b0 + b1*lstat`

.

Compute linear regression model:

```
# Build the model
model <- lm(medv ~ lstat, data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
```

```
## RMSE R2
## 1 6.07 0.535
```

Visualize the data:

```
ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ x)
```

The polynomial regression adds polynomial or quadratic terms to the regression equation as follow:

\[medv = b0 + b1*lstat + b2*lstat^2\]

In R, to create a predictor x^2 you should use the function `I()`

, as follow: `I(x^2)`

. This raise x to the power 2.

The polynomial regression can be computed in R as follow:

`lm(medv ~ lstat + I(lstat^2), data = train.data)`

An alternative simple solution is to use this:

`lm(medv ~ poly(lstat, 2, raw = TRUE), data = train.data)`

```
##
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = train.data)
##
## Coefficients:
## (Intercept) poly(lstat, 2, raw = TRUE)1
## 43.351 -2.340
## poly(lstat, 2, raw = TRUE)2
## 0.043
```

The output contains two coefficients associated with lstat : one for the linear term (lstat^1) and one for the quadratic term (lstat^2).

The following example computes a sixfth-order polynomial fit:

```
lm(medv ~ poly(lstat, 6, raw = TRUE), data = train.data) %>%
summary()
```

```
##
## Call:
## lm(formula = medv ~ poly(lstat, 6, raw = TRUE), data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.23 -3.24 -0.74 2.02 26.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.14e+01 6.00e+00 11.90 < 2e-16 ***
## poly(lstat, 6, raw = TRUE)1 -1.45e+01 3.22e+00 -4.48 9.6e-06 ***
## poly(lstat, 6, raw = TRUE)2 1.87e+00 6.26e-01 2.98 0.003 **
## poly(lstat, 6, raw = TRUE)3 -1.32e-01 5.73e-02 -2.30 0.022 *
## poly(lstat, 6, raw = TRUE)4 4.98e-03 2.66e-03 1.87 0.062 .
## poly(lstat, 6, raw = TRUE)5 -9.56e-05 6.03e-05 -1.58 0.114
## poly(lstat, 6, raw = TRUE)6 7.29e-07 5.30e-07 1.38 0.170
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.28 on 400 degrees of freedom
## Multiple R-squared: 0.684, Adjusted R-squared: 0.679
## F-statistic: 144 on 6 and 400 DF, p-value: <2e-16
```

From the output above, it can be seen that polynomial terms beyond the fith order are not significant. So, just create a fith polynomial regression model as follow:

```
# Build the model
model <- lm(medv ~ poly(lstat, 5, raw = TRUE), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
```

```
## RMSE R2
## 1 4.96 0.689
```

Visualize the fith polynomial regression line as follow:

```
ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ poly(x, 5, raw = TRUE))
```

When you have a non-linear relationship, you can also try a logarithm transformation of the predictor variables:

```
# Build the model
model <- lm(medv ~ log(lstat), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
```

```
## RMSE R2
## 1 5.24 0.657
```

Visualize the data:

```
ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ log(x))
```

Polynomial regression only captures a certain amount of curvature in a nonlinear relationship. An alternative, and often superior, approach to modeling nonlinear relationships is to use splines (P. Bruce and Bruce 2017).

Splines provide a way to smoothly interpolate between fixed points, called knots. Polynomial regression is computed between knots. In other words, splines are series of polynomial segments strung together, joining at knots (P. Bruce and Bruce 2017).

The R package `splines`

includes the function `bs`

for creating a b-spline term in a regression model.

You need to specify two parameters: the degree of the polynomial and the location of the knots. In our example, we’ll place the knots at the lower quartile, the median quartile, and the upper quartile:

`knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75))`

We’ll create a model using a cubic spline (degree = 3):

```
library(splines)
# Build the model
knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75))
model <- lm (medv ~ bs(lstat, knots = knots), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
```

```
## RMSE R2
## 1 4.97 0.688
```

Note that, the coefficients for a spline term are not interpretable.

Visualize the cubic spline as follow:

```
ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ splines::bs(x, df = 3))
```

Once you have detected a non-linear relationship in your data, the polynomial terms may not be flexible enough to capture the relationship, and spline terms require specifying the knots.

Generalized additive models, or GAM, are a technique to automatically fit a spline regression. This can be done using the `mgcv`

R package:

```
library(mgcv)
# Build the model
model <- gam(medv ~ s(lstat), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
```

```
## RMSE R2
## 1 5.02 0.684
```

The term `s(lstat)`

tells the `gam()`

function to find the “best” knots for a spline term.

Visualize the data:

```
ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = gam, formula = y ~ s(x))
```

From analyzing the RMSE and the R2 metrics of the different models, it can be seen that the polynomial regression, the spline regression and the generalized additive models outperform the linear regression model and the log transformation approaches.

This chapter describes how to compute non-linear regression models using R.

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

The **simple linear regression** is used to predict a quantitative outcome `y`

on the basis of one single predictor variable `x`

. The goal is to build a mathematical model (or formula) that defines y as a function of the x variable.

Once, we built a statistically significant model, it’s possible to use it for predicting future outcome on the basis of new x values.

Consider that, we want to evaluate the impact of advertising budgets of three medias (youtube, facebook and newspaper) on future sales. This example of problem can be modeled with linear regression.

Contents:

The mathematical formula of the linear regression can be written as `y = b0 + b1*x + e`

, where:

`b0`

and`b1`

are known as the regression*beta coefficients*or*parameters*:`b0`

is the*intercept*of the regression line; that is the predicted value when`x = 0`

.`b1`

is the*slope*of the regression line.

`e`

is the*error term*(also known as the*residual errors*), the part of y that can be explained by the regression model

The figure below illustrates the linear regression model, where:

- the best-fit regression line is in blue
- the intercept (b0) and the slope (b1) are shown in green
- the error terms (e) are represented by vertical red lines

From the scatter plot above, it can be seen that not all the data points fall exactly on the fitted regression line. Some of the points are above the blue curve and some are below it; overall, the residual errors (e) have approximately mean zero.

The sum of the squares of the residual errors are called the **Residual Sum of Squares** or **RSS**.

The average variation of points around the fitted regression line is called the **Residual Standard Error** (**RSE**). This is one the metrics used to evaluate the overall quality of the fitted regression model. The lower the RSE, the better it is.

Since the mean error term is zero, the outcome variable y can be approximately estimated as follow:

`y ~ b0 + b1*x`

Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as minimal as possible. This method of determining the beta coefficients is technically called **least squares** regression or **ordinary least squares** (OLS) regression.

Once, the beta coefficients are calculated, a t-test is performed to check whether or not these coefficients are significantly different from zero. A non-zero beta coefficients means that there is a significant relationship between the predictors (x) and the outcome variable (y).

Load required packages:

`tidyverse`

for data manipulation and visualization`ggpubr`

: creates easily a publication ready-plot

```
library(tidyverse)
library(ggpubr)
theme_set(theme_pubr())
```

We’ll use the `marketing`

data set [datarium package]. It contains the impact of three advertising medias (youtube, facebook and newspaper) on sales. Data are the advertising budget in thousands of dollars along with the sales. The advertising experiment has been repeated 200 times with different budgets and the observed sales have been recorded.

First install the `datarium`

package using `devtools::install_github("kassmbara/datarium")`

, then load and inspect the `marketing`

data as follow:

Inspect the data:

```
# Load the package
data("marketing", package = "datarium")
head(marketing, 4)
```

```
## youtube facebook newspaper sales
## 1 276.1 45.4 83.0 26.5
## 2 53.4 47.2 54.1 12.5
## 3 20.6 55.1 83.2 11.2
## 4 181.8 49.6 70.2 22.2
```

We want to predict future sales on the basis of advertising budget spent on youtube.

- Create a scatter plot displaying the sales units versus youtube advertising budget.
- Add a smoothed line

```
ggplot(marketing, aes(x = youtube, y = sales)) +
geom_point() +
stat_smooth()
```

The graph above suggests a linearly increasing relationship between the `sales`

and the `youtube`

variables. This is a good thing, because, one important assumption of the linear regression is that the relationship between the outcome and predictor variables is linear and additive.

It’s also possible to compute the correlation coefficient between the two variables using the R function `cor()`

:

`cor(marketing$sales, marketing$youtube)`

`## [1] 0.782`

The correlation coefficient measures the level of the association between two variables x and y. Its value ranges between -1 (perfect negative correlation: when x increases, y decreases) and +1 (perfect positive correlation: when x increases, y increases).

A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the outcome variable (y) is not explained by the predictor (x). In such case, we should probably look for better predictor variables.

In our example, the correlation coefficient is large enough, so we can continue by building a linear model of y as a function of x.

The simple linear regression tries to find the best line to predict sales on the basis of youtube advertising budget.

The linear model equation can be written as follow: `sales = b0 + b1 * youtube`

The R function `lm()`

can be used to determine the beta coefficients of the linear model:

```
model <- lm(sales ~ youtube, data = marketing)
model
```

```
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Coefficients:
## (Intercept) youtube
## 8.4391 0.0475
```

The results show the intercept and the beta coefficient for the youtube variable.

From the output above:

the estimated regression line equation can be written as follow:

`sales = 8.44 + 0.048*youtube`

the intercept (

`b0`

) is 8.44. It can be interpreted as the predicted sales unit for a zero youtube advertising budget. Recall that, we are operating in units of thousand dollars. This means that, for a youtube advertising budget equal zero, we can expect a sale of 8.44 *1000 = 8440 dollars.the regression beta coefficient for the variable youtube (

`b1`

), also known as the slope, is 0.048. This means that, for a youtube advertising budget equal to 1000 dollars, we can expect an increase of 48 units (0.048*1000) in sales. That is,`sales = 8.44 + 0.048*1000 = 56.44 units`

. As we are operating in units of thousand dollars, this represents a sale of 56440 dollars.

To add the regression line onto the scatter plot, you can use the function `stat_smooth()`

[ggplot2]. By default, the fitted line is presented with confidence interval around it. The confidence bands reflect the uncertainty about the line. If you don’t want to display it, specify the option `se = FALSE`

in the function `stat_smooth()`

.

```
ggplot(marketing, aes(youtube, sales)) +
geom_point() +
stat_smooth(method = lm)
```

In the previous section, we built a linear model of sales as a function of youtube advertising budget: `sales = 8.44 + 0.048*youtube`

.

Before using this formula to predict future sales, you should make sure that this model is statistically significant, that is:

- there is a statistically significant relationship between the predictor and the outcome variables
- the model that we built fits very well the data in our hand.

In this section, we’ll describe how to check the quality of a linear regression model.

We start by displaying the statistical summary of the model using the R function `summary()`

:

`summary(model)`

```
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.06 -2.35 -0.23 2.48 8.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.43911 0.54941 15.4 <2e-16 ***
## youtube 0.04754 0.00269 17.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.91 on 198 degrees of freedom
## Multiple R-squared: 0.612, Adjusted R-squared: 0.61
## F-statistic: 312 on 1 and 198 DF, p-value: <2e-16
```

The summary outputs shows 6 components, including:

**Call**. Shows the function call used to compute the regression model.**Residuals**. Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.**Coefficients**. Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.**Residual standard error**(RSE),**R-squared**(R2) and the**F-statistic**are metrics that are used to check how well the model fits to our data.

The coefficients table, in the model statistical summary, shows:

- the estimates of the
**beta coefficients** - the
**standard errors**(SE), which defines the accuracy of beta coefficients. For a given beta coefficient, the SE reflects how the coefficient varies under repeated sampling. It can be used to compute the confidence intervals and the t-statistic. - the
**t-statistic**and the associated**p-value**, which defines the statistical significance of the beta coefficients.

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.4391 0.54941 15.4 1.41e-35
## youtube 0.0475 0.00269 17.7 1.47e-42
```

**t-statistic and p-values**:

For a given predictor, the t-statistic (and its associated p-value) tests whether or not there is a statistically significant relationship between a given predictor and the outcome variable, that is whether or not the beta coefficient of the predictor is significantly different from zero.

The statistical hypotheses are as follow:

- Null hypothesis (H0): the coefficients are equal to zero (i.e., no relationship between x and y)
- Alternative Hypothesis (Ha): the coefficients are not equal to zero (i.e., there is some relationship between x and y)

Mathematically, for a given beta coefficient (b), the t-test is computed as `t = (b - 0)/SE(b)`

, where SE(b) is the standard error of the coefficient b. The t-statistic measures the number of standard deviations that b is away from 0. Thus a large t-statistic will produce a small p-value.

The higher the t-statistic (and the lower the p-value), the more significant the predictor. The symbols to the right visually specifies the level of significance. The line below the table shows the definition of these symbols; one star means 0.01 < p < 0.05. The more the stars beside the variable’s p-value, the more significant the variable.

A statistically significant coefficient indicates that there is an association between the predictor (x) and the outcome (y) variable.

In our example, both the p-values for the intercept and the predictor variable are highly significant, so we can reject the null hypothesis and accept the alternative hypothesis, which means that there is a significant association between the predictor and the outcome variables.

The t-statistic is a very useful guide for whether or not to include a predictor in a model. High t-statistics (which go with low p-values near 0) indicate that a predictor should be retained in a model, while very low t-statistics indicate a predictor could be dropped (P. Bruce and Bruce 2017).

**Standard errors and confidence intervals**:

The standard error measures the variability/accuracy of the beta coefficients. It can be used to compute the confidence intervals of the coefficients.

For example, the 95% confidence interval for the coefficient b1 is defined as `b1 +/- 2*SE(b1)`

, where:

- the lower limits of b1 =
`b1 - 2*SE(b1) = 0.047 - 2*0.00269 = 0.042`

- the upper limits of b1 =
`b1 + 2*SE(b1) = 0.047 + 2*0.00269 = 0.052`

That is, there is approximately a 95% chance that the interval [0.042, 0.052] will contain the true value of b1. Similarly the 95% confidence interval for b0 can be computed as `b0 +/- 2*SE(b0)`

.

To get these information, simply type:

`confint(model)`

```
## 2.5 % 97.5 %
## (Intercept) 7.3557 9.5226
## youtube 0.0422 0.0528
```

Once you identified that, at least, one predictor variable is significantly associated the outcome, you should continue the diagnostic by checking how well the model fits the data. This process is also referred to as the *goodness-of-fit*

The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model summary:

- The Residual Standard Error (RSE).
- The R-squared (R2)
- F-statistic

```
## rse r.squared f.statistic p.value
## 1 3.91 0.612 312 1.47e-42
```

**Residual standard error**(RSE).

The RSE (also known as the model sigma) is the residual variation, representing the average variation of the observations points around the fitted regression line. This is the standard deviation of residual errors.

RSE provides an absolute measure of patterns in the data that can’t be explained by the model. When comparing two models, the model with the small RSE is a good indication that this model fits the best the data.

Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible.

In our example, RSE = 3.91, meaning that the observed sales values deviate from the true regression line by approximately 3.9 units in average.

Whether or not an RSE of 3.9 units is an acceptable prediction error is subjective and depends on the problem context. However, we can calculate the percentage error. In our data set, the mean value of sales is 16.827, and so the percentage error is 3.9/16.827 = 23%.

`sigma(model)*100/mean(marketing$sales)`

`## [1] 23.2`

**R-squared and Adjusted R-squared**:

The R-squared (R2) ranges from 0 to 1 and represents the proportion of information (i.e. variation) in the data that can be explained by the model. The adjusted R-squared adjusts for the degrees of freedom.

The R2 measures, how well the model fits the data. For a simple linear regression, R2 is the square of the Pearson correlation coefficient.

A high value of R2 is a good indication. However, as the value of R2 tends to increase when more predictors are added in the model, such as in multiple linear regression model, you should mainly consider the adjusted R-squared, which is a penalized R2 for a higher number of predictors.

- An (adjusted) R2 that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model.
- A number near 0 indicates that the regression model did not explain much of the variability in the outcome.

**F-Statistic**:

The F-statistic gives the overall significance of the model. It assess whether at least one predictor variable has a non-zero coefficient.

In a simple linear regression, this test is not really interesting since it just duplicates the information in given by the t-test, available in the coefficient table. In fact, the F test is identical to the square of the t test: 312.1 = (17.67)^2. This is true in any model with 1 degree of freedom.

The F-statistic becomes more important once we start using multiple predictors as in multiple linear regression.

A large F-statistic will corresponds to a statistically significant p-value (p < 0.05). In our example, the F-statistic equal 312.14 producing a p-value of 1.46e-42, which is highly significant.

After computing a regression model, a first step is to check whether, at least, one predictor is significantly associated with outcome variables.

If one or more predictors are significant, the second step is to assess how well the model fits the data by inspecting the Residuals Standard Error (RSE), the R2 value and the F-statistics. These metrics give the overall quality of the model.

- RSE: Closer to zero the better
- R-Squared: Higher the better
- F-statistic: Higher the better

- Introduction to statistical learning (James et al. 2014)
- Practical Statistics for Data Scientists (P. Bruce and Bruce 2017)

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

*An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

**Multiple linear regression** is an extension of simple linear regression used to predict an outcome variable (y) on the basis of multiple distinct predictor variables (x).

With three predictor variables (x), the prediction of y is expressed by the following equation:

`y = b0 + b1*x1 + b2*x2 + b3*x3`

The “b” values are called the regression weights (or *beta coefficients*). They measure the association between the predictor variable and the outcome. “b_j” can be interpreted as the average effect on y of a one unit increase in “x_j”, holding all other predictors fixed.

In this chapter, you will learn how to:

- Build and interpret a multiple linear regression model in R
- Check the overall quality of the model

Make sure, you have read our previous article: [simple linear regression model]((http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/).

Contents:

The following R packages are required for this chapter:

`tidyverse`

for data manipulation and visualization

`library(tidyverse)`

We’ll use the `marketing`

data set [datarium package], which contains the impact of the amount of money spent on three advertising medias (youtube, facebook and newspaper) on sales.

First install the `datarium`

package using `devtools::install_github("kassmbara/datarium")`

, then load and inspect the `marketing`

data as follow:

```
data("marketing", package = "datarium")
head(marketing, 4)
```

```
## youtube facebook newspaper sales
## 1 276.1 45.4 83.0 26.5
## 2 53.4 47.2 54.1 12.5
## 3 20.6 55.1 83.2 11.2
## 4 181.8 49.6 70.2 22.2
```

We want to build a model for estimating sales based on the advertising budget invested in youtube, facebook and newspaper, as follow:

`sales = b0 + b1*youtube + b2*facebook + b3*newspaper`

You can compute the model coefficients in R as follow:

```
model <- lm(sales ~ youtube + facebook + newspaper, data = marketing)
summary(model)
```

```
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.59 -1.07 0.29 1.43 3.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.52667 0.37429 9.42 <2e-16 ***
## youtube 0.04576 0.00139 32.81 <2e-16 ***
## facebook 0.18853 0.00861 21.89 <2e-16 ***
## newspaper -0.00104 0.00587 -0.18 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 196 degrees of freedom
## Multiple R-squared: 0.897, Adjusted R-squared: 0.896
## F-statistic: 570 on 3 and 196 DF, p-value: <2e-16
```

The first step in interpreting the multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.

In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.

To see which predictor variables are significant, you can examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statitic p-values:

`summary(model)$coefficient`

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.52667 0.37429 9.422 1.27e-17
## youtube 0.04576 0.00139 32.809 1.51e-81
## facebook 0.18853 0.00861 21.893 1.51e-54
## newspaper -0.00104 0.00587 -0.177 8.60e-01
```

For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.

It can be seen that, changing in youtube and facebook advertising budget are significantly associated to changes in sales while changes in newspaper budget is not significantly associated with sales.

For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.

For example, for a fixed amount of youtube and newspaper advertising budget, spending an additional 1 000 dollars on facebook advertising leads to an increase in sales by approximately 0.1885*1000 = 189 sale units, on average.

The youtube coefficient suggests that for every 1 000 dollars increase in youtube advertising budget, holding all other predictors constant, we can expect an increase of 0.045*1000 = 45 sales units, on average.

We found that newspaper is not significant in the multiple regression model. This means that, for a fixed amount of youtube and newspaper advertising budget, changes in the newspaper advertising budget will not significantly affect sales units.

As the newspaper variable is not significant, it is possible to remove it from the model:

```
model <- lm(sales ~ youtube + facebook, data = marketing)
summary(model)
```

```
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.557 -1.050 0.291 1.405 3.399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.50532 0.35339 9.92 <2e-16 ***
## youtube 0.04575 0.00139 32.91 <2e-16 ***
## facebook 0.18799 0.00804 23.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 197 degrees of freedom
## Multiple R-squared: 0.897, Adjusted R-squared: 0.896
## F-statistic: 860 on 2 and 197 DF, p-value: <2e-16
```

Finally, our model equation can be written as follow: `sales = 3.5 + 0.045*youtube + 0.187*facebook`

.

The confidence interval of the model coefficient can be extracted as follow:

`confint(model)`

```
## 2.5 % 97.5 %
## (Intercept) 2.808 4.2022
## youtube 0.043 0.0485
## facebook 0.172 0.2038
```

As we have seen in simple linear regression, the overall quality of the model can be assessed by examining the R-squared (R2) and Residual Standard Error (RSE).

**R-squared**:

In multiple linear regression, the R2 represents the correlation coefficient between the observed values of the outcome variable (y) and the fitted (i.e., predicted) values of y. For this reason, the value of R will always be positive and will range from zero to one.

R2 represents the proportion of variance, in the outcome variable y, that may be predicted by knowing the value of the x variables. An R2 value close to 1 indicates that the model explains a large portion of the variance in the outcome variable.

A problem with the R2, is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the response (James et al. 2014). A solution is to adjust the R2 by taking into account the number of predictor variables.

The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x variables included in the prediction model.

In our example, with youtube and facebook predictor variables, the adjusted R2 = 0.89, meaning that “89% of the variance in the measure of sales can be predicted by youtube and facebook advertising budgets.

Thi model is better than the simple linear model with only youtube (Chapter simple-linear-regression), which had an adjusted R2 of 0.61.

**Residual Standard Error** (RSE), or sigma:

The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model (on the data in hand).

The error rate can be estimated by dividing the RSE by the mean outcome variable:

`sigma(model)/mean(marketing$sales)`

`## [1] 0.12`

In our multiple regression example, the RSE is 2.023 corresponding to 12% error rate.

Again, this is better than the simple model, with only youtube variable, where the RSE was 3.9 (~23% error rate) (Chapter simple-linear-regression).

This chapter describes multiple linear regression model.

Note that, if you have many predictors variable in your data, you don’t necessarily need to type their name when computing the model.

To compute multiple regression using all of the predictors in the data set, simply type this:

`model <- lm(sales ~., data = marketing)`

If you want to perform the regression using all of the variables except one, say newspaper, type this:

`model <- lm(sales ~. -newspaper, data = marketing)`

Alternatively, you can use the update function:

`model1 <- update(model, ~. -newspaper)`

*An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

The main goal of **linear regression** is to **predict** an outcome value on the basis of one or multiple predictor variables.

In this chapter, we’ll describe how to predict outcome for new observations data using R.. You will also learn how to display the confidence intervals and the prediction intervals.

Contents:

We start by building a simple linear regression model that predicts the stopping distances of cars on the basis of the speed.

```
# Load the data
data("cars", package = "datasets")
# Build the model
model <- lm(dist ~ speed, data = cars)
model
```

```
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.58 3.93
```

The linear model equation can be written as follow: `dist = -17.579 + 3.932*speed`

.

Note that, the units of the variable `speed`

and `dist`

are respectively, `mph`

and `ft`

.

Using the above model, we can predict the stopping distance for a new speed value.

Start by creating a new data frame containing, for example, three new speed values:

```
new.speeds <- data.frame(
speed = c(12, 19, 24)
)
```

You can predict the corresponding stopping distances using the R function `predict()`

as follow:

`predict(model, newdata = new.speeds)`

```
## 1 2 3
## 29.6 57.1 76.8
```

The confidence interval reflects the uncertainty around the mean predictions. To display the 95% confidence intervals around the mean the predictions, specify the option `interval = "confidence"`

:

`predict(model, newdata = new.speeds, interval = "confidence")`

```
## fit lwr upr
## 1 29.6 24.4 34.8
## 2 57.1 51.8 62.4
## 3 76.8 68.4 85.2
```

The output contains the following columns:

`fit`

: the predicted sale values for the three new advertising budget`lwr`

and`upr`

: the lower and the upper confidence limits for the expected values, respectively. By default the function produces the 95% confidence limits.

For example, the 95% confidence interval associated with a speed of 19 is (51.83, 62.44). This means that, according to our model, a car with a speed of 19 mph has, on average, a stopping distance ranging between 51.83 and 62.44 ft.

The prediction interval gives uncertainty around a single value. In the same way, as the confidence intervals, the prediction intervals can be computed as follow:

`predict(model, newdata = new.speeds, interval = "prediction")`

```
## fit lwr upr
## 1 29.6 -1.75 61.0
## 2 57.1 25.76 88.5
## 3 76.8 44.75 108.8
```

The 95% prediction intervals associated with a speed of 19 is (25.76, 88.51). This means that, according to our model, 95% of the cars with a speed of 19 mph have a stopping distance between 25.76 and 88.51.

Note that, prediction interval relies strongly on the assumption that the residual errors are normally distributed with a constant variance. So, you should only use such intervals if you believe that the assumption is approximately met for the data at hand.

A prediction interval reflects the uncertainty around a single value, while a confidence interval reflects the uncertainty around the mean prediction values. Thus, a prediction interval will be generally much wider than a confidence interval for the same value.

Which one should we use? The answer to this question depends on the context and the purpose of the analysis. Generally, we are interested in specific individual predictions, so a prediction interval would be more appropriate. Using a confidence interval when you should be using a prediction interval will greatly underestimate the uncertainty in a given predicted value (P. Bruce and Bruce 2017).

The R code below creates a scatter plot with:

- The regression line in blue
- The confidence band in gray
- The prediction band in red

```
# 0. Build linear model
data("cars", package = "datasets")
model <- lm(dist ~ speed, data = cars)
# 1. Add predictions
pred.int <- predict(model, interval = "prediction")
mydata <- cbind(cars, pred.int)
# 2. Regression line + confidence intervals
library("ggplot2")
p <- ggplot(mydata, aes(speed, dist)) +
geom_point() +
stat_smooth(method = lm)
# 3. Add prediction intervals
p + geom_line(aes(y = lwr), color = "red", linetype = "dashed")+
geom_line(aes(y = upr), color = "red", linetype = "dashed")
```

In this chapter, we have described how to use the R function `predict`

() for predicting outcome for new data.

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.