<?xml version="1.0" encoding="UTF-8" ?>
<!-- RSS generated by PHPBoost on Fri, 01 May 2026 00:28:20 +0200 -->

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title><![CDATA[Last articles - STHDA : Machine Learning]]></title>
		<atom:link href="https://www.sthda.com/english/syndication/rss/articles/11" rel="self" type="application/rss+xml"/>
		<link>https://www.sthda.com</link>
		<description><![CDATA[Last articles - STHDA : Machine Learning]]></description>
		<copyright>(C) 2005-2026 PHPBoost</copyright>
		<language>en</language>
		<generator>PHPBoost</generator>
		
		
		<item>
			<title><![CDATA[Linear Regression Essentials in R]]></title>
			<link>https://www.sthda.com/english/articles/40-regression-analysis/165-linear-regression-essentials-in-r/</link>
			<guid>https://www.sthda.com/english/articles/40-regression-analysis/165-linear-regression-essentials-in-r/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">





<p><strong>Linear regression</strong> (or <strong>linear model</strong>) is used to predict a quantitative outcome variable (y) on the basis of one or multiple predictor variables (x) <span class="citation">(James et al. 2014,<span class="citation">P. Bruce and Bruce (2017)</span>)</span>.</p>
<p>The goal is to build a mathematical formula that defines y as a function of the x variable. Once, we built a statistically significant model, it’s possible to use it for predicting future outcome on the basis of new x values.</p>
<p>When you build a regression model, you need to assess the performance of the predictive model. In other words, you need to evaluate how well the model is in predicting the outcome of a new test data that have not been used to build the model.</p>
<p>Two important metrics are commonly used to assess the performance of the predictive regression model:</p>
<ul>
<li><strong>Root Mean Squared Error</strong>, which measures the model prediction error. It corresponds to the average difference between the observed known values of the outcome and the predicted value by the model. RMSE is computed as <code>RMSE = mean((observeds - predicteds)^2) %>% sqrt()</code>. The lower the RMSE, the better the model.</li>
<li><strong>R-square</strong>, representing the squared correlation between the observed known outcome values and the predicted values by the model. The higher the R2, the better the model.</li>
</ul>
<p>A simple workflow to build to build a predictive regression model is as follow:</p>
<ol style="list-style-type: decimal">
<li>Randomly split your data into training set (80%) and test set (20%)</li>
<li>Build the regression model using the training set</li>
<li>Make predictions using the test set and compute the model accuracy metrics</li>
</ol>
<p>In this chapter, you will learn:</p>
<ul>
<li>the basics and the formula of linear regression,</li>
<li>how to compute simple and multiple regression models in R,</li>
<li>how to make predictions of the outcome of new data,</li>
<li>how to assess the performance of the model</li>
</ul>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#formula">Formula</a></li>
<li><a href="#loading-required-r-packages">Loading Required R packages</a></li>
<li><a href="#preparing-the-data">Preparing the data</a></li>
<li><a href="#computing-linear-regression">Computing linear regression</a><ul>
<li><a href="#quick-start-r-code">Quick start R code</a></li>
<li><a href="#simple-linear-regression">Simple linear regression</a></li>
<li><a href="#multiple-linear-regression">Multiple linear regression</a></li>
</ul></li>
<li><a href="#interpretation">Interpretation</a><ul>
<li><a href="#model-summary">Model summary</a></li>
<li><a href="#coefficients-significance">Coefficients significance</a></li>
<li><a href="#model-accuracy">Model accuracy</a></li>
</ul></li>
<li><a href="#making-predictions">Making predictions</a></li>
<li><a href="#discussion">Discussion</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<div class = "small-block content-privileged-friends navr-book">
  <p>The Book:</p>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/54-machine-learning-essentials/">
          <img src = "https://www.sthda.com/english/upload/machine-learning-essentials-frontcover-200px.png" /><br/>
     Machine Learning Essentials: Practical Guide in R
      </a>
</div>
<div class="spacer"></div>


<div id="formula" class="section level2">
<h2>Formula</h2>
<p>The mathematical formula of the linear regression can be written as follow:</p>
<p><code>y = b0 + b1*x + e</code></p>
<p>We read this as “y is modeled as beta1 (<code>b1</code>) times <code>x</code>, plus a constant beta0 (<code>b0</code>), plus an error term <code>e</code>.”</p>
<p>When you have multiple predictor variables, the equation can be written as <code>y = b0 + b1*x1 + b2*x2 + ... + bn*xn</code>, where:</p>
<ul>
<li>b0 is the intercept,</li>
<li>b1, b2, …, bn are the regression weights or coefficients associated with the predictors x1, x2, …, xn.</li>
<li><code>e</code> is the <em>error term</em> (also known as the <em>residual errors</em>), the part of y that can be explained by the regression model</li>
</ul>
<div class="notice">
<p>
Note that, b0, b1, b2, … and bn are known as the regression beta coefficients or parameters.
</p>
</div>
<p>The figure below illustrates a simple linear regression model, where:</p>
<ul>
<li>the best-fit regression line is in blue</li>
<li>the intercept (b0) and the slope (b1) are shown in green</li>
<li>the error terms (e) are represented by vertical red lines</li>
</ul>
<p><img src="https://www.sthda.com/english/sthda-upload/images/machine-learning-essentials/linear-regression.png" alt="Linear regression" /></p>
<p>From the scatter plot above, it can be seen that not all the data points fall exactly on the fitted regression line. Some of the points are above the blue curve and some are below it; overall, the residual errors (e) have approximately mean zero.</p>
<p>The sum of the squares of the residual errors are called the <strong>Residual Sum of Squares</strong> or <strong>RSS</strong>.  </p>
<p>The average variation of points around the fitted regression line is called the <strong>Residual Standard Error</strong> (<strong>RSE</strong>). This is one the metrics used to evaluate the overall quality of the fitted regression model. The lower the RSE, the better it is.</p>
<p>Since the mean error term is zero, the outcome variable y can be approximately estimated as follow:</p>
<p><code>y ~ b0 + b1*x</code></p>
<p>Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as minimal as possible. This method of determining the beta coefficients is technically called <strong>least squares</strong> regression or <strong>ordinary least squares</strong> (OLS) regression.</p>
<p>Once, the beta coefficients are calculated, a t-test is performed to check whether or not these coefficients are significantly different from zero. A non-zero beta coefficients means that there is a significant relationship between the predictors (x) and the outcome variable (y).</p>
</div>
<div id="loading-required-r-packages" class="section level2">
<h2>Loading Required R packages</h2>
<ul>
<li><code>tidyverse</code> for easy data manipulation and visualization</li>
<li><code>caret</code> for easy machine learning workflow</li>
</ul>
<pre class="r"><code>library(tidyverse)
library(caret)
theme_set(theme_bw())</code></pre>
</div>
<div id="preparing-the-data" class="section level2">
<h2>Preparing the data</h2>
<p>We’ll use the <code>marketing</code> data set, introduced in the Chapter @ref(regression-analysis), for predicting sales units on the basis of the amount of money spent in the three advertising medias (youtube, facebook and newspaper)</p>
<p>We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.</p>
<pre class="r"><code># Load the data
data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)</code></pre>
<pre><code>##     youtube facebook newspaper sales
## 58    163.4     23.0      19.9  15.8
## 157   112.7     52.2      60.6  18.4
## 81     91.7     32.0      26.8  14.2</code></pre>
<pre class="r"><code># Split the data into training and test set
set.seed(123)
training.samples <- marketing$sales %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- marketing[training.samples, ]
test.data <- marketing[-training.samples, ]</code></pre>
</div>
<div id="computing-linear-regression" class="section level2">
<h2>Computing linear regression</h2>
<p>The R function <code>lm()</code> is used to compute linear regression model.</p>
<div id="quick-start-r-code" class="section level3">
<h3>Quick start R code</h3>
<pre class="r"><code># Build the model
model <- lm(sales ~., data = train.data)
# Summarize the model
summary(model)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
RMSE(predictions, test.data$sales)
# (b) R-square
R2(predictions, test.data$sales)</code></pre>
</div>
<div id="simple-linear-regression" class="section level3">
<h3>Simple linear regression</h3>
<p>The <strong>simple linear regression</strong> is used to predict a continuous outcome variable (y) based on one single predictor variable (x).</p>
<p>In the following example, we’ll build a simple linear model to predict sales units based on the advertising budget spent on youtube. The regression equation can be written as <code>sales = b0 + b1*youtube</code>.</p>
<p>The R function <code>lm()</code> can be used to determine the beta coefficients of the linear model, as follow:</p>
<pre class="r"><code>model <- lm(sales ~ youtube, data = train.data)
summary(model)$coef</code></pre>
<pre><code>##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   8.3839    0.62442    13.4 5.22e-28
## youtube       0.0468    0.00301    15.6 7.84e-34</code></pre>
<p>The output above shows the estimate of the regression beta coefficients (column <code>Estimate</code>) and their significance levels (column <code>Pr(>|t|)</code>. The intercept (<code>b0</code>) is 8.38 and the coefficient of youtube variable is 0.046.</p>
<p>The estimated regression equation can be written as follow: <code>sales = 8.38 + 0.046*youtube</code>. Using this formula, for each new youtube advertising budget, you can predict the number of sale units.</p>
<p>For example:</p>
<ul>
<li>For a youtube advertising budget equal zero, we can expect a sale of 8.38 units.</li>
<li>For a youtube advertising budget equal 1000, we can expect a sale of 8.38 + 0.046*1000 = 55 units.</li>
</ul>
<p>Predictions can be easily made using the R function <code>predict()</code>. In the following example, we predict sales units for two youtube advertising budget: 0 and 1000.</p>
<pre class="r"><code>newdata <- data.frame(youtube = c(0,  1000))
model %>% predict(newdata)</code></pre>
<pre><code>##     1     2 
##  8.38 55.19</code></pre>
</div>
<div id="multiple-linear-regression" class="section level3">
<h3>Multiple linear regression</h3>
<p><strong>Multiple linear regression</strong> is an extension of simple linear regression for predicting an outcome variable (y) on the basis of multiple distinct predictor variables (x).</p>
<p>For example, with three predictor variables (x), the prediction of y is expressed by the following equation: <code>y = b0 + b1*x1 + b2*x2 + b3*x3</code></p>
<p>The regression beta coefficients measure the association between each predictor variable and the outcome. “b_j” can be interpreted as the average effect on y of a one unit increase in “x_j”, holding all other predictors fixed.</p>
<p>In this section, we’ll build a multiple regression model to predict sales based on the budget invested in three advertising medias: youtube, facebook and newspaper. The formula is as follow: <code>sales = b0 + b1*youtube + b2*facebook + b3*newspaper</code></p>
<p>You can compute the multiple regression model coefficients in R as follow:</p>
<pre class="r"><code>model <- lm(sales ~ youtube + facebook + newspaper, 
            data = train.data)
summary(model)$coef</code></pre>
<p>Note that, if you have many predictor variables in your data, you can simply include all the available variables in the model using <code>~.</code>:</p>
<pre class="r"><code>model <- lm(sales ~., data = train.data)
summary(model)$coef</code></pre>
<pre><code>##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.39188    0.44062   7.698 1.41e-12
## youtube      0.04557    0.00159  28.630 2.03e-64
## facebook     0.18694    0.00989  18.905 2.07e-42
## newspaper    0.00179    0.00677   0.264 7.92e-01</code></pre>
<p>From the output above, the coefficients table shows the beta coefficient estimates and their significance levels. Columns are:</p>
<ul>
<li><code>Estimate</code>: the intercept (b0) and the beta coefficient estimates associated to each predictor variable</li>
<li><code>Std.Error</code>: the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are about the estimate.</li>
<li><code>t value</code>: the t-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3)</li>
<li><code>Pr(>|t|)</code>: The p-value corresponding to the t-statistic. The smaller the p-value, the more significant the estimate is.</li>
</ul>
<p>As previously described, you can easily make predictions using the R function <code>predict()</code>:</p>
<pre class="r"><code># New advertising budgets
newdata <- data.frame(
  youtube = 2000, facebook = 1000,
  newspaper = 1000
)
# Predict sales values
model %>% predict(newdata)</code></pre>
<pre><code>##   1 
## 283</code></pre>
</div>
</div>
<div id="interpretation" class="section level2">
<h2>Interpretation</h2>
<p>Before using a model for predictions, you need to assess the statistical significance of the model. This can be easily checked by displaying the statistical summary of the model.</p>
<div id="model-summary" class="section level3">
<h3>Model summary</h3>
<p>Display the statistical summary of the model as follow:</p>
<pre class="r"><code>summary(model)</code></pre>
<pre><code>## 
## Call:
## lm(formula = sales ~ ., data = train.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.412  -1.110   0.348   1.422   3.499 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.39188    0.44062    7.70  1.4e-12 ***
## youtube      0.04557    0.00159   28.63  < 2e-16 ***
## facebook     0.18694    0.00989   18.90  < 2e-16 ***
## newspaper    0.00179    0.00677    0.26     0.79    
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## Residual standard error: 2.12 on 158 degrees of freedom
## Multiple R-squared:  0.89,   Adjusted R-squared:  0.888 
## F-statistic:  427 on 3 and 158 DF,  p-value: <2e-16</code></pre>
<p>The summary outputs shows 6 components, including:</p>
<ul>
<li><strong>Call</strong>. Shows the function call used to compute the regression model.</li>
<li><strong>Residuals</strong>. Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.</li>
<li><strong>Coefficients</strong>. Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.</li>
<li><strong>Residual standard error</strong> (RSE), <strong>R-squared</strong> (R2) and the <strong>F-statistic</strong> are metrics that are used to check how well the model fits to our data.</li>
</ul>
<p>The first step in interpreting the multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.</p>
<div class="success">
<p>
In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.
</p>
</div>
</div>
<div id="coefficients-significance" class="section level3">
<h3>Coefficients significance</h3>
<p>To see which predictor variables are significant, you can examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statistic p-values.</p>
<pre class="r"><code>summary(model)$coef</code></pre>
<pre><code>##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.39188    0.44062   7.698 1.41e-12
## youtube      0.04557    0.00159  28.630 2.03e-64
## facebook     0.18694    0.00989  18.905 2.07e-42
## newspaper    0.00179    0.00677   0.264 7.92e-01</code></pre>
<p>For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.</p>
<div class="success">
<p>
It can be seen that, changing in youtube and facebook advertising budget are significantly associated to changes in sales while changes in newspaper budget is not significantly associated with sales.
</p>
</div>
<p>For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.</p>
<p>For example, for a fixed amount of youtube and newspaper advertising budget, spending an additional 1 000 dollars on facebook advertising leads to an increase in sales by approximately 0.1885*1000 = 189 sale units, on average.</p>
<p>The youtube coefficient suggests that for every 1 000 dollars increase in youtube advertising budget, holding all other predictors constant, we can expect an increase of 0.045*1000 = 45 sales units, on average.</p>
<p>We found that newspaper is not significant in the multiple regression model. This means that, for a fixed amount of youtube and newspaper advertising budget, changes in the newspaper advertising budget will not significantly affect sales units.</p>
<p>As the newspaper variable is not significant, it is possible to remove it from the model:</p>
<pre class="r"><code>model <- lm(sales ~ youtube + facebook, data = train.data)
summary(model)</code></pre>
<pre><code>## 
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.481  -1.104   0.349   1.423   3.486 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.43446    0.40877     8.4  2.3e-14 ***
## youtube      0.04558    0.00159    28.7  < 2e-16 ***
## facebook     0.18788    0.00920    20.4  < 2e-16 ***
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## Residual standard error: 2.11 on 159 degrees of freedom
## Multiple R-squared:  0.89,   Adjusted R-squared:  0.889 
## F-statistic:  644 on 2 and 159 DF,  p-value: <2e-16</code></pre>
<div class="success">
<p>
Finally, our model equation can be written as follow: <code>sales = 3.43+ 0.045<em>youtube + 0.187</em>facebook</code>.
</p>
</div>
</div>
<div id="model-accuracy" class="section level3">
<h3>Model accuracy</h3>
<p>Once you identified that, at least, one predictor variable is significantly associated to the outcome, you should continue the diagnostic by checking how well the model fits the data. This process is also referred to as the <em>goodness-of-fit</em></p>
<p>The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model summary:</p>
<ol style="list-style-type: decimal">
<li>Residual Standard Error (RSE),</li>
<li>R-squared (R2) and adjusted R2,</li>
<li>F-statistic, which has been already described in the previous section</li>
</ol>
<pre><code>##    rse r.squared f.statistic  p.value
## 1 2.11      0.89         644 5.64e-77</code></pre>
<ol style="list-style-type: decimal">
<li><strong>Residual standard error</strong> (RSE). </li>
</ol>
<p>The RSE (or model <em>sigma</em>), corresponding to the prediction error, represents roughly the average difference between the observed outcome values and the predicted values by the model. The lower the RSE the best the model fits to our data.</p>
<p>Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible.</p>
<div class="success">
<p>
In our example, using only youtube and facebook predictor variables, the RSE = 2.11, meaning that the observed sales values deviate from the predicted values by approximately 2.11 units in average.
</p>
<p>
This corresponds to an error rate of 2.11/mean(train.data$sales) = 2.11/16.77 = 13%, which is low.
</p>
</div>
<ol start="2" style="list-style-type: decimal">
<li><strong>R-squared and Adjusted R-squared</strong>:</li>
</ol>
<p>The R-squared (R2) ranges from 0 to 1 and represents the proportion of variation in the outcome variable that can be explained by the model predictor variables.</p>
<p>For a simple linear regression, R2 is the square of the Pearson correlation coefficient between the outcome and the predictor variables. In multiple linear regression, the R2 represents the correlation coefficient between the observed outcome values and the predicted values.</p>
<p>The R2 measures, how well the model fits the data. The higher the R2, the better the model. However, a problem with the R2, is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the outcome <span class="citation">(James et al. 2014)</span>. A solution is to adjust the R2 by taking into account the number of predictor variables.</p>
<p>The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x variables included in the predictive model.</p>
<p>So, you should mainly consider the adjusted R-squared, which is a penalized R2 for a higher number of predictors.</p>
<ul>
<li>An (adjusted) R2 that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model.</li>
<li>A number near 0 indicates that the regression model did not explain much of the variability in the outcome.</li>
</ul>
<div class="success">
<p>
In our example, the adjusted R2 is 0.88, which is good.
</p>
</div>
<ol start="3" style="list-style-type: decimal">
<li><strong>F-Statistic</strong>: </li>
</ol>
<p>Recall that, the F-statistic gives the overall significance of the model. It assess whether at least one predictor variable has a non-zero coefficient.</p>
<p>In a simple linear regression, this test is not really interesting since it just duplicates the information given by the t-test, available in the coefficient table.</p>
<p>The F-statistic becomes more important once we start using multiple predictors as in multiple linear regression.</p>
<div class="success">
<p>
A large F-statistic will corresponds to a statistically significant p-value (p < 0.05). In our example, the F-statistic equal 644 producing a p-value of 1.46e-42, which is highly significant.
</p>
</div>
</div>
</div>
<div id="making-predictions" class="section level2">
<h2>Making predictions</h2>
<p>We’ll make predictions using the test data in order to evaluate the performance of our regression model.</p>
<p>The procedure is as follow:</p>
<ol style="list-style-type: decimal">
<li>Predict the sales values based on new advertising budgets in the test data</li>
<li>Assess the model performance by computing:
<ul>
<li>The prediction error RMSE (Root Mean Squared Error), representing the average difference between the observed known outcome values in the test data and the predicted outcome values by the model. The lower the RMSE, the better the model.</li>
<li>The R-square (R2), representing the correlation between the observed outcome values and the predicted outcome values. The higher the R2, the better the model.</li>
</ul></li>
</ol>
<pre class="r"><code># Make predictions
predictions <- model %>% predict(test.data)
# Model performance
# (a) Compute the prediction error, RMSE
RMSE(predictions, test.data$sales)</code></pre>
<pre><code>## [1] 1.58</code></pre>
<pre class="r"><code># (b) Compute R-square
R2(predictions, test.data$sales)</code></pre>
<pre><code>## [1] 0.938</code></pre>
<div class="success">
<p>
From the output above, the R2 is 0.93, meaning that the observed and the predicted outcome values are highly correlated, which is very good.
</p>
<p>
The prediction error RMSE is 1.58, representing an error rate of 1.58/mean(test.data$sales) = 1.58/17 = 9.2%, which is good.
</p>
</div>
</div>
<div id="discussion" class="section level2">
<h2>Discussion</h2>
<p>This chapter describes the basics of linear regression and provides practical examples in R for computing simple and multiple linear regression models. We also described how to assess the performance of the model for predictions.</p>
<p>Note that, linear regression assumes a linear relationship between the outcome and the predictor variables. This can be easily checked by creating a scatter plot of the outcome variable vs the predictor variable.</p>
<p>For example, the following R code displays sales units versus youtube advertising budget. We’ll also add a smoothed line:</p>
<pre class="r"><code>ggplot(marketing, aes(x = youtube, y = sales)) +
  geom_point() +
  stat_smooth()</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/006-linear-regression-scatter-plot-1.png" width="384" /></p>
<p>The graph above shows a linearly increasing relationship between the <code>sales</code> and the <code>youtube</code> variables, which is a good thing.</p>
<p>In addition to the linearity assumptions, the linear regression method makes many other assumptions about your data (see Chapter @ref(regression-assumptions-and-diagnostics)). You should make sure that these assumptions hold true for your data.</p>
<p>Potential problems, include: a) the presence of influential observations in the data (Chapter @ref(regression-assumptions-and-diagnostics)), non-linearity between the outcome and some predictor variables (@ref(polynomial-and-spline-regression)) and the presence of strong correlation between predictor variables (Chapter @ref(multicollinearity)).</p>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-bruce2017">
<p>Bruce, Peter, and Andrew Bruce. 2017. <em>Practical Statistics for Data Scientists</em>. O’Reilly Media.</p>
</div>
<div id="ref-james2014">
<p>James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. <em>An Introduction to Statistical Learning: With Applications in R</em>. Springer Publishing Company, Incorporated.</p>
</div>
</div>
</div>


</div><!--end rdoc-->


<!-- END HTML -->]]></description>
			<pubDate>Sun, 11 Mar 2018 12:47:00 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Interaction Effect in Multiple Regression: Essentials ]]></title>
			<link>https://www.sthda.com/english/articles/40-regression-analysis/164-interaction-effect-in-multiple-regression-essentials/</link>
			<guid>https://www.sthda.com/english/articles/40-regression-analysis/164-interaction-effect-in-multiple-regression-essentials/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">





<p>This chapter describes how to compute multiple linear regression with <strong>interaction effects</strong>.</p>
<p>Previously, we have described how to build a multiple linear regression model (Chapter @ref(linear-regression)) for predicting a continuous outcome variable (y) based on multiple predictor variables (x).</p>
<p>For example, to predict sales, based on advertising budgets spent on youtube and facebook, the model equation is <code>sales = b0 + b1*youtube + b2*facebook</code>, where, b0 is the intercept; b1 and b2 are the regression coefficients associated respectively with the predictor variables youtube and facebook.</p>
<p>The above equation, also known as <em>additive model</em>, investigates only the main effects of predictors. It assumes that the relationship between a given predictor variable and the outcome is independent of the other predictor variables <span class="citation">(James et al. 2014,<span class="citation">P. Bruce and Bruce (2017)</span>)</span>.</p>
<p>Considering our example, the additive model assumes that, the effect on sales of youtube advertising is independent of the effect of facebook advertising.</p>
<p>This assumption might not be true. For example, spending money on facebook advertising may increase the effectiveness of youtube advertising on sales. In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect <span class="citation">(James et al. 2014)</span>.</p>
<p>In this chapter, you’ll learn:</p>
<ul>
<li>the equation of multiple linear regression with interaction</li>
<li>R codes for computing the regression coefficients associated with the main effects and the interaction effects</li>
<li>how to interpret the interaction effect</li>
</ul>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#equation">Equation</a></li>
<li><a href="#loading-required-r-packages">Loading Required R packages</a></li>
<li><a href="#preparing-the-data">Preparing the data</a></li>
<li><a href="#computation">Computation</a><ul>
<li><a href="#additive-model">Additive model</a></li>
<li><a href="#interaction-effects">Interaction effects</a></li>
</ul></li>
<li><a href="#interpretation">Interpretation</a></li>
<li><a href="#comparing-the-additive-and-the-interaction-models">Comparing the additive and the interaction models</a></li>
<li><a href="#discussion">Discussion</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<div class = "small-block content-privileged-friends navr-book">
  <p>The Book:</p>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/54-machine-learning-essentials/">
          <img src = "https://www.sthda.com/english/upload/machine-learning-essentials-frontcover-200px.png" /><br/>
     Machine Learning Essentials: Practical Guide in R
      </a>
</div>
<div class="spacer"></div>


<div id="equation" class="section level2">
<h2>Equation</h2>
<p>The multiple linear regression equation, with interaction effects between two predictors (x1 and x2), can be written as follow:</p>
<p><code>y = b0 + b1*x1 + b2*x2 + b3*(x1*x2)</code></p>
<p>Considering our example, it becomes:</p>
<p><code>sales = b0 + b1*youtube + b2*facebook + b3*(youtube*facebook)</code></p>
<p>This can be also written as:</p>
<p><code>sales = b0 + (b1 + b3*facebook)*youtube + b2*facebook</code></p>
<p>or as:</p>
<p><code>sales = b0 + b1*youtube + (b2 +b3*youtube)*facebook</code></p>
<div class="success">
<p>
<code>b3</code> can be interpreted as the increase in the effectiveness of youtube advertising for a one unit increase in facebook advertising (or vice-versa).
</p>
</div>
<p>In the following sections, you will learn how to compute the regression coefficients in R.</p>
</div>
<div id="loading-required-r-packages" class="section level2">
<h2>Loading Required R packages</h2>
<ul>
<li><code>tidyverse</code> for easy data manipulation and visualization</li>
<li><code>caret</code> for easy machine learning workflow</li>
</ul>
<pre class="r"><code>library(tidyverse)
library(caret)</code></pre>
</div>
<div id="preparing-the-data" class="section level2">
<h2>Preparing the data</h2>
<p>We’ll use the <code>marketing</code> data set, introduced in the Chapter @ref(regression-analysis), for predicting sales units on the basis of the amount of money spent in the three advertising medias (youtube, facebook and newspaper)</p>
<p>We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model).</p>
<pre class="r"><code># Load the data
data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)</code></pre>
<pre><code>##     youtube facebook newspaper sales
## 58    163.4     23.0      19.9  15.8
## 157   112.7     52.2      60.6  18.4
## 81     91.7     32.0      26.8  14.2</code></pre>
<pre class="r"><code># Split the data into training and test set
set.seed(123)
training.samples <- marketing$sales %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- marketing[training.samples, ]
test.data <- marketing[-training.samples, ]</code></pre>
</div>
<div id="computation" class="section level2">
<h2>Computation</h2>
<div id="additive-model" class="section level3">
<h3>Additive model</h3>
<p>The standard linear regression model can be computed as follow:</p>
<pre class="r"><code># Build the model
model1 <- lm(sales ~ youtube + facebook, data = train.data)
# Summarize the model
summary(model1)</code></pre>
<pre><code>## 
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.481  -1.104   0.349   1.423   3.486 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.43446    0.40877     8.4  2.3e-14 ***
## youtube      0.04558    0.00159    28.7  < 2e-16 ***
## facebook     0.18788    0.00920    20.4  < 2e-16 ***
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## Residual standard error: 2.11 on 159 degrees of freedom
## Multiple R-squared:  0.89,   Adjusted R-squared:  0.889 
## F-statistic:  644 on 2 and 159 DF,  p-value: <2e-16</code></pre>
<pre class="r"><code># Make predictions
predictions <- model1 %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
RMSE(predictions, test.data$sales)</code></pre>
<pre><code>## [1] 1.58</code></pre>
<pre class="r"><code># (b) R-square
R2(predictions, test.data$sales)</code></pre>
<pre><code>## [1] 0.938</code></pre>
</div>
<div id="interaction-effects" class="section level3">
<h3>Interaction effects</h3>
<p>In R, you include interactions between variables using the <code>*</code> operator:</p>
<pre class="r"><code># Build the model
# Use this: 
model2 <- lm(sales ~ youtube + facebook + youtube:facebook,
             data = marketing)
# Or simply, use this: 
model2 <- lm(sales ~ youtube*facebook, data = train.data)

# Summarize the model
summary(model2)</code></pre>
<pre><code>## 
## Call:
## lm(formula = sales ~ youtube * facebook, data = train.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.438 -0.482  0.231  0.748  1.860 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.90e+00   3.28e-01   24.06   <2e-16 ***
## youtube          1.95e-02   1.64e-03   11.90   <2e-16 ***
## facebook         2.96e-02   9.83e-03    3.01    0.003 ** 
## youtube:facebook 9.12e-04   4.84e-05   18.86   <2e-16 ***
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## Residual standard error: 1.18 on 158 degrees of freedom
## Multiple R-squared:  0.966,  Adjusted R-squared:  0.966 
## F-statistic: 1.51e+03 on 3 and 158 DF,  p-value: <2e-16</code></pre>
<pre class="r"><code># Make predictions
predictions <- model2 %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
RMSE(predictions, test.data$sales)</code></pre>
<pre><code>## [1] 0.963</code></pre>
<pre class="r"><code># (b) R-square
R2(predictions, test.data$sales)</code></pre>
<pre><code>## [1] 0.982</code></pre>
</div>
</div>
<div id="interpretation" class="section level2">
<h2>Interpretation</h2>
<p>It can be seen that all the coefficients, including the interaction term coefficient, are statistically significant, suggesting that there is an interaction relationship between the two predictor variables (youtube and facebook advertising).</p>
<p>Our model equation looks like this:</p>
<p><code>sales = 7.89 + 0.019*youtube + 0.029*facebook + 0.0009*youtube*facebook</code></p>
<p>We can interpret this as an increase in youtube advertising of 1000 dollars is associated with increased sales of <code>(b1 + b3*facebook)*1000 = 19 + 0.9*facebook units</code>. And an increase in facebook advertising of 1000 dollars will be associated with an increase in sales of <code>(b2 + b3*youtube)*1000 = 28 + 0.9*youtube units</code>.</p>
<div class="warning">
<p>
Note that, sometimes, it is the case that the interaction term is significant but not the main effects. The hierarchical principle states that, if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant <span class="citation"><span class="citation">(James et al. 2014)</span></span>.
</p>
</div>
</div>
<div id="comparing-the-additive-and-the-interaction-models" class="section level2">
<h2>Comparing the additive and the interaction models</h2>
<p>The prediction error RMSE of the interaction model is 0.963, which is lower than the prediction error of the additive model (1.58).</p>
<p>Additionally, the R-square (R2) value of the interaction model is 98% compared to only 93% for the additive model.</p>
<p>These results suggest that the model with the interaction term is better than the model that contains only main effects. So, for this specific data, we should go for the model with the interaction model.</p>
</div>
<div id="discussion" class="section level2">
<h2>Discussion</h2>
<p>This chapter describes how to compute multiple linear regression with interaction effects. Interaction terms should be included in the model if they are significantly.</p>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-bruce2017">
<p>Bruce, Peter, and Andrew Bruce. 2017. <em>Practical Statistics for Data Scientists</em>. O’Reilly Media.</p>
</div>
<div id="ref-james2014">
<p>James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. <em>An Introduction to Statistical Learning: With Applications in R</em>. Springer Publishing Company, Incorporated.</p>
</div>
</div>
</div>


</div><!--end rdoc-->


<!-- END HTML -->]]></description>
			<pubDate>Sun, 11 Mar 2018 12:44:00 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Regression with Categorical Variables: Dummy Coding Essentials in R]]></title>
			<link>https://www.sthda.com/english/articles/40-regression-analysis/163-regression-with-categorical-variables-dummy-coding-essentials-in-r/</link>
			<guid>https://www.sthda.com/english/articles/40-regression-analysis/163-regression-with-categorical-variables-dummy-coding-essentials-in-r/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">





<p>This chapter describes how to compute <strong>regression with categorical variables</strong>.</p>
<p><strong>Categorical variables</strong> (also known as <em>factor</em> or <em>qualitative variables</em>) are variables that classify observations into groups. They have a limited number of different values, called levels. For example the gender of individuals are a categorical variable that can take two levels: Male or Female.</p>
<p>Regression analysis requires numerical variables. So, when a researcher wishes to include a categorical variable in a regression model, supplementary steps are required to make the results interpretable.</p>
<p>In these steps, the categorical variables are recoded into a set of separate binary variables. This recoding is called “dummy coding” and leads to the creation of a table called <em>contrast matrix</em>. This is done automatically by statistical software, such as R.</p>
<p>Here, you’ll learn how to build and interpret a linear regression model with categorical predictor variables. We’ll also provide practical examples in R.</p>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#loading-required-r-packages">Loading Required R packages</a></li>
<li><a href="#example-of-data-set">Example of data set</a></li>
<li><a href="#categorical-variables-with-two-levels">Categorical variables with two levels</a></li>
<li><a href="#categorical-variables-with-more-than-two-levels">Categorical variables with more than two levels</a></li>
<li><a href="#discussion">Discussion</a></li>
</ul>
</div>
<br/>
<div class = "small-block content-privileged-friends navr-book">
  <p>The Book:</p>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/54-machine-learning-essentials/">
          <img src = "https://www.sthda.com/english/upload/machine-learning-essentials-frontcover-200px.png" /><br/>
     Machine Learning Essentials: Practical Guide in R
      </a>
</div>
<div class="spacer"></div>


<div id="loading-required-r-packages" class="section level2">
<h2>Loading Required R packages</h2>
<ul>
<li><code>tidyverse</code> for easy data manipulation and visualization</li>
</ul>
<pre class="r"><code>library(tidyverse)</code></pre>
</div>
<div id="example-of-data-set" class="section level2">
<h2>Example of data set</h2>
<p>We’ll use the <code>Salaries</code> data set [<code>car</code> package], which contains 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.</p>
<p>The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members.</p>
<pre class="r"><code># Load the data
data("Salaries", package = "car")
# Inspect the data
sample_n(Salaries, 3)</code></pre>
<pre><code>##     rank discipline yrs.since.phd yrs.service    sex salary
## 115 Prof          A            12           0 Female 105000
## 313 Prof          A            29          19   Male  94350
## 162 Prof          B            26          19   Male 176500</code></pre>
</div>
<div id="categorical-variables-with-two-levels" class="section level2">
<h2>Categorical variables with two levels</h2>
<p>Recall that, the regression equation, for predicting an outcome variable (y) on the basis of a predictor variable (x), can be simply written as <code>y = b0 + b1*x</code>. <code>b0</code> and `b1 are the regression beta coefficients, representing the intercept and the slope, respectively.</p>
<p>Suppose that, we wish to investigate differences in salaries between males and females.</p>
<p>Based on the gender variable, we can create a new dummy variable that takes the value:</p>
<ul>
<li><code>1</code> if a person is male</li>
<li><code>0</code> if a person is female</li>
</ul>
<p>and use this variable as a predictor in the regression equation, leading to the following the model:</p>
<ul>
<li><code>b0 + b1</code> if person is male</li>
<li><code>bo</code> if person is female</li>
</ul>
<p>The coefficients can be interpreted as follow:</p>
<ol style="list-style-type: decimal">
<li><code>b0</code> is the average salary among females,</li>
<li><code>b0 + b1</code> is the average salary among males,</li>
<li>and <code>b1</code> is the average difference in salary between males and females.</li>
</ol>
<p>For simple demonstration purpose, the following example models the salary difference between males and females by computing a simple linear regression model on the <code>Salaries</code> data set [<code>car</code> package]. R creates dummy variables automatically:</p>
<pre class="r"><code># Compute the model
model <- lm(salary ~ sex, data = Salaries)
summary(model)$coef</code></pre>
<pre><code>##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   101002       4809   21.00 2.68e-66
## sexMale        14088       5065    2.78 5.67e-03</code></pre>
<p>From the output above, the average salary for female is estimated to be 101002, whereas males are estimated a total of 101002 + 14088 = 115090. The p-value for the dummy variable <code>sexMale</code> is very significant, suggesting that there is a statistical evidence of a difference in average salary between the genders.</p>
<p>The <code>contrasts()</code> function returns the coding that R have used to create the dummy variables:</p>
<pre class="r"><code>contrasts(Salaries$sex)</code></pre>
<pre><code>##        Male
## Female    0
## Male      1</code></pre>
<p>R has created a sexMale dummy variable that takes on a value of 1 if the sex is Male, and 0 otherwise. The decision to code males as 1 and females as 0 (baseline) is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients.</p>
<p>You can use the function <code>relevel()</code> to set the baseline category to males as follow:</p>
<pre class="r"><code>Salaries <- Salaries %>%
  mutate(sex = relevel(sex, ref = "Male"))</code></pre>
<p>The output of the regression fit becomes:</p>
<pre class="r"><code>model <- lm(salary ~ sex, data = Salaries)
summary(model)$coef</code></pre>
<pre><code>##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)   115090       1587   72.50 2.46e-230
## sexFemale     -14088       5065   -2.78  5.67e-03</code></pre>
<p>The fact that the coefficient for <code>sexFemale</code> in the regression output is negative indicates that being a Female is associated with decrease in salary (relative to Males).</p>
<p>Now the estimates for <code>bo</code> and <code>b1</code> are 115090 and -14088, respectively, leading once again to a prediction of average salary of 115090 for males and a prediction of 115090 - 14088 = 101002 for females.</p>
<p>Alternatively, instead of a 0/1 coding scheme, we could create a dummy variable -1 (male) / 1 (female) . This results in the model:</p>
<ul>
<li><code>b0 - b1</code> if person is male</li>
<li><code>b0 + b1</code> if person is female</li>
</ul>
<p>So, if the categorical variable is coded as -1 and 1, then if the regression coefficient is positive, it is subtracted from the group coded as -1 and added to the group coded as 1. If the regression coefficient is negative, then addition and subtraction is reversed.</p>
</div>
<div id="categorical-variables-with-more-than-two-levels" class="section level2">
<h2>Categorical variables with more than two levels</h2>
<p>Generally, a categorical variable with n levels will be transformed into n-1 variables each with two levels. These n-1 new variables contain the same information than the single variable. This recoding creates a table called <strong>contrast matrix</strong>.</p>
<p>For example <code>rank</code> in the <code>Salaries</code> data has three levels: “AsstProf”, “AssocProf” and “Prof”. This variable could be dummy coded into two variables, one called AssocProf and one Prof:</p>
<ul>
<li>If rank = AssocProf, then the column AssocProf would be coded with a 1 and Prof with a 0.</li>
<li>If rank = Prof, then the column AssocProf would be coded with a 0 and Prof would be coded with a 1.</li>
<li>If rank = AsstProf, then both columns “AssocProf” and “Prof” would be coded with a 0.</li>
</ul>
<p>This dummy coding is automatically performed by R. For demonstration purpose, you can use the function <code>model.matrix()</code> to create a contrast matrix for a factor variable:</p>
<pre class="r"><code>res <- model.matrix(~rank, data = Salaries)
head(res[, -1])</code></pre>
<pre><code>##   rankAssocProf rankProf
## 1             0        1
## 2             0        1
## 3             0        0
## 4             0        1
## 5             0        1
## 6             1        0</code></pre>
<p>When building linear model, there are different ways to encode categorical variables, known as contrast coding systems. The default option in R is to use the first level of the factor as a reference and interpret the remaining levels relative to this level.</p>
<p>Note that, ANOVA (analyse of variance) is just a special case of linear model where the predictors are categorical variables. And, because R understands the fact that ANOVA and regression are both examples of linear models, it lets you extract the classic ANOVA table from your regression model using the R base <code>anova()</code> function or the <code>Anova()</code> function [in <code>car</code> package]. We generally recommend the <code>Anova()</code> function because it automatically takes care of unbalanced designs.</p>
<p>The results of predicting salary from using a multiple regression procedure are presented below.</p>
<pre class="r"><code>library(car)
model2 <- lm(salary ~ yrs.service + rank + discipline + sex,
             data = Salaries)
Anova(model2)</code></pre>
<pre><code>## Anova Table (Type II tests)
## 
## Response: salary
##               Sum Sq  Df F value  Pr(>F)    
## yrs.service 3.24e+08   1    0.63    0.43    
## rank        1.03e+11   2  100.26 < 2e-16 ***
## discipline  1.74e+10   1   33.86 1.2e-08 ***
## sex         7.77e+08   1    1.51    0.22    
## Residuals   2.01e+11 391                    
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1</code></pre>
<p>Taking other variables (yrs.service, rank and discipline) into account, it can be seen that the categorical variable sex is no longer significantly associated with the variation in salary between individuals. Significant variables are rank and discipline.</p>
<p>If you want to interpret the contrasts of the categorical variable, type this:</p>
<pre class="r"><code>summary(model2)</code></pre>
<pre><code>## 
## Call:
## lm(formula = salary ~ yrs.service + rank + discipline + sex, 
##     data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64202 -14255  -1533  10571  99163 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    73122.9     3245.3   22.53  < 2e-16 ***
## yrs.service      -88.8      111.6   -0.80  0.42696    
## rankAssocProf  14560.4     4098.3    3.55  0.00043 ***
## rankProf       49159.6     3834.5   12.82  < 2e-16 ***
## disciplineB    13473.4     2315.5    5.82  1.2e-08 ***
## sexFemale      -4771.2     3878.0   -1.23  0.21931    
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## Residual standard error: 22700 on 391 degrees of freedom
## Multiple R-squared:  0.448,  Adjusted R-squared:  0.441 
## F-statistic: 63.4 on 5 and 391 DF,  p-value: <2e-16</code></pre>
<p>For example, it can be seen that being from discipline B (applied departments) is significantly associated with an average increase of 13473.38 in salary compared to discipline A (theoretical departments).</p>
</div>
<div id="discussion" class="section level2">
<h2>Discussion</h2>
<p>In this chapter we described how categorical variables are included in linear regression model. As regression requires numerical inputs, categorical variables need to be recoded into a set of binary variables.</p>
<p>We provide practical examples for the situations where you have categorical variables containing two or more levels.</p>
<p>Note that, for categorical variables with a large number of levels it might be useful to group together some of the levels.</p>
<p>Some categorical variables have levels that are ordered. They can be converted to numerical values and used as is. For example, if the professor grades (“AsstProf”, “AssocProf” and “Prof”) have a special meaning, you can convert them into numerical values, ordered from low to high, corresponding to higher-grade professors.</p>
</div>


</div><!--end rdoc-->


<!-- END HTML -->]]></description>
			<pubDate>Sun, 11 Mar 2018 12:38:00 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Nonlinear Regression Essentials in R: Polynomial and Spline Regression Models]]></title>
			<link>https://www.sthda.com/english/articles/40-regression-analysis/162-nonlinear-regression-essentials-in-r-polynomial-and-spline-regression-models/</link>
			<guid>https://www.sthda.com/english/articles/40-regression-analysis/162-nonlinear-regression-essentials-in-r-polynomial-and-spline-regression-models/</guid>
			<description><![CDATA[<!-- START HTML -->


  <div id="rdoc">





<p>In some cases, the true relationship between the outcome and a predictor variable might not be linear.</p>
<p>There are different solutions extending the linear regression model (Chapter @ref(linear-regression)) for capturing these nonlinear effects, including:</p>
<ul>
<li><p><strong>Polynomial regression</strong>. This is the simple approach to model non-linear relationships. It add polynomial terms or quadratic terms (square, cubes, etc) to a regression.</p></li>
<li><p><strong>Spline regression</strong>. Fits a smooth curve with a series of polynomial segments. The values delimiting the spline segments are called <strong>Knots</strong>.</p></li>
<li><p><strong>Generalized additive models</strong> (GAM). Fits spline models with automated selection of knots.</p></li>
</ul>
<p>In this chapter, you’ll learn how to compute non-linear regression models and how to compare the different models in order to choose the one that fits the best your data.</p>
<p>The RMSE and the R2 metrics, will be used to compare the different models (see Chapter @ref(linear regression)).</p>
<p>Recall that, the RMSE represents the model prediction error, that is the average difference the observed outcome values and the predicted outcome values. The R2 represents the squared correlation between the observed and predicted outcome values. The best model is the model with the lowest RMSE and the highest R2.</p>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#loading-required-r-packages">Loading Required R packages</a></li>
<li><a href="#preparing-the-data">Preparing the data</a></li>
<li><a href="#linear-regression-linear-reg">Linear regression {linear-reg}</a></li>
<li><a href="#polynomial-regression">Polynomial regression</a></li>
<li><a href="#log-transformation">Log transformation</a></li>
<li><a href="#spline-regression">Spline regression</a></li>
<li><a href="#generalized-additive-models">Generalized additive models</a></li>
<li><a href="#comparing-the-models">Comparing the models</a></li>
<li><a href="#discussion">Discussion</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<div class = "small-block content-privileged-friends navr-book">
  <p>The Book:</p>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/54-machine-learning-essentials/">
          <img src = "https://www.sthda.com/english/upload/machine-learning-essentials-frontcover-200px.png" /><br/>
     Machine Learning Essentials: Practical Guide in R
      </a>
</div>
<div class="spacer"></div>


<div id="loading-required-r-packages" class="section level2">
<h2>Loading Required R packages</h2>
<ul>
<li><code>tidyverse</code> for easy data manipulation and visualization</li>
<li><code>caret</code> for easy machine learning workflow</li>
</ul>
<pre class="r"><code>library(tidyverse)
library(caret)
theme_set(theme_classic())</code></pre>
</div>
<div id="preparing-the-data" class="section level2">
<h2>Preparing the data</h2>
<p>We’ll use the <code>Boston</code> data set [in <code>MASS</code> package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (<code>mdev</code>), in Boston Suburbs, based on the predictor variable <code>lstat</code> (percentage of lower status of the population).</p>
<p>We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.</p>
<pre class="r"><code># Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]</code></pre>
<p>First, visualize the scatter plot of the <code>medv</code> vs <code>lstat</code> variables as follow:</p>
<pre class="r"><code>ggplot(train.data, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth()</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/009-polynomial-and-spline-regression-scatter-plot-1.png" width="384" /></p>
<div class="success">
<p>
The above scatter plot suggests a non-linear relationship between the two variables
</p>
</div>
<p>In the following sections, we start by computing linear and non-linear regression models. Next, we’ll compare the different models in order to choose the best one for our data.</p>
</div>
<div id="linear-regression-linear-reg" class="section level2">
<h2>Linear regression {linear-reg}</h2>
<p>The standard linear regression model equation can be written as <code>medv = b0 + b1*lstat</code>.</p>
<p>Compute linear regression model:</p>
<pre class="r"><code># Build the model
model <- lm(medv ~ lstat, data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
  RMSE = RMSE(predictions, test.data$medv),
  R2 = R2(predictions, test.data$medv)
)</code></pre>
<pre><code>##   RMSE    R2
## 1 6.07 0.535</code></pre>
<p>Visualize the data:</p>
<pre class="r"><code>ggplot(train.data, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = lm, formula = y ~ x)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/009-polynomial-and-spline-regression-linear-regression-1.png" width="384" /></p>
</div>
<div id="polynomial-regression" class="section level2">
<h2>Polynomial regression</h2>
<p>The polynomial regression adds polynomial or quadratic terms to the regression equation as follow:</p>
<p><span class="math display">\[medv = b0 + b1*lstat + b2*lstat^2\]</span></p>
<p>In R, to create a predictor x^2 you should use the function <code>I()</code>, as follow: <code>I(x^2)</code>. This raise x to the power 2.</p>
<p>The polynomial regression can be computed in R as follow:</p>
<pre class="r"><code>lm(medv ~ lstat + I(lstat^2), data = train.data)</code></pre>
<p>An alternative simple solution is to use this:</p>
<pre class="r"><code>lm(medv ~ poly(lstat, 2, raw = TRUE), data = train.data)</code></pre>
<pre><code>## 
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = train.data)
## 
## Coefficients:
##                 (Intercept)  poly(lstat, 2, raw = TRUE)1  
##                      43.351                       -2.340  
## poly(lstat, 2, raw = TRUE)2  
##                       0.043</code></pre>
<p>The output contains two coefficients associated with lstat : one for the linear term (lstat^1) and one for the quadratic term (lstat^2).</p>
<p>The following example computes a sixfth-order polynomial fit:</p>
<pre class="r"><code>lm(medv ~ poly(lstat, 6, raw = TRUE), data = train.data) %>%
  summary()</code></pre>
<pre><code>## 
## Call:
## lm(formula = medv ~ poly(lstat, 6, raw = TRUE), data = train.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14.23  -3.24  -0.74   2.02  26.50 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  7.14e+01   6.00e+00   11.90  < 2e-16 ***
## poly(lstat, 6, raw = TRUE)1 -1.45e+01   3.22e+00   -4.48  9.6e-06 ***
## poly(lstat, 6, raw = TRUE)2  1.87e+00   6.26e-01    2.98    0.003 ** 
## poly(lstat, 6, raw = TRUE)3 -1.32e-01   5.73e-02   -2.30    0.022 *  
## poly(lstat, 6, raw = TRUE)4  4.98e-03   2.66e-03    1.87    0.062 .  
## poly(lstat, 6, raw = TRUE)5 -9.56e-05   6.03e-05   -1.58    0.114    
## poly(lstat, 6, raw = TRUE)6  7.29e-07   5.30e-07    1.38    0.170    
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## Residual standard error: 5.28 on 400 degrees of freedom
## Multiple R-squared:  0.684,  Adjusted R-squared:  0.679 
## F-statistic:  144 on 6 and 400 DF,  p-value: <2e-16</code></pre>
<p>From the output above, it can be seen that polynomial terms beyond the fith order are not significant. So, just create a fith polynomial regression model as follow:</p>
<pre class="r"><code># Build the model
model <- lm(medv ~ poly(lstat, 5, raw = TRUE), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
  RMSE = RMSE(predictions, test.data$medv),
  R2 = R2(predictions, test.data$medv)
)</code></pre>
<pre><code>##   RMSE    R2
## 1 4.96 0.689</code></pre>
<p>Visualize the fith polynomial regression line as follow:</p>
<pre class="r"><code>ggplot(train.data, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = lm, formula = y ~ poly(x, 5, raw = TRUE))</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/009-polynomial-and-spline-regression-polynomial-regression-1.png" width="384" /></p>
</div>
<div id="log-transformation" class="section level2">
<h2>Log transformation</h2>
<p>When you have a non-linear relationship, you can also try a logarithm transformation of the predictor variables:</p>
<pre class="r"><code># Build the model
model <- lm(medv ~ log(lstat), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
  RMSE = RMSE(predictions, test.data$medv),
  R2 = R2(predictions, test.data$medv)
)</code></pre>
<pre><code>##   RMSE    R2
## 1 5.24 0.657</code></pre>
<p>Visualize the data:</p>
<pre class="r"><code>ggplot(train.data, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = lm, formula = y ~ log(x))</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/009-polynomial-and-spline-regression-log-transformation-1.png" width="384" /></p>
</div>
<div id="spline-regression" class="section level2">
<h2>Spline regression</h2>
<p>Polynomial regression only captures a certain amount of curvature in a nonlinear relationship. An alternative, and often superior, approach to modeling nonlinear relationships is to use splines <span class="citation">(P. Bruce and Bruce 2017)</span>.</p>
<p>Splines provide a way to smoothly interpolate between fixed points, called knots. Polynomial regression is computed between knots. In other words, splines are series of polynomial segments strung together, joining at knots <span class="citation">(P. Bruce and Bruce 2017)</span>.</p>
<p>The R package <code>splines</code> includes the function <code>bs</code> for creating a b-spline term in a regression model.</p>
<p>You need to specify two parameters: the degree of the polynomial and the location of the knots. In our example, we’ll place the knots at the lower quartile, the median quartile, and the upper quartile:</p>
<pre class="r"><code>knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75))</code></pre>
<p>We’ll create a model using a cubic spline (degree = 3):</p>
<pre class="r"><code>library(splines)
# Build the model
knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75))
model <- lm (medv ~ bs(lstat, knots = knots), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
  RMSE = RMSE(predictions, test.data$medv),
  R2 = R2(predictions, test.data$medv)
)</code></pre>
<pre><code>##   RMSE    R2
## 1 4.97 0.688</code></pre>
<p>Note that, the coefficients for a spline term are not interpretable.</p>
<p>Visualize the cubic spline as follow:</p>
<pre class="r"><code>ggplot(train.data, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = lm, formula = y ~ splines::bs(x, df = 3))</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/009-polynomial-and-spline-regression-cubic-spline-1.png" width="384" /></p>
</div>
<div id="generalized-additive-models" class="section level2">
<h2>Generalized additive models</h2>
<p>Once you have detected a non-linear relationship in your data, the polynomial terms may not be flexible enough to capture the relationship, and spline terms require specifying the knots.</p>
<p>Generalized additive models, or GAM, are a technique to automatically fit a spline regression. This can be done using the <code>mgcv</code> R package:</p>
<pre class="r"><code>library(mgcv)
# Build the model
model <- gam(medv ~ s(lstat), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
  RMSE = RMSE(predictions, test.data$medv),
  R2 = R2(predictions, test.data$medv)
)</code></pre>
<pre><code>##   RMSE    R2
## 1 5.02 0.684</code></pre>
<p>The term <code>s(lstat)</code> tells the <code>gam()</code> function to find the “best” knots for a spline term.</p>
<p>Visualize the data:</p>
<pre class="r"><code>ggplot(train.data, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = gam, formula = y ~ s(x))</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/009-polynomial-and-spline-regression-generalized-additive-model-1.png" width="384" /></p>
</div>
<div id="comparing-the-models" class="section level2">
<h2>Comparing the models</h2>
<p>From analyzing the RMSE and the R2 metrics of the different models, it can be seen that the polynomial regression, the spline regression and the generalized additive models outperform the linear regression model and the log transformation approaches.</p>
</div>
<div id="discussion" class="section level2">
<h2>Discussion</h2>
<p>This chapter describes how to compute non-linear regression models using R.</p>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-bruce2017">
<p>Bruce, Peter, and Andrew Bruce. 2017. <em>Practical Statistics for Data Scientists</em>. O’Reilly Media.</p>
</div>
</div>
</div>


</div><!--end rdoc-->

 
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>


<!-- END HTML -->]]></description>
			<pubDate>Sun, 11 Mar 2018 12:33:00 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Linear Regression Assumptions and Diagnostics in R: Essentials]]></title>
			<link>https://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/</link>
			<guid>https://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">





<p>Linear regression (Chapter @ref(linear-regression)) makes several assumptions about the data at hand. This chapter describes <strong>regression assumptions</strong> and provides built-in plots for <strong>regression diagnostics</strong> in R programming language.</p>
<p>After performing a regression analysis, you should always check if the model works well for the data at hand.</p>
<p>A first step of this regression diagnostic is to inspect the significance of the regression beta coefficients, as well as, the R2 that tells us how well the linear regression model fits to the data. This has been described in the Chapters @ref(linear-regression) and @ref(cross-validation).</p>
<p>In this current chapter, you will learn additional steps to evaluate how well the model fits the data.</p>
<p>For example, the linear regression model makes the assumption that the relationship between the predictors (x) and the outcome variable is linear. This might not be true. The relationship could be polynomial or logarithmic.</p>
<p>Additionally, the data might contain some influential observations, such as outliers (or extreme values), that can affect the result of the regression.</p>
<p>Therefore, you should closely diagnostic the regression model that you built in order to detect potential problems and to check whether the assumptions made by the linear regression model are met or not.</p>
<p>To do so, we generally examine the distribution of <strong>residuals errors</strong>, that can tell you more about your data.</p>
<p>In this chapter,</p>
<ul>
<li>we start by explaining <strong>residuals errors</strong> and <strong>fitted values</strong>.</li>
<li>next, we present linear <strong>regresion assumptions</strong>, as well as, potential problems you can face when performing regression analysis.</li>
<li>finally, we describe some built-in <strong>diagnostic plots</strong> in R for testing the assumptions underlying linear regression model.</li>
</ul>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#loading-required-r-packages">Loading Required R packages</a></li>
<li><a href="#example-of-data">Example of data</a></li>
<li><a href="#building-a-regression-model">Building a regression model</a></li>
<li><a href="#fitted-values-and-residuals">Fitted values and residuals</a></li>
<li><a href="#regression-assumptions">Regression assumptions</a></li>
<li><a href="#regression-diagnostics-reg-diag">Regression diagnostics {reg-diag}</a><ul>
<li><a href="#diagnostic-plots">Diagnostic plots</a></li>
</ul></li>
<li><a href="#linearity-of-the-data">Linearity of the data</a></li>
<li><a href="#homogeneity-of-variance">Homogeneity of variance</a></li>
<li><a href="#normality-of-residuals">Normality of residuals</a></li>
<li><a href="#outliers-and-high-levarage-points">Outliers and high levarage points</a></li>
<li><a href="#influential-values">Influential values</a></li>
<li><a href="#discussion">Discussion</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<div class = "small-block content-privileged-friends navr-book">
  <p>The Book:</p>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/54-machine-learning-essentials/">
          <img src = "https://www.sthda.com/english/upload/machine-learning-essentials-frontcover-200px.png" /><br/>
     Machine Learning Essentials: Practical Guide in R
      </a>
</div>
<div class="spacer"></div>


<div id="loading-required-r-packages" class="section level2">
<h2>Loading Required R packages</h2>
<ul>
<li><code>tidyverse</code> for easy data manipulation and visualization</li>
<li><code>broom</code>: creates a tidy data frame from statistical test results</li>
</ul>
<pre class="r"><code>library(tidyverse)
library(broom)
theme_set(theme_classic())</code></pre>
</div>
<div id="example-of-data" class="section level2">
<h2>Example of data</h2>
<p>We’ll use the data set <code>marketing</code> [datarium package], introduced in Chapter @ref(regression-analysis).</p>
<pre class="r"><code># Load the data
data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)</code></pre>
<pre><code>##     youtube facebook newspaper sales
## 58    163.4     23.0      19.9  15.8
## 157   112.7     52.2      60.6  18.4
## 81     91.7     32.0      26.8  14.2</code></pre>
</div>
<div id="building-a-regression-model" class="section level2">
<h2>Building a regression model</h2>
<p>We build a model to predict sales on the basis of advertising budget spent in youtube medias.</p>
<pre class="r"><code>model <- lm(sales ~ youtube, data = marketing)
model</code></pre>
<pre><code>## 
## Call:
## lm(formula = sales ~ youtube, data = marketing)
## 
## Coefficients:
## (Intercept)      youtube  
##      8.4391       0.0475</code></pre>
<p>Our regression equation is: <code>y = 8.43 + 0.07*x</code>, that is <code>sales = 8.43 + 0.047*youtube</code>.</p>
<p>Before, describing regression assumptions and regression diagnostics, we start by explaining two key concepts in regression analysis: Fitted values and residuals errors. These are important for understanding the diagnostic plots presented hereafter.</p>
</div>
<div id="fitted-values-and-residuals" class="section level2">
<h2>Fitted values and residuals</h2>
<p>The <strong>fitted</strong> (or <strong>predicted</strong>) values are the y-values that you would expect for the given x-values according to the built regression model (or visually, the best-fitting straight regression line).</p>
<p>In our example, for a given youtube advertising budget, the fitted (predicted) sales value would be, <code>sales =  8.44 + 0.0048*youtube</code>.</p>
<p>From the scatter plot below, it can be seen that not all the data points fall exactly on the estimated regression line. This means that, for a given youtube advertising budget, the observed (or measured) sale values can be different from the predicted sale values. The difference is called the <strong>residual errors</strong>, represented by a vertical red lines.</p>
<p><img src="https://www.sthda.com/english/sthda-upload/images/machine-learning-essentials/linear-regression.png" alt="Linear regression" /></p>
<p>In R, you can easily augment your data to add fitted values and residuals by using the function <code>augment()</code> [broom package]. Let’s call the output <code>model.diag.metrics</code> because it contains several metrics useful for regression diagnostics. We’ll describe theme later.</p>
<pre class="r"><code>model.diag.metrics <- augment(model)
head(model.diag.metrics)</code></pre>
<pre><code>##   sales youtube .fitted .se.fit .resid    .hat .sigma  .cooksd .std.resid
## 1 26.52   276.1   21.56   0.385  4.955 0.00970   3.90 7.94e-03     1.2733
## 2 12.48    53.4   10.98   0.431  1.502 0.01217   3.92 9.20e-04     0.3866
## 3 11.16    20.6    9.42   0.502  1.740 0.01649   3.92 1.69e-03     0.4486
## 4 22.20   181.8   17.08   0.277  5.119 0.00501   3.90 4.34e-03     1.3123
## 5 15.48   217.0   18.75   0.297 -3.273 0.00578   3.91 2.05e-03    -0.8393
## 6  8.64    10.4    8.94   0.525 -0.295 0.01805   3.92 5.34e-05    -0.0762</code></pre>
<p>Among the table columns, there are:</p>
<ul>
<li><code>youtube</code>: the invested youtube advertising budget</li>
<li><code>sales</code>: the observed sale values</li>
<li><code>.fitted</code>: the fitted sale values</li>
<li><code>.resid</code>: the residual errors</li>
<li>…</li>
</ul>
<p>The following R code plots the residuals error (in red color) between observed values and the fitted regression line. Each vertical red segments represents the residual error between an observed sale value and the corresponding predicted (i.e. fitted) value.</p>
<pre class="r"><code>ggplot(model.diag.metrics, aes(youtube, sales)) +
  geom_point() +
  stat_smooth(method = lm, se = FALSE) +
  geom_segment(aes(xend = youtube, yend = .fitted), color = "red", size = 0.3)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/011-regression-assumptions-and-diagnostics-residual-error-1.png" width="384" /></p>
<div class="block">
<p>
In order to check regression assumptions, we’ll examine the distribution of residuals.
</p>
</div>
</div>
<div id="regression-assumptions" class="section level2">
<h2>Regression assumptions</h2>
<p>Linear regression makes several assumptions about the data, such as :</p>
<ol style="list-style-type: decimal">
<li><strong>Linearity of the data</strong>. The relationship between the predictor (x) and the outcome (y) is assumed to be linear.</li>
<li><strong>Normality of residuals</strong>. The residual errors are assumed to be normally distributed.</li>
<li><strong>Homogeneity of residuals variance</strong>. The residuals are assumed to have a constant variance (<strong>homoscedasticity</strong>)</li>
<li><strong>Independence of residuals error terms</strong>.</li>
</ol>
<p>You should check whether or not these assumptions hold true. Potential problems include:</p>
<ol style="list-style-type: decimal">
<li><strong>Non-linearity</strong> of the outcome - predictor relationships</li>
<li><strong>Heteroscedasticity</strong>: Non-constant variance of error terms.</li>
<li><strong>Presence of influential values</strong> in the data that can be:
<ul>
<li>Outliers: extreme values in the outcome (y) variable</li>
<li>High-leverage points: extreme values in the predictors (x) variable</li>
</ul></li>
</ol>
<p>All these assumptions and potential problems can be checked by producing some diagnostic plots visualizing the residual errors.</p>
</div>
<div id="regression-diagnostics-reg-diag" class="section level2">
<h2>Regression diagnostics {reg-diag}</h2>
<div id="diagnostic-plots" class="section level3">
<h3>Diagnostic plots</h3>
<p>Regression diagnostics plots can be created using the R base function <code>plot()</code> or the <code>autoplot()</code> function [ggfortify package], which creates a ggplot2-based graphics.</p>
<ul>
<li>Create the diagnostic plots with the R base function:</li>
</ul>
<pre class="r"><code>par(mfrow = c(2, 2))
plot(model)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/011-regression-assumptions-and-diagnostics-diagnostic-plots-1.png" width="672" /></p>
<ul>
<li>Create the diagnostic plots using ggfortify:</li>
</ul>
<pre class="r"><code>library(ggfortify)
autoplot(model)</code></pre>
<p>The diagnostic plots show residuals in four different ways:</p>
<ol style="list-style-type: decimal">
<li><p><strong>Residuals vs Fitted</strong>. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.</p></li>
<li><p><strong>Normal Q-Q</strong>. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line.</p></li>
<li><p><strong>Scale-Location</strong> (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.</p></li>
<li><p><strong>Residuals vs Leverage</strong>. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. This plot will be described further in the next sections.</p></li>
</ol>
<p>The four plots show the top 3 most extreme data points labeled with with the row numbers of the data in the data set. They might be potentially problematic. You might want to take a close look at them individually to check if there is anything special for the subject or if it could be simply data entry errors. We’ll discuss about this in the following sections.</p>
<p>The metrics used to create the above plots are available in the <code>model.diag.metrics</code> data, described in the previous section.</p>
<pre class="r"><code># Add observations indices and
# drop some columns (.se.fit, .sigma) for simplification
model.diag.metrics <- model.diag.metrics %>%
  mutate(index = 1:nrow(model.diag.metrics)) %>%
  select(index, everything(), -.se.fit, -.sigma)
# Inspect the data
head(model.diag.metrics, 4)</code></pre>
<pre><code>##   index sales youtube .fitted .resid    .hat .cooksd .std.resid
## 1     1  26.5   276.1   21.56   4.96 0.00970 0.00794      1.273
## 2     2  12.5    53.4   10.98   1.50 0.01217 0.00092      0.387
## 3     3  11.2    20.6    9.42   1.74 0.01649 0.00169      0.449
## 4     4  22.2   181.8   17.08   5.12 0.00501 0.00434      1.312</code></pre>
<p>We’ll use mainly the following columns:</p>
<ul>
<li><code>.fitted</code>: fitted values</li>
<li><code>.resid</code>: residual errors</li>
<li><code>.hat</code>: hat values, used to detect high-leverage points (or extreme values in the predictors x variables)</li>
<li><code>.std.resid</code>: standardized residuals, which is the residuals divided by their standard errors. Used to detect outliers (or extreme values in the outcome y variable)</li>
<li><code>.cooksd</code>: Cook’s distance, used to detect influential values, which can be an outlier or a high leverage point</li>
</ul>
<p>In the following section, we’ll describe, in details, how to use these graphs and metrics to check the regression assumptions and to diagnostic potential problems in the model.</p>
</div>
</div>
<div id="linearity-of-the-data" class="section level2">
<h2>Linearity of the data</h2>
<p>The linearity assumption can be checked by inspecting the <strong>Residuals vs Fitted</strong> plot (1st plot):</p>
<pre class="r"><code>plot(model, 1)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/011-regression-assumptions-and-diagnostics-residual-vs-fitted-plot-1.png" width="432" /></p>
<p>Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at zero. The presence of a pattern may indicate a problem with some aspect of the linear model.</p>
<div class="success">
<p>
In our example, there is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.
</p>
</div>
<div class="warning">
<p>
Note that, if the residual plot indicates a non-linear relationship in the data, then a simple approach is to use non-linear transformations of the predictors, such as log(x), sqrt(x) and x^2, in the regression model.
</p>
</div>
</div>
<div id="homogeneity-of-variance" class="section level2">
<h2>Homogeneity of variance</h2>
<p>This assumption can be checked by examining the <em>scale-location plot</em>, also known as the <em>spread-location plot</em>.</p>
<pre class="r"><code>plot(model, 3)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/011-regression-assumptions-and-diagnostics-scale-location-plot-1.png" width="384" /></p>
<p>This plot shows if residuals are spread equally along the ranges of predictors. It’s good if you see a horizontal line with equally spread points. In our example, this is not the case.</p>
<p>It can be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors (or <em>heteroscedasticity</em>).</p>
<p>A possible solution to reduce the heteroscedasticity problem is to use a log or square root transformation of the outcome variable (y).</p>
<pre class="r"><code>model2 <- lm(log(sales) ~ youtube, data = marketing)
plot(model2, 3)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/011-regression-assumptions-and-diagnostics-heteroscedasticity-1.png" width="384" /></p>
</div>
<div id="normality-of-residuals" class="section level2">
<h2>Normality of residuals</h2>
<p>The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.</p>
<p>In our example, all the points fall approximately along this reference line, so we can assume normality.</p>
<pre class="r"><code>plot(model, 2)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/011-regression-assumptions-and-diagnostics-normality-of-residuals-1.png" width="384" /></p>
</div>
<div id="outliers-and-high-levarage-points" class="section level2">
<h2>Outliers and high levarage points</h2>
<p><strong>Outliers</strong>:</p>
<p>An outlier is a point that has an extreme outcome variable value. The presence of outliers may affect the interpretation of the model, because it increases the RSE.</p>
<p>Outliers can be identified by examining the <em>standardized residual</em> (or <em>studentized residual</em>), which is the residual divided by its estimated standard error. Standardized residuals can be interpreted as the number of standard errors away from the regression line.</p>
<p>Observations whose standardized residuals are greater than 3 in absolute value are possible outliers <span class="citation">(James et al. 2014)</span>.</p>
<p><strong>High leverage points</strong>:</p>
<p>A data point has high leverage, if it has extreme predictor x values. This can be detected by examining the leverage statistic or the <em>hat-value</em>. A value of this statistic above <code>2(p + 1)/n</code> indicates an observation with high leverage <span class="citation">(P. Bruce and Bruce 2017)</span>; where, <code>p</code> is the number of predictors and <code>n</code> is the number of observations.</p>
<p>Outliers and high leverage points can be identified by inspecting the <em>Residuals vs Leverage</em> plot:</p>
<pre class="r"><code>plot(model, 5)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/011-regression-assumptions-and-diagnostics-residuals-vs-leverage-points-1.png" width="384" /></p>
<div class="success">
<p>
The plot above highlights the top 3 most extreme points (#26, #36 and #179), with a standardized residuals below -2. However, there is no outliers that exceed 3 standard deviations, what is good.
</p>
<p>
Additionally, there is no high leverage point in the data. That is, all data points, have a leverage statistic below 2(p + 1)/n = 4/200 = 0.02.
</p>
</div>
</div>
<div id="influential-values" class="section level2">
<h2>Influential values</h2>
<p>An influential value is a value, which inclusion or exclusion can alter the results of the regression analysis. Such a value is associated with a large residual.</p>
<p>Not all outliers (or extreme data points) are influential in linear regression analysis.</p>
<p>Statisticians have developed a metric called <em>Cook’s distance</em> to determine the influence of a value. This metric defines influence as a combination of leverage and residual size.</p>
<p>A rule of thumb is that an observation has high influence if Cook’s distance exceeds <code>4/(n - p - 1)</code><span class="citation">(P. Bruce and Bruce 2017)</span>, where <code>n</code> is the number of observations and <code>p</code> the number of predictor variables.</p>
<p>The <em>Residuals vs Leverage</em> plot can help us to find influential observations if any. On this plot, outlying values are generally located at the upper right corner or at the lower right corner. Those spots are the places where data points can be influential against a regression line.</p>
<p>The following plots illustrate the Cook’s distance and the leverage of our model:</p>
<pre class="r"><code># Cook&amp;#39;s distance
plot(model, 4)
# Residuals vs Leverage
plot(model, 5)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/011-regression-assumptions-and-diagnostics-cooks-distance-1.png" width="720" /></p>
<p>By default, the top 3 most extreme values are labelled on the Cook’s distance plot. If you want to label the top 5 extreme values, specify the option <code>id.n</code> as follow:</p>
<pre class="r"><code>plot(model, 4, id.n = 5)</code></pre>
<p>If you want to look at these top 3 observations with the highest Cook’s distance in case you want to assess them further, type this R code:</p>
<pre class="r"><code>model.diag.metrics %>%
  top_n(3, wt = .cooksd)</code></pre>
<pre><code>##   index sales youtube .fitted .resid   .hat .cooksd .std.resid
## 1    26  14.4     315    23.4  -9.04 0.0142  0.0389      -2.33
## 2    36  15.4     349    25.0  -9.66 0.0191  0.0605      -2.49
## 3   179  14.2     332    24.2 -10.06 0.0165  0.0563      -2.59</code></pre>
<div class="warning">
<p>
When data points have high Cook’s distance scores and are to the upper or lower right of the leverage plot, they have leverage meaning they are influential to the regression results. The regression results will be altered if we exclude those cases.
</p>
</div>
<div class="success">
<p>
In our example, the data don’t present any influential points. Cook’s distance lines (a red dashed line) are not shown on the Residuals vs Leverage plot because all points are well inside of the Cook’s distance lines.
</p>
</div>
<p>Let’s show now another example, where the data contain two extremes values with potential influence on the regression results:</p>
<pre class="r"><code>df2 <- data.frame(
  x = c(marketing$youtube, 500, 600),
  y = c(marketing$sales, 80, 100)
)
model2 <- lm(y ~ x, df2)</code></pre>
<p>Create the <em>Residuals vs Leverage</em> plot of the two models:</p>
<pre class="r"><code># Cook&amp;#39;s distance
plot(model2, 4)
# Residuals vs Leverage
plot(model2, 5)</code></pre>
<p><img src="https://www.sthda.com/english/sthda-upload/figures/machine-learning-essentials/011-regression-assumptions-and-diagnostics-influential-observations-1.png" width="720" /></p>
<p>On the Residuals vs Leverage plot, look for a data point outside of a dashed line, Cook’s distance. When the points are outside of the Cook’s distance, this means that they have high Cook’s distance scores. In this case, the values are influential to the regression results. The regression results will be altered if we exclude those cases.</p>
<p>In the above example 2, two data points are far beyond the Cook’s distance lines. The other residuals appear clustered on the left. The plot identified the influential observation as #201 and #202. If you exclude these points from the analysis, the slope coefficient changes from 0.06 to 0.04 and R2 from 0.5 to 0.6. Pretty big impact!</p>
</div>
<div id="discussion" class="section level2">
<h2>Discussion</h2>
<p>This chapter describes linear regression assumptions and shows how to diagnostic potential problems in the model.</p>
<p>The diagnostic is essentially performed by visualizing the residuals. Having patterns in residuals is not a stop signal. Your current regression model might not be the best way to understand your data.</p>
<p>Potential problems might be:</p>
<ul>
<li><p>A non-linear relationships between the outcome and the predictor variables. When facing to this problem, one solution is to include a quadratic term, such as polynomial terms or log transformation. See Chapter @ref(polynomial-and-spline-regression).</p></li>
<li><p>Existence of important variables that you left out from your model. Other variables you didn’t include (e.g., age or gender) may play an important role in your model and data. See Chapter @ref(confounding-variables).</p></li>
<li><p>Presence of outliers. If you believe that an outlier has occurred due to an error in data collection and entry, then one solution is to simply remove the concerned observation.</p></li>
</ul>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-bruce2017">
<p>Bruce, Peter, and Andrew Bruce. 2017. <em>Practical Statistics for Data Scientists</em>. O’Reilly Media.</p>
</div>
<div id="ref-james2014">
<p>James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. <em>An Introduction to Statistical Learning: With Applications in R</em>. Springer Publishing Company, Incorporated.</p>
</div>
</div>
</div>


</div><!--end rdoc-->


<!-- END HTML -->]]></description>
			<pubDate>Sun, 11 Mar 2018 12:20:00 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Multicollinearity Essentials and VIF in R]]></title>
			<link>https://www.sthda.com/english/articles/39-regression-model-diagnostics/160-multicollinearity-essentials-and-vif-in-r/</link>
			<guid>https://www.sthda.com/english/articles/39-regression-model-diagnostics/160-multicollinearity-essentials-and-vif-in-r/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">





<p>In multiple regression (Chapter @ref(linear-regression)), two or more predictor variables might be correlated with each other. This situation is referred as <em>collinearity</em>.</p>
<p>There is an extreme situation, called <strong>multicollinearity</strong>, where collinearity exists between three or more variables even if no pair of variables has a particularly high correlation. This means that there is redundancy between predictor variables.</p>
<p>In the presence of multicollinearity, the solution of the regression model becomes unstable.</p>
<p>For a given predictor (p), multicollinearity can assessed by computing a score called the <strong>variance inflation factor</strong> (or <strong>VIF</strong>), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.</p>
<p>The smallest possible value of VIF is one (absence of multicollinearity). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity <span class="citation">(James et al. 2014)</span>.</p>
<p>When faced to multicollinearity, the concerned variables should be removed, since the presence of multicollinearity implies that the information that this variable provides about the response is redundant in the presence of the other variables <span class="citation">(James et al. 2014,<span class="citation">P. Bruce and Bruce (2017)</span>)</span>.</p>
<p>This chapter describes how to detect multicollinearity in a regression model using R.</p>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#loading-required-r-packages">Loading Required R packages</a></li>
<li><a href="#preparing-the-data">Preparing the data</a></li>
<li><a href="#building-a-regression-model">Building a regression model</a></li>
<li><a href="#detecting-multicollinearity">Detecting multicollinearity</a></li>
<li><a href="#dealing-with-multicollinearity">Dealing with multicollinearity</a></li>
<li><a href="#discussion">Discussion</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<div class = "small-block content-privileged-friends navr-book">
  <p>The Book:</p>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/54-machine-learning-essentials/">
          <img src = "https://www.sthda.com/english/upload/machine-learning-essentials-frontcover-200px.png" /><br/>
     Machine Learning Essentials: Practical Guide in R
      </a>
</div>
<div class="spacer"></div>


<div id="loading-required-r-packages" class="section level2">
<h2>Loading Required R packages</h2>
<ul>
<li><code>tidyverse</code> for easy data manipulation and visualization</li>
<li><code>caret</code> for easy machine learning workflow</li>
</ul>
<pre class="r"><code>library(tidyverse)
library(caret)</code></pre>
</div>
<div id="preparing-the-data" class="section level2">
<h2>Preparing the data</h2>
<p>We’ll use the <code>Boston</code> data set [in <code>MASS</code> package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (<code>mdev</code>), in Boston Suburbs, based on multiple predictor variables.</p>
<p>We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.</p>
<pre class="r"><code># Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]</code></pre>
</div>
<div id="building-a-regression-model" class="section level2">
<h2>Building a regression model</h2>
<p>The following regression model include all predictor variables:</p>
<pre class="r"><code># Build the model
model1 <- lm(medv ~., data = train.data)
# Make predictions
predictions <- model1 %>% predict(test.data)
# Model performance
data.frame(
  RMSE = RMSE(predictions, test.data$medv),
  R2 = R2(predictions, test.data$medv)
)</code></pre>
<pre><code>##   RMSE   R2
## 1 4.99 0.67</code></pre>
</div>
<div id="detecting-multicollinearity" class="section level2">
<h2>Detecting multicollinearity</h2>
<p>The R function <code>vif()</code> [car package] can be used to detect multicollinearity in a regression model:</p>
<pre class="r"><code>car::vif(model1)</code></pre>
<pre><code>##    crim      zn   indus    chas     nox      rm     age     dis     rad 
##    1.87    2.36    3.90    1.06    4.47    2.01    3.02    3.96    7.80 
##     tax ptratio   black   lstat 
##    9.16    1.91    1.31    2.97</code></pre>
<p>In our example, the VIF score for the predictor variable <code>tax</code> is very high (VIF = 9.16). This might be problematic.</p>
</div>
<div id="dealing-with-multicollinearity" class="section level2">
<h2>Dealing with multicollinearity</h2>
<p>In this section, we’ll update our model by removing the the predictor variables with high VIF value:</p>
<pre class="r"><code># Build a model excluding the tax variable
model2 <- lm(medv ~. -tax, data = train.data)
# Make predictions
predictions <- model2 %>% predict(test.data)
# Model performance
data.frame(
  RMSE = RMSE(predictions, test.data$medv),
  R2 = R2(predictions, test.data$medv)
)</code></pre>
<pre><code>##   RMSE    R2
## 1 5.01 0.671</code></pre>
<p>It can be seen that removing the <code>tax</code> variable does not affect very much the model performance metrics.</p>
</div>
<div id="discussion" class="section level2">
<h2>Discussion</h2>
<p>This chapter describes how to detect and deal with multicollinearity in regression models. Multicollinearity problems consist of including, in the model, different variables that have a similar predictive relationship with the outcome. This can be assessed for each predictor by computing the VIF value.</p>
<p>Any variable with a high VIF value (above 5 or 10) should be removed from the model. This leads to a simpler model without compromising the model accuracy, which is good.</p>
<p>Note that, in a large data set presenting multiple correlated predictor variables, you can perform principal component regression and partial least square regression strategies. See Chapter @ref(pcr-and-pls-regression).</p>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-bruce2017">
<p>Bruce, Peter, and Andrew Bruce. 2017. <em>Practical Statistics for Data Scientists</em>. O’Reilly Media.</p>
</div>
<div id="ref-james2014">
<p>James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. <em>An Introduction to Statistical Learning: With Applications in R</em>. Springer Publishing Company, Incorporated.</p>
</div>
</div>
</div>


</div><!--end rdoc-->


<!-- END HTML -->]]></description>
			<pubDate>Sun, 11 Mar 2018 12:13:00 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Confounding Variable Essentials]]></title>
			<link>https://www.sthda.com/english/articles/39-regression-model-diagnostics/159-confounding-variable-essentials/</link>
			<guid>https://www.sthda.com/english/articles/39-regression-model-diagnostics/159-confounding-variable-essentials/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">




<p>A <strong>Confounding variable</strong> is an important variable that should be included in the predictive model but you omit it.Naive interpretation of such models can lead to invalid conclusions.</p>
<p>For example, consider that we want to model life expentency in different countries based on the GDP per capita, using the <code>gapminder</code> data set:</p>
<pre class="r"><code>library(gapminder)
lm(lifeExp ~ gdpPercap, data = gapminder)</code></pre>
<p>In this example, it is clear that the continent is an important variable: countries in Europe are estimated to have a higher life expectancy compared to countries in Africa. Therefore, continent is a confounding variable that should be included in the model:</p>
<pre class="r"><code>lm(lifeExp ~ gdpPercap + continent, data = gapminder)</code></pre>


<br/>
<div class = "small-block content-privileged-friends navr-book">
  <p>The Book:</p>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/54-machine-learning-essentials/">
          <img src = "https://www.sthda.com/english/upload/machine-learning-essentials-frontcover-200px.png" /><br/>
     Machine Learning Essentials: Practical Guide in R
      </a>
</div>
<div class="spacer"></div>


</div><!--end rdoc-->


<!-- END HTML -->]]></description>
			<pubDate>Sun, 11 Mar 2018 12:07:00 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Regression Model Accuracy Metrics: R-square, AIC, BIC, Cp and more]]></title>
			<link>https://www.sthda.com/english/articles/38-regression-model-validation/158-regression-model-accuracy-metrics-r-square-aic-bic-cp-and-more/</link>
			<guid>https://www.sthda.com/english/articles/38-regression-model-validation/158-regression-model-accuracy-metrics-r-square-aic-bic-cp-and-more/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">





<p>In this chapter we’ll describe different statistical regression <strong>metrics</strong> for measuring the performance of a regression model (Chapter @ref(linear-regression)).</p>
<p>Next, we’ll provide practical examples in R for comparing the performance of two models in order to select the best one for our data.</p>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#model-performance-metrics">Model performance metrics</a></li>
<li><a href="#loading-required-r-packages">Loading required R packages</a></li>
<li><a href="#example-of-data">Example of data</a></li>
<li><a href="#building-regression-models">Building regression models</a></li>
<li><a href="#assessing-model-quality">Assessing model quality</a></li>
<li><a href="#comparing-regression-models-performance">Comparing regression models performance</a></li>
<li><a href="#discussion">Discussion</a></li>
</ul>
</div>
<br/>
<div class = "small-block content-privileged-friends navr-book">
  <p>The Book:</p>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/54-machine-learning-essentials/">
          <img src = "https://www.sthda.com/english/upload/machine-learning-essentials-frontcover-200px.png" /><br/>
     Machine Learning Essentials: Practical Guide in R
      </a>
</div>
<div class="spacer"></div>


<div id="model-performance-metrics" class="section level2">
<h2>Model performance metrics</h2>
<p>In regression model, the most commonly known evaluation metrics include:</p>
<ol style="list-style-type: decimal">
<li><p><strong>R-squared</strong> (R2), which is the proportion of variation in the outcome that is explained by the predictor variables. In multiple regression models, R2 corresponds to the squared correlation between the observed outcome values and the predicted values by the model. The Higher the R-squared, the better the model.</p></li>
<li><p><strong>Root Mean Squared Error</strong> (RMSE), which measures the average error performed by the model in predicting the outcome for an observation. Mathematically, the RMSE is the square root of the <em>mean squared error (MSE)</em>, which is the average squared difference between the observed actual outome values and the values predicted by the model. So, <code>MSE = mean((observeds - predicteds)^2)</code> and <code>RMSE = sqrt(MSE</code>). The lower the RMSE, the better the model.</p></li>
<li><p><strong>Residual Standard Error</strong> (RSE), also known as the <em>model sigma</em>, is a variant of the RMSE adjusted for the number of predictors in the model. The lower the RSE, the better the model. In practice, the difference between RMSE and RSE is very small, particularly for large multivariate data.</p></li>
<li><p><strong>Mean Absolute Error</strong> (MAE), like the RMSE, the MAE measures the prediction error. Mathematically, it is the average absolute difference between observed and predicted outcomes, <code>MAE = mean(abs(observeds - predicteds))</code>. MAE is less sensitive to outliers compared to RMSE.</p></li>
</ol>
<p>The problem with the above metrics, is that they are sensible to the inclusion of additional variables in the model, even if those variables dont have significant contribution in explaining the outcome. Put in other words, including additional variables in the model will always increase the R2 and reduce the RMSE. So, we need a more robust metric to guide the model choice.</p>
<p>Concerning R2, there is an adjusted version, called <strong>Adjusted R-squared</strong>, which adjusts the R2 for having too many variables in the model.</p>
<p>Additionally, there are four other important metrics - <strong>AIC</strong>, <strong>AICc</strong>, <strong>BIC</strong> and <strong>Mallows Cp</strong> - that are commonly used for model evaluation and selection. These are an unbiased estimate of the model prediction error MSE. The lower these metrics, he better the model.</p>
<ol style="list-style-type: decimal">
<li><strong>AIC</strong> stands for (<em>Akaike’s Information Criteria</em>), a metric developped by the Japanese Statistician, Hirotugu Akaike, 1970. The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.</li>
<li><strong>AICc</strong> is a version of AIC corrected for small sample sizes.</li>
<li><strong>BIC</strong> (or <em>Bayesian information criteria</em>) is a variant of AIC with a stronger penalty for including additional variables to the model.</li>
<li><strong>Mallows Cp</strong>: A variant of AIC developed by Colin Mallows.</li>
</ol>
<div class="block">
<p>
Generally, the most commonly used metrics, for measuring regression model quality and for comparing models, are: Adjusted R2, AIC, BIC and Cp.
</p>
</div>
<p>In the following sections, we’ll show you how to compute these above mentionned metrics.</p>
</div>
<div id="loading-required-r-packages" class="section level2">
<h2>Loading required R packages</h2>
<ul>
<li><code>tidyverse</code> for data manipulation and visualization</li>
<li><code>modelr</code> provides helper functions for computing regression model performance metrics</li>
<li><code>broom</code> creates easily a tidy data frame containing the model statistical metrics</li>
</ul>
<pre class="r"><code>library(tidyverse)
library(modelr)
library(broom)</code></pre>
</div>
<div id="example-of-data" class="section level2">
<h2>Example of data</h2>
<p>We’ll use the built-in R <code>swiss</code> data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.</p>
<pre class="r"><code># Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)</code></pre>
</div>
<div id="building-regression-models" class="section level2">
<h2>Building regression models</h2>
<p>We start by creating two models:</p>
<ol style="list-style-type: decimal">
<li>Model 1, including all predictors</li>
<li>Model 2, including all predictors except the variable Examination</li>
</ol>
<pre class="r"><code>model1 <- lm(Fertility ~., data = swiss)

model2 <- lm(Fertility ~. -Examination, data = swiss)</code></pre>
</div>
<div id="assessing-model-quality" class="section level2">
<h2>Assessing model quality</h2>
<p>There are many R functions and packages for assessing model quality, including:</p>
<ul>
<li><code>summary()</code> [stats package], returns the R-squared, adjusted R-squared and the RSE</li>
<li><code>AIC()</code> and <code>BIC()</code> [stats package], computes the AIC and the BIC, respectively</li>
</ul>
<pre class="r"><code>summary(model1)
AIC(model1)
BIC(model1)</code></pre>
<ul>
<li><code>rsquare()</code>, <code>rmse()</code> and <code>mae()</code> [modelr package], computes, respectively, the R2, RMSE and the MAE.</li>
</ul>
<pre class="r"><code>library(modelr)
data.frame(
  R2 = rsquare(model1, data = swiss),
  RMSE = rmse(model1, data = swiss),
  MAE = mae(model1, data = swiss)
)</code></pre>
<ul>
<li><code>R2()</code>, <code>RMSE()</code> and <code>MAE()</code> [caret package], computes, respectively, the R2, RMSE and the MAE.</li>
</ul>
<pre class="r"><code>library(caret)
predictions <- model1 %>% predict(swiss)
data.frame(
  R2 = R2(predictions, swiss$Fertility),
  RMSE = RMSE(predictions, swiss$Fertility),
  MAE = MAE(predictions, swiss$Fertility)
)</code></pre>
<ul>
<li><code>glance()</code> [broom package], computes the R2, adjusted R2, sigma (RSE), AIC, BIC.</li>
</ul>
<pre class="r"><code>library(broom)
glance(model1)</code></pre>
<ul>
<li>Manual computation of R2, RMSE and MAE:</li>
</ul>
<pre class="r"><code># Make predictions and compute the
# R2, RMSE and MAE
swiss %>%
  add_predictions(model1) %>%
  summarise(
    R2 = cor(Fertility, pred)^2,
    MSE = mean((Fertility - pred)^2),
    RMSE = sqrt(MSE),
    MAE = mean(abs(Fertility - pred))
  )</code></pre>
</div>
<div id="comparing-regression-models-performance" class="section level2">
<h2>Comparing regression models performance</h2>
<p>Here, we’ll use the function <code>glance()</code> to simply compare the overall quality of our two models:</p>
<pre class="r"><code># Metrics for model 1
glance(model1) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)</code></pre>
<pre><code>##   adj.r.squared sigma AIC BIC  p.value
## 1         0.671  7.17 326 339 5.59e-10</code></pre>
<pre class="r"><code># Metrics for model 2
glance(model2) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)</code></pre>
<pre><code>##   adj.r.squared sigma AIC BIC  p.value
## 1         0.671  7.17 325 336 1.72e-10</code></pre>
<p>From the output above, it can be seen that:</p>
<ol style="list-style-type: decimal">
<li><p>The two models have exactly the samed <strong>adjusted R2</strong> (0.67), meaning that they are equivalent in explaining the outcome, here fertility score. Additionally, they have the same amount of <strong>residual standard error</strong> (RSE or sigma = 7.17). However, the model 2 is more simple than model 1 because it incorporates less variables. All things equal, the simple model is always better in statistics.</p></li>
<li><p>The AIC and the BIC of the model 2 are lower than those of the model1. In model comparison strategies, the model with the lowest AIC and BIC score is preferred.</p></li>
<li><p>Finally, the F-statistic p.value of the model 2 is lower than the one of the model 1. This means that the model 2 is statistically more significant compared to model 1, which is consistent to the above conclusion.</p></li>
</ol>
<p>Note that, the RMSE and the RSE are measured in the same scale as the outcome variable. Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible:</p>
<pre class="r"><code>sigma(model1)/mean(swiss$Fertility)</code></pre>
<pre><code>## [1] 0.102</code></pre>
<p>In our example the average prediction error rate is 10%.</p>
</div>
<div id="discussion" class="section level2">
<h2>Discussion</h2>
<p>This chapter describes several metrics for assessing the overall performance of a regression model.</p>
<p>The most important metrics are the Adjusted R-square, RMSE, AIC and the BIC. These metrics are also used as the basis of model comparison and optimal model selection.</p>
<p>Note that, these regression metrics are all internal measures, that is they have been computed on the same data that was used to build the regression model. They tell you how well the model fits to the data in hand, called training data set.</p>
<p>In general, we do not really care how well the method works on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data.</p>
<p>However, the test data is not always available making the test error very difficult to estimate. In this situation, methods such as <strong>cross-validation</strong> (Chapter @ref(cross-validation)) and <strong>bootstrap</strong> (Chapter @ref(bootstrap-resampling)) are applied for estimating the test error (or the prediction error rate) using training data.</p>
</div>


</div><!--end rdoc-->


<!-- END HTML -->]]></description>
			<pubDate>Sun, 11 Mar 2018 10:28:00 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Cross-Validation Essentials in R]]></title>
			<link>https://www.sthda.com/english/articles/38-regression-model-validation/157-cross-validation-essentials-in-r/</link>
			<guid>https://www.sthda.com/english/articles/38-regression-model-validation/157-cross-validation-essentials-in-r/</guid>
			<description><![CDATA[<!-- START HTML -->


  <div id="rdoc">





<p><strong>Cross-validation</strong> refers to a set of methods for measuring the performance of a given predictive model on new test data sets.</p>
<p>The basic idea, behind cross-validation techniques, consists of dividing the data into two sets:</p>
<ol style="list-style-type: decimal">
<li>The training set, used to train (i.e. build) the model;</li>
<li>and the testing set (or validation set), used to test (i.e. validate) the model by estimating the prediction error.</li>
</ol>
<p>Cross-validation is also known as a <em>resampling method</em> because it involves fitting the same statistical method multiple times using different subsets of the data.</p>
<p>In this chapter, you’ll learn:</p>
<ol style="list-style-type: decimal">
<li><p>the most commonly used statistical metrics (Chapter @ref(regression-model-accuracy-metrics)) for measuring the performance of a regression model in predicting the outcome of new test data.</p></li>
<li>The different cross-validation methods for assessing model performance. We cover the following approaches:
<ul>
<li>Validation set approach (or data split)</li>
<li>Leave One Out Cross Validation</li>
<li>k-fold Cross Validation</li>
<li>Repeated k-fold Cross Validation</li>
</ul></li>
</ol>
<p>Each of these methods has their advantages and drawbacks. Use the method that best suits your problem. Generally, the (repeated) k-fold cross validation is recommended.</p>
<ol start="3" style="list-style-type: decimal">
<li>Practical examples of R codes for computing cross-validation methods.</li>
</ol>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#loading-required-r-packages">Loading required R packages</a></li>
<li><a href="#example-of-data">Example of data</a></li>
<li><a href="#model-performance-metrics">Model performance metrics</a></li>
<li><a href="#cross-validation-methods">Cross-validation methods</a><ul>
<li><a href="#the-validation-set-approach">The Validation set Approach</a></li>
<li><a href="#leave-one-out-cross-validation---loocv">Leave one out cross validation - LOOCV</a></li>
<li><a href="#k-fold-cross-validation">K-fold cross-validation</a></li>
<li><a href="#repeated-k-fold-cross-validation">Repeated K-fold cross-validation</a></li>
</ul></li>
<li><a href="#discussion">Discussion</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<br/>
<div class = "small-block content-privileged-friends navr-book">
  <p>The Book:</p>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/54-machine-learning-essentials/">
          <img src = "https://www.sthda.com/english/upload/machine-learning-essentials-frontcover-200px.png" /><br/>
     Machine Learning Essentials: Practical Guide in R
      </a>
</div>
<div class="spacer"></div>


<div id="loading-required-r-packages" class="section level2">
<h2>Loading required R packages</h2>
<ul>
<li><code>tidyverse</code> for easy data manipulation and visualization

</li>
<li><code>caret</code> for easily computing cross-validation methods</li>
</ul>
<pre class="r"><code>library(tidyverse)
library(caret)</code></pre>
</div>
<div id="example-of-data" class="section level2">
<h2>Example of data</h2>
<p>We’ll use the built-in R <code>swiss</code> data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.</p>
<pre class="r"><code># Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)</code></pre>
</div>
<div id="model-performance-metrics" class="section level2">
<h2>Model performance metrics</h2>
<p>After building a model, we are interested in determining the accuracy of this model on predicting the outcome for new unseen observations not used to build the model. Put in other words, we want to estimate the prediction error.</p>
<p>To do so, the basic strategy is to:</p>
<ol style="list-style-type: decimal">
<li>Build the model on a training data set</li>
<li>Apply the model on a new test data set to make predictions</li>
<li>Compute the prediction errors</li>
</ol>
<p>In Chapter @ref(regression-model-accuracy-metrics), we described several statistical metrics for quantifying the overall quality of regression models. These include:</p>
<ul>
<li><strong>R-squared</strong> (R2), representing the squared correlation between the observed outcome values and the predicted values by the model. The higher the adjusted R2, the better the model.</li>
<li><strong>Root Mean Squared Error</strong> (RMSE), which measures the average prediction error made by the model in predicting the outcome for an observation. That is, the average difference between the observed known outcome values and the values predicted by the model. The lower the RMSE, the better the model.</li>
<li><strong>Mean Absolute Error</strong> (MAE), an alternative to the RMSE that is less sensitive to outliers. It corresponds to the average absolute difference between observed and predicted outcomes. The lower the MAE, the better the model</li>
</ul>
<p>In classification setting, the prediction error rate is estimated as the proportion of misclassified observations.</p>
<div class="block">
<p>
R2, RMSE and MAE are used to measure the regression model performance during <strong>cross-validation</strong>.
</p>
</div>
<p>In the following section, we’ll explain the basics of cross-validation, and we’ll provide practical example using mainly the <code>caret</code> R package.</p>
</div>
<div id="cross-validation-methods" class="section level2">
<h2>Cross-validation methods</h2>
<p>Briefly, cross-validation algorithms can be summarized as follow:</p>
<ol style="list-style-type: decimal">
<li>Reserve a small sample of the data set</li>
<li>Build (or train) the model using the remaining part of the data set</li>
<li>Test the effectiveness of the model on the the reserved sample of the data set. If the model works well on the test data set, then it’s good.</li>
</ol>
<p>The following sections describe the different cross-validation techniques.</p>
<div id="the-validation-set-approach" class="section level3">
<h3>The Validation set Approach</h3>
<p>The validation set approach consists of randomly splitting the data into two sets: one set is used to train the model and the remaining other set sis used to test the model.</p>
<p>The process works as follow:</p>
<ol style="list-style-type: decimal">
<li>Build (train) the model on the training data set</li>
<li>Apply the model to the test data set to predict the outcome of new unseen observations</li>
<li>Quantify the prediction error as the mean squared difference between the observed and the predicted outcome values.</li>
</ol>
<p>The example below splits the <code>swiss</code> data set so that 80% is used for training a linear regression model and 20% is used to evaluate the model performance.</p>
<pre class="r"><code># Split the data into training and test set
set.seed(123)
training.samples <- swiss$Fertility %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- swiss[training.samples, ]
test.data <- swiss[-training.samples, ]
# Build the model
model <- lm(Fertility ~., data = train.data)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$Fertility),
            RMSE = RMSE(predictions, test.data$Fertility),
            MAE = MAE(predictions, test.data$Fertility))</code></pre>
<pre><code>##     R2 RMSE  MAE
## 1 0.39 9.11 7.48</code></pre>
<p>When comparing two models, the one that produces the lowest test sample RMSE is the preferred model.</p>
<p>the RMSE and the MAE are measured in the same scale as the outcome variable. Dividing the RMSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible:</p>
<pre class="r"><code>RMSE(predictions, test.data$Fertility)/mean(test.data$Fertility)</code></pre>
<pre><code>## [1] 0.128</code></pre>
<p>Note that, the validation set method is only useful when you have a large data set that can be partitioned. A disadvantage is that we build a model on a fraction of the data set only, possibly leaving out some interesting information about data, leading to higher bias. Therefore, the test error rate can be highly variable, depending on which observations are included in the training set and which observations are included in the validation set.</p>
</div>
<div id="leave-one-out-cross-validation---loocv" class="section level3">
<h3>Leave one out cross validation - LOOCV</h3>
<p>This method works as follow:</p>
<ol style="list-style-type: decimal">
<li>Leave out one data point and build the model on the rest of the data set</li>
<li>Test the model against the data point that is left out at step 1 and record the test error associated with the prediction</li>
<li>Repeat the process for all data points</li>
<li>Compute the overall prediction error by taking the average of all these test error estimates recorded at step 2.</li>
</ol>
<p>Practical example in R using the <code>caret</code> package:</p>
<pre class="r"><code># Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)</code></pre>
<pre><code>## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ... 
## Resampling results:
## 
##   RMSE  Rsquared  MAE 
##   7.74  0.613     6.12
## 
## Tuning parameter &amp;#39;intercept&amp;#39; was held constant at a value of TRUE</code></pre>
<p>The advantage of the LOOCV method is that we make use all data points reducing potential bias.</p>
<p>However, the process is repeated as many times as there are data points, resulting to a higher execution time when n is extremely large.</p>
<p>Additionally, we test the model performance against one data point at each iteration. This might result to higher variation in the prediction error, if some data points are outliers. So, we need a good ratio of testing data points, a solution provided by the <strong>k-fold cross-validation method</strong>.</p>
</div>
<div id="k-fold-cross-validation" class="section level3">
<h3>K-fold cross-validation</h3>
<p>The k-fold cross-validation method evaluates the model performance on different subset of the training data and then calculate the average prediction error rate. The algorithm is as follow:</p>
<ol style="list-style-type: decimal">
<li>Randomly split the data set into k-subsets (or k-fold) (for example 5 subsets)</li>
<li>Reserve one subset and train the model on all other subsets</li>
<li>Test the model on the reserved subset and record the prediction error</li>
<li>Repeat this process until each of the k subsets has served as the test set.</li>
<li>Compute the average of the k recorded errors. This is called the cross-validation error serving as the performance metric for the model.</li>
</ol>
<p>K-fold cross-validation (CV) is a robust method for estimating the accuracy of a model.</p>
<p>The most obvious advantage of k-fold CV compared to LOOCV is computational. A less obvious but potentially more important advantage of k-fold CV is that it often gives more accurate estimates of the test error rate than does LOOCV <span class="citation">(James et al. 2014)</span>.</p>
<p>Typical question, is how to choose right value of k?</p>
<p>Lower value of K is more biased and hence undesirable. On the other hand, higher value of K is less biased, but can suffer from large variability. It is not hard to see that a smaller value of k (say k = 2) always takes us towards validation set approach, whereas a higher value of k (say k = number of data points) leads us to LOOCV approach.</p>
<div class="block">
<p>
In practice, one typically performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.
</p>
</div>
<p>The following example uses 10-fold cross validation to estimate the prediction error. Make sure to set seed for reproducibility.</p>
<pre class="r"><code># Define training control
set.seed(123) 
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)</code></pre>
<pre><code>## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 43, 42, 42, 41, 43, 41, ... 
## Resampling results:
## 
##   RMSE  Rsquared  MAE 
##   7.38  0.751     6.03
## 
## Tuning parameter &amp;#39;intercept&amp;#39; was held constant at a value of TRUE</code></pre>
</div>
<div id="repeated-k-fold-cross-validation" class="section level3">
<h3>Repeated K-fold cross-validation</h3>
<p>The process of splitting the data into k-folds can be repeated a number of times, this is called repeated k-fold cross validation.</p>
<p>The final model error is taken as the mean error from the number of repeats.</p>
<p>The following example uses 10-fold cross validation with 3 repeats:</p>
<pre class="r"><code># Define training control
set.seed(123)
train.control <- trainControl(method = "repeatedcv", 
                              number = 10, repeats = 3)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)</code></pre>
</div>
</div>
<div id="discussion" class="section level2">
<h2>Discussion</h2>
<p>In this chapter, we described 4 different methods for assessing the performance of a model on unseen test data.</p>
<p>These methods include: validation set approach, leave-one-out cross-validation, k-fold cross-validation and repeated k-fold cross-validation.</p>
<p>We generally recommend the (repeated) k-fold cross-validation to estimate the prediction error rate. It can be used in regression and classification settings.</p>
<p>Another alternative to cross-validation is the bootstrap resampling methods (Chapter @ref(bootstrap-resampling)), which consists of repeatedly and randomly selecting a sample of n observations from the original data set, and to evaluate the model performance on each copy.</p>
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-james2014">
<p>James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. <em>An Introduction to Statistical Learning: With Applications in R</em>. Springer Publishing Company, Incorporated.</p>
</div>
</div>
</div>


</div><!--end rdoc-->



<!-- END HTML -->]]></description>
			<pubDate>Sun, 11 Mar 2018 10:22:00 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Bootstrap Resampling Essentials in R]]></title>
			<link>https://www.sthda.com/english/articles/38-regression-model-validation/156-bootstrap-resampling-essentials-in-r/</link>
			<guid>https://www.sthda.com/english/articles/38-regression-model-validation/156-bootstrap-resampling-essentials-in-r/</guid>
			<description><![CDATA[<!-- START HTML -->

  <div id="rdoc">





<p>Similarly to cross-validation techniques (Chapter @ref(cross-validation)), the <strong>bootstrap resampling</strong> method can be used to measure the accuracy of a predictive model. Additionally, it can be used to measure the uncertainty associated with any statistical estimator.</p>
<p>Bootstrap resampling consists of repeatedly selecting a sample of n observations from the original data set, and to evaluate the model on each copy. An average standard error is then calculated and the results provide an indication of the overall variance of the model performance.</p>
<p>This chapter describes the basics of bootstrapping and provides practical examples in R for computing a model prediction error. Additionally, we’ll show you how to compute an estimator uncertainty using bootstrap techniques.</p>
<p>Contents:</p>
<div id="TOC">
<ul>
<li><a href="#loading-required-r-packages">Loading required R packages</a></li>
<li><a href="#example-of-data">Example of data</a></li>
<li><a href="#bootstrap-procedure">Bootstrap procedure</a></li>
<li><a href="#evaluating-a-predictive-model-performance">Evaluating a predictive model performance</a></li>
<li><a href="#quantifying-an-estimator-uncertainty-and-confidence-intervals">Quantifying an estimator uncertainty and confidence intervals</a></li>
<li><a href="#discussion">Discussion</a></li>
</ul>
</div>
<br/>
<div class = "small-block content-privileged-friends navr-book">
  <p>The Book:</p>
        <a href = "https://www.sthda.com/english/web/5-bookadvisor/54-machine-learning-essentials/">
          <img src = "https://www.sthda.com/english/upload/machine-learning-essentials-frontcover-200px.png" /><br/>
     Machine Learning Essentials: Practical Guide in R
      </a>
</div>
<div class="spacer"></div>


<div id="loading-required-r-packages" class="section level2">
<h2>Loading required R packages</h2>
<ul>
<li><code>tidyverse</code> for easy data manipulation and visualization

</li>
<li><code>caret</code> for easily computing cross-validation methods</li>
</ul>
<pre class="r"><code>library(tidyverse)
library(caret)</code></pre>
</div>
<div id="example-of-data" class="section level2">
<h2>Example of data</h2>
<p>We’ll use the built-in R <code>swiss</code> data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.</p>
<pre class="r"><code># Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)</code></pre>
</div>
<div id="bootstrap-procedure" class="section level2">
<h2>Bootstrap procedure</h2>
<p>The bootstrap method is used to quantify the uncertainty associated with a given statistical estimator or with a predictive model.</p>
<p>It consists of randomly selecting a sample of n observations from the original data set. This subset, called bootstrap data set is then used to evaluate the model.</p>
<p>This procedure is repeated a large number of times and the standard error of the bootstrap estimate is then calculated. The results provide an indication of the variance of the models performance.</p>
<p>Note that, the sampling is performed with replacement, which means that the same observation can occur more than once in the bootstrap data set.</p>
</div>
<div id="evaluating-a-predictive-model-performance" class="section level2">
<h2>Evaluating a predictive model performance</h2>
<p>The following example uses a bootstrap with 100 resamples to test a linear regression model:</p>
<pre class="r"><code># Define training control
train.control <- trainControl(method = "boot", number = 100)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)</code></pre>
<pre><code>## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (100 reps) 
## Summary of sample sizes: 47, 47, 47, 47, 47, 47, ... 
## Resampling results:
## 
##   RMSE  Rsquared  MAE 
##   8.4   0.597     6.76
## 
## Tuning parameter &amp;#39;intercept&amp;#39; was held constant at a value of TRUE</code></pre>
<p>The output shows the average model performance across the 100 resamples.</p>
<p>RMSE (Root Mean Squared Error) and MAE(Mean Absolute Error), represent two different measures of the model prediction error. The lower the RMSE and the MAE, the better the model. The R-squared represents the proportion of variation in the outcome explained by the predictor variables included in the model. The higher the R-squared, the better the model. Read more on these metrics at Chapter @ref(regression-model-accuracy-metrics).</p>
</div>
<div id="quantifying-an-estimator-uncertainty-and-confidence-intervals" class="section level2">
<h2>Quantifying an estimator uncertainty and confidence intervals</h2>
<p>The bootstrap approach can be used to quantify the uncertainty (or standard error) associated with any given statistical estimator.</p>
<p>For example, you might want to estimate the accuracy of the linear regression beta coefficients using bootstrap method.</p>
<p>The different steps are as follow:</p>
<ol style="list-style-type: decimal">
<li>Create a simple function, <code>model_coef()</code>, that takes the <code>swiss</code> data set as well as the indices for the observations, and returns the regression coefficients.</li>
<li>Apply the function <code>boot_fun()</code> to the full data set of 47 observations in order to compute the coefficients</li>
</ol>
<p>We start by creating a function that returns the regression model coefficients:</p>
<pre class="r"><code>model_coef <- function(data, index){
  coef(lm(Fertility ~., data = data, subset = index))
}

model_coef(swiss, 1:47)</code></pre>
<pre><code>##      (Intercept)      Agriculture      Examination        Education 
##           66.915           -0.172           -0.258           -0.871 
##         Catholic Infant.Mortality 
##            0.104            1.077</code></pre>
<p>Next, we use the <code>boot()</code> function [boot package] to compute the standard errors of 500 bootstrap estimates for the coefficients:</p>
<pre class="r"><code>library(boot)
boot(swiss, model_coef, 500)</code></pre>
<pre><code>## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = swiss, statistic = model_coef, R = 500)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1*   66.915 -2.04e-01     10.9174
## t2*   -0.172 -5.62e-03      0.0639
## t3*   -0.258 -2.27e-02      0.2524
## t4*   -0.871  3.89e-05      0.2203
## t5*    0.104 -7.77e-04      0.0319
## t6*    1.077  4.45e-02      0.4478</code></pre>
<p>In the output above,</p>
<ul>
<li><code>original</code> column corresponds to the regression coefficients. The associated standard errors are given in the column <code>std.error</code>.</li>
<li>t1 corresponds to the intercept, t2 corresponds to <code>Agriculture</code> and so on…</li>
</ul>
<p>For example, it can be seen that, the standard error (SE) of the regression coefficient associated with <code>Agriculture</code> is 0.06.</p>
<p>Note that, the standard errors measure the variability/accuracy of the beta coefficients. It can be used to compute the confidence intervals of the coefficients.</p>
<p>For example, the 95% confidence interval for a given coefficient b is defined as <code>b +/- 2*SE(b)</code>, where:</p>
<ul>
<li>the lower limits of b = <code>b - 2*SE(b) = -0.172 - (2*0.0680) = -0.308</code> (for Agriculture variable)</li>
<li>the upper limits of b = <code>b + 2*SE(b) = -0.172 + (2*0.0680) = -0.036</code> (for Agriculture variable)</li>
</ul>
<p>That is, there is approximately a 95% chance that the interval [-0.308, -0.036] will contain the true value of the coefficient.</p>
<p>Using the standard <code>lm()</code> function gives a slightly different standard errors, because the linear model make some assumptions about the data:</p>
<pre class="r"><code>summary(lm(Fertility ~., data = swiss))$coef</code></pre>
<pre><code>##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)        66.915    10.7060    6.25 1.91e-07
## Agriculture        -0.172     0.0703   -2.45 1.87e-02
## Examination        -0.258     0.2539   -1.02 3.15e-01
## Education          -0.871     0.1830   -4.76 2.43e-05
## Catholic            0.104     0.0353    2.95 5.19e-03
## Infant.Mortality    1.077     0.3817    2.82 7.34e-03</code></pre>
<p>The bootstrap approach does not rely on any of these assumptions made by the linear model, and so it is likely giving a more accurate estimate of the coefficients standard errors than is the summary() function.</p>
</div>
<div id="discussion" class="section level2">
<h2>Discussion</h2>
<p>This chapter describes bootstrap resampling method for evaluating a predictive model accuracy, as well as, for measuring the uncertainty associated with a given statistical estimator.</p>
<p>An alternative approach to bootstrapping, for evaluating a predictive model performance, is cross-validation techniques (Chapter @ref(cross-validation)).</p>
</div>


</div><!--end rdoc-->


<!-- END HTML -->]]></description>
			<pubDate>Sun, 11 Mar 2018 10:19:00 +0100</pubDate>
			
		</item>
		
	</channel>
</rss>
