The **k-nearest neighbors** (**KNN**) algorithm is a simple *machine learning* method used for both classification and regression. The kNN algorithm predicts the outcome of a new observation by comparing it to k similar cases in the training data set, where k is defined by the analyst.

In this chapter, we start by describing the basics of the KNN algorithm for both classification and regression settings. Next, we provide practical example in *R* for preparing the data and computing KNN model.

Additionally, you’ll learn how to make predictions and to assess the performance of the built model in the predicting the outcome of new test observations.

Contents:

**KNN algorithm for classification**:

To classify a given new observation (new_obs), the k-nearest neighbors method starts by identifying the k most similar training observations (i.e. neighbors) to our new_obs, and then assigns new_obs to the class containing the majority of its neighbors.

**KNN algorithm for regression**:

Similarly, to predict a continuous outcome value for given new observation (new_obs), the KNN algorithm computes the average outcome value of the k training observations that are the most similar to new_obs, and returns this value as new_obs predicted outcome value.

**Similarity measures**:

Note that, the (dis)similarity between observations is generally determined using Euclidean distance measure, which is very sensitive to the scale on which predictor variable measurements are made. So, it’s generally recommended to standardize (i.e., normalize) the predictor variables for making their scales comparable.

The following sections shows how to build a k-nearest neighbor predictive model for classification and regression settings.

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
```

Data set: `PimaIndiansDiabetes2`

[in `mlbench`

package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
```

We’ll use the `caret`

package, which automatically tests different possible values of k, then chooses the optimal k that minimizes the cross-validation (“cv”) error, and fits the final best KNN model that explains the best our data.

Additionally `caret`

can automatically preprocess the data in order to normalize the predictor variables.

We’ll use the following arguments in the function `train()`

:

`trControl`

, to set up 10-fold cross validation`preProcess`

, to normalize the data`tuneLength`

, to specify the number of possible k values to evaluate

```
# Fit the model on the training set
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "knn",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale"),
tuneLength = 20
)
# Plot model accuracy vs different values of k
plot(model)
```

```
# Print the best tuning parameter k that
# maximizes model accuracy
model$bestTune
```

```
## k
## 5 13
```

```
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
```

```
## [1] neg pos neg pos pos neg
## Levels: neg pos
```

```
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
```

`## [1] 0.769`

The overall prediction accuracy of our model is 76.9%, which is good (see Chapter @ref(classification-model-evaluation) for learning key metrics used to evaluate a classification model performance).

In this section, we’ll describe how to predict a continuous variable using KNN.

We’ll use the `Boston`

data set [in `MASS`

package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (`mdev`

), in Boston Suburbs, using different predictor variables.

**Randomly split the data**into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data
data("Boston", package = "MASS")
# Inspect the data
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
```

**Compute KNN using caret**.

The best k is the one that minimize the prediction error RMSE (root mean squared error).

The RMSE corresponds to the square root of the average difference between the observed known outcome values and the predicted values, `RMSE = mean((observeds - predicteds)^2) %>% sqrt()`

. The lower the RMSE, the better the model.

```
# Fit the model on the training set
set.seed(123)
model <- train(
medv~., data = train.data, method = "knn",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale"),
tuneLength = 10
)
# Plot model error RMSE vs different values of k
plot(model)
# Best tuning parameter k that minimize the RMSE
model$bestTune
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the prediction error RMSE
RMSE(predictions, test.data$medv)
```

This chapter describes the basics of KNN (k-nearest neighbors) modeling, which is conceptually, one of the simpler machine learning method.

It’s recommended to standardize the data when performing the KNN analysis. We provided R codes to easily compute KNN predictive model and to assess the model performance on test data.

When fitting the KNN algorithm, the analyst needs to specify the number of neighbors (k) to be considered in the KNN algorithm for predicting the outcome of an observation. The choice of k considerably impacts the output of KNN. k = 1 corresponds to a highly flexible method resulting to a training error rate of 0 (overfitting), but the test error rate may be quite high.

You need to test multiple k-values to decide an optimal value for your data. This can be done automatically using the `caret`

package, which chooses a value of k that minimize the cross-validation error @ref(cross-validation).

The **decision tree** method is a powerful and popular predictive machine learning technique that is used for both *classification* and *regression*. So, it is also known as **Classification and Regression Trees** (**CART**).

Note that the R implementation of the CART algorithm is called *RPART* (Recursive Partitioning And Regression Trees) available in a package of the same name.

In this chapter we’ll describe the basics of tree models and provide R codes to compute classification and regression trees.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow`rpart`

for computing decision tree models

```
library(tidyverse)
library(caret)
library(rpart)
```

The algorithm of decision tree models works by repeatedly partitioning the data into multiple sub-spaces, so that the outcomes in each final sub-space is as homogeneous as possible. This approach is technically called *recursive partitioning*.

The produced result consists of a set of rules used for predicting the outcome variable, which can be either:

- a continuous variable, for regression trees
- a categorical variable, for classification trees

The decision rules generated by the CART predictive model are generally visualized as a binary tree.

The following example represents a tree model predicting the species of iris flower based on the length (in cm) and width of sepal and petal.

```
library(rpart)
model <- rpart(Species ~., data = iris)
par(xpd = NA) # otherwise on some devices the text is clipped
plot(model)
text(model, digits = 3)
```

The plot shows the different possible splitting rules that can be used to effectively predict the type of outcome (here, iris species). For example, the top split assigns observations having `Petal.length < 2.45`

to the left branch, where the predicted species are `setosa`

.

The different rules in tree can be printed as follow:

`print(model, digits = 2)`

```
## n= 150
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 100 setosa (0.333 0.333 0.333)
## 2) Petal.Length< 2.5 50 0 setosa (1.000 0.000 0.000) *
## 3) Petal.Length>=2.5 100 50 versicolor (0.000 0.500 0.500)
## 6) Petal.Width< 1.8 54 5 versicolor (0.000 0.907 0.093) *
## 7) Petal.Width>=1.8 46 1 virginica (0.000 0.022 0.978) *
```

These rules are produced by repeatedly splitting the predictor variables, starting with the variable that has the highest association with the response variable. The process continues until some predetermined stopping criteria are met.

The resulting tree is composed of *decision nodes*, *branches* and *leaf nodes*. The tree is placed from upside to down, so the *root* is at the top and leaves indicating the outcome is put at the bottom.

Each decision node corresponds to a single input predictor variable and a split cutoff on that variable. The leaf nodes of the tree are the outcome variable which is used to make predictions.

The tree grows from the top (root), at each node the algorithm decides the best split cutoff that results to the greatest purity (or homogeneity) in each subpartition.

The tree will stop growing by the following three criteria (Zhang 2016):

- all leaf nodes are pure with a single class;
- a pre-specified minimum number of training observations that cannot be assigned to each leaf nodes with any splitting methods;
- The number of observations in the leaf node reaches the pre-specified minimum one.

A fully grown tree will overfit the training data and the resulting model might not be performant for predicting the outcome of new test data. Techniques, such as **pruning**, are used to control this problem.

Technically, **for regression modeling**, the split cutoff is defined so that the residual sum of squared error (RSS) is minimized across the training samples that fall within the subpartition.

Recall that, the RSS is the sum of the squared difference between the observed outcome values and the predicted ones, `RSS = sum((Observeds - Predicteds)^2)`

. See Chapter @ref(linear-regression)

**In classification settings**, the split point is defined so that the population in subpartitions are pure as much as possible. Two measures of purity are generally used, including the *Gini index* and the *entropy* (or *information gain*).

For a given subpartition, `Gini = sum(p(1-p))`

and `entropy = -1*sum(p*log(p))`

, where p is the proportion of misclassified observations within the subpartition.

The sum is computed across the different categories or classes in the outcome variable. The Gini index and the entropy varie from 0 (greatest purity) to 1 (maximum degree of impurity)

The different rule sets established in the tree are used to predict the outcome of a new test data.

The following R code predict the species of a new collected iris flower:

```
newdata <- data.frame(
Sepal.Length = 6.5, Sepal.Width = 3.0,
Petal.Length = 5.2, Petal.Width = 2.0
)
model %>% predict(newdata, "class")
```

```
## 1
## virginica
## Levels: setosa versicolor virginica
```

The new data is predicted to be virginica.

Data set: `PimaIndiansDiabetes2`

[in `mlbench`

package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
```

Here, we’ll create a fully grown tree showing all predictor variables in the data set.

```
# Build the model
set.seed(123)
model1 <- rpart(diabetes ~., data = train.data, method = "class")
# Plot the trees
par(xpd = NA) # Avoid clipping the text in some device
plot(model1)
text(model1, digits = 3)
```

```
# Make predictions on the test data
predicted.classes <- model1 %>%
predict(test.data, type = "class")
head(predicted.classes)
```

```
## 21 25 28 29 32 36
## neg pos neg pos pos neg
## Levels: neg pos
```

```
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$diabetes)
```

`## [1] 0.782`

The overall accuracy of our tree model is 78%, which is not so bad.

However, this full tree including all predictor appears to be very complex and can be difficult to interpret in the situation where you have a large data sets with multiple predictors.

Additionally, it is easy to see that, a fully grown tree will overfit the training data and might lead to poor test set performance.

A strategy to limit this overfitting is to prune back the tree resulting to a simpler tree with fewer splits and better interpretation at the cost of a little bias (James et al. 2014, P. Bruce and Bruce (2017)).

Briefly, our goal here is to see if a smaller subtree can give us comparable results to the fully grown tree. If yes, we should go for the simpler tree because it reduces the likelihood of overfitting.

One possible robust strategy of pruning the tree (or stopping the tree to grow) consists of avoiding splitting a partition if the split does not significantly improves the overall quality of the model.

In `rpart`

package, this is controlled by the *complexity parameter* (cp), which imposes a penalty to the tree for having two many splits. The default value is 0.01. The higher the `cp`

, the smaller the tree.

A too small value of `cp`

leads to overfitting and a too large cp value will result to a too small tree. Both cases decrease the predictive performance of the model.

An optimal `cp`

value can be estimated by testing different cp values and using cross-validation approaches to determine the corresponding prediction accuracy of the model. The best `cp`

is then defined as the one that maximize the cross-validation accuracy (Chapter @ref(cross-validation)).

Pruning can be easily performed in the `caret`

package workflow, which invokes the `rpart`

method for automatically testing different possible values of `cp`

, then choose the optimal `cp`

that maximize the cross-validation (“cv”) accuracy, and fit the final best CART model that explains the best our data.

You can use the following arguments in the function `train()`

[from caret package]:

`trControl`

, to set up 10-fold cross validation`tuneLength`

, to specify the number of possible`cp`

values to evaluate. Default value is 3, here we’ll use 10.

```
# Fit the model on the training set
set.seed(123)
model2 <- train(
diabetes ~., data = train.data, method = "rpart",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Plot model accuracy vs different values of
# cp (complexity parameter)
plot(model2)
```

```
# Print the best tuning parameter cp that
# maximizes the model accuracy
model2$bestTune
```

```
## cp
## 2 0.0321
```

```
# Plot the final tree model
par(xpd = NA) # Avoid clipping the text in some device
plot(model2$finalModel)
text(model2$finalModel, digits = 3)
```

```
# Decision rules in the model
model2$finalModel
```

```
## n= 314
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 314 104 neg (0.6688 0.3312)
## 2) glucose< 128 188 26 neg (0.8617 0.1383) *
## 3) glucose>=128 126 48 pos (0.3810 0.6190)
## 6) glucose< 166 88 44 neg (0.5000 0.5000)
## 12) age< 23.5 16 1 neg (0.9375 0.0625) *
## 13) age>=23.5 72 29 pos (0.4028 0.5972) *
## 7) glucose>=166 38 4 pos (0.1053 0.8947) *
```

```
# Make predictions on the test data
predicted.classes <- model2 %>% predict(test.data)
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$diabetes)
```

`## [1] 0.795`

From the output above, it can be seen that the best value for the complexity parameter (cp) is 0.032, allowing a simpler tree, easy to interpret, with an overall accuracy of 79%, which is comparable to the accuracy (78%) that we have obtained with the full tree. The prediction accuracy of the pruned tree is even better compared to the full tree.

Taken together, we should go for this simpler model.

Previously, we described how to build a classification tree for predicting the group (i.e. class) of observations. In this section, we’ll describe how to build a tree for predicting a continuous variable, a method called regression analysis (Chapter @ref(regression-analysis)).

The R code is identical to what we have seen in previous sections. Pruning should be also applied here to limit overfiting.

Similarly to classification trees, the following R code uses the `caret`

package to build regression trees and to predict the output of a new test data set.

Data set: We’ll use the `Boston`

data set [in `MASS`

package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (`mdev`

), in Boston Suburbs, using different predictor variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data
data("Boston", package = "MASS")
# Inspect the data
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
```

Here, the best `cp`

value is the one that minimize the prediction error RMSE (root mean squared error).

The prediction error is measured by the RMSE, which corresponds to the average difference between the observed known values of the outcome and the predicted value by the model. RMSE is computed as `RMSE = mean((observeds - predicteds)^2) %>% sqrt()`

. The lower the RMSE, the better the model.

Choose the best cp value:

```
# Fit the model on the training set
set.seed(123)
model <- train(
medv ~., data = train.data, method = "rpart",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Plot model error vs different values of
# cp (complexity parameter)
plot(model)
# Print the best tuning parameter cp that
# minimize the model RMSE
model$bestTune
```

Plot the final tree model:

```
# Plot the final tree model
par(xpd = NA) # Avoid clipping the text in some device
plot(model$finalModel)
text(model$finalModel, digits = 3)
```

```
# Decision rules in the model
model$finalModel
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the prediction error RMSE
RMSE(predictions, test.data$medv)
```

The conditional inference tree (ctree) uses significance test methods to select and split recursively the most related predictor variables to the outcome. This can limit overfitting compared to the classical rpart algorithm.

At each splitting step, the algorithm stops if there is no dependence between predictor variables and the outcome variable. Otherwise the variable that is the most associated to the outcome is selected for splitting.

The conditional tree can be easily computed using the `caret`

workflow, which will invoke the function `ctree()`

available in the `party`

package.

- Demo data:
`PimaIndiansDiabetes2`

. First split the data into training (80%) and test set (20%)

```
# Load the data
data("PimaIndiansDiabetes2", package = "mlbench")
pima.data <- na.omit(PimaIndiansDiabetes2)
# Split the data into training and test set
set.seed(123)
training.samples <- pima.data$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- pima.data[training.samples, ]
test.data <- pima.data[-training.samples, ]
```

- Build conditional trees using the tuning parameters
`maxdepth`

and`mincriterion`

for controlling the tree size.`caret`

package selects automatically the optimal tuning values for your data, but here we’ll specify`maxdepth`

and`mincriterion`

.

The following example create a classification tree:

```
library(party)
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "ctree2",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(maxdepth = 3, mincriterion = 0.95 )
)
plot(model$finalModel)
```

```
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$diabetes)
```

`## [1] 0.744`

The p-value indicates the association between a given predictor variable and the outcome variable. For example, the first decision node at the top shows that `glucose`

is the variable that is most strongly associated with `diabetes`

with a p value < 0.001, and thus is selected as the first node.

This chapter describes how to build classification and regression tree in R. Trees provide a visual tool that are very easy to interpret and to explain to people.

Tree models might be very performant compared to the linear regression model (Chapter @ref(linear-regression)), when there is a highly non-linear and complex relationships between the outcome variable and the predictors.

However, building only one single tree from a training data set might results to a less performant predictive model. A single tree is unstable and the structure might be altered by small changes in the training data.

For example, the exact split point of a given predictor variable and the predictor to be selected at each step of the algorithm are strongly dependent on the training data set. Using a slightly different training data may alter the first variable to split in, and the structure of the tree can be completely modified.

Other machine learning algorithms - including *bagging*, *random forest* and *boosting* - can be used to build multiple different trees from one single data set leading to a better predictive performance. But, with these methods the interpretability observed for a single tree is lost. Note that all these above mentioned strategies are based on the CART algorithm. See Chapter @ref(bagging-and-random-forest) and @ref(boosting).

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. *An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

Zhang, Zhongheng. 2016. “Decision Tree Modeling Using R.” *Annals of Translational Medicine* 4 (15).

In the Chapter @ref(decision-tree-models), we have described how to build decision trees for predictive modeling.

The standard decision tree model, CART for classification and regression trees, build only one single tree, which is then used to predict the outcome of new observations. The output of this strategy is very unstable and the tree structure might be severally affected by a small change in the training data set.

There are different powerful alternatives to the classical CART algorithm, including **bagging**, **Random Forest** and **boosting**.

**Bagging** stands for *bootstrap aggregating*. It consists of building multiple different decision tree models from a single training data set by repeatedly using multiple bootstrapped subsets of the data and averaging the models. Here, each tree is build independently to the others. Read more on bootstrapping in the Chapter @ref(bootstrap-resampling).

**Random Forest** algorithm, is one of the most commonly used and the most powerful machine learning techniques. It is a special type of bagging applied to decision trees.

Compared to the standard CART model (Chapter @ref(decision-tree-models)), the random forest provides a strong improvement, which consists of applying bagging to the data and bootstrap sampling to the predictor variables at each split (James et al. 2014, P. Bruce and Bruce (2017)). This means that at each splitting step of the tree algorithm, a random sample of n predictors is chosen as split candidates from the full set of the predictors.

Random forest can be used for both classification (predicting a categorical variable) and regression (predicting a continuous variable).

In this chapter, we’ll describe how to compute random forest algorithm in R for building a powerful predictive model. Additionally, you’ll learn how to rank the predictor variable according to their importance in contributing to the model accuracy.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow`randomForest`

for computing random forest algorithm

```
library(tidyverse)
library(caret)
library(randomForest)
```

Data set: `PimaIndiansDiabetes2`

[in `mlbench`

package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

Randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
```

We’ll use the `caret`

workflow, which invokes the `randomforest()`

function [randomForest package], to automatically select the optimal number (`mtry`

) of predictor variables randomly sampled as candidates at each split, and fit the final best random forest model that explains the best our data.

We’ll use the following arguments in the function `train()`

:

`trControl`

, to set up 10-fold cross validation

```
# Fit the model on the training set
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "rf",
trControl = trainControl("cv", number = 10),
importance = TRUE
)
# Best tuning parameter
model$bestTune
```

```
## mtry
## 3 8
```

```
# Final model
model$finalModel
```

```
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 22%
## Confusion matrix:
## neg pos class.error
## neg 185 25 0.119
## pos 44 60 0.423
```

```
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
```

```
## [1] neg pos neg neg pos neg
## Levels: neg pos
```

```
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
```

`## [1] 0.808`

By default, 500 trees are trained. The optimal number of variables sampled at each split is 8.

Each bagged tree makes use of around two-thirds of the observations. The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations (James et al. 2014).

For a given tree, the out-of-bag (OOB) error is the model error in predicting the data left out of the training set for that tree (P. Bruce and Bruce 2017). OOB is a very straightforward way to estimate the test error of a bagged model, without the need to perform cross-validation or the validation set approach.

In our example, the OBB estimate of error rate is 24%.

The prediction accuracy on new test data is 79%, which is good.

The importance of each variable can be printed using the function `importance()`

[randomForest package]:

`importance(model$finalModel)`

```
## neg pos MeanDecreaseAccuracy MeanDecreaseGini
## pregnant 11.57 0.318 10.36 8.86
## glucose 38.93 28.437 46.17 53.30
## pressure -1.94 0.846 -1.06 8.09
## triceps 6.19 3.249 6.85 9.92
## insulin 8.65 -2.037 6.01 12.43
## mass 7.71 2.299 7.57 14.58
## pedigree 6.57 1.083 5.66 14.50
## age 9.51 12.310 15.75 16.76
```

The result shows:

`MeanDecreaseAccuracy`

, which is the average decrease of model accuracy in predicting the outcome of the out-of-bag samples when a specific variable is excluded from the model.`MeanDecreaseGini`

, which is the average decrease in node impurity that results from splits over that variable. The Gini impurity index is only used for classification problem. In the regression the node impurity is measured by training set RSS. These measures, calculated using the training set, are less reliable than a measure calculated on out-of-bag data. See Chapter @ref(decision-tree-models) for node impurity measures (Gini index and RSS).

Note that, by default (argument importance = FALSE), randomForest only calculates the Gini impurity index. However, computing the model accuracy by variable (argument importance = TRUE) requires supplementary computations which might be time consuming in the situations, where thousands of models (trees) are being fitted.

Variables importance measures can be plotted using the function `varImpPlot()`

[randomForest package]:

```
# Plot MeanDecreaseAccuracy
varImpPlot(model$finalModel, type = 1)
# Plot MeanDecreaseGini
varImpPlot(model$finalModel, type = 2)
```

The results show that across all of the trees considered in the random forest, the glucose and age variables are the two most important variables.

The function `varImp()`

[in caret] displays the importance of variables in percentage:

`varImp(model)`

```
## rf variable importance
##
## Importance
## glucose 100.0
## age 33.5
## pregnant 19.0
## mass 16.2
## triceps 15.4
## pedigree 12.8
## insulin 11.2
## pressure 0.0
```

Similarly, you can build a random forest model to perform regression, that is to predict a continuous variable.

We’ll use the `Boston`

data set [in `MASS`

package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (`mdev`

), in Boston Suburbs, using different predictor variables.

Randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model).

```
# Load the data
data("Boston", package = "MASS")
# Inspect the data
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
```

Here the prediction error is measured by the RMSE, which corresponds to the average difference between the observed known values of the outcome and the predicted value by the model. RMSE is computed as `RMSE = mean((observeds - predicteds)^2) %>% sqrt()`

. The lower the RMSE, the better the model.

```
# Fit the model on the training set
set.seed(123)
model <- train(
medv ~., data = train.data, method = "rf",
trControl = trainControl("cv", number = 10)
)
# Best tuning parameter mtry
model$bestTune
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the average prediction error RMSE
RMSE(predictions, test.data$medv)
```

Note that, the random forest algorithm has a set of hyperparameters that should be tuned using cross-validation to avoid overfitting.

These include:

`nodesize`

: Minimum size of terminal nodes. Default value for classification is 1 and default for regression is 5.`maxnodes`

: Maximum number of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible (subject to limits by nodesize).

Ignoring these parameters might lead to overfitting on noisy data set (P. Bruce and Bruce 2017). Cross-validation can be used to test different values, in order to select the optimal value.

Hyperparameters can be tuned manually using the `caret`

package. For a given parameter, the approach consists of fitting many models with different values of the parameters and then comparing the models.

The following example tests different values of `nodesize`

using the `PimaIndiansDiabetes2`

data set for classification:

(This will take 1-2 minutes execution time)

```
data("PimaIndiansDiabetes2", package = "mlbench")
models <- list()
for (nodesize in c(1, 2, 4, 8)) {
set.seed(123)
model <- train(
diabetes~., data = na.omit(PimaIndiansDiabetes2), method="rf",
trControl = trainControl(method="cv", number=10),
metric = "Accuracy",
nodesize = nodesize
)
model.name <- toString(nodesize)
models[[model.name]] <- model
}
# Compare results
resamples(models) %>% summary(metric = "Accuracy")
```

```
##
## Call:
## summary.resamples(object = ., metric = "Accuracy")
##
## Models: 1, 2, 4, 8
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 0.692 0.750 0.785 0.793 0.840 0.897 0
## 2 0.692 0.744 0.808 0.788 0.841 0.850 0
## 4 0.692 0.744 0.795 0.786 0.825 0.846 0
## 8 0.692 0.750 0.808 0.796 0.841 0.897 0
```

It can be seen that, using a `nodesize`

value of 2 or 8 leads to the most median accuracy value.

This chapter describes the basics of bagging and random forest machine learning algorithms. We also provide practical examples in R for classification and regression analyses.

Another alternative to bagging and random forest is boosting (Chapter @ref(boosting)).

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

*An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

Previously, we have described bagging and random forest machine learning algorithms for building a powerful predictive model (Chapter @ref(bagging-and-random-forest)).

Recall that bagging consists of taking multiple subsets of the training data set, then building multiple independent decision tree models, and then average the models allowing to create a very performant predictive model compared to the classical CART model (Chapter @ref(decision-tree-models)).

This chapter describes an alternative method called **boosting**, which is similar to the bagging method, except that the trees are grown sequentially: each successive tree is grown using information from previously grown trees, with the aim to minimize the error of the previous models (James et al. 2014).

For example, given a current regression tree model, the procedure is as follow:

- Fit a decision tree using the model residual errors as the outcome variable.
- Add this new decision tree, adjusted by a shrinkage parameter
`lambda`

, into the fitted function in order to update the residuals. lambda is a small positive value, typically comprised between 0.01 and 0.001 (James et al. 2014).

This approach results in slowly and successively improving the fitted the model resulting a very performant model. Boosting has different tuning parameters including:

- The number of trees B
- The shrinkage parameter lambda
- The number of splits in each tree.

There are different variants of boosting, including *Adaboost*, *gradient boosting* and *stochastic gradient boosting*.

Stochastic gradient boosting, implemented in the R package *xgboost*, is the most commonly used boosting technique, which involves resampling of observations and columns in each round. It offers the best performance. xgboost stands for extremely gradient boosting.

Boosting can be used for both classification and regression problems.

In this chapter we’ll describe how to compute boosting in R.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow`xgboost`

for computing boosting algorithm

```
library(tidyverse)
library(caret)
library(xgboost)
```

`PimaIndiansDiabetes2`

[in `mlbench`

package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

Randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
```

We’ll use the `caret`

workflow, which invokes the `xgboost`

package, to automatically adjust the model parameter values, and fit the final best boosted tree that explains the best our data.

We’ll use the following arguments in the function `train()`

:

`trControl`

, to set up 10-fold cross validation

```
# Fit the model on the training set
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "xgbTree",
trControl = trainControl("cv", number = 10)
)
# Best tuning parameter
model$bestTune
```

```
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 18 150 1 0.3 0 0.8 1 1
```

```
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
```

```
## [1] neg pos neg neg pos neg
## Levels: neg pos
```

```
# Compute model prediction accuracy rate
mean(predicted.classes == test.data$diabetes)
```

`## [1] 0.744`

The prediction accuracy on new test data is 74%, which is good.

For more explanation about the boosting tuning parameters, type `?xgboost`

in R to see the documentation.

The function `varImp()`

[in caret] displays the importance of variables in percentage:

`varImp(model)`

```
## xgbTree variable importance
##
## Overall
## glucose 100.00
## mass 20.23
## pregnant 15.83
## insulin 13.15
## pressure 9.51
## triceps 8.18
## pedigree 0.00
## age 0.00
```

Similarly, you can build a random forest model to perform regression, that is to predict a continuous variable.

We’ll use the `Boston`

data set [in `MASS`

package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (`mdev`

), in Boston Suburbs, using different predictor variables.

Randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model).

```
# Load the data
data("Boston", package = "MASS")
# Inspect the data
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
```

Here the prediction error is measured by the RMSE, which corresponds to the average difference between the observed known values of the outcome and the predicted value by the model.

```
# Fit the model on the training set
set.seed(123)
model <- train(
medv ~., data = train.data, method = "xgbTree",
trControl = trainControl("cv", number = 10)
)
# Best tuning parameter mtry
model$bestTune
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the average prediction error RMSE
RMSE(predictions, test.data$medv)
```

This chapter describes the boosting machine learning techniques and provide examples in R for building a predictive model. See also bagging and random forest methods in Chapter @ref(bagging-and-random-forest).

*An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.