Last articles - STHDA : Statistical Machine Learning Essentials

KNN: K-Nearest Neighbors Essentials

Sun, 11 Mar 2018 01:09:00 +0100

The k-nearest neighbors (KNN) algorithm is a simple machine learning method used for both classification and regression. The kNN algorithm predicts the outcome of a new observation by comparing it to k similar cases in the training data set, where k is defined by the analyst.

In this chapter, we start by describing the basics of the KNN algorithm for both classification and regression settings. Next, we provide practical example in R for preparing the data and computing KNN model.

Additionally, you’ll learn how to make predictions and to assess the performance of the built model in the predicting the outcome of new test observations.

Contents:

KNN algorithm
Loading required R packages
Classification
- Example of data set
- Computing KNN classifier
KNN for regression
Discussion

The Book:

Machine Learning Essentials: Practical Guide in R

KNN algorithm

KNN algorithm for classification:

To classify a given new observation (new_obs), the k-nearest neighbors method starts by identifying the k most similar training observations (i.e. neighbors) to our new_obs, and then assigns new_obs to the class containing the majority of its neighbors.

KNN algorithm for regression:

Similarly, to predict a continuous outcome value for given new observation (new_obs), the KNN algorithm computes the average outcome value of the k training observations that are the most similar to new_obs, and returns this value as new_obs predicted outcome value.

Similarity measures:

Note that, the (dis)similarity between observations is generally determined using Euclidean distance measure, which is very sensitive to the scale on which predictor variable measurements are made. So, it’s generally recommended to standardize (i.e., normalize) the predictor variables for making their scales comparable.

The following sections shows how to build a k-nearest neighbor predictive model for classification and regression settings.

Loading required R packages

tidyverse for easy data manipulation and visualization
caret for easy machine learning workflow

library(tidyverse)
library(caret)

Classification

Example of data set

Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing KNN classifier

We’ll use the caret package, which automatically tests different possible values of k, then chooses the optimal k that minimizes the cross-validation (“cv”) error, and fits the final best KNN model that explains the best our data.

Additionally caret can automatically preprocess the data in order to normalize the predictor variables.

We’ll use the following arguments in the function train():

trControl, to set up 10-fold cross validation
preProcess, to normalize the data
tuneLength, to specify the number of possible k values to evaluate

# Fit the model on the training set
set.seed(123)
model <- train(
  diabetes ~., data = train.data, method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale"),
  tuneLength = 20
  )
# Plot model accuracy vs different values of k
plot(model)

# Print the best tuning parameter k that
# maximizes model accuracy
model$bestTune

##    k
## 5 13

# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)

## [1] neg pos neg pos pos neg
## Levels: neg pos

# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)

## [1] 0.769

The overall prediction accuracy of our model is 76.9%, which is good (see Chapter @ref(classification-model-evaluation) for learning key metrics used to evaluate a classification model performance).

KNN for regression

In this section, we’ll describe how to predict a continuous variable using KNN.

We’ll use the Boston data set [in MASS package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (mdev), in Boston Suburbs, using different predictor variables.

Randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data
data("Boston", package = "MASS")
# Inspect the data
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

Compute KNN using caret.

The best k is the one that minimize the prediction error RMSE (root mean squared error).

The RMSE corresponds to the square root of the average difference between the observed known outcome values and the predicted values, RMSE = mean((observeds - predicteds)^2) %>% sqrt(). The lower the RMSE, the better the model.

# Fit the model on the training set
set.seed(123)
model <- train(
  medv~., data = train.data, method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale"),
  tuneLength = 10
  )
# Plot model error RMSE vs different values of k
plot(model)
# Best tuning parameter k that minimize the RMSE
model$bestTune
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the prediction error RMSE
RMSE(predictions, test.data$medv)

Discussion

This chapter describes the basics of KNN (k-nearest neighbors) modeling, which is conceptually, one of the simpler machine learning method.

It’s recommended to standardize the data when performing the KNN analysis. We provided R codes to easily compute KNN predictive model and to assess the model performance on test data.

When fitting the KNN algorithm, the analyst needs to specify the number of neighbors (k) to be considered in the KNN algorithm for predicting the outcome of an observation. The choice of k considerably impacts the output of KNN. k = 1 corresponds to a highly flexible method resulting to a training error rate of 0 (overfitting), but the test error rate may be quite high.

You need to test multiple k-values to decide an optimal value for your data. This can be done automatically using the caret package, which chooses a value of k that minimize the cross-validation error @ref(cross-validation).

CART Model: Decision Tree Essentials

Sun, 11 Mar 2018 00:45:00 +0100

The decision tree method is a powerful and popular predictive machine learning technique that is used for both classification and regression. So, it is also known as Classification and Regression Trees (CART).

Note that the R implementation of the CART algorithm is called RPART (Recursive Partitioning And Regression Trees) available in a package of the same name.

In this chapter we’ll describe the basics of tree models and provide R codes to compute classification and regression trees.

Contents:

Loading required R packages
Decision tree algorithm
Classification trees
Regression trees
- Example of data set
- Create the regression tree
Conditionnal inference tree
Discussion
References

The Book:

Machine Learning Essentials: Practical Guide in R

Loading required R packages

tidyverse for easy data manipulation and visualization
caret for easy machine learning workflow
rpart for computing decision tree models

library(tidyverse)
library(caret)
library(rpart)

Decision tree algorithm

Basics and visual representation

The algorithm of decision tree models works by repeatedly partitioning the data into multiple sub-spaces, so that the outcomes in each final sub-space is as homogeneous as possible. This approach is technically called recursive partitioning.

The produced result consists of a set of rules used for predicting the outcome variable, which can be either:

a continuous variable, for regression trees
a categorical variable, for classification trees

The decision rules generated by the CART predictive model are generally visualized as a binary tree.

The following example represents a tree model predicting the species of iris flower based on the length (in cm) and width of sepal and petal.

library(rpart)
model <- rpart(Species ~., data = iris)
par(xpd = NA) # otherwise on some devices the text is clipped
plot(model)
text(model, digits = 3)

The plot shows the different possible splitting rules that can be used to effectively predict the type of outcome (here, iris species). For example, the top split assigns observations having Petal.length < 2.45 to the left branch, where the predicted species are setosa.

The different rules in tree can be printed as follow:

print(model, digits = 2)

## n= 150 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 150 100 setosa (0.333 0.333 0.333)  
##   2) Petal.Length< 2.5 50   0 setosa (1.000 0.000 0.000) *
##   3) Petal.Length>=2.5 100  50 versicolor (0.000 0.500 0.500)  
##     6) Petal.Width< 1.8 54   5 versicolor (0.000 0.907 0.093) *
##     7) Petal.Width>=1.8 46   1 virginica (0.000 0.022 0.978) *

These rules are produced by repeatedly splitting the predictor variables, starting with the variable that has the highest association with the response variable. The process continues until some predetermined stopping criteria are met.

The resulting tree is composed of decision nodes, branches and leaf nodes. The tree is placed from upside to down, so the root is at the top and leaves indicating the outcome is put at the bottom.

Each decision node corresponds to a single input predictor variable and a split cutoff on that variable. The leaf nodes of the tree are the outcome variable which is used to make predictions.

The tree grows from the top (root), at each node the algorithm decides the best split cutoff that results to the greatest purity (or homogeneity) in each subpartition.

The tree will stop growing by the following three criteria (Zhang 2016):

all leaf nodes are pure with a single class;
a pre-specified minimum number of training observations that cannot be assigned to each leaf nodes with any splitting methods;
The number of observations in the leaf node reaches the pre-specified minimum one.

A fully grown tree will overfit the training data and the resulting model might not be performant for predicting the outcome of new test data. Techniques, such as pruning, are used to control this problem.

Choosing the trees split points

Technically, for regression modeling, the split cutoff is defined so that the residual sum of squared error (RSS) is minimized across the training samples that fall within the subpartition.

Recall that, the RSS is the sum of the squared difference between the observed outcome values and the predicted ones, RSS = sum((Observeds - Predicteds)^2). See Chapter @ref(linear-regression)

In classification settings, the split point is defined so that the population in subpartitions are pure as much as possible. Two measures of purity are generally used, including the Gini index and the entropy (or information gain).

For a given subpartition, Gini = sum(p(1-p)) and entropy = -1*sum(p*log(p)), where p is the proportion of misclassified observations within the subpartition.

The sum is computed across the different categories or classes in the outcome variable. The Gini index and the entropy varie from 0 (greatest purity) to 1 (maximum degree of impurity)

Making predictions

The different rule sets established in the tree are used to predict the outcome of a new test data.

The following R code predict the species of a new collected iris flower:

newdata <- data.frame(
  Sepal.Length = 6.5, Sepal.Width = 3.0,
  Petal.Length = 5.2, Petal.Width = 2.0
)
model %>% predict(newdata, "class")

##         1 
## virginica 
## Levels: setosa versicolor virginica

The new data is predicted to be virginica.

Classification trees

Example of data set

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Fully grown trees

Here, we’ll create a fully grown tree showing all predictor variables in the data set.

# Build the model
set.seed(123)
model1 <- rpart(diabetes ~., data = train.data, method = "class")
# Plot the trees
par(xpd = NA) # Avoid clipping the text in some device
plot(model1)
text(model1, digits = 3)

# Make predictions on the test data
predicted.classes <- model1 %>% 
  predict(test.data, type = "class")
head(predicted.classes)

##  21  25  28  29  32  36 
## neg pos neg pos pos neg 
## Levels: neg pos

# Compute model accuracy rate on test data
mean(predicted.classes == test.data$diabetes)

## [1] 0.782

The overall accuracy of our tree model is 78%, which is not so bad.

However, this full tree including all predictor appears to be very complex and can be difficult to interpret in the situation where you have a large data sets with multiple predictors.

Additionally, it is easy to see that, a fully grown tree will overfit the training data and might lead to poor test set performance.

A strategy to limit this overfitting is to prune back the tree resulting to a simpler tree with fewer splits and better interpretation at the cost of a little bias (James et al. 2014, P. Bruce and Bruce (2017)).

Pruning the tree

Briefly, our goal here is to see if a smaller subtree can give us comparable results to the fully grown tree. If yes, we should go for the simpler tree because it reduces the likelihood of overfitting.

One possible robust strategy of pruning the tree (or stopping the tree to grow) consists of avoiding splitting a partition if the split does not significantly improves the overall quality of the model.

In rpart package, this is controlled by the complexity parameter (cp), which imposes a penalty to the tree for having two many splits. The default value is 0.01. The higher the cp, the smaller the tree.

A too small value of cp leads to overfitting and a too large cp value will result to a too small tree. Both cases decrease the predictive performance of the model.

An optimal cp value can be estimated by testing different cp values and using cross-validation approaches to determine the corresponding prediction accuracy of the model. The best cp is then defined as the one that maximize the cross-validation accuracy (Chapter @ref(cross-validation)).

Pruning can be easily performed in the caret package workflow, which invokes the rpart method for automatically testing different possible values of cp, then choose the optimal cp that maximize the cross-validation (“cv”) accuracy, and fit the final best CART model that explains the best our data.

You can use the following arguments in the function train() [from caret package]:

trControl, to set up 10-fold cross validation
tuneLength, to specify the number of possible cp values to evaluate. Default value is 3, here we’ll use 10.

# Fit the model on the training set
set.seed(123)
model2 <- train(
  diabetes ~., data = train.data, method = "rpart",
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
  )
# Plot model accuracy vs different values of
# cp (complexity parameter)
plot(model2)

# Print the best tuning parameter cp that
# maximizes the model accuracy
model2$bestTune

##       cp
## 2 0.0321

# Plot the final tree model
par(xpd = NA) # Avoid clipping the text in some device
plot(model2$finalModel)
text(model2$finalModel,  digits = 3)

# Decision rules in the model
model2$finalModel

## n= 314 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 314 104 neg (0.6688 0.3312)  
##    2) glucose< 128 188  26 neg (0.8617 0.1383) *
##    3) glucose>=128 126  48 pos (0.3810 0.6190)  
##      6) glucose< 166 88  44 neg (0.5000 0.5000)  
##       12) age< 23.5 16   1 neg (0.9375 0.0625) *
##       13) age>=23.5 72  29 pos (0.4028 0.5972) *
##      7) glucose>=166 38   4 pos (0.1053 0.8947) *

# Make predictions on the test data
predicted.classes <- model2 %>% predict(test.data)
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$diabetes)

## [1] 0.795

From the output above, it can be seen that the best value for the complexity parameter (cp) is 0.032, allowing a simpler tree, easy to interpret, with an overall accuracy of 79%, which is comparable to the accuracy (78%) that we have obtained with the full tree. The prediction accuracy of the pruned tree is even better compared to the full tree.

Taken together, we should go for this simpler model.

Regression trees

Previously, we described how to build a classification tree for predicting the group (i.e. class) of observations. In this section, we’ll describe how to build a tree for predicting a continuous variable, a method called regression analysis (Chapter @ref(regression-analysis)).

The R code is identical to what we have seen in previous sections. Pruning should be also applied here to limit overfiting.

Similarly to classification trees, the following R code uses the caret package to build regression trees and to predict the output of a new test data set.

Example of data set

Data set: We’ll use the Boston data set [in MASS package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (mdev), in Boston Suburbs, using different predictor variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data
data("Boston", package = "MASS")
# Inspect the data
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

Create the regression tree

Here, the best cp value is the one that minimize the prediction error RMSE (root mean squared error).

The prediction error is measured by the RMSE, which corresponds to the average difference between the observed known values of the outcome and the predicted value by the model. RMSE is computed as RMSE = mean((observeds - predicteds)^2) %>% sqrt(). The lower the RMSE, the better the model.

Choose the best cp value:

# Fit the model on the training set
set.seed(123)
model <- train(
  medv ~., data = train.data, method = "rpart",
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
  )
# Plot model error vs different values of
# cp (complexity parameter)
plot(model)
# Print the best tuning parameter cp that
# minimize the model RMSE
model$bestTune

Plot the final tree model:

# Plot the final tree model
par(xpd = NA) # Avoid clipping the text in some device
plot(model$finalModel)
text(model$finalModel, digits = 3)

# Decision rules in the model
model$finalModel
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the prediction error RMSE
RMSE(predictions, test.data$medv)

Conditionnal inference tree

The conditional inference tree (ctree) uses significance test methods to select and split recursively the most related predictor variables to the outcome. This can limit overfitting compared to the classical rpart algorithm.

At each splitting step, the algorithm stops if there is no dependence between predictor variables and the outcome variable. Otherwise the variable that is the most associated to the outcome is selected for splitting.

The conditional tree can be easily computed using the caret workflow, which will invoke the function ctree() available in the party package.

Demo data: PimaIndiansDiabetes2. First split the data into training (80%) and test set (20%)

# Load the data
data("PimaIndiansDiabetes2", package = "mlbench")
pima.data <- na.omit(PimaIndiansDiabetes2)
# Split the data into training and test set
set.seed(123)
training.samples <- pima.data$diabetes %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- pima.data[training.samples, ]
test.data <- pima.data[-training.samples, ]

Build conditional trees using the tuning parameters maxdepth and mincriterion for controlling the tree size. caret package selects automatically the optimal tuning values for your data, but here we’ll specify maxdepth and mincriterion.

The following example create a classification tree:

library(party)
set.seed(123)
model <- train(
  diabetes ~., data = train.data, method = "ctree2",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(maxdepth = 3, mincriterion = 0.95 )
  )
plot(model$finalModel)

# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$diabetes)

## [1] 0.744

The p-value indicates the association between a given predictor variable and the outcome variable. For example, the first decision node at the top shows that glucose is the variable that is most strongly associated with diabetes with a p value < 0.001, and thus is selected as the first node.

Discussion

This chapter describes how to build classification and regression tree in R. Trees provide a visual tool that are very easy to interpret and to explain to people.

Tree models might be very performant compared to the linear regression model (Chapter @ref(linear-regression)), when there is a highly non-linear and complex relationships between the outcome variable and the predictors.

However, building only one single tree from a training data set might results to a less performant predictive model. A single tree is unstable and the structure might be altered by small changes in the training data.

For example, the exact split point of a given predictor variable and the predictor to be selected at each step of the algorithm are strongly dependent on the training data set. Using a slightly different training data may alter the first variable to split in, and the structure of the tree can be completely modified.

Other machine learning algorithms - including bagging, random forest and boosting - can be used to build multiple different trees from one single data set leading to a better predictive performance. But, with these methods the interpretability observed for a single tree is lost. Note that all these above mentioned strategies are based on the CART algorithm. See Chapter @ref(bagging-and-random-forest) and @ref(boosting).

References

Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.

Zhang, Zhongheng. 2016. “Decision Tree Modeling Using R.” Annals of Translational Medicine 4 (15).

Bagging and Random Forest Essentials

Sun, 11 Mar 2018 00:32:00 +0100

In the Chapter @ref(decision-tree-models), we have described how to build decision trees for predictive modeling.

The standard decision tree model, CART for classification and regression trees, build only one single tree, which is then used to predict the outcome of new observations. The output of this strategy is very unstable and the tree structure might be severally affected by a small change in the training data set.

There are different powerful alternatives to the classical CART algorithm, including bagging, Random Forest and boosting.

Bagging stands for bootstrap aggregating. It consists of building multiple different decision tree models from a single training data set by repeatedly using multiple bootstrapped subsets of the data and averaging the models. Here, each tree is build independently to the others. Read more on bootstrapping in the Chapter @ref(bootstrap-resampling).

Random Forest algorithm, is one of the most commonly used and the most powerful machine learning techniques. It is a special type of bagging applied to decision trees.

Compared to the standard CART model (Chapter @ref(decision-tree-models)), the random forest provides a strong improvement, which consists of applying bagging to the data and bootstrap sampling to the predictor variables at each split (James et al. 2014, P. Bruce and Bruce (2017)). This means that at each splitting step of the tree algorithm, a random sample of n predictors is chosen as split candidates from the full set of the predictors.

Random forest can be used for both classification (predicting a categorical variable) and regression (predicting a continuous variable).

In this chapter, we’ll describe how to compute random forest algorithm in R for building a powerful predictive model. Additionally, you’ll learn how to rank the predictor variable according to their importance in contributing to the model accuracy.

Contents:

Loading required R packages
Classification
Regression
- Example of data set
- Computing random forest regression trees
Hyperparameters
Discussion
References

The Book:

Machine Learning Essentials: Practical Guide in R

Loading required R packages

tidyverse for easy data manipulation and visualization
caret for easy machine learning workflow
randomForest for computing random forest algorithm

library(tidyverse)
library(caret)
library(randomForest)

Classification

Example of data set

Randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing random forest classifier

We’ll use the caret workflow, which invokes the randomforest() function [randomForest package], to automatically select the optimal number (mtry) of predictor variables randomly sampled as candidates at each split, and fit the final best random forest model that explains the best our data.

We’ll use the following arguments in the function train():

trControl, to set up 10-fold cross validation

# Fit the model on the training set
set.seed(123)
model <- train(
  diabetes ~., data = train.data, method = "rf",
  trControl = trainControl("cv", number = 10),
  importance = TRUE
  )
# Best tuning parameter
model$bestTune

##   mtry
## 3    8

# Final model
model$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 22%
## Confusion matrix:
##     neg pos class.error
## neg 185  25       0.119
## pos  44  60       0.423

# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)

## [1] neg pos neg neg pos neg
## Levels: neg pos

# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)

## [1] 0.808

By default, 500 trees are trained. The optimal number of variables sampled at each split is 8.

Each bagged tree makes use of around two-thirds of the observations. The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations (James et al. 2014).

For a given tree, the out-of-bag (OOB) error is the model error in predicting the data left out of the training set for that tree (P. Bruce and Bruce 2017). OOB is a very straightforward way to estimate the test error of a bagged model, without the need to perform cross-validation or the validation set approach.

In our example, the OBB estimate of error rate is 24%.

The prediction accuracy on new test data is 79%, which is good.

Variable importance

The importance of each variable can be printed using the function importance() [randomForest package]:

importance(model$finalModel)

##            neg    pos MeanDecreaseAccuracy MeanDecreaseGini
## pregnant 11.57  0.318                10.36             8.86
## glucose  38.93 28.437                46.17            53.30
## pressure -1.94  0.846                -1.06             8.09
## triceps   6.19  3.249                 6.85             9.92
## insulin   8.65 -2.037                 6.01            12.43
## mass      7.71  2.299                 7.57            14.58
## pedigree  6.57  1.083                 5.66            14.50
## age       9.51 12.310                15.75            16.76

The result shows:

MeanDecreaseAccuracy, which is the average decrease of model accuracy in predicting the outcome of the out-of-bag samples when a specific variable is excluded from the model.
MeanDecreaseGini, which is the average decrease in node impurity that results from splits over that variable. The Gini impurity index is only used for classification problem. In the regression the node impurity is measured by training set RSS. These measures, calculated using the training set, are less reliable than a measure calculated on out-of-bag data. See Chapter @ref(decision-tree-models) for node impurity measures (Gini index and RSS).

Note that, by default (argument importance = FALSE), randomForest only calculates the Gini impurity index. However, computing the model accuracy by variable (argument importance = TRUE) requires supplementary computations which might be time consuming in the situations, where thousands of models (trees) are being fitted.

Variables importance measures can be plotted using the function varImpPlot() [randomForest package]:

# Plot MeanDecreaseAccuracy
varImpPlot(model$finalModel, type = 1)

# Plot MeanDecreaseGini
varImpPlot(model$finalModel, type = 2)

The results show that across all of the trees considered in the random forest, the glucose and age variables are the two most important variables.

The function varImp() [in caret] displays the importance of variables in percentage:

varImp(model)

## rf variable importance
## 
##          Importance
## glucose       100.0
## age            33.5
## pregnant       19.0
## mass           16.2
## triceps        15.4
## pedigree       12.8
## insulin        11.2
## pressure        0.0

Regression

Similarly, you can build a random forest model to perform regression, that is to predict a continuous variable.

Example of data set

Randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model).

# Load the data
data("Boston", package = "MASS")
# Inspect the data
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

Computing random forest regression trees

Here the prediction error is measured by the RMSE, which corresponds to the average difference between the observed known values of the outcome and the predicted value by the model. RMSE is computed as RMSE = mean((observeds - predicteds)^2) %>% sqrt(). The lower the RMSE, the better the model.

# Fit the model on the training set
set.seed(123)
model <- train(
  medv ~., data = train.data, method = "rf",
  trControl = trainControl("cv", number = 10)
  )
# Best tuning parameter mtry
model$bestTune
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the average prediction error RMSE
RMSE(predictions, test.data$medv)

Hyperparameters

Note that, the random forest algorithm has a set of hyperparameters that should be tuned using cross-validation to avoid overfitting.

These include:

nodesize: Minimum size of terminal nodes. Default value for classification is 1 and default for regression is 5.
maxnodes: Maximum number of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible (subject to limits by nodesize).

Ignoring these parameters might lead to overfitting on noisy data set (P. Bruce and Bruce 2017). Cross-validation can be used to test different values, in order to select the optimal value.

Hyperparameters can be tuned manually using the caret package. For a given parameter, the approach consists of fitting many models with different values of the parameters and then comparing the models.

The following example tests different values of nodesize using the PimaIndiansDiabetes2 data set for classification:

(This will take 1-2 minutes execution time)

data("PimaIndiansDiabetes2", package = "mlbench")
models <- list()
for (nodesize in c(1, 2, 4, 8)) {
    set.seed(123)
    model <- train(
      diabetes~., data = na.omit(PimaIndiansDiabetes2), method="rf", 
      trControl = trainControl(method="cv", number=10),
      metric = "Accuracy",
      nodesize = nodesize
      )
    model.name <- toString(nodesize)
    models[[model.name]] <- model
}
# Compare results
resamples(models) %>% summary(metric = "Accuracy")

## 
## Call:
## summary.resamples(object = ., metric = "Accuracy")
## 
## Models: 1, 2, 4, 8 
## Number of resamples: 10 
## 
## Accuracy 
##    Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
## 1 0.692   0.750  0.785 0.793   0.840 0.897    0
## 2 0.692   0.744  0.808 0.788   0.841 0.850    0
## 4 0.692   0.744  0.795 0.786   0.825 0.846    0
## 8 0.692   0.750  0.808 0.796   0.841 0.897    0

It can be seen that, using a nodesize value of 2 or 8 leads to the most median accuracy value.

Discussion

This chapter describes the basics of bagging and random forest machine learning algorithms. We also provide practical examples in R for classification and regression analyses.

Another alternative to bagging and random forest is boosting (Chapter @ref(boosting)).

References

Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.

Gradient Boosting Essentials in R Using XGBOOST

Sun, 11 Mar 2018 00:22:00 +0100

Previously, we have described bagging and random forest machine learning algorithms for building a powerful predictive model (Chapter @ref(bagging-and-random-forest)).

Recall that bagging consists of taking multiple subsets of the training data set, then building multiple independent decision tree models, and then average the models allowing to create a very performant predictive model compared to the classical CART model (Chapter @ref(decision-tree-models)).

This chapter describes an alternative method called boosting, which is similar to the bagging method, except that the trees are grown sequentially: each successive tree is grown using information from previously grown trees, with the aim to minimize the error of the previous models (James et al. 2014).

For example, given a current regression tree model, the procedure is as follow:

Fit a decision tree using the model residual errors as the outcome variable.
Add this new decision tree, adjusted by a shrinkage parameter lambda, into the fitted function in order to update the residuals. lambda is a small positive value, typically comprised between 0.01 and 0.001 (James et al. 2014).

This approach results in slowly and successively improving the fitted the model resulting a very performant model. Boosting has different tuning parameters including:

The number of trees B
The shrinkage parameter lambda
The number of splits in each tree.

There are different variants of boosting, including Adaboost, gradient boosting and stochastic gradient boosting.

Stochastic gradient boosting, implemented in the R package xgboost, is the most commonly used boosting technique, which involves resampling of observations and columns in each round. It offers the best performance. xgboost stands for extremely gradient boosting.

Boosting can be used for both classification and regression problems.

In this chapter we’ll describe how to compute boosting in R.

Contents:

Loading required R packages
Classification
Regression
- Example of data set
- Boosted regression trees
Discussion
References

The Book:

Machine Learning Essentials: Practical Guide in R

Loading required R packages

tidyverse for easy data manipulation and visualization
caret for easy machine learning workflow
xgboost for computing boosting algorithm

library(tidyverse)
library(caret)
library(xgboost)

Classification

Example of data set

Randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Boosted classification trees

We’ll use the caret workflow, which invokes the xgboost package, to automatically adjust the model parameter values, and fit the final best boosted tree that explains the best our data.

We’ll use the following arguments in the function train():

trControl, to set up 10-fold cross validation

# Fit the model on the training set
set.seed(123)
model <- train(
  diabetes ~., data = train.data, method = "xgbTree",
  trControl = trainControl("cv", number = 10)
  )
# Best tuning parameter
model$bestTune

##    nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 18     150         1 0.3     0              0.8                1         1

# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)

## [1] neg pos neg neg pos neg
## Levels: neg pos

# Compute model prediction accuracy rate
mean(predicted.classes == test.data$diabetes)

## [1] 0.744

The prediction accuracy on new test data is 74%, which is good.

For more explanation about the boosting tuning parameters, type ?xgboost in R to see the documentation.

Variable importance

The function varImp() [in caret] displays the importance of variables in percentage:

varImp(model)

## xgbTree variable importance
## 
##          Overall
## glucose   100.00
## mass       20.23
## pregnant   15.83
## insulin    13.15
## pressure    9.51
## triceps     8.18
## pedigree    0.00
## age         0.00

Regression

Similarly, you can build a random forest model to perform regression, that is to predict a continuous variable.

Example of data set

Randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model).

# Load the data
data("Boston", package = "MASS")
# Inspect the data
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

Boosted regression trees

Here the prediction error is measured by the RMSE, which corresponds to the average difference between the observed known values of the outcome and the predicted value by the model.

# Fit the model on the training set
set.seed(123)
model <- train(
  medv ~., data = train.data, method = "xgbTree",
  trControl = trainControl("cv", number = 10)
  )
# Best tuning parameter mtry
model$bestTune
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the average prediction error RMSE
RMSE(predictions, test.data$medv)

Discussion

This chapter describes the boosting machine learning techniques and provide examples in R for building a predictive model. See also bagging and random forest methods in Chapter @ref(bagging-and-random-forest).

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.