**Logistic regression** is used to predict the class (or category) of individuals based on one or multiple predictor variables (x). It is used to model a binary outcome, that is a variable, which can have only two possible values: 0 or 1, yes or no, diseased or non-diseased.

Logistic regression belongs to a family, named *Generalized Linear Model* (*GLM*), developed for extending the linear regression model (Chapter @ref(linear-regression)) to other situations. Other synonyms are *binary logistic regression*, *binomial logistic regression* and *logit model*.

Logistic regression does not return directly the class of observations. It allows us to estimate the probability (p) of class membership. The probability will range between 0 and 1. You need to decide the threshold probability at which the category flips from one to the other. By default, this is set to `p = 0.5`

, but in reality it should be settled based on the analysis purpose.

In this chapter you’ll learn how to:

- Define the logistic regression equation and key terms such as log-odds and logit
- Perform logistic regression in
**R**and interpret the results - Make predictions on new test data and evaluate the model accuracy

Contents:

The standard logistic regression function, for predicting the outcome of an observation given a predictor variable (x), is an s-shaped curve defined as `p = exp(y) / [1 + exp(y)]`

(James et al. 2014). This can be also simply written as `p = 1/[1 + exp(-y)]`

, where:

`y = b0 + b1*x`

,`exp()`

is the exponential and`p`

is the probability of event to occur (1) given`x`

. Mathematically, this is written as`p(event=1|x) and abbreviated as`

p(x)`, so`

px = 1/[1 + exp(-(b0 + b1*x))]`

By a bit of manipulation, it can be demonstrated that `p/(1-p) = exp(b0 + b1*x)`

. By taking the logarithm of both sides, the formula becomes a linear combination of predictors: `log[p/(1-p)] = b0 + b1*x`

.

When you have multiple predictor variables, the logistic function looks like: `log[p/(1-p)] = b0 + b1*x1 + b2*x2 + ... + bn*xn`

`b0`

and `b1`

are the regression beta coefficients. A positive `b1`

indicates that increasing `x`

will be associated with increasing `p`

. Conversely, a negative `b1`

indicates that increasing `x`

will be associated with decreasing `p`

.

The quantity `log[p/(1-p)]`

is called the logarithm of the odd, also known as **log-odd** or **logit**.

The **odds** reflect the likelihood that the event will occur. It can be seen as the ratio of “successes” to “non-successes”. Technically, odds are the probability of an event divided by the probability that the event will not take place (P. Bruce and Bruce 2017). For example, if the probability of being diabetes-positive is 0.5, the probability of “won’t be” is 1-0.5 = 0.5, and the odds are 1.0.

Note that, the probability can be calculated from the odds as `p = Odds/(1 + Odds)`

.

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
theme_set(theme_bw())
```

Logistic regression works for a data that contain continuous and/or categorical predictor variables.

Performing the following steps might improve the accuracy of your model

- Remove potential outliers
- Make sure that the predictor variables are normally distributed. If not, you can use log, root, Box-Cox transformation.
- Remove highly correlated predictors to minimize overfitting. The presence of highly correlated predictors might lead to an unstable model solution.

Here, we’ll use the `PimaIndiansDiabetes2`

[in `mlbench`

package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
```

The R function `glm()`

, for generalized linear model, can be used to compute logistic regression. You need to specify the option `family = binomial`

, which tells to R that we want to fit logistic regression.

```
# Fit the model
model <- glm( diabetes ~., data = train.data, family = binomial)
# Summarize the model
summary(model)
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
mean(predicted.classes == test.data$diabetes)
```

The simple logistic regression is used to predict the probability of class membership based on one single predictor variable.

The following R code builds a model to predict the probability of being diabetes-positive based on the plasma glucose concentration:

```
model <- glm( diabetes ~ glucose, data = train.data, family = binomial)
summary(model)$coef
```

```
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.3267 0.7241 -8.74 2.39e-18
## glucose 0.0437 0.0054 8.09 6.01e-16
```

The output above shows the estimate of the regression beta coefficients and their significance levels. The intercept (`b0`

) is -6.32 and the coefficient of glucose variable is 0.043.

The logistic equation can be written as `p = exp(-6.32 + 0.043*glucose)/ [1 + exp(-6.32 + 0.043*glucose)]`

. Using this formula, for each new glucose plasma concentration value, you can predict the probability of the individuals in being diabetes positive.

Predictions can be easily made using the function `predict()`

. Use the option type = “response” to directly obtain the probabilities

```
newdata <- data.frame(glucose = c(20, 180))
probabilities <- model %>% predict(newdata, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes
```

The logistic function gives an s-shaped probability curve illustrated as follow:

```
train.data %>%
mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>%
ggplot(aes(glucose, prob)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
labs(
title = "Logistic Regression Model",
x = "Plasma Glucose Concentration",
y = "Probability of being diabete-pos"
)
```

The multiple logistic regression is used to predict the probability of class membership based on multiple predictor variables, as follow:

```
model <- glm( diabetes ~ glucose + mass + pregnant,
data = train.data, family = binomial)
summary(model)$coef
```

Here, we want to include all the predictor variables available in the data set. This is done using `~.`

:

```
model <- glm( diabetes ~., data = train.data, family = binomial)
summary(model)$coef
```

```
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.50372 1.31719 -7.215 5.39e-13
## pregnant 0.04571 0.06218 0.735 4.62e-01
## glucose 0.04230 0.00657 6.439 1.20e-10
## pressure -0.00700 0.01291 -0.542 5.87e-01
## triceps 0.01858 0.01861 0.998 3.18e-01
## insulin -0.00159 0.00139 -1.144 2.52e-01
## mass 0.04502 0.02887 1.559 1.19e-01
## pedigree 0.96845 0.46020 2.104 3.53e-02
## age 0.04256 0.02158 1.972 4.86e-02
```

From the output above, the coefficients table shows the beta coefficient estimates and their significance levels. Columns are:

`Estimate`

: the intercept (b0) and the beta coefficient estimates associated to each predictor variable`Std.Error`

: the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are about the estimate.`z value`

: the z-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3)`Pr(>|z|)`

: The p-value corresponding to the z-statistic. The smaller the p-value, the more significant the estimate is.

Note that, the functions `coef()`

and `summary()`

can be used to extract only the coefficients, as follow:

```
coef(model)
summary(model )$coef
```

It can be seen that only 5 out of the 8 predictors are significantly associated to the outcome. These include: pregnant, glucose, pressure, mass and pedigree.

The coefficient estimate of the variable `glucose`

is b = 0.045, which is positive. This means that an increase in glucose is associated with increase in the probability of being diabetes-positive. However the coefficient for the variable `pressure`

is b = -0.007, which is negative. This means that an increase in blood pressure will be associated with a decreased probability of being diabetes-positive.

An important concept to understand, for interpreting the logistic beta coefficients, is the **odds ratio**. An odds ratio measures the association between a predictor variable (x) and the outcome variable (y). It represents the ratio of the odds that an event will occur (`event = 1`

) given the presence of the predictor x (`x = 1`

), compared to the odds of the event occurring in the absence of that predictor (`x = 0`

).

For a given predictor (say x1), the associated beta coefficient (b1) in the logistic regression function corresponds to the log of the odds ratio for that predictor.

If the odds ratio is 2, then the odds that the event occurs (`event = 1`

) are two times higher when the predictor x is present (`x = 1`

) versus x is absent (`x = 0`

).

For example, the regression coefficient for glucose is 0.042. This indicate that one unit increase in the glucose concentration will increase the odds of being diabetes-positive by exp(0.042) 1.04 times.

From the logistic regression results, it can be noticed that some variables - triceps, insulin and age - are not statistically significant. Keeping them in the model may contribute to overfitting. Therefore, they should be eliminated. This can be done automatically using statistical techniques, including **stepwise regression** and **penalized regression** methods. This methods are described in the next section. Briefly, they consist of selecting an optimal model with a reduced set of variables, without compromising the model curacy.

Here, as we have a small number of predictors (n = 9), we can select manually the most significant:

```
model <- glm( diabetes ~ pregnant + glucose + pressure + mass + pedigree,
data = train.data, family = binomial)
```

We’ll make predictions using the test data in order to evaluate the performance of our logistic regression model.

The procedure is as follow:

- Predict the class membership probabilities of observations based on predictor variables
- Assign the observations to the class with highest probability score (i.e above 0.5)

The R function `predict()`

can be used to predict the probability of being diabetes-positive, given the predictor values.

**Predict the probabilities** of being diabetes-positive:

```
probabilities <- model %>% predict(test.data, type = "response")
head(probabilities)
```

```
## 21 25 28 29 32 36
## 0.3914 0.6706 0.0501 0.5735 0.6444 0.1494
```

Which classes do these probabilities refer to? In our example, the output is the probability that the diabetes test will be positive. We know that these values correspond to the probability of the test to be positive, rather than negative, because the `contrasts()`

function indicates that R has created a dummy variable with a 1 for “pos” and “0” for neg. The probabilities always refer to the class dummy-coded as “1”.

Check the dummy coding:

`contrasts(test.data$diabetes)`

```
## pos
## neg 0
## pos 1
```

**Predict the class of individuals**:

The following R code categorizes individuals into two groups based on their predicted probabilities (p) of being diabetes-positive. Individuals, with p above 0.5 (random guessing), are considered as diabetes-positive.

```
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)
```

```
## 21 25 28 29 32 36
## "neg" "pos" "neg" "pos" "pos" "neg"
```

The model accuracy is measured as the proportion of observations that have been correctly classified. Inversely, the classification error is defined as the proportion of observations that have been misclassified.

Proportion of correctly classified observations:

`mean(predicted.classes == test.data$diabetes)`

`## [1] 0.756`

The classification prediction accuracy is about 76%, which is good. The misclassification error rate is 24%.

Note that, there are several metrics for evaluating the performance of a classification model (Chapter @ref(classification-model-evaluation)).

In this chapter, we have described how logistic regression works and we have provided R codes to compute logistic regression. Additionally, we demonstrated how to make predictions and to assess the model accuracy. Logistic regression model output is very easy to interpret compared to other classification methods. Additionally, because of its simplicity it is less prone to overfitting than flexible methods such as decision trees.

Note that, many concepts for linear regression hold true for the logistic regression modeling. For example, you need to perform some diagnostics (Chapter @ref(logistic-regression-assumptions-and-diagnostics)) to make sure that the assumptions made by the model are met for your data.

Furthermore, you need to measure how good the model is in predicting the outcome of new test data observations. Here, we described how to compute the raw classification accuracy, but not that other important performance metric exists (Chapter @ref(classification-model-evaluation))

In a situation, where you have many predictors you can select, without compromising the prediction accuracy, a minimal list of predictor variables that contribute the most to the model using stepwise regression (Chapter @ref(stepwise-logistic-regression)) and lasso regression techniques (Chapter @ref(penalized-logistic-regression)).

Additionally, you can add interaction terms in the model, or include spline terms.

The same problems concerning confounding and correlated variables apply to logistic regression (see Chapter @ref(confounding-variables) and @ref(multicollinearity)).

You can also fit *generalized additive models* (Chapter @ref(polynomial-and-spline-regression)), when linearity of the predictor cannot be assumed. This can be done using the `mgcv`

package:

```
library("mgcv")
# Fit the model
gam.model <- gam(diabetes ~ s(glucose) + mass + pregnant,
data = train.data, family = "binomial")
# Summarize model
summary(gam.model )
# Make predictions
probabilities <- gam.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities> 0.5, "pos", "neg")
# Model Accuracy
mean(predicted.classes == test.data$diabetes)
```

Logistic regression is limited to only two-class classification problems. There is an extension, called *multinomial logistic regression*, for multiclass classification problem (Chapter @ref(multinomial-logistic-regression)).

Note that, the most popular method, for multiclass tasks, is the *Linear Discriminant Analysis* (Chapter @ref(discriminant-analysis)).

Bruce, Peter, and Andrew Bruce. 2017. *Practical Statistics for Data Scientists*. O’Reilly Media.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. *An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

**Stepwise logistic regression** consists of automatically selecting a reduced number of predictor variables for building the best performing logistic regression model. Read more at Chapter @ref(stepwise-regression).

This chapter describes how to compute the stepwise logistic regression in **R**.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
```

Data set: `PimaIndiansDiabetes2`

[in `mlbench`

package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproductibility.

```
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
```

The stepwise logistic regression can be easily computed using the R function `stepAIC()`

available in the MASS package. It performs model selection by AIC. It has an option called `direction`

, which can have the following values: “both”, “forward”, “backward” (see Chapter @ref(stepwise-regression)).

```
library(MASS)
# Fit the model
model <- glm(diabetes ~., data = train.data, family = binomial) %>%
stepAIC(trace = FALSE)
# Summarize the final selected model
summary(model)
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
mean(predicted.classes==test.data$diabetes)
```

Full model incorporating all predictors:

```
full.model <- glm(diabetes ~., data = train.data, family = binomial)
coef(full.model)
```

```
## (Intercept) pregnant glucose pressure triceps insulin
## -9.50372 0.04571 0.04230 -0.00700 0.01858 -0.00159
## mass pedigree age
## 0.04502 0.96845 0.04256
```

Select the most contributive variables:

```
library(MASS)
step.model <- full.model %>% stepAIC(trace = FALSE)
coef(step.model)
```

```
## (Intercept) glucose mass pedigree age
## -9.5612 0.0379 0.0523 0.9697 0.0529
```

The function chose a final model in which one variable has been removed from the original full model. Dropped predictor is: `triceps`

.

Here, we’ll compare the performance of the full and the stepwise logistic models. The best model is defined as the model that has the lowest classification error rate in predicting the class of new test data:

Prediction accuracy of the full logistic regression model:

```
# Make predictions
probabilities <- full.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Prediction accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
```

`## [1] 0.808`

Prediction accuracy of the stepwise logistic regression model:

```
# Make predictions
probabilities <- predict(step.model, test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Prediction accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
```

`## [1] 0.795`

This chapter describes how to perform stepwise logistic regression in R. In our example, the stepwise regression have selected a reduced number of predictor variables resulting to a final model, which performance was similar to the one of the full model.

So, the stepwise selection reduced the complexity of the model without compromising its accuracy. Note that, all things equal, we should always choose the simpler model, here the final model returned by the stepwise regression.

Another alternative to the stepwise method, for model selection, is the penalized regression approach (Chapter @ref(penalized-logistic-regression)), which penalizes the model for having two many variables.

When you have multiple variables in your logistic regression model, it might be useful to find a reduced set of variables resulting to an optimal performing model (see Chapter @ref(penalized-regression)).

**Penalized logistic regression** imposes a penalty to the logistic model for having too many variables. This results in shrinking the coefficients of the less contributive variables toward zero. This is also known as **regularization**.

The most commonly used penalized regression include:

**ridge regression**: variables with minor contribution have their coefficients close to zero. However, all the variables are incorporated in the model. This is useful when all variables need to be incorporated in the model according to domain knowledge.**lasso regression**: the coefficients of some less contributive variables are forced to be exactly zero. Only the most significant variables are kept in the final model.**elastic net regression**: the combination of ridge and lasso regression. It shrinks some coefficients toward zero (like ridge regression) and set some coefficients to exactly zero (like lasso regression)

This chapter describes how to compute penalized logistic regression, such as lasso regression, for automatically selecting an optimal model containing the most contributive predictor variables.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow`glmnet`

, for computing penalized regression

```
library(tidyverse)
library(caret)
library(glmnet)
```

Data set: `PimaIndiansDiabetes2`

[in `mlbench`

package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproductibility.

```
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
```

The R function `model.matrix()`

helps to create the matrix of predictors and also automatically converts categorical predictors to appropriate dummy variables, which is required for the `glmnet()`

function.

```
# Dumy code categorical predictor variables
x <- model.matrix(diabetes~., train.data)[,-1]
# Convert the outcome (class) to a numerical variable
y <- ifelse(train.data$diabetes == "pos", 1, 0)
```

We’ll use the R function `glmnet()`

[glmnet package] for computing penalized logistic regression.

The simplified format is as follow:

`glmnet(x, y, family = "binomial", alpha = 1, lambda = NULL)`

`x`

: matrix of predictor variables`y`

: the response or outcome variable, which is a binary variable.`family`

: the response type. Use “binomial” for a binary outcome variable`alpha`

: the elasticnet mixing parameter. Allowed values include:- “1”: for lasso regression
- “0”: for ridge regression
- a value between 0 and 1 (say 0.3) for elastic net regression.

`lamba`

: a numeric value defining the amount of shrinkage. Should be specify by analyst.

In penalized regression, you need to specify a constant `lambda`

to adjust the amount of the coefficient shrinkage. The best `lambda`

for your data, can be defined as the `lambda`

that minimize the cross-validation prediction error rate. This can be determined automatically using the function `cv.glmnet()`

.

In the following R code, we’ll show how to compute lasso regression by specifying the option `alpha = 1`

. You can also try the ridge regression, using `alpha = 0`

, to see which is better for your data.

Fit the lasso penalized regression model:

```
library(glmnet)
# Find the best lambda using cross-validation
set.seed(123)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.min)
# Display regression coefficients
coef(model)
# Make predictions on the test data
x.test <- model.matrix(diabetes ~., test.data)[,-1]
probabilities <- model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
```

**Find the optimal value of lambda that minimizes the cross-validation error**:

```
library(glmnet)
set.seed(123)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
plot(cv.lasso)
```

The plot displays the cross-validation error according to the log of lambda. The left dashed vertical line indicates that the log of the optimal value of lambda is approximately -5, which is the one that minimizes the prediction error. This lambda value will give the most accurate model. The exact value of `lambda`

can be viewed as follow:

`cv.lasso$lambda.min`

`## [1] 0.00871`

Generally, the purpose of regularization is to balance accuracy and simplicity. This means, a model with the smallest number of predictors that also gives a good accuracy. To this end, the function `cv.glmnet()`

finds also the value of `lambda`

that gives the simplest model but also lies within one standard error of the optimal value of `lambda`

. This value is called `lambda.1se`

.

`cv.lasso$lambda.1se`

`## [1] 0.0674`

Using `lambda.min`

as the best lambda, gives the following regression coefficients:

`coef(cv.lasso, cv.lasso$lambda.min)`

```
## 9 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -8.615615
## pregnant 0.035076
## glucose 0.036916
## pressure .
## triceps 0.016484
## insulin -0.000392
## mass 0.030485
## pedigree 0.785506
## age 0.036265
```

From the output above, only the viable `triceps`

has a coefficient exactly equal to zero.

Using `lambda.1se`

as the best lambda, gives the following regression coefficients:

`coef(cv.lasso, cv.lasso$lambda.1se)`

```
## 9 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -4.65750
## pregnant .
## glucose 0.02628
## pressure .
## triceps 0.00191
## insulin .
## mass .
## pedigree .
## age 0.01734
```

Using `lambda.1se`

, only 5 variables have non-zero coefficients. The coefficients of all other variables have been set to zero by the lasso algorithm, reducing the complexity of the model.

Setting lambda = lambda.1se produces a simpler model compared to lambda.min, but the model might be a little bit less accurate than the one obtained with lambda.min.

In the next sections, we’ll compute the final model using `lambda.min`

and then assess the model accuracy against the test data. We’ll also discuss the results obtained by fitting the model using `lambda = lambda.1se`

.

**Compute the final lasso model**:

- Compute the final model using
`lambda.min`

:

```
# Final model with lambda.min
lasso.model <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.min)
# Make prediction on test data
x.test <- model.matrix(diabetes ~., test.data)[,-1]
probabilities <- lasso.model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
```

`## [1] 0.769`

- Compute the final model using
`lambda.1se`

:

```
# Final model with lambda.1se
lasso.model <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.1se)
# Make prediction on test data
x.test <- model.matrix(diabetes ~., test.data)[,-1]
probabilities <- lasso.model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy rate
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
```

`## [1] 0.705`

In the next sections, we’ll compare the accuracy obtained with lasso regression against the one obtained using the full logistic regression model (including all predictors).

```
# Fit the model
full.model <- glm(diabetes ~., data = train.data, family = binomial)
# Make predictions
probabilities <- full.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
```

`## [1] 0.808`

This chapter described how to compute penalized logistic regression model in R. Here, we focused on lasso model, but you can also fit the ridge regression by using `alpha = 0`

in the `glmnet()`

function. For elastic net regression, you need to choose a value of alpha somewhere between 0 and 1. This can be done automatically using the `caret`

package. See Chapter @ref(penalized-regression).

Our analysis demonstrated that the lasso regression, using `lambda.min`

as the best lambda, results to simpler model without compromising much the model performance on the test data when compared to the full logistic model.

The model accuracy that we have obtained with `lambda.1se`

is a bit less than what we got with the more complex model using all predictor variables (n = 8) or using `lambda.min`

in the lasso regression. Even with `lambda.1se`

, the obtained accuracy remains good enough in addition to the resulting model simplicity.

This means that the simpler model obtained with lasso regression does at least as good a job fitting the information in the data as the more complicated one. According to the bias-variance trade-off, all things equal, simpler model should be always preferred because it is less likely to overfit the training data.

For variable selection, an alternative to the penalized logistic regression techniques is the stepwise logistic regression described in the Chapter @ref(stepwise-logistic-regression).

The **logistic regression** model makes several **assumptions** about the data.

This chapter describes the major assumptions and provides practical guide, in R, to check whether these assumptions hold true for your data, which is essential to build a good model.

Make sure you have read the logistic regression essentials in Chapter @ref(logistic-regression).

Contents:

The logistic regression method assumes that:

- The outcome is a binary or dichotomous variable like yes vs no, positive vs negative, 1 vs 0.
- There is a linear relationship between the logit of the outcome and each predictor variables. Recall that the logit function is
`logit(p) = log(p/(1-p))`

, where p is the probabilities of the outcome (see Chapter @ref(logistic-regression)). - There is no influential values (extreme values or outliers) in the continuous predictors
- There is no high intercorrelations (i.e. multicollinearity) among the predictors.

To improve the accuracy of your model, you should make sure that these assumptions hold true for your data. In the following sections, we’ll describe how to diagnostic potential problems in the data.

`tidyverse`

for easy data manipulation and visualization`broom`

: creates a tidy data frame from statistical test results

```
library(tidyverse)
library(broom)
theme_set(theme_classic())
```

We start by computing an example of logistic regression model using the `PimaIndiansDiabetes2`

[mlbench package], introduced in Chapter @ref(classification-in-r), for predicting the probability of diabetes test positivity based on clinical variables.

```
# Load the data
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Fit the logistic regression model
model <- glm(diabetes ~., data = PimaIndiansDiabetes2,
family = binomial)
# Predict the probability (p) of diabete positivity
probabilities <- predict(model, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)
```

```
## 4 5 7 9 14 15
## "neg" "pos" "neg" "pos" "pos" "pos"
```

Here, we’ll check the linear relationship between continuous predictor variables and the logit of the outcome. This can be done by visually inspecting the scatter plot between each predictor and the logit values.

- Remove qualitative variables from the original data frame and bind the logit values to the data:

```
# Select only numeric predictors
mydata <- PimaIndiansDiabetes2 %>%
dplyr::select_if(is.numeric)
predictors <- colnames(mydata)
# Bind the logit and tidying the data for plot
mydata <- mydata %>%
mutate(logit = log(probabilities/(1-probabilities))) %>%
gather(key = "predictors", value = "predictor.value", -logit)
```

- Create the scatter plots:

```
ggplot(mydata, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")
```

The smoothed scatter plots show that variables glucose, mass, pregnant, pressure and triceps are all quite linearly associated with the diabetes outcome in logit scale.

The variable age and pedigree is not linear and might need some transformations. If the scatter plot shows non-linearity, you need other methods to build the model such as including 2 or 3-power terms, fractional polynomials and spline function (Chapter @ref(polynomial-and-spline-regression)).

Influential values are extreme individual data points that can alter the quality of the logistic regression model.

The most extreme values in the data can be examined by visualizing the Cook’s distance values. Here we label the top 3 largest values:

`plot(model, which = 4, id.n = 3)`

Note that, not all outliers are influential observations. To check whether the data contains potential influential observations, the standardized residual error can be inspected. Data points with an absolute standardized residuals above 3 represent possible outliers and may deserve closer attention.

The following R code computes the standardized residuals (`.std.resid`

) and the Cook’s distance (`.cooksd`

) using the R function `augment()`

[broom package].

```
# Extract model results
model.data <- augment(model) %>%
mutate(index = 1:n())
```

The data for the top 3 largest values, according to the Cook’s distance, can be displayed as follow:

`model.data %>% top_n(3, .cooksd)`

Plot the standardized residuals:

```
ggplot(model.data, aes(index, .std.resid)) +
geom_point(aes(color = diabetes), alpha = .5) +
theme_bw()
```

Filter potential influential data points with `abs(.std.res) > 3`

:

```
model.data %>%
filter(abs(.std.resid) > 3)
```

There is no influential observations in our data.

When you have outliers in a continuous predictor, potential solutions include:

- Removing the concerned records
- Transform the data into log scale
- Use non parametric methods

Multicollinearity corresponds to a situation where the data contain highly correlated predictor variables. Read more in Chapter @ref(multicollinearity).

Multicollinearity is an important issue in regression analysis and should be fixed by removing the concerned variables. It can be assessed using the R function `vif()`

[car package], which computes the variance inflation factors:

`car::vif(model)`

```
## pregnant glucose pressure triceps insulin mass pedigree age
## 1.89 1.38 1.19 1.64 1.38 1.83 1.03 1.97
```

As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. In our example, there is no collinearity: all variables have a value of VIF well below 5.

This chapter describes the main assumptions of logistic regression model and provides examples of R code to diagnostic potential problems in the data, including non linearity between the predictor variables and the logit of the outcome, the presence of influential observations in the data and multicollinearity among predictors.

Fixing these potential problems might improve considerably the goodness of the model. See also, additional performance metrics to check the validity of your model are described in the Chapter @ref(classification-model-evaluation).

The **multinomial logistic regression** is an extension of the logistic regression (Chapter @ref(logistic-regression)) for multiclass classification tasks. It is used when the outcome involves more than two classes.

In this chapter, we’ll show you how to compute multinomial logistic regression in R.

Contents:

`tidyverse`

for easy data manipulation`caret`

for easy predictive modeling`nnet`

for computing multinomial logistic regression

```
library(tidyverse)
library(caret)
library(nnet)
```

We’ll use the `iris`

data set, introduced in Chapter @ref(classification-in-r), for predicting iris species based on the predictor variables Sepal.Length, Sepal.Width, Petal.Length, Petal.Width.

We start by randomly splitting the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data
data("iris")
# Inspect the data
sample_n(iris, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- iris$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- iris[training.samples, ]
test.data <- iris[-training.samples, ]
```

```
# Fit the model
model <- nnet::multinom(Species ~., data = train.data)
# Summarize the model
summary(model)
# Make predictions
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
# Model accuracy
mean(predicted.classes == test.data$Species)
```

Model accuracy:

`mean(predicted.classes == test.data$Species)`

`## [1] 0.967`

Our model is very good in predicting the different categories with an accuracy of 97%.

This chapter describes how to compute multinomial logistic regression in R. This method is used for multiclass problems. In practice, it is not used very often. Discriminant analysis (Chapter @ref(discriminant-analysis)) is more popular for multiple-class classification.

**Discriminant analysis** is used to predict the probability of belonging to a given class (or category) based on one or multiple predictor variables. It works with continuous and/or categorical predictor variables.

Previously, we have described the logistic regression for two-class classification problems, that is when the outcome variable has two possible values (0/1, no/yes, negative/positive).

Compared to logistic regression, the discriminant analysis is more suitable for predicting the category of an observation in the situation where the outcome variable contains more than two classes. Additionally, it’s more stable than the logistic regression for multi-class classification problems.

Note that, both logistic regression and discriminant analysis can be used for binary classification tasks.

In this chapter, you’ll learn the most widely used discriminant analysis techniques and extensions. Additionally, we’ll provide R code to perform the different types of analysis.

The following discriminant analysis methods will be described:

**Linear discriminant analysis**(**LDA**): Uses linear combinations of predictors to predict the class of a given observation. Assumes that the predictor variables (p) are normally distributed and the classes have identical variances (for univariate analysis, p = 1) or identical covariance matrices (for multivariate analysis, p > 1).**Quadratic discriminant analysis**(**QDA**): More flexible than LDA. Here, there is no assumption that the covariance matrix of classes is the same.**Mixture discriminant analysis**(**MDA**): Each class is assumed to be a Gaussian mixture of subclasses.**Flexible Discriminant Analysis**(**FDA**): Non-linear combinations of predictors is used such as splines.**Regularized discriminant anlysis**(**RDA**): Regularization (or shrinkage) improves the estimate of the covariance matrices in situations where the number of predictors is larger than the number of samples in the training data. This leads to an improvement of the discriminant analysis.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
theme_set(theme_classic())
```

We’ll use the `iris`

data set, introduced in Chapter @ref(classification-in-r), for predicting iris species based on the predictor variables Sepal.Length, Sepal.Width, Petal.Length, Petal.Width.

Discriminant analysis can be affected by the scale/unit in which predictor variables are measured. It’s generally recommended to standardize/normalize continuous predictor before the analysis.

- Split the data into training and test set:

```
# Load the data
data("iris")
# Split the data into training (80%) and test set (20%)
set.seed(123)
training.samples <- iris$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- iris[training.samples, ]
test.data <- iris[-training.samples, ]
```

- Normalize the data. Categorical variables are automatically ignored.

```
# Estimate preprocessing parameters
preproc.param <- train.data %>%
preProcess(method = c("center", "scale"))
# Transform the data using the estimated parameters
train.transformed <- preproc.param %>% predict(train.data)
test.transformed <- preproc.param %>% predict(test.data)
```

The LDA algorithm starts by finding directions that maximize the separation between classes, then use these directions to predict the class of individuals. These directions, called linear discriminants, are a linear combinations of predictor variables.

LDA assumes that predictors are normally distributed (Gaussian distribution) and that the different classes have class-specific means and equal variance/covariance.

Before performing LDA, consider:

- Inspecting the univariate distributions of each variable and make sure that they are normally distribute. If not, you can transform them using log and root for exponential distributions and Box-Cox for skewed distributions.
- removing outliers from your data and standardize the variables to make their scale comparable.

The linear discriminant analysis can be easily computed using the function `lda()`

[MASS package].

**Quick start R code**:

```
library(MASS)
# Fit the model
model <- lda(Species~., data = train.transformed)
# Make predictions
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class==test.transformed$Species)
```

**Compute LDA**:

```
library(MASS)
model <- lda(Species~., data = train.transformed)
model
```

```
## Call:
## lda(Species ~ ., data = train.transformed)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.333 0.333 0.333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa -1.012 0.787 -1.293 -1.250
## versicolor 0.117 -0.648 0.272 0.154
## virginica 0.895 -0.139 1.020 1.095
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.911 0.0318
## Sepal.Width 0.648 0.8985
## Petal.Length -4.082 -2.2272
## Petal.Width -2.313 2.6544
##
## Proportion of trace:
## LD1 LD2
## 0.9905 0.0095
```

LDA determines group means and computes, for each individual, the probability of belonging to the different groups. The individual is then affected to the group with the highest probability score.

The `lda()`

outputs contain the following elements:

*Prior probabilities of groups*: the proportion of training observations in each group. For example, there are 31% of the training observations in the setosa group*Group means*: group center of gravity. Shows the mean of each variable in each group.*Coefficients of linear discriminants*: Shows the linear combination of predictor variables that are used to form the LDA decision rule. for example,`LD1 = 0.91*Sepal.Length + 0.64*Sepal.Width - 4.08*Petal.Length - 2.3*Petal.Width`

. Similarly,`LD2 = 0.03*Sepal.Length + 0.89*Sepal.Width - 2.2*Petal.Length - 2.6*Petal.Width`

.

Using the function `plot()`

produces plots of the linear discriminants, obtained by computing LD1 and LD2 for each of the training observations.

`plot(model)`

**Make predictions**:

```
predictions <- model %>% predict(test.transformed)
names(predictions)
```

`## [1] "class" "posterior" "x"`

The `predict()`

function returns the following elements:

*class*: predicted classes of observations.*posterior*: is a matrix whose columns are the groups, rows are the individuals and values are the posterior probability that the corresponding observation belongs to the groups.*x*: contains the linear discriminants, described above

Inspect the results:

```
# Predicted classes
head(predictions$class, 6)
# Predicted probabilities of class memebership.
head(predictions$posterior, 6)
# Linear discriminants
head(predictions$x, 3)
```

Note that, you can create the LDA plot using ggplot2 as follow:

```
lda.data <- cbind(train.transformed, predict(model)$x)
ggplot(lda.data, aes(LD1, LD2)) +
geom_point(aes(color = Species))
```

**Model accuracy**:

You can compute the model accuracy as follow:

`mean(predictions$class==test.transformed$Species)`

`## [1] 1`

It can be seen that, our model correctly classified 100% of observations, which is excellent.

Note that, by default, the probability cutoff used to decide group-membership is 0.5 (random guessing). For example, the number of observations in the setosa group can be re-calculated using:

`sum(predictions$posterior[ ,1] >=.5)`

`## [1] 10`

In some situations, you might want to increase the precision of the model. In this case you can fine-tune the model by adjusting the posterior probability cutoff. For example, you can increase or lower the cutoff.

**Variable selection**:

Note that, if the predictor variables are standardized before computing LDA, the discriminator weights can be used as measures of variable importance for feature selection.

QDA is little bit more flexible than LDA, in the sense that it does not assumes the equality of variance/covariance. In other words, for QDA the covariance matrix can be different for each class.

LDA tends to be a better than QDA when you have a small training set.

In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major issue, or if the assumption of a common covariance matrix for the K classes is clearly untenable (James et al. 2014).

QDA can be computed using the R function `qda()`

[MASS package]

```
library(MASS)
# Fit the model
model <- qda(Species~., data = train.transformed)
model
# Make predictions
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class == test.transformed$Species)
```

The LDA classifier assumes that each class comes from a single normal (or Gaussian) distribution. This is too restrictive.

For MDA, there are classes, and each class is assumed to be a Gaussian mixture of subclasses, where each data point has a probability of belonging to each class. Equality of covariance matrix, among classes, is still assumed.

```
library(mda)
# Fit the model
model <- mda(Species~., data = train.transformed)
model
# Make predictions
predicted.classes <- model %>% predict(test.transformed)
# Model accuracy
mean(predicted.classes == test.transformed$Species)
```

MDA might outperform LDA and QDA is some situations, as illustrated below. In this example data, we have 3 main groups of individuals, each having 3 no adjacent subgroups. The solid black lines on the plot represent the decision boundaries of LDA, QDA and MDA. It can be seen that the MDA classifier have identified correctly the subclasses compared to LDA and QDA, which were not good at all in modeling this data.

The code for generating the above plots is from John Ramey

FDA is a flexible extension of LDA that uses non-linear combinations of predictors such as splines. FDA is useful to model multivariate non-normality or non-linear relationships among variables within each group, allowing for a more accurate classification.

```
library(mda)
# Fit the model
model <- fda(Species~., data = train.transformed)
# Make predictions
predicted.classes <- model %>% predict(test.transformed)
# Model accuracy
mean(predicted.classes == test.transformed$Species)
```

RDA builds a classification rule by regularizing the group covariance matrices (Friedman 1989) allowing a more robust model against multicollinearity in the data. This might be very useful for a large multivariate data set containing highly correlated predictors.

Regularized discriminant analysis is a kind of a trade-off between LDA and QDA. Recall that, in LDA we assume equality of covariance matrix for all of the classes. QDA assumes different covariance matrices for all the classes. Regularized discriminant analysis is an intermediate between LDA and QDA.

RDA shrinks the separate covariances of QDA toward a common covariance as in LDA. This improves the estimate of the covariance matrices in situations where the number of predictors is larger than the number of samples in the training data, potentially leading to an improvement of the model accuracy.

```
library(klaR)
# Fit the model
model <- rda(Species~., data = train.transformed)
# Make predictions
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class == test.transformed$Species)
```

We have described linear discriminant analysis (LDA) and extensions for predicting the class of an observations based on multiple predictor variables. Discriminant analysis is more suitable to multiclass classification problems compared to the logistic regression (Chapter @ref(logistic-regression)).

LDA assumes that the different classes has the same variance or covariance matrix. We have described many extensions of LDA in this chapter. The most popular extension of LDA is the quadratic discriminant analysis (QDA), which is more flexible than LDA in the sens that it does not assume the equality of group covariance matrices.

LDA tends to be better than QDA for small data set. QDA is recommended for large training data set.

Friedman, Jerome H. 1989. “Regularized Discriminant Analysis.” *Journal of the American Statistical Association* 84 (405). Taylor & Francis: 165–75. doi:10.1080/01621459.1989.10478752.

*An Introduction to Statistical Learning: With Applications in R*. Springer Publishing Company, Incorporated.

The **Naive Bayes classifier** is a simple and powerful method that can be used for binary and multiclass classification problems.

Naive Bayes classifier predicts the class membership probability of observations using Bayes theorem, which is based on conditional probability, that is the probability of something to happen, given that something else has already occurred.

Observations are assigned to the class with the largest probability score.

In this chapter, you’ll learn how to perform naive Bayes classification in R using the `klaR`

and `caret`

package.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
```

The input predictor variables can be categorical and/or numeric variables.

Here, we’ll use the `PimaIndiansDiabetes2`

[in `mlbench`

package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
```

```
library("klaR")
# Fit the model
model <- NaiveBayes(diabetes ~., data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model accuracy
mean(predictions$class == test.data$diabetes)
```

`## [1] 0.821`

The `caret`

R package can automatically train the model and assess the model accuracy using k-fold cross-validation Chapter @ref(cross-validation).

```
library(klaR)
# Build the model
set.seed(123)
model <- train(diabetes ~., data = train.data, method = "nb",
trControl = trainControl("cv", number = 10))
# Make predictions
predicted.classes <- model %>% predict(test.data)
# Model n accuracy
mean(predicted.classes == test.data$diabetes)
```

This chapter introduces the basics of Naive Bayes classification and provides practical examples in R using the `klaR`

and `caret`

package.

**Support Vector Machine** (or **SVM**) is a machine learning technique used for *classification* tasks. Briefly, SVM works by identifying the optimal decision boundary that separates data points from different groups (or classes), and then predicts the class of new observations based on this separation boundary.

Depending on the situations, the different groups might be separable by a linear straight line or by a non-linear boundary line.

Support vector machine methods can handle both linear and non-linear class boundaries. It can be used for both two-class and multi-class classification problems.

In real life data, the separation boundary is generally nonlinear. Technically, the SVM algorithm perform a non-linear classification using what is called the kernel trick. The most commonly used kernel transformations are *polynomial kernel* and *radial kernel*.

Note that, there is also an extension of the SVM for regression, called support vector regression.

In this chapter, we’ll describe how to build SVM classifier using the *caret* R package.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
```

Data set: `PimaIndiansDiabetes2`

[in `mlbench`

package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

```
# Load the data
data("PimaIndiansDiabetes2", package = "mlbench")
pima.data <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(pima.data, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- pima.data$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- pima.data[training.samples, ]
test.data <- pima.data[-training.samples, ]
```

In the following example variables are normalized to make their scale comparable. This is automatically done before building the SVM classifier by setting the option `preProcess = c("center","scale")`

.

```
# Fit the model on the training set
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "svmLinear",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale")
)
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
```

```
## [1] neg pos neg pos pos neg
## Levels: neg pos
```

```
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
```

`## [1] 0.782`

Note that, there is a tuning parameter `C`

, also known as *Cost*, that determines the possible misclassifications. It essentially imposes a penalty to the model for making an error: the higher the value of C, the less likely it is that the SVM algorithm will misclassify a point.

By default `caret`

builds the SVM linear classifier using `C = 1`

. You can check this by typing `model`

in R console.

It’s possible to automatically compute SVM for different values of `C and to choose the optimal one that maximize the model cross-validation accuracy.

The following R code compute SVM for a grid values of `C`

and choose automatically the final model for predictions:

```
# Fit the model on the training set
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "svmLinear",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(C = seq(0, 2, length = 20)),
preProcess = c("center","scale")
)
# Plot model accuracy vs different values of Cost
plot(model)
```

```
# Print the best tuning parameter C that
# maximizes model accuracy
model$bestTune
```

```
## C
## 12 1.16
```

```
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
```

`## [1] 0.782`

To build a non-linear SVM classifier, we can use either polynomial kernel or radial kernel function. Again, the `caret`

package can be used to easily computes the polynomial and the radial SVM non-linear models.

The package automatically choose the optimal values for the model tuning parameters, where optimal is defined as values that maximize the model accuracy.

**Computing SVM using radial basis kernel**:

```
# Fit the model on the training set
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "svmRadial",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale"),
tuneLength = 10
)
# Print the best tuning parameter sigma and C that
# maximizes model accuracy
model$bestTune
```

```
## sigma C
## 1 0.136 0.25
```

```
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
```

`## [1] 0.795`

**Computing SVM using polynomial basis kernel**:

```
# Fit the model on the training set
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "svmPoly",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale"),
tuneLength = 4
)
# Print the best tuning parameter sigma and C that
# maximizes model accuracy
model$bestTune
```

```
## degree scale C
## 8 1 0.01 2
```

```
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
```

`## [1] 0.795`

In our examples, it can be seen that the SVM classifier using non-linear kernel gives a better result compared to the linear model.

This chapter describes how to use support vector machine for classification tasks. Other alternatives exist, such as logistic regression (Chapter @ref(logistic-regression)).

You need to assess the performance of different methods on your data in order to choose the best one.

After building a predictive classification model, you need to **evaluate the performance of the model**, that is how good the model is in predicting the outcome of new observations test data that have been not used to train the model.

In other words you need to estimate the model prediction *accuracy* and prediction errors using a new test data set. Because we know the actual outcome of observations in the test data set, the performance of the predictive model can be assessed by comparing the predicted outcome values against the known outcome values.

This chapter describes the commonly used metrics and methods for assessing the performance of predictive classification models, including:

**Average classification accuracy**, representing the proportion of correctly classified observations.**Confusion matrix**, which is 2x2 table showing four parameters, including the number of true positives, true negatives, false negatives and false positives.**Precision, Recall and Specificity**, which are three major performance metrics describing a predictive classification model**ROC curve**, which is a graphical summary of the overall performance of the model, showing the proportion of true positives and false positives at all possible values of probability cutoff. The**Area Under the Curve**(**AUC**) summarizes the overall performance of the classifier.

We’ll provide practical examples in R to compute these above metrics, as well as, to create the ROC plot.

Contents:

`tidyverse`

for easy data manipulation and visualization`caret`

for easy machine learning workflow

```
library(tidyverse)
library(caret)
```

To keep things simple, we’ll perform a binary classification, where the outcome variable can have only two possible values: negative vs positive.

We’ll compute an example of linear discriminant analysis model using the `PimaIndiansDiabetes2`

[mlbench package], introduced in Chapter @ref(classification-in-r), for predicting the probability of diabetes test positivity based on clinical variables.

- Split the data into training (80%, used to build the model) and test set (20%, used to evaluate the model performance):

```
# Load the data
data("PimaIndiansDiabetes2", package = "mlbench")
pima.data <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(pima.data, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- pima.data$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- pima.data[training.samples, ]
test.data <- pima.data[-training.samples, ]
```

- Fit the LDA model on the training set and make predictions on the test data:

```
library(MASS)
# Fit LDA
fit <- lda(diabetes ~., data = train.data)
# Make predictions on the test data
predictions <- predict(fit, test.data)
prediction.probabilities <- predictions$posterior[,2]
predicted.classes <- predictions$class
observed.classes <- test.data$diabetes
```

The overall **classification accuracy** rate corresponds to the proportion of observations that have been correctly classified. Determining the raw classification accuracy is the first step in assessing the performance of a model.

Inversely, the **classification error rate** is defined as the proportion of observations that have been misclassified. `Error rate = 1 - accuracy`

The raw classification accuracy and error can be easily computed by comparing the observed classes in the test data against the predicted classes by the model:

```
accuracy <- mean(observed.classes == predicted.classes)
accuracy
```

`## [1] 0.808`

```
error <- mean(observed.classes != predicted.classes)
error
```

`## [1] 0.192`

From the output above, the linear discriminant analysis correctly predicted the individual outcome in 81% of the cases. This is by far better than random guessing. The misclassification error rate can be calculated as 100 - 81% = 19%.

In our example, a binary classifier can make two types of errors:

- it can incorrectly assign an individual who is diabetes-positive to the diabetes-negative category
- it can incorrectly assign an individual who is diabetes-negative to the diabetes-positive category.

The proportion of theses two types of errors can be determined by creating a **confusion matrix**, which compare the predicted outcome values against the known outcome values.

The R function `table()`

can be used to produce a **confusion matrix** in order to determine how many observations were correctly or incorrectly classified. It compares the observed and the predicted outcome values and shows the number of correct and incorrect predictions categorized by type of outcome.

```
# Confusion matrix, number of cases
table(observed.classes, predicted.classes)
```

```
## predicted.classes
## observed.classes neg pos
## neg 48 4
## pos 11 15
```

```
# Confusion matrix, proportion of cases
table(observed.classes, predicted.classes) %>%
prop.table() %>% round(digits = 3)
```

```
## predicted.classes
## observed.classes neg pos
## neg 0.615 0.051
## pos 0.141 0.192
```

The diagonal elements of the confusion matrix indicate correct predictions, while the off-diagonals represent incorrect predictions. So, the correct classification rate is the sum of the number on the diagonal divided by the sample size in the test data. In our example, that is (48 + 15)/78 = 81%.

Each cell of the table has an important meaning:

**True positives**(d): these are cases in which we predicted the individuals would be diabetes-positive and they were.**True negatives**(a): We predicted diabetes-negative, and the individuals were diabetes-negative.**False positives**(b): We predicted diabetes-positive, but the individuals didn’t actually have diabetes. (Also known as a*Type I error*.)**False negatives**(c): We predicted diabetes-negative, but they did have diabetes. (Also known as a*Type II error*.)

Technically the raw prediction accuracy of the model is defined as `(TruePositives + TrueNegatives)/SampleSize`

.

In addition to the raw classification accuracy, there are many other metrics that are widely used to examine the performance of a classification model, including:

**Precision**, which is the proportion of true positives among all the individuals that have been predicted to be diabetes-positive by the model. This represents the accuracy of a predicted positive outcome. `Precision = TruePositives/(TruePositives + FalsePositives)`

.

**Sensitivity** (or **Recall**), which is the **True Positive Rate** (TPR) or the proportion of identified positives among the diabetes-positive population (class = 1). `Sensitivity = TruePositives/(TruePositives + FalseNegatives)`

.

**Specificity**, which measures the **True Negative Rate** (TNR), that is the proportion of identified negatives among the diabetes-negative population (class = 0). `Specificity = TrueNegatives/(TrueNegatives + FalseNegatives)`

.

**False Positive Rate** (FPR), which represents the proportion of identified positives among the healthy individuals (i.e. diabetes-negative). This can be seen as a false alarm. The FPR can be also calculated as `1-specificity`

. When positives are rare, the FPR can be high, leading to the situation where a predicted positive is most likely a negative.

**Sensitivy** and **Specificity** are commonly used to measure the performance of a predictive model.

These above mentioned metrics can be easily computed using the function `confusionMatrix()`

[caret package].

In two-class setting, you might need to specify the optional argument `positive`

, which is a character string for the factor level that corresponds to a “positive” result (if that makes sense for your data). If there are only two factor levels, the default is to use the first level as the “positive” result.

```
confusionMatrix(predicted.classes, observed.classes,
positive = "pos")
```

```
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 48 11
## pos 4 15
##
## Accuracy : 0.808
## 95% CI : (0.703, 0.888)
## No Information Rate : 0.667
## P-Value [Acc > NIR] : 0.00439
##
## Kappa : 0.536
## Mcnemar's Test P-Value : 0.12134
##
## Sensitivity : 0.577
## Specificity : 0.923
## Pos Pred Value : 0.789
## Neg Pred Value : 0.814
## Prevalence : 0.333
## Detection Rate : 0.192
## Detection Prevalence : 0.244
## Balanced Accuracy : 0.750
##
## 'Positive' Class : pos
##
```

The above results show different statistical metrics among which the most important include:

- the cross-tabulation between prediction and reference known outcome
- the model accuracy, 81%
- the kappa (54%), which is the accuracy corrected for chance.

In our example, the sensitivity is ~58%, that is the proportion of diabetes-positive individuals that were correctly identified by the model as diabetes-positive.

The specificity of the model is ~92%, that is the proportion of diabetes-negative individuals that were correctly identified by the model as diabetes-negative.

The model precision or the proportion of positive predicted value is 79%.

In medical science, sensitivity and specificity are two important metrics that characterize the performance of classifier or screening test. The importance between sensitivity and specificity depends on the context. Generally, we are concerned with one of these metrics.

In medical diagnostic, such as in our example, we are likely to be more concerned with minimal wrong positive diagnosis. So, we are more concerned about high Specificity. Here, the model specificity is 92%, which is very good.

In some situations, we may be more concerned with tuning a model so that the sensitivity/precision is improved. To this end, you can test different probability cutoff to decide which individuals are positive and which are negative.

Note that, here we have used `p > 0.5`

as the probability threshold above which, we declare the concerned individuals as diabetes positive. However, if we are concerned about incorrectly predicting the diabetes-positive status for individuals who are truly positive, then we can consider lowering this threshold: `p > 0.2`

.

The **ROC curve** (or **receiver operating characteristics curve** ) is a popular graphical measure for assessing the performance or the accuracy of a classifier, which corresponds to the total proportion of correctly classified observations.

For example, the accuracy of a medical diagnostic test can be assessed by considering the two possible types of errors: false positives, and false negatives. In classification point of view, the test will be declared positive when the corresponding predicted probability, returned by the classifier algorithm, is above a fixed threshold. This threshold is generally set to 0.5 (i.e., 50%), which corresponds to the random guessing probability.

So, in reference to our diabetes data example, for a given fixed probability cutoff:

- the
**true positive rate**(or fraction) is the proportion of identified positives among the diabetes-positive population. Recall that, this is also known as the**sensitivity**of the predictive classifier model. - and the
**false positive rate**is the proportion of identified positives among the healthy (i.e. diabetes-negative) individuals. This is also defined as`1-specificity`

, where**specificity**measures the**true negative rate**, that is the proportion of identified negatives among the diabetes-negative population.

Since we don’t usually know the probability cutoff in advance, the ROC curve is typically used to plot the true positive rate (or sensitivity on y-axis) against the false positive rate (or “1-specificity” on x-axis) at all possible probability cutoffs. This shows the trade off between the rate at which you can correctly predict something with the rate of incorrectly predicting something. Another visual representation of the ROC plot is to simply display the sensitive against the specificity.

The **Area Under the Curve** (**AUC**) summarizes the overall performance of the classifier, over all possible probability cutoffs. It represents the ability of a classification algorithm to distinguish 1s from 0s (i.e, events from non-events or positives from negatives).

For a good model, the ROC curve should rise steeply, indicating that the true positive rate (y-axis) increases faster than the false positive rate (x-axis) as the probability threshold decreases.

So, the “ideal point” is the top left corner of the graph, that is a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that the larger the AUC the better the classifier.

The AUC metric varies between 0.50 (random classifier) and 1.00. Values above 0.80 is an indication of a good classifier.

In this section, we’ll show you how to compute and plot ROC curve in R for two-class and multiclass classification tasks. We’ll use the linear discriminant analysis to classify individuals into groups.

The ROC analysis can be easily performed using the R package `pROC`

.

```
library(pROC)
# Compute roc
res.roc <- roc(observed.classes, prediction.probabilities)
plot.roc(res.roc, print.auc = TRUE)
```

The gray diagonal line represents a classifier no better than random chance.

A highly performant classifier will have an ROC that rises steeply to the top-left corner, that is it will correctly identify lots of positives without misclassifying lots of negatives as positives.

In our example, the AUC is 0.85, which is close to the maximum ( max = 1). So, our classifier can be considered as very good. A classifier that performs no better than chance is expected to have an AUC of 0.5 when evaluated on an independent test set not used to train the model.

If we want a classifier model with a specificity of at least 60%, then the sensitivity is about 0.88%. The corresponding probability threshold can be extract as follow:

```
# Extract some interesting results
roc.data <- data_frame(
thresholds = res.roc$thresholds,
sensitivity = res.roc$sensitivities,
specificity = res.roc$specificities
)
# Get the probality threshold for specificity = 0.6
roc.data %>% filter(specificity >= 0.6)
```

```
## # A tibble: 44 x 3
## thresholds sensitivity specificity
##
```
## 1 0.111 0.885 0.615
## 2 0.114 0.885 0.635
## 3 0.114 0.885 0.654
## 4 0.115 0.885 0.673
## 5 0.119 0.885 0.692
## 6 0.131 0.885 0.712
## # ... with 38 more rows

The best threshold with the highest sum sensitivity + specificity can be printed as follow. There might be more than one threshold.

`plot.roc(res.roc, print.auc = TRUE, print.thres = "best")`

Here, the best probability cutoff is 0.335 resulting to a predictive classifier with a specificity of 0.84 and a sensitivity of 0.660.

Note that, `print.thres`

can be also a numeric vector containing a direct definition of the thresholds to display:

`plot.roc(res.roc, print.thres = c(0.3, 0.5, 0.7))`

If you have grouping variables in your data, you might wish to create multiple ROC curves on the same plot. This can be done using ggplot2.

```
# Create some grouping variable
glucose <- ifelse(test.data$glucose < 127.5, "glu.low", "glu.high")
age <- ifelse(test.data$age < 28.5, "young", "old")
roc.data <- roc.data %>%
filter(thresholds !=-Inf) %>%
mutate(glucose = glucose, age = age)
# Create ROC curve
ggplot(roc.data, aes(specificity, sensitivity)) +
geom_path(aes(color = age))+
scale_x_reverse(expand = c(0,0))+
scale_y_continuous(expand = c(0,0))+
geom_abline(intercept = 1, slope = 1, linetype = "dashed")+
theme_bw()
```

We start by building a linear discriminant model using the `iris`

data set, which contains the length and width of sepals and petals for three iris species. We want to predict the species based on the sepal and petal parameters using LDA.

```
# Load the data
data("iris")
# Split the data into training (80%) and test set (20%)
set.seed(123)
training.samples <- iris$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- iris[training.samples, ]
test.data <- iris[-training.samples, ]
# Build the model on the train set
library(MASS)
model <- lda(Species ~., data = train.data)
```

Performance metrics (sensitivity, specificity, …) of the predictive model can be calculated, separately for each class, comparing each factor level to the remaining levels (i.e. a “one versus all” approach).

```
# Make predictions on the test data
predictions <- model %>% predict(test.data)
# Model accuracy
confusionMatrix(predictions$class, test.data$Species)
```

```
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 0 10
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.884, 1)
## No Information Rate : 0.333
## P-Value [Acc > NIR] : 4.86e-15
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.000 1.000 1.000
## Specificity 1.000 1.000 1.000
## Pos Pred Value 1.000 1.000 1.000
## Neg Pred Value 1.000 1.000 1.000
## Prevalence 0.333 0.333 0.333
## Detection Rate 0.333 0.333 0.333
## Detection Prevalence 0.333 0.333 0.333
## Balanced Accuracy 1.000 1.000 1.000
```

Note that, the ROC curves are typically used in binary classification but not for multiclass classification problems.

This chapter described different metrics for evaluating the performance of classification models. These metrics include:

- classification accuracy,
- confusion matrix,
- Precision, Recall and Specificity,
- and ROC curve

To evaluate the performance of regression models, read the Chapter @ref(regression-model-accuracy-metrics).