Logistic Regression Essentials in R

Previously, we have described the regression model (Chapter @ref(regression-analysis)), which is used to predict a quantitative or continuous outcome variable based on one or multiple predictor variables.
In Classification, the outcome variable is qualitative (or categorical). Classification refers to a set of machine learning methods for predicting the class (or category) of individuals on the basis of one or multiple predictor variables.
In this part, we’ll cover the following topics:
Most of the classification algorithms computes the probability of belonging to a given class. Observations are then assigned to the class that have the highest probability score.
Generally, you need to decide a probability cutoff above which you consider the an observation as belonging to a given class.
Contents:
The Pima Indian Diabetes data set is available in the mlbench
package. It will be used for binary classification.
# Load the data set
data("PimaIndiansDiabetes2", package = "mlbench")
# Inspect the data
head(PimaIndiansDiabetes2, 4)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 NA 33.6 0.627 50 pos
## 2 1 85 66 29 NA 26.6 0.351 31 neg
## 3 8 183 64 NA NA 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
The data contains 768 individuals (female) and 9 clinical variables for predicting the probability of individuals in being diabete-positive or negative:
The iris
data set will be used for multiclass classification tasks. It contains the length and width of sepals and petals for three iris species. We want to predict the species based on the sepal and petal parameters.
# Load the data
data("iris")
# Inspect the data
head(iris, 4)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa