Articles - Classification Methods Essentials
Previously, we have described the regression model (Chapter @ref(regression-analysis)), which is used to predict a quantitative or continuous outcome variable based on one or multiple predictor variables.
In Classification, the outcome variable is qualitative (or categorical). Classification refers to a set of machine learning methods for predicting the class (or category) of individuals on the basis of one or multiple predictor variables.
In this part, we’ll cover the following topics:
- Logistic regression, for binary classification tasks (Chapter @ref(logistic-regression))
- Stepwise and penalized logistic regression for variable selections (Chapter @ref(stepwise-logistic-regression) and @ref(penalized-logistic-regression))
- Logistic regression assumptions and diagnostics (Chapter @ref(logistic-regression-assumptions-and-diagnostics))
- Multinomial logistic regression, an extension of the logistic regression for multiclass classification tasks (Chapter @ref(multinomial-logistic-regression)).
- Discriminant analysis, for binary and multiclass classification problems (Chapter @ref(discriminant-analysis))
- Naive bayes classifier (Chapter @ref(naive-bayes-classifier))
- Support vector machines (Chapter @ref(support-vector-machine))
- Classification model evaluation (Chapter @ref(classification-model-evaluation))
Most of the classification algorithms computes the probability of belonging to a given class. Observations are then assigned to the class that have the highest probability score.
Generally, you need to decide a probability cutoff above which you consider the an observation as belonging to a given class.
Contents:
Examples of data set
PimaIndiansDiabetes2 data set
The Pima Indian Diabetes data set is available in the mlbench
package. It will be used for binary classification.
# Load the data set
data("PimaIndiansDiabetes2", package = "mlbench")
# Inspect the data
head(PimaIndiansDiabetes2, 4)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 NA 33.6 0.627 50 pos
## 2 1 85 66 29 NA 26.6 0.351 31 neg
## 3 8 183 64 NA NA 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
The data contains 768 individuals (female) and 9 clinical variables for predicting the probability of individuals in being diabete-positive or negative:
- pregnant: number of times pregnant
- glucose: plasma glucose concentration
- pressure: diastolic blood pressure (mm Hg)
- triceps: triceps skin fold thickness (mm)
- insulin: 2-Hour serum insulin (mu U/ml)
- mass: body mass index (weight in kg/(height in m)^2)
- pedigree: diabetes pedigree function
- age: age (years)
- diabetes: class variable
Iris data set
The iris
data set will be used for multiclass classification tasks. It contains the length and width of sepals and petals for three iris species. We want to predict the species based on the sepal and petal parameters.
# Load the data
data("iris")
# Inspect the data
head(iris, 4)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa