Articles - Classification Methods Essentials

Previously, we have described the regression model (Chapter @ref(regression-analysis)), which is used to predict a quantitative or continuous outcome variable based on one or multiple predictor variables.

In Classification, the outcome variable is qualitative (or categorical). Classification refers to a set of machine learning methods for predicting the class (or category) of individuals on the basis of one or multiple predictor variables.

In this part, we’ll cover the following topics:

  • Logistic regression, for binary classification tasks (Chapter @ref(logistic-regression))
  • Stepwise and penalized logistic regression for variable selections (Chapter @ref(stepwise-logistic-regression) and @ref(penalized-logistic-regression))
  • Logistic regression assumptions and diagnostics (Chapter @ref(logistic-regression-assumptions-and-diagnostics))
  • Multinomial logistic regression, an extension of the logistic regression for multiclass classification tasks (Chapter @ref(multinomial-logistic-regression)).
  • Discriminant analysis, for binary and multiclass classification problems (Chapter @ref(discriminant-analysis))
  • Naive bayes classifier (Chapter @ref(naive-bayes-classifier))
  • Support vector machines (Chapter @ref(support-vector-machine))
  • Classification model evaluation (Chapter @ref(classification-model-evaluation))

Most of the classification algorithms computes the probability of belonging to a given class. Observations are then assigned to the class that have the highest probability score.

Generally, you need to decide a probability cutoff above which you consider the an observation as belonging to a given class.

Contents:


Examples of data set

PimaIndiansDiabetes2 data set

The Pima Indian Diabetes data set is available in the mlbench package. It will be used for binary classification.

# Load the data set
data("PimaIndiansDiabetes2", package = "mlbench")
# Inspect the data
head(PimaIndiansDiabetes2, 4)
##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35      NA 33.6    0.627  50      pos
## 2        1      85       66      29      NA 26.6    0.351  31      neg
## 3        8     183       64      NA      NA 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg

The data contains 768 individuals (female) and 9 clinical variables for predicting the probability of individuals in being diabete-positive or negative:

  • pregnant: number of times pregnant
  • glucose: plasma glucose concentration
  • pressure: diastolic blood pressure (mm Hg)
  • triceps: triceps skin fold thickness (mm)
  • insulin: 2-Hour serum insulin (mu U/ml)
  • mass: body mass index (weight in kg/(height in m)^2)
  • pedigree: diabetes pedigree function
  • age: age (years)
  • diabetes: class variable

Iris data set

The iris data set will be used for multiclass classification tasks. It contains the length and width of sepals and petals for three iris species. We want to predict the species based on the sepal and petal parameters.

# Load the data
data("iris")
# Inspect the data
head(iris, 4)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
Sort by