Articles - Statistical Machine Learning Essentials

Statistical machine learning refers to a set of powerful automated algorithms that are used to predict an outcome variable based on multiple predictor variables. The algorithms automatically improve their performance through “learning” from the data, that is they are data-driven and do not seek to impose linear or other overall structure on the data (P. Bruce and Bruce 2017). This means that they are non-parametric.

The different machine learning methods can be used for both:

  • classification, where the outcome variable is a categorical variable, for example positive vs negative
  • and regression , where the outcome variable is a continuous variable.

In this part, we’ll cover the following methods:

  • K-Nearest Neighbors, which predict the outcome of a new observation x as the average outcome of the k most similar observations to x (Chapter @ref(knn-k-nearest-neighbors)).
  • Decision trees, which build a set of decision rules describing the relationship between predictors and the outcome. These rules are used to predict the outcome of a new observations (Chapter @ref(decision-tree-models)).
  • Ensemble learning, including bagging, random forest and boosting. These machine learning algorithm are based on decision trees. They produce many tree models from the training data sets, and use the average as the predictive model. These results to the top-performing predictive modeling techniques. See Chapter @ref(bagging-and-random-forest) and @ref(boosting).


Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.

Sort by