Articles - Model Selection Essentials in R
When you have many predictor variables in a predictive model, the model selection methods allow to select automatically the best combination of predictor variables for building an optimal predictive model.
Removing irrelevant variables leads a more interpretable and a simpler model. With the same performance, a simpler model should be always used in preference to a more complex model.
Additionally, the use of model selection approaches is critical in some situations, where you have a large multivariate data sets with many predictor variables. This is often the case in genomic area, where a substantial challenge comes from the fact that the number of genomic variables (p) is usually much larger than the number of individuals (n) (i.e., p >> n) (Bovelstad et al. 2007).
It’s well known that, when p >> n, it is easy to find predictors that perform excellently on the fitted data, but fail in external validation, leading to poor prediction rules. Furthermore, there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions on future observations not used in model training (James et al. 2014).
One possible strategy consists of testing all possible combination of the predictors, and then selecting the best model. This method called best subsets regression (Chapter @ref(best-subsets-regression)) is computationally expensive and becomes unfeasible for a large data set with many variables.
A better alternative to the best subsets regression is to use the stepwise regression (Chapter @ref(stepwise-regression)) method, which consists of adding and deleting predictors in order to find the best performing model with a reduced set of variables .
Other methods for high-dimensional data, containing multiple predictor variables, include the penalized regression (ridge and lasso regression, Chapter @ref(penalized-regression)) and the principal components-based regression methods (PCR and PLS, Chapter @ref(pcr-and-pls-regression)).
In this part, we’ll cover three different categories of approaches to select an optimal linear model for a large multivariate data. These include:
- Best subsets selection (Chapter @ref(best-subsets-regression))
- Stepwise selection (Chapter @ref(stepwise-regression))
- Penalized regression (or shrinkage methods) (Chapter @ref(penalized-regression))
- Dimension reduction methods (Chapter @ref(pcr-and-pls-regression))
References
Bovelstad, H.M., S. Nygård, H.L. Storvold, M. Aldrin, o. Borgan, A. Frigessi, and O.C. Lingjoerde. 2007. “Predicting Survival from Microarray Data—a Comparative Study.” Bioinformatics 23 (16): 2080–7. doi:10.1093/bioinformatics/btm305.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.