Select your variables

Evaluate the importance of your independent variables and select an optimal subset for your prediction model.

Learning objectives

At the end of this session you should be able to

  • discuss the importance of feature selection strategies in multiple variable models,
  • decide, which feature selection strategy to use in some standard cases, and
  • implement some basic feature selection strategies.

Basic idea of variable selection

Use only those explanatory variables, which best explain the dependent variable without overfitting the model to the sample.

Increasing and then decreasing R squared with increasing number of variables.
Influence of variables used for two regression models on model performance.

The graphic above shows how the cross-validation performance of two regression models typically changes with increasing number of independent variables used by the model. At first, more independent variables lead to a better performance (increasing R squared) but the maximum performance is reached quite fast. Afterwards, additional independent variables would still lead to the same or better “internal” model results as long as the performance meassure does not account for the number of variables but the increasing overfitting of the model to the data sample certainly leads to a decreasing performance in the cross-validation.

For a deeper look into variable selection have a look at Meyer et al. 2016

Comic illustrating the selection of a seat in a plane.
CC-BY by xkcd.com

Comments?

You can leave comments below if you have questions or remarks about any of the text or code in this unit. Please copy the corresponding line into your comment to make it easier to answer your question.

Updated: