Select your variables
Evaluate the importance of your independent variables and select an optimal subset for your prediction model.
Learning objectives
At the end of this session you should be able to
- discuss the importance of feature selection strategies in multiple variable models,
- decide, which feature selection strategy to use in some standard cases, and
- implement some basic feature selection strategies.
Basic idea of variable selection
Use only those explanatory variables, which best explain the dependent variable without overfitting the model to the sample.
The graphic above shows how the cross-validation performance of two regression models typically changes with increasing number of independent variables used by the model. At first, more independent variables lead to a better performance (increasing R squared) but the maximum performance is reached quite fast. Afterwards, additional independent variables would still lead to the same or better “internal” model results as long as the performance meassure does not account for the number of variables but the increasing overfitting of the model to the data sample certainly leads to a decreasing performance in the cross-validation.
For a deeper look into variable selection have a look at Meyer et al. 2016
Comments?
You can leave comments below if you have questions or remarks about any of the text or code in this unit. Please copy the corresponding line into your comment to make it easier to answer your question.