Feature selection is the second class of dimension reduction methods. They are used to reduce the number of predictors used by a model by selecting the best d predictors among the original p predictors.
This allows for smaller, faster scoring, and more meaningful Generalized Linear Models (GLM).
Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points).
Feature selection is also useful as part of the data analysis process, as it shows which features are important for prediction, and how these features are related.
What are the important variables to include in the model.
When we have a small number of features, the model becomes more interpretable.
Feature selection is a way of choosing among features to find the ones that are most informative.
We'd like to fit a model that has all the good (signal) variables and leaves out the noise variables.
Feature selections procedure generate models and uses models_selection methods to try to find among the p predictors the ones that are the most related to the response.
The central assumption when using a feature selection technique is that the data contains many redundant or irrelevant features.
We have access to p predictors but we want to actually have a simpler model that involves only a subset of those p predictors. This model selection is made in two steps:
- model generation and selection for each k parameters
- model selection among all best model for each k parameters. We choose then between them based on some criterion that balances training error with model size.
All the below methods take a subset of the predictors and use least squares to fit the model.
Subset selection flavours two methods in order to generate the models for k predictors:
- Best subset selection, where we try to look among all possible combinations of features to find the ones that are the most productive.
- Forward and backward stepwise methods which try to do an intelligent search through the space of models.
The shrinkage methods take all of the predictors, but use a shrinkage approach to fit the model instead of least squares.
The shrinkage approach fit the coefficients by minimizing the RSS (as for least squares) but with a penalty term. The regression coefficients will then shrink towards, typically, 0.
This methods use least squares not on the original predictors but on new predictors, which are linear combinations of the original projectors.