Statistics - Best Subset Selection Regression

About

The most direct approach in order to generate a set of model for the feature selection approach is called all subsets or best subsets regression. We compute the least squares t for all possible subsets in order to choose them.

Articles Related

Procedure

The natural way to do it is to consider every possible subset of p predictors and to choose the best model out of every single model the just contains some of the predictors.

Calculate the null model noted as <math>M_0</math> . 0 for 0 parameters.
Select the best model with one predictor called <math>M_1</math> . In order to get <math>M_1</math> , we fit all p models that contain exactly one predictor <math>(^p_1)</math> .
Fit all model <math>(^p_2)</math> and select the best model with exactly 2 predictor called <math>M_2</math>
….
Fit all model <math>(^p_k)</math> and select the best model with p predictors <math>M_p</math> .

This procedure produces a path of models.

The best model is defined by type of model:

the smallest residual sum of squares, or equivalently, the largest r squared for linear regression
the deviance (negative two times the maximized log-likelihood) plays the role of RSS for a broader class of models.

This notation <math>(^p_k)</math> is pronounced “choose” and means p choose k. It is the number of possible models that I can get that contain exactly k predictors out of p predictors total. <MATH> \begin{array}{rrc} \left(^p_k \right) & = & \frac{p!}{k!(p-k)!} \end{array} </MATH>

Best subset Issues

Best subset selection suffers from:

Computational Performance

If I have access to p predictors, and I want to consider every possible sub model, I'm actually looking at 2 to the p subsets. That is an absolutely huge number. When p is large, we're really not going to be able to do best subset selection. 40 predictors isn't large. Actually, there's easily tens or hundreds of thousands of predictors. So, Best subset selection just does not scale for many of the types of problems that people are interested in today. The limit for most R packages or function that do subset selection is about 30 or 40.

Statistical problems: Generalization, Overfitting

If I have 40 predictors, and I'm considering a trillion models (A Gosh), I'm going to over fit the data because a I'm just looking at so many models, I'm going to choose a model that looks really, really great on the training data but it's not going to look great on an independent test. It's a reasonable thing to do with a small number of variables, but it gets really hard when the number of variables gets large. Each predictor can either be included or not included in the model. That means that for each of the p variables there are two options. Number of Subset: <math>2^p</math> . 2 to the p grows exponentially with the number of variables.

For these two reasons– computational and statistical– best subset selection isn't really great unless p is extremely small. Best Subset Selection is rarely used in practice for say p=10 or larger.