Statistics - Forward and Backward Stepwise (Selection|Regression)

About

In statistics, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure.

Stepwise methods have the same ideas as best subset selection but they look at a more restrictive set of models.

Between backward and forward stepwise selection, there's just one fundamental difference, which is whether you're starting with a model:

with no predictors (forward)
with all the predictors. (backward)

At each step:

we're not looking at every single possible model in the universe that contains k predictors such as in best subset selection but we're just looking at the models that contain the k minus 1 predictors the we already chose in the previous step.
we're just going to choose the variable that gives the biggest improvement to the model we just had a moment earlier.

Articles Related

Selection

Forward

Forward Selection chooses a subset of the predictor variables for the final model.

We can do forward stepwise in context of linear regression whether n is less than p or n is greater than p.

Forward selection is a very attractive approach, because it's both tractable and it gives a good sequence of models.

Start with a null model. The null model has no predictors, just one intercept (The mean over Y).
Fit p simple linear regression models, each with one of the variables in and the intercept. So basically, you just search through all the single-variable models the best one (the one that results in the lowest residual sum of squares). You pick and fix this one in the model.
Now search through the remaining p minus 1 variables and find out which variable should be added to the current model to best improve the residual sum of squares.
Continue until some stopping rule is satisfied, for example when all remaining variables have a p-value above some threshold.

Backward

Unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.

In order to be able to perform backward selection, we need to be in a situation where we have more observations than variables because we can do least squares regression when n is greater than p. If p is greater than n, we cannot fit a least squares model. It's not even defined.

Start with all variables in the model.
Remove the variable with the largest p-value | that is, the variable that is the least statistically significant.
The new (p - 1)-variable model is t, and the variable with the largest p-value is removed.
Continue until a stopping rule is reached. For instance, we may stop when all remaining variables have a significant p-value defined by some significance threshold.

Characteristics

Overfitting

Forward and backward stepwise selection is not guaranteed to give us the best model containing a particular subset of the p predictors but that's the price to pay in order to avoid overfitting. Even if p is less than 40, looking at all possible models may not be the best thing to do. The point is that is not always best to do a full search, even when you can do it because we will pay a price in variance (and thus in test error). Just because best subset has a better model on the training data doesn't mean that it's really going to be a better model overall in the context of test data, which is what we really care about.

Correlation and RSS

Just like in best subset selection, we get p plus 1 models from M0 to Mp. The difference is that these models are nested.

M1 contains the predictor in M0, plus one more.
M2 contains M1 plus one more predictor.
M3 contains the predictors in M2 plus one more,
and so on.

These models are nested in a way that wasn't the case for best subset selection.

As backward and forward stepwise are not doing a search among all possible models. For a given model size, they are going to have an RSS that typically will be above that for best subset. This happens only when there's correlation between the features. It's pretty easy to show that if the variables had no correlation, then the variables chosen by the two methods would be exactly the same. Because of the correlation between the features you can get a discrepancy between best subset and forward stepwise.

They still have the same RSS on two points:

the null model because they each contain it
the model with all p predictors because they are each considering the model with all p predictors.

But in between there is a gap.

Computation

Instead of looking at 2 to the p models, The backward and forward selection approach searches through around p squared models:

<MATH> \begin{array}{rrl} Model Space & \approx & 1 + p + (p-1) + (p-2) + \dots + (p-k) \approx p^2 \\ & = & 1 + \frac{p(p + 1)}{2} \\ \end{array} </MATH>

Documentation / Reference

wiki/Stepwise regression