Table of Contents

Statistics - Forward and Backward Stepwise (Selection|Regression)

About

In statistics, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure.

Stepwise methods have the same ideas as best subset selection but they look at a more restrictive set of models.

Between backward and forward stepwise selection, there's just one fundamental difference, which is whether you're starting with a model:

At each step:

Selection

Forward

Forward Selection chooses a subset of the predictor variables for the final model.

We can do forward stepwise in context of linear regression whether n is less than p or n is greater than p.

Forward selection is a very attractive approach, because it's both tractable and it gives a good sequence of models.

Backward

Unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.

In order to be able to perform backward selection, we need to be in a situation where we have more observations than variables because we can do least squares regression when n is greater than p. If p is greater than n, we cannot fit a least squares model. It's not even defined.

Characteristics

Overfitting

Forward and backward stepwise selection is not guaranteed to give us the best model containing a particular subset of the p predictors but that's the price to pay in order to avoid overfitting. Even if p is less than 40, looking at all possible models may not be the best thing to do. The point is that is not always best to do a full search, even when you can do it because we will pay a price in variance (and thus in test error). Just because best subset has a better model on the training data doesn't mean that it's really going to be a better model overall in the context of test data, which is what we really care about.

Correlation and RSS

Just like in best subset selection, we get p plus 1 models from M0 to Mp. The difference is that these models are nested.

These models are nested in a way that wasn't the case for best subset selection.

As backward and forward stepwise are not doing a search among all possible models. For a given model size, they are going to have an RSS that typically will be above that for best subset. This happens only when there's correlation between the features. It's pretty easy to show that if the variables had no correlation, then the variables chosen by the two methods would be exactly the same. Because of the correlation between the features you can get a discrepancy between best subset and forward stepwise.

They still have the same RSS on two points:

But in between there is a gap.

Computation

Instead of looking at 2 to the p models, The backward and forward selection approach searches through around p squared models:

<MATH> \begin{array}{rrl} Model Space & \approx & 1 + p + (p-1) + (p-2) + \dots + (p-k) \approx p^2 \\ & = & 1 + \frac{p(p + 1)}{2} \\ \end{array} </MATH>

Documentation / Reference