Statistics - Model Selection

About

Model selection is the task of selecting a statistical model from a set of candidate models through the use of criteria's

Dimension reduction procedures generates and returns a sequence of possible models <math>M_0</math> indexed by a tuning parameter.

The final step implies to select the best model out of a set of model (<math>M_0</math> through <math>M_i</math> ) (ie the Model Path).

In other words, model selection helps choose the best model complexity :

the number of variable with feature selection
or the polynomial degree, see this R procedure

There is actually three ways:

RSS and R2 (not to be done)
Training Error Adjustment (not the best ones)
Hold out set (ie cross-validation)

Articles Related

Criteria

RSS and R2

RSS and R2 for Model Size selection

There's 2 to the 10 dots in this picture each representing a model. Some of them are on top of each other.
The red line tracks the best model for a given number of predictors, visualize the Model Path and is monotony increasing for R2 and decreasing for RSS

Even if the data set contains N predictors, the x-axis can goes beyond N if there is categorical variables as then we need to create dummy variables

In the context of linear regression, we can't use RSS or R squared to choose among these <math>p+1 </math> models because:

these quantities are really related to the training error and we want a model with low test error, because we want it to do well on future observations that we haven't seen. And unfortunately, just choosing a model with the best training error isn't going to give me a model that has a low test error. In general, training error is a really bad predictor of test error
they all have different sizes. When you are comparing a model with four predictors to a model with eight predictors, you can't just look at which one has a smaller residual sum of squares because of course the one with eight predictors is going to have always a smaller residual sum of squares. As the models get bigger, the residual sum of squares decreases and the r squared increases because when you add in variables things can not get worse. If you have a subset of size three, for example, and you look for the best subset of size four, at the very worst you could set the coefficient for the fourth variable to be 0. And you'll have the same error as for the three variable model. So the best model curve can never get worse. It can be flat but we can't do any worse by adding a predictor.

Therefore in order to choose the “optimal” or best member in the last steps, others criteria than RSS and R2 must be used.

Training Error Adjustement

The below techniques adjust the training error for the model size in order to give us an estimate of the test error and can be used to select among a set of models with different numbers of variables. They balances training error with model size.

BIC, Cp, and AIC are really almost identical. They just have slightly different formulas. We want to minimize them. They all require an estimate for a error variance <math>\hat{\sigma}^2</math> , which is only available if n is greater than p.
Adjusted R2
- Adjusted R2

We want Cp, BIC to be as small as possible and adjusted R squared as large as possible.

In this case, there's no compelling evidence that a four parameter model is really better than three or better than five. As simpler is beter, three predictors, maximum four predictors would probably be the best model.

Hold out set

The major advantages of this methods against the above adjustement methods is that you don't need to know:

d: the number of parameters (ie model size). For instance, for shrinkage methode (like ridge regression and Lasso), it's not at all clear what d is.
<math>\hat{\sigma}</math> and then n does not need to be bigger than p. Because if p's bigger than n, we can't fit a full model, because a full model will totally saturate and give an error of 0.

d and sigma squared are both challenges and hold out set method relieves the worry of having to come up with good estimates of those. Cross-validation or validation set can always be performed, no matter how wacky the model is. Rather than making an adjustment, they're more direct.

We compute the validation set error or the cross-validation error for each model <math>M_k</math> under consideration, and then select the k for which the resulting estimated test error is smallest. The validation error as a function of k will be what we use to estimate prediction error and to choose the model size.

Statistics - Resampling through Random Percentage Split
Cross-validation (CV) prediction error estimated. Cross-validation is a favorite method.

The one standard error rule will not choose the minimum but will take the (simplest|smallest) model for which the estimated test error is within one standard error of the lowest point on the curve)

As the cross-validation is just an average over the k folds, there is a standard error. In fact, we acknowledge the fact that the curves have variation because the chosen variables are random variables just like the data are. If the models are within one standard error of each other, the error is almost the same and we'd rather have a simpler model.

Cross-Validation is similar to doing a Validation set multiple times and then averaging the answers. As such, we expect it to have lower variance than the Validation set method. This is why Cross-Validation is appealing (especially for small n).