Table of Contents

Statistics - Model Selection

About

Model selection is the task of selecting a statistical model from a set of candidate models through the use of criteria's

Dimension reduction procedures generates and returns a sequence of possible models <math>M_0</math> indexed by a tuning parameter.

The final step implies to select the best model out of a set of model (<math>M_0</math> through <math>M_i</math> ) (ie the Model Path).

In other words, model selection helps choose the best model complexity :

There is actually three ways:

Criteria

RSS and R2

RSS and R2 for Model Size selection

Subset Selection Model Path

Even if the data set contains N predictors, the x-axis can goes beyond N if there is categorical variables as then we need to create dummy variables

In the context of linear regression, we can't use RSS or R squared to choose among these <math>p+1 </math> models because:

Therefore in order to choose the “optimal” or best member in the last steps, others criteria than RSS and R2 must be used.

Training Error Adjustement

The below techniques adjust the training error for the model size in order to give us an estimate of the test error and can be used to select among a set of models with different numbers of variables. They balances training error with model size.

We want Cp, BIC to be as small as possible and adjusted R squared as large as possible.

Statistics Adujsted Training Error Cip Bic Adjusted R2

In this case, there's no compelling evidence that a four parameter model is really better than three or better than five. As simpler is beter, three predictors, maximum four predictors would probably be the best model.

Hold out set

The major advantages of this methods against the above adjustement methods is that you don't need to know:

d and sigma squared are both challenges and hold out set method relieves the worry of having to come up with good estimates of those. Cross-validation or validation set can always be performed, no matter how wacky the model is. Rather than making an adjustment, they're more direct.

We compute the validation set error or the cross-validation error for each model <math>M_k</math> under consideration, and then select the k for which the resulting estimated test error is smallest. The validation error as a function of k will be what we use to estimate prediction error and to choose the model size.

The one standard error rule will not choose the minimum but will take the (simplest|smallest) model for which the estimated test error is within one standard error of the lowest point on the curve)

As the cross-validation is just an average over the k folds, there is a standard error. In fact, we acknowledge the fact that the curves have variation because the chosen variables are random variables just like the data are. If the models are within one standard error of each other, the error is almost the same and we'd rather have a simpler model.

Cross-Validation is similar to doing a Validation set multiple times and then averaging the answers. As such, we expect it to have lower variance than the Validation set method. This is why Cross-Validation is appealing (especially for small n).