Statistics - Model Selection

Model Funny


Model selection is the task of selecting a statistical model from a set of candidate models through the use of criteria's

Dimension reduction procedures generates and returns a sequence of possible models <math>M_0</math> indexed by a tuning parameter.

The final step implies to select the best model out of a set of model (<math>M_0</math> through <math>M_i</math> ) (ie the Model Path).

In other words, model selection helps choose the best model complexity :

There is actually three ways:

  • RSS and R2 (not to be done)
  • Training Error Adjustment (not the best ones)
  • Hold out set (ie cross-validation)


RSS and R2

RSS and R2 for Model Size selection

Subset Selection Model Path

  • There's 2 to the 10 dots in this picture each representing a model. Some of them are on top of each other.
  • The red line tracks the best model for a given number of predictors, visualize the Model Path and is monotony increasing for R2 and decreasing for RSS

Even if the data set contains N predictors, the x-axis can goes beyond N if there is categorical variables as then we need to create dummy variables

In the context of linear regression, we can't use RSS or R squared to choose among these <math>p+1 </math> models because:

  • these quantities are really related to the training error and we want a model with low test error, because we want it to do well on future observations that we haven't seen. And unfortunately, just choosing a model with the best training error isn't going to give me a model that has a low test error. In general, training error is a really bad predictor of test error
  • they all have different sizes. When you are comparing a model with four predictors to a model with eight predictors, you can't just look at which one has a smaller residual sum of squares because of course the one with eight predictors is going to have always a smaller residual sum of squares. As the models get bigger, the residual sum of squares decreases and the r squared increases because when you add in variables things can not get worse. If you have a subset of size three, for example, and you look for the best subset of size four, at the very worst you could set the coefficient for the fourth variable to be 0. And you'll have the same error as for the three variable model. So the best model curve can never get worse. It can be flat but we can't do any worse by adding a predictor.

Therefore in order to choose the “optimal” or best member in the last steps, others criteria than RSS and R2 must be used.

Training Error Adjustement

The below techniques adjust the training error for the model size in order to give us an estimate of the test error and can be used to select among a set of models with different numbers of variables. They balances training error with model size.

We want Cp, BIC to be as small as possible and adjusted R squared as large as possible.

Statistics Adujsted Training Error Cip Bic Adjusted R2

In this case, there's no compelling evidence that a four parameter model is really better than three or better than five. As simpler is beter, three predictors, maximum four predictors would probably be the best model.

Hold out set

The major advantages of this methods against the above adjustement methods is that you don't need to know:

  • d: the number of parameters (ie model size). For instance, for shrinkage methode (like ridge regression and Lasso), it's not at all clear what d is.
  • <math>\hat{\sigma}</math> and then n does not need to be bigger than p. Because if p's bigger than n, we can't fit a full model, because a full model will totally saturate and give an error of 0.

d and sigma squared are both challenges and hold out set method relieves the worry of having to come up with good estimates of those. Cross-validation or validation set can always be performed, no matter how wacky the model is. Rather than making an adjustment, they're more direct.

We compute the validation set error or the cross-validation error for each model <math>M_k</math> under consideration, and then select the k for which the resulting estimated test error is smallest. The validation error as a function of k will be what we use to estimate prediction error and to choose the model size.

The one standard error rule will not choose the minimum but will take the (simplest|smallest) model for which the estimated test error is within one standard error of the lowest point on the curve)

As the cross-validation is just an average over the k folds, there is a standard error. In fact, we acknowledge the fact that the curves have variation because the chosen variables are random variables just like the data are. If the models are within one standard error of each other, the error is almost the same and we'd rather have a simpler model.

Cross-Validation is similar to doing a Validation set multiple times and then averaging the answers. As such, we expect it to have lower variance than the Validation set method. This is why Cross-Validation is appealing (especially for small n).

Discover More
Cross Validation Cake
(Statistics|Data Mining) - (K-Fold) Cross-validation (rotation estimation)

Cross-validation, sometimes called rotation estimation is a resampling validation technique for assessing how the results of a statistical analysis will generalize to an independent new data set. This...
Feature Importance
Data Mining - (Attribute|Feature) (Selection|Importance)

Feature selection is the second class of dimension reduction methods. They are used to reduce the number of predictors used by a model by selecting the best d predictors among the original p predictors....
Thomas Bayes
Data Mining - (Dimension|Feature) (Reduction)

In machine learning and statistics, dimensionality reduction is the process of reducing the number of random variables (features) under consideration and can be divided into: feature selection (returns...
Bed Overfitting
Machine Learning - (Overfitting|Overtraining|Robust|Generalization) (Underfitting)

A learning algorithm is said to overfit if it is: more accurate in fitting known data (ie training data) (hindsight) but less accurate in predicting new data (ie test data) (foresight) Ie the model...
Lasso Vs Ridge Regression211
Statistics - (Shrinkage|Regularization) of Regression Coefficients

Shrinkage methods are more modern techniques in which we don't actually select variables explicitly but rather we fit a model containingall p predictors using a technique that constrains or regularizes...

Share this page:
Follow us:
Task Runner