About
In a feature selection process, once you have generated all possible models, you have to select the best one. This article talks about the indirect methods.
Articles Related
Model Selection
Adujstement formula
We will select the models using CP but as you see below, the regsubset object that has been created in the model generation step, has also adjusted r squared and BIC
names(myPathOfModel.summary)
[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
Like for each models, the best subset models has the following variables:
- the adjusted r squared,
- the Cp statistic,
- the BIC statistic.
You use this data to plot them
Function
The idea here is to pick a model with the lowest Cp. To identify it, you can do that with the Cp plot or with the following function:
which.min(myPathOfModel.summary$cp)
[1] 10
In this case, the model with 10 variables is the smallest.
Plot
Cp
Cp is an estimate of prediction error.
plot(myPathOfModel.summary$cp,xlab="Number of Variables",ylab="Cp")
You can also plot the best point:
points(10,myPathOfModel.summary$cp[10],pch=20,col="red")
Cp Model by variables
plot(myPathOfModel,scale="Cp")
This plot gives a quick summary of all the models by variables, as opposed to just seeing the Cp statistics.
- the unique value of Cp for each model are in a descendant order (worst and worst) on the y axis (Small is good)
- the variables are on the x axis
- The black squares indicates that variable's are in (one) and the white squares indicates that variable's are out (null)