R - Feature Selection - Indirect Model Selection

About

In a feature selection process, once you have generated all possible models, you have to select the best one. This article talks about the indirect methods.

Model Selection

Adujstement formula

We will select the models using CP but as you see below, the regsubset object that has been created in the model generation step, has also adjusted r squared and BIC

names(myPathOfModel.summary)
[1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"   

Like for each models, the best subset models has the following variables:

You use this data to plot them

Function

The idea here is to pick a model with the lowest Cp. To identify it, you can do that with the Cp plot or with the following function:

which.min(myPathOfModel.summary$cp)
[1] 10

In this case, the model with 10 variables is the smallest.

Plot

Cp

Cp is an estimate of prediction error.

plot(myPathOfModel.summary$cp,xlab="Number of Variables",ylab="Cp")

You can also plot the best point:

points(10,myPathOfModel.summary$cp[10],pch=20,col="red")

_

Cp Model by variables

plot(myPathOfModel,scale="Cp")

_

This plot gives a quick summary of all the models by variables, as opposed to just seeing the Cp statistics.

  • the unique value of Cp for each model are in a descendant order (worst and worst) on the y axis (Small is good)
  • the variables are on the x axis
  • The black squares indicates that variable's are in (one) and the white squares indicates that variable's are out (null)

Powered by ComboStrap