About
Feature selection through model generation and selection through a direct approach with :
- and cross validation
in R
Articles Related
Method
Validation Set
We are picking a subset of the observations and put them aside and use them as a validation set and the other as a training data set.
- Lets make a training and validation set
# for reproducibility, we'll set the random number seed.
set.seed(1)
sampleSize=dim(myDataFrame)[1]
trainingSetSize=2/3*sampleSize
# seq(sampleSize) creates numbers one, two, three, four up to sampleSize
# There's 180 sampleSize chosen at random from the sequence.
trainingIndex=sample(seq(sampleSize),trainingSetSize,replace=FALSE)
- Generate all models (here with the forward stepwise methode)
library(leaps)
numberOfModel=dim(myDataFrame)[2]-1 # All variables -1 to exclude the response variable
trainingData=myDataFrame[trainingIndex,]
models=regsubsets(Response~.,data=trainingData,nvmax=numberOfModel,method="forward")
- Caluclate the test error for each model on the test set
# Initialize a vector to contain all test errors
testErrors=rep(NA,numberOfModel)
# Validation Data Set (Test Data) Matrix in order to calculate manually the prediction for each model
testData=myDataFrame[-trainingIndex,] # All data that are not training data (-1)
# A model matrix permits to use the formula used to build the model (response twiddle dot, which means model response as a
# function of all the other variables in the headers data frame)
testDataMatrix=model.matrix(Response~.,data=testData)
# For each model, calculate the prediction and the test errors
for(i in 1:numberOfModel){
# Get the coefficient of the model i
coefi=coef(models,id=i)
# Calculate the prediction
# To get the right elements of testDataMatrix, we index the columns by the names that are on the coefficient vector.
# That's the subset of columns of testDataMatrix that correspond to the variables that are in th coefficient vector.
pred=testDataMatrix[,names(coefi)]%*%coefi
# Calculate the mean square error
testErrors[i]=mean((myDataFrame$Response[-train]-pred)^2)
}
- Plot the data
# Plot the (validation|test) errors: the root mean squared error
plot(sqrt(testErrors),ylab="Root MSE",ylim=c(300,400),pch=19,type="b")
# Plot the Rss
points(sqrt(models$rss[-1]/length(trainingData)),col="blue",pch=19,type="b") # -1 to not plot the null model
# Legend
legend("topright",legend=c("Training","Validation"),col=c("blue","black"),pch=19)
- The validation error has a little bit of noise as it is slightly jumpy.
- The root mean residual sum of squares is monotone decreasing as it has to be because forward stepwise, each time includes a variable that improves the fit the most. And so therefore, by definition, it's got to improve the residual sum of squares on the training data.
plotting character (PCH) gives a solid dot, which is often nice to visualize on the screen.
Cross-Validation
with k=10, 10 fold cross validation:
- Creation of a serie of point from 1 to K for the number of rows of the data. Each observation's going to be assigned a fold number. It's a random assignment of folds to each of the observations.
set.seed(1)
k=10
folds=sample(rep(1:k,length=nrow(myDataFrame)))
[1] 3 5 9 7 6 9 5 1 3 3 2 4 1 7 5 2 3 5 3 3 7 6 7 4 6 9 1 3 3 3
[31] 6 7 8 6 9 1 3 3 10 8 7 8 3 3 8 6 6 4 3 5 6 1 10 7 7 1 9 4 5 9
[61] 5 6 1 8 3 9 1 1 2 6 6 7 7 8 2 2 5 10 2 6 9 1 9 3 8 5 10 2 1 2
[91] 5 7 1 8 8 2 4 3 2 6 8 2 3 8 8 5 1 2 8 4 5 7 6 9 9 4 5 2 7 4
[121] 6 5 6 6 9 9 7 6 10 1 3 2 8 4 1 9 5 4 9 8 2 6 10 10 7 2 9 4 5 1
[151] 3 4 8 3 10 1 9 2 1 8 10 1 7 7 9 4 5 10 4 2 1 3 10 4 4 9 10 7 4 4
[181] 10 3 3 3 5 5 2 10 10 4 9 9 3 2 10 3 8 6 5 2 5 2 2 9 8 6 8 4 6 6
[211] 7 5 2 4 6 6 10 8 8 1 7 4 10 8 10 9 4 10 8 1 10 10 1 4 7 6 1 5 5 10
[241] 5 7 10 9 1 10 2 2 8 7 1 9 7 7 1 4 8 7 9 2 4 10 5
- The function Table give the distribution for each value. It's pretty balanced. There is either 26 or 27 observations in each fold.
table(folds)
1 2 3 4 5 6 7 8 9 10
27 27 27 26 26 26 26 26 26 26
numberOfVariables=ncol(myDataFrame)-1 # For the response variable
# errorsMatrix to save the errors for each fold and for each model size
errorsMatrix=matrix(NA,K,numberOfVariables)
# Loop through the folds
for(k in 1:K){
# Print the progression
cat("Fold",k," Forward Stepwise Generation \n")
# The training index (which observations will be used to train the model
trainingIndex=(folds!=k)
# Models Generation with forward method (one for each variables until numberOfVariables)
subsetModels=regsubsets(Response~.,data=myDataFrame[trainingIndex,],nvmax=numberOfVariables,method="forward")
# For each susbet model (here=numberOfVariables), calculate the predictions and the errors
for(i in 1:numberOfVariables){
testIndex=-trainingIndex # or testIndex=(folds==k)
# Predictions with the function in the annex (see below)
pred=predict(subsetModels,myDataFrame[testIndex,],id=i)
# compute and save the mean squared error of the predictions
errorsMatrix[k,i]=mean((myDataFrame$Response[testIndex]-pred)^2)
}
}
- Plot
# root mean squared error
# apply the mean function to the columns (2 indicates columns) and take the square of.
rmse=sqrt(apply(errorsMatrix,2,mean))
# show the tenfold cross validation curve.
plot(rmse,pch=19,type="b")
The curve plot must be a little bit smoother because the root mean squared error is computed over the full training set.
Annexe
Predict Method for regsubsets
# Object = regsubset object
# newdata = data frame
# id = model id
predict.regsubsets=function(object,newdata,id,...){
# All objects got a component called a call that contains creation information. The second value contains the formula.
form=as.formula(object$call[[2]])
mat=model.matrix(form,newdata)
coefi=coef(object,id=id)
mat[,names(coefi)]%*%coefi
}