R - Shrinkage Method (Ridge Regression and Lasso)

About

Unlike subset and forward stepwise regression, which controls the complexity of a model by restricting the number of variables, ridge regression keeps all the variables in and shrinks the coefficients towards zero.

Articles Related

Steps

Missing values

There are some missing values here, so before we proceed we will remove them.

library(ISLR)
Hitters=na.omit(Hitters)
with(Hitters,sum(is.na(Salary)))

The package package=ISLR contains the data set

Formula parameters

As the package package=glmnet does not use the model formula language, we are then require to give it:

a matrix x of predictors
and a response vector

library(glmnet)
x=model.matrix(Salary~.-1,data=Hitters) 
y=Hitters$Salary

Ridge

Ridge is doing:

shrinkage
but not variable selection.

Fitting

The ridge-regression model is fitted by calling the glmnet function with alpha=0 (When alpha equals 1 you fit a lasso model). For alphas in between 0 and 1, you get what's called elastic net models, which are in between ridge and lasso.

fit.ridge=glmnet(x,y,alpha=0)
plot(fit.ridge,xvar="lambda",label=TRUE)

It makes a plot as a function of log of lambda, and is plotting the coefficients. glmnet develops a whole part of models on a grid of values of lambda (about 100 values of lambda).

When log of lambda is 12, all the coefficients are essentially zero. Then as we relax lambda, the coefficients grow away from zero in a nice smooth way, and the sum of squares of the coefficients is getting bigger and bigger until we reach a point where lambda is effectively zero, and the coefficients are unregularized.

Model Selection

Ridge regression gives a whole path of model and we need to pick one. glmnet's got a built-in function, called CV.glmnet that will do cross validation

cv.ridge=cv.glmnet(x,y,alpha=0)
plot(cv.ridge)

In the beginning, the mean squared error is very high, and the coefficients are restricted to be too small, and then at some point, it kind of levels off. This seems to indicate that the full model is doing a good job.

There's two vertical lines.

The one is at the minimum,
and the other vertical line is within one standard error of the minimum. The second line is a slightly more restricted model that does almost as well as the minimum, and sometimes we'll go for that.

At the top of the plot, you actually see how many non-zero variables coefficients are in the model. There's all 20 variables in the model (19 variables plus the intercept) and no coefficient is zero.

Lasso

lasso is doing:

shrinkage
and variable selection.

Fit

To fit the lasso model, you can specify alpha=1 to the fitting function (or not as it's the default)

fit.lasso=glmnet(x,y,alpha=1)
plot(fit.lasso,xvar="lambda",label=TRUE)

Plot

The plot has various choices (see help file). The deviance shows the percentage of deviance explained, (equivalent to r squared in case of regression)

plot(fit.lasso,xvar="dev",label=TRUE)

A lot of the r squared was explained for quite heavily shrunk coefficients. And towards the end, with a relatively small increase in r squared from between 0.4 and 0.5, coefficients grow very large. This may be an indication that the end of the path is overfitting.

There's different ways of plotting the coefficients that give us different information about the coefficients and about the nature of the path.

Model Selection

This section shows how to do model selection:

with the built-in cross validation procedure
or with the help of two sets (training/validation)

Cross validation

cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)

The fit.lasso will have the whole path of coefficients. which is roughly 100 coefficient vectors dependent on each value, indexed by different values of lambda.

Coefficient Extraction

There's a coefficient function extractor that works on a cross validation object and pick the coefficient vector corresponding to the best model,

coef(cv.lasso)

The output below has 6 non-zero coefficients which shows that the function has chosen the second vertical second line on the cross-validation plot (within one standard error of the minimum) because cross validation error is measured with some variance.

21 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept) 127.95694754
AtBat         .         
Hits          1.42342566
HmRun         .         
Runs          .         
RBI           .         
Walks         1.58214111
Years         .         
CAtBat        .         
CHits         .         
CHmRun        .         
CRuns         0.16027975
CRBI          0.33667715
CWalks        .         
LeagueA       .         
LeagueN       .         
DivisionW    -8.06171262
PutOuts       0.08393604
Assists       .         
Errors        .         
NewLeagueN    .

Train/validation set

lasso.tr=glmnet(x[train,],y[train])
lasso.tr
pred=predict(lasso.tr,x[-train,])
dim(pred)
rmse= sqrt(apply((y[-train]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)")
lam.best=lasso.tr$lambda[order(rmse)[1]]
lam.best
coef(lasso.tr,s=lam.best)

Documentation / References

http://www.statlearning.com/