About
Unlike subset and forward stepwise regression, which controls the complexity of a model by restricting the number of variables, ridge regression keeps all the variables in and shrinks the coefficients towards zero.
Articles Related
Steps
Missing values
There are some missing values here, so before we proceed we will remove them.
library(ISLR)
Hitters=na.omit(Hitters)
with(Hitters,sum(is.na(Salary)))
The package package=ISLR contains the data set
Formula parameters
As the package package=glmnet does not use the model formula language, we are then require to give it:
- a matrix x of predictors
- and a response vector
library(glmnet)
x=model.matrix(Salary~.-1,data=Hitters)
y=Hitters$Salary
Ridge
Ridge is doing:
- shrinkage
- but not variable selection.
Fitting
The ridge-regression model is fitted by calling the glmnet function with alpha=0 (When alpha equals 1 you fit a lasso model). For alphas in between 0 and 1, you get what's called elastic net models, which are in between ridge and lasso.
fit.ridge=glmnet(x,y,alpha=0)
plot(fit.ridge,xvar="lambda",label=TRUE)
It makes a plot as a function of log of lambda, and is plotting the coefficients. glmnet develops a whole part of models on a grid of values of lambda (about 100 values of lambda).
When log of lambda is 12, all the coefficients are essentially zero. Then as we relax lambda, the coefficients grow away from zero in a nice smooth way, and the sum of squares of the coefficients is getting bigger and bigger until we reach a point where lambda is effectively zero, and the coefficients are unregularized.
Model Selection
Ridge regression gives a whole path of model and we need to pick one. glmnet's got a built-in function, called CV.glmnet that will do cross validation
cv.ridge=cv.glmnet(x,y,alpha=0)
plot(cv.ridge)
In the beginning, the mean squared error is very high, and the coefficients are restricted to be too small, and then at some point, it kind of levels off. This seems to indicate that the full model is doing a good job.
There's two vertical lines.
- The one is at the minimum,
- and the other vertical line is within one standard error of the minimum. The second line is a slightly more restricted model that does almost as well as the minimum, and sometimes we'll go for that.
At the top of the plot, you actually see how many non-zero variables coefficients are in the model. There's all 20 variables in the model (19 variables plus the intercept) and no coefficient is zero.
Lasso
lasso is doing:
- shrinkage
- and variable selection.
Fit
To fit the lasso model, you can specify alpha=1 to the fitting function (or not as it's the default)
fit.lasso=glmnet(x,y,alpha=1)
plot(fit.lasso,xvar="lambda",label=TRUE)
Plot
The plot has various choices (see help file). The deviance shows the percentage of deviance explained, (equivalent to r squared in case of regression)
plot(fit.lasso,xvar="dev",label=TRUE)
A lot of the r squared was explained for quite heavily shrunk coefficients. And towards the end, with a relatively small increase in r squared from between 0.4 and 0.5, coefficients grow very large. This may be an indication that the end of the path is overfitting.
There's different ways of plotting the coefficients that give us different information about the coefficients and about the nature of the path.
Model Selection
This section shows how to do model selection:
- with the built-in cross validation procedure
- or with the help of two sets (training/validation)
Cross validation
cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)
The fit.lasso will have the whole path of coefficients. which is roughly 100 coefficient vectors dependent on each value, indexed by different values of lambda.
Coefficient Extraction
There's a coefficient function extractor that works on a cross validation object and pick the coefficient vector corresponding to the best model,
coef(cv.lasso)
The output below has 6 non-zero coefficients which shows that the function has chosen the second vertical second line on the cross-validation plot (within one standard error of the minimum) because cross validation error is measured with some variance.
21 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 127.95694754
AtBat .
Hits 1.42342566
HmRun .
Runs .
RBI .
Walks 1.58214111
Years .
CAtBat .
CHits .
CHmRun .
CRuns 0.16027975
CRBI 0.33667715
CWalks .
LeagueA .
LeagueN .
DivisionW -8.06171262
PutOuts 0.08393604
Assists .
Errors .
NewLeagueN .
Train/validation set
lasso.tr=glmnet(x[train,],y[train])
lasso.tr
pred=predict(lasso.tr,x[-train,])
dim(pred)
rmse= sqrt(apply((y[-train]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)")
lam.best=lasso.tr$lambda[order(rmse)[1]]
lam.best
coef(lasso.tr,s=lam.best)