About
logistic regression in R
Steps
Model
We have a call to GLM where we gives:
- the direction: the response,
- and the predictors,
- the family equals binomial. This parameter tells GLM to fit a logistic regression model instead of one of the many other models that can be fit to the GLM.
logisticRegressionModel =glm(response~variable1+variable2+...+variableN,data=dataframe,family=binomial)
Summary
Call:
glm(formula = Response ~ Variable1 + Variable2, family = binomial, data = dataframe)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.446 -1.203 1.065 1.145 1.326
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000 0.240736 -0.523 0.601
Variable1 -0.073074 0.050167 -1.457 0.145
Variable2 -0.042301 0.050086 -0.845 0.398
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1731.2 on 1249 degrees of freedom
Residual deviance: 1727.6 on 1243 degrees of freedom
AIC: 1741.6
Number of Fisher Scoring iterations: 3
where:
- the null deviance is the deviance just for the mean. So that's the log likelihood if you just use the mean model.
- the residual deviance is the deviance for the model with all the predictors in.
There is in this case a very modest change in deviance.
Prediction Probabilities
# predict gives you a vector of fitted probabilities.
probabilities=predict(dataModel,type="response")
# Look at the first five
probabilities[1:5]
1 2 3 4 5
0.5070841 0.4814679 0.4811388 0.5152224 0.5107812
In this case, the fitted probabilities are very close to 50% (not from 100% or 0%), which indicates no strong predictions (relation).
Classification
- We can turn those probabilities into classifications by thresholding at 0.5 with the if/else command.
# glm.probs>0.5 will return a vector of trues and falses.
estimatedResponses=ifelse(probabilities>0.5,"True","False")
Accuracy
Confusion matrix
The table (confusion matrix) of the estimated response (estimatedResponses) against the true response can be made
table(estimatedResponses,trueResponses)
trueResponse
estimatedResponses False True
False 145 141
True 457 507
- On the diagonals is where we do correct classification
- On the off diagonals is where we make mistakes.
Mean classification performance
mean(estimatedResponses==trueResponses)
[1] 0.5216
We do slightly better than chance.
Data Set Splits
As we may be have overfit, we split the data set.
- For all those observations for which the variable value is less than 10, we'll get a true. Otherwise, we'll get a false.
train = variable<2005
- New Fit with only this subset
logisticTrainRegressionModel =glm(response~variable1+variable2+...+variableN,data=dataframe,family=binomial,subset=train)
- Prediction
# predict gives you a vector of fitted probabilities.
trainProbabilities=predict(dataModel,type="response",newdata=dataframe[!train,])
- Classification
trainEstimatedResponse=ifelse(trainProbabilities >0.5,"True","False")
- Accuracy, Estimation
table(trainEstimatedResponse,trueResponses[!train])
- Mean
mean(trainEstimatedResponse==trueResponses[!train])
[1] 0.4801587
We're doing worse than the null rate, which is 50%.
We might be overfitting because we're doing worse on the test data.
It doesn't necessarily mean it won't be able to make any kind of reasonable predictions. It just means that possibly these variables are very correlated. To discover that, we must fit to a smaller model (ie with less variables in the regression).