logistic regression in R
We have a call to GLM where we gives:
logisticRegressionModel =glm(response~variable1+variable2+...+variableN,data=dataframe,family=binomial)
Call:
glm(formula = Response ~ Variable1 + Variable2, family = binomial, data = dataframe)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.446 -1.203 1.065 1.145 1.326
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000 0.240736 -0.523 0.601
Variable1 -0.073074 0.050167 -1.457 0.145
Variable2 -0.042301 0.050086 -0.845 0.398
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1731.2 on 1249 degrees of freedom
Residual deviance: 1727.6 on 1243 degrees of freedom
AIC: 1741.6
Number of Fisher Scoring iterations: 3
where:
There is in this case a very modest change in deviance.
# predict gives you a vector of fitted probabilities.
probabilities=predict(dataModel,type="response")
# Look at the first five
probabilities[1:5]
1 2 3 4 5
0.5070841 0.4814679 0.4811388 0.5152224 0.5107812
In this case, the fitted probabilities are very close to 50% (not from 100% or 0%), which indicates no strong predictions (relation).
# glm.probs>0.5 will return a vector of trues and falses.
estimatedResponses=ifelse(probabilities>0.5,"True","False")
The table (confusion matrix) of the estimated response (estimatedResponses) against the true response can be made
table(estimatedResponses,trueResponses)
trueResponse
estimatedResponses False True
False 145 141
True 457 507
mean(estimatedResponses==trueResponses)
[1] 0.5216
We do slightly better than chance.
As we may be have overfit, we split the data set.
train = variable<2005
logisticTrainRegressionModel =glm(response~variable1+variable2+...+variableN,data=dataframe,family=binomial,subset=train)
# predict gives you a vector of fitted probabilities.
trainProbabilities=predict(dataModel,type="response",newdata=dataframe[!train,])
trainEstimatedResponse=ifelse(trainProbabilities >0.5,"True","False")
table(trainEstimatedResponse,trueResponses[!train])
mean(trainEstimatedResponse==trueResponses[!train])
[1] 0.4801587
We're doing worse than the null rate, which is 50%.
We might be overfitting because we're doing worse on the test data.
It doesn't necessarily mean it won't be able to make any kind of reasonable predictions. It just means that possibly these variables are very correlated. To discover that, we must fit to a smaller model (ie with less variables in the regression).