R - Logistic Regression

About

Steps

Model

We have a call to GLM where we gives:

the direction: the response,
and the predictors,
the family equals binomial. This parameter tells GLM to fit a logistic regression model instead of one of the many other models that can be fit to the GLM.

logisticRegressionModel =glm(response~variable1+variable2+...+variableN,data=dataframe,family=binomial)

Summary

Call:
glm(formula = Response ~ Variable1 + Variable2, family = binomial, data = dataframe)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.446  -1.203   1.065   1.145   1.326  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Variable1   -0.073074   0.050167  -1.457    0.145
Variable2   -0.042301   0.050086  -0.845    0.398

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3

where:

the null deviance is the deviance just for the mean. So that's the log likelihood if you just use the mean model.
the residual deviance is the deviance for the model with all the predictors in.

There is in this case a very modest change in deviance.

Prediction Probabilities

# predict gives you a vector of fitted probabilities.
probabilities=predict(dataModel,type="response") 
# Look at the first five
probabilities[1:5]

1         2         3         4         5 
0.5070841 0.4814679 0.4811388 0.5152224 0.5107812

In this case, the fitted probabilities are very close to 50% (not from 100% or 0%), which indicates no strong predictions (relation).

Classification

We can turn those probabilities into classifications by thresholding at 0.5 with the if/else command.

# glm.probs>0.5 will return a vector of trues and falses.
estimatedResponses=ifelse(probabilities>0.5,"True","False")

Accuracy

Confusion matrix

The table (confusion matrix) of the estimated response (estimatedResponses) against the true response can be made

table(estimatedResponses,trueResponses)

trueResponse
estimatedResponses   False True
              False  145   141
              True   457   507

On the diagonals is where we do correct classification
On the off diagonals is where we make mistakes.

Mean classification performance

mean(estimatedResponses==trueResponses)

[1] 0.5216

We do slightly better than chance.

Data Set Splits

As we may be have overfit, we split the data set.

For all those observations for which the variable value is less than 10, we'll get a true. Otherwise, we'll get a false.

train = variable<2005

New Fit with only this subset

logisticTrainRegressionModel =glm(response~variable1+variable2+...+variableN,data=dataframe,family=binomial,subset=train)

Prediction

# predict gives you a vector of fitted probabilities.
trainProbabilities=predict(dataModel,type="response",newdata=dataframe[!train,])

Classification

trainEstimatedResponse=ifelse(trainProbabilities >0.5,"True","False")

Accuracy, Estimation

table(trainEstimatedResponse,trueResponses[!train])

Mean

mean(trainEstimatedResponse==trueResponses[!train])

[1] 0.4801587

We're doing worse than the null rate, which is 50%.

We might be overfitting because we're doing worse on the test data.

It doesn't necessarily mean it won't be able to make any kind of reasonable predictions. It just means that possibly these variables are very correlated. To discover that, we must fit to a smaller model (ie with less variables in the regression).