Statistics Learning - Discriminant analysis

About

Discriminant analysis is a classification method.

In discriminant analysis, the idea is to:

model the distribution of X in each of the classes separately.
use what's known as Bayes theorem to flip things around to get the probability of Y given X. <math>Pr(Y|X)</math>

The Bayes theorem is a basis for discriminant analysis.

Articles Related

Bayes theorem for classification

<MATH> \begin{array}{rrl} Pr(Y = k|X = x) & = & \frac{\displaystyle Pr(X = x|Y = k) Pr(Y = k)}{\displaystyle Pr(X = x)} \\ & = & \frac{\displaystyle \pi_k f_k(x)}{\displaystyle \sum_{l=1}^K \pi_l f_l (x)} \end{array} </MATH>

where:

<math>\pi_k</math> is the prior probability for class k.
The marginal is the summing over all the classes. <math>\displaystyle \sum_{l=1}^K \pi_l f_l (x)</math>
This formula is quite general where we can plug in any probability densities: <math>f_k(x) =Pr(Y = k|X = x)</math> is the density for X in class k.

This approach is quite general, and other (distributions|density) can be used including non-parametric approaches. By altering the forms for <math>f_k(x)</math> , we get different classifiers (ie classification rule).

Gaussian

When <math>f_k(x)</math> are Gaussian densities, with the same covariance matrix in each class, this leads to linear discriminant analysis.

The two popular forms of linear discriminant analysis are:

linear : same variance for all X
and quadratic : different variance for all X

Naive Bayes

When you have a large number of features (like 4,000), you really wouldn't want to estimate the large covariance matrices.

You will then assume that in each class the density factors into a product of densities. <MATH> f_k(x) = \prod^p_{j=1} f_{jk}(x_j) </MATH> where:

k is the class
p is the number of parameters

ie that the variables are conditionally independent in each of the classes.

If we plug it into the above Bayes formula, we get something known as the naive Bayes classifier.

For linear discriminant analysis, this means that the covariances matrix <math>\sigma_k</math> are diagonal. Instead of estimating the covariance matrix, if you've got p variables, we got P squared parameters that must be estimated.

Although the assumption seems very crude or wrong, the naive Bayes classifier is actually very useful in high-dimensional problems. We end up with maybe biased estimates for the probabilities

In classification, we're mainly concerned about which class has the highest probability and not whether we got the probabilities exactly right.

In terms of classification, as we just need to to classify to the largest probability, we can tolerate quite a lot of bias and still get good classification performance. What we get in return is much reduced variance from having to estimate far fewer parameters.

Classify to the highest density

We classify a new point according to which density is highest.

On the right, when the priors are different we favor the pink class, the decision boundary has shifted to the left.

Advantage / Disadvantage

When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
If the sample size N is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data.
If you have the right population model, Bayes rule is the best you can possibly do.

Example

Suppose that in Ad Clicks (a problem where you try to model if a user will click on a particular ad) it is well known that the majority of the time an ad is shown it will not be clicked. What is another way of saying that?

Ad Clicks have a low Prior Probability (Status: correct)
Ad Clicks have a high Prior Probability.
Ad Clicks have a low Density.
Ad Clicks have a high Density

Whether or not an ad gets clicked is a Qualitative Variable. Thus, it does not have a density. The Prior Probability of Ad Clicks is low because most ads are not clicked.