Discriminant analysis is a classification method.
In discriminant analysis, the idea is to:
The Bayes theorem is a basis for discriminant analysis.
The Bayes theorem for classification
<MATH> \begin{array}{rrl} Pr(Y = k|X = x) & = & \frac{\displaystyle Pr(X = x|Y = k) Pr(Y = k)}{\displaystyle Pr(X = x)} \\ & = & \frac{\displaystyle \pi_k f_k(x)}{\displaystyle \sum_{l=1}^K \pi_l f_l (x)} \end{array} </MATH>
where:
This approach is quite general, and other (distributions|density) can be used including non-parametric approaches. By altering the forms for <math>f_k(x)</math> , we get different classifiers (ie classification rule).
When <math>f_k(x)</math> are Gaussian densities, with the same covariance matrix in each class, this leads to linear discriminant analysis.
The two popular forms of linear discriminant analysis are:
When you have a large number of features (like 4,000), you really wouldn't want to estimate the large covariance matrices.
You will then assume that in each class the density factors into a product of densities. <MATH> f_k(x) = \prod^p_{j=1} f_{jk}(x_j) </MATH> where:
ie that the variables are conditionally independent in each of the classes.
If we plug it into the above Bayes formula, we get something known as the naive Bayes classifier.
For linear discriminant analysis, this means that the covariances matrix <math>\sigma_k</math> are diagonal. Instead of estimating the covariance matrix, if you've got p variables, we got P squared parameters that must be estimated.
Although the assumption seems very crude or wrong, the naive Bayes classifier is actually very useful in high-dimensional problems. We end up with maybe biased estimates for the probabilities
In classification, we're mainly concerned about which class has the highest probability and not whether we got the probabilities exactly right.
In terms of classification, as we just need to to classify to the largest probability, we can tolerate quite a lot of bias and still get good classification performance. What we get in return is much reduced variance from having to estimate far fewer parameters.
We classify a new point according to which density is highest.
On the right, when the priors are different we favor the pink class, the decision boundary has shifted to the left.
Suppose that in Ad Clicks (a problem where you try to model if a user will click on a particular ad) it is well known that the majority of the time an ad is shown it will not be clicked. What is another way of saying that?
Whether or not an ad gets clicked is a Qualitative Variable. Thus, it does not have a density. The Prior Probability of Ad Clicks is low because most ads are not clicked.