data_mining:linear_discriminant

About

Linear Discriminant Analysis with only one variable (p = 1).

For a generalization, see Statistics - Fisher (Multiple Linear Discriminant Analysis|multi-variant Gaussian)

Articles Related

Assumption

The variance <math>\sigma_k</math> from the distribution of the value <math>X_i</math> when <math>Y_i = k</math> is the same in each of the classes k.

It is an important convenience as it's going to determine whether the discriminant function that we get, the discriminant analysis, gives us linear functions or quadratic functions.

Model Construction

Gaussian density

The Gaussian density has the form:

<MATH> f_k(x) = \frac{1}{\sqrt{2\pi}\sigma_k} e^{\displaystyle -\frac{1}{2} \left (\frac{x-\mu_k}{\sigma_k} \right )^2} </MATH>

where:

<math>\mu_k</math> is the mean in class k
<math>\sigma_k</math> is the variance in class k

Bayes Formula

Total

Plugging the gaussian density into the Bayes formula, we get a rather complex expression.

<MATH> \begin{array}{rrl} Pr(Y = k|X = x) & = & \frac{\displaystyle Pr(X = x|Y = k) Pr(Y = k)}{\displaystyle Pr(X = x)} \\ p_k(x) & = & \frac {\displaystyle \pi_k \frac{1}{\sqrt{2\pi}\sigma_k} e^{\displaystyle -\frac{1}{2} \left (\frac{x-\mu_k}{\sigma_k} \right )^2}} {\displaystyle \sum^K_{l=1} \pi_l \frac{1}{\sqrt{2\pi}\sigma_k} e^{\displaystyle -\frac{1}{2} \left (\frac{x-\mu_l}{\sigma_k} \right )^2}} \end{array} </MATH>

Simplification

Luckily, thanks to the assumptions, there's some simplifications and cancellations.

To classify an observation to a class, we don't need to initially evaluate the probabilities. We just need to see which is the largest.

Whenever you see exponentials the first thing you want to do is take the logs.

And if you discard terms that do not depend on k, that amounts to doing a lot of cancellation of terms that don't count.

This is equivalent to assigning to the class with the largest discriminant score.

<MATH> \delta_k(x) = x.\frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+log(\pi_k) </MATH>

It involves:

x, a single variable in this case.
the mean <math>\mu_k</math> for the class k
the variance <math>\sigma</math> of the distribution
the prior <math>\pi_k</math> for the class k

And importantly, <math>\delta_k(x)</math> is a linear function of x.

There's:

a constant <math>\frac{\mu_k}{\sigma^2}</math>
a constant term <math>-\frac{\mu_k^2}{2\sigma^2}+log(\pi_k)</math>

For each of the classes, we get one of those functions .

Binary

If:

there are two classes (K = 2)
<math>\pi_1 = \pi_2 = 0,5</math>

, you can simplify even further and see that the decision boundary is at

Parameters Estimation

The priors are just the number in each class divided by the sample size

<MATH> \hat{\pi_k} = \frac{\displaystyle N_k}{N} </MATH>

The mean for the class k is the sum of all variable when the attribute Y is equal to the class divided by the number of case for this class

The notation <math>\displaystyle \sum_{i:y_i=k}</math> will just sum the <math>x_i</math> 's that are in class k.

As we're assuming that the variance is the same in each

of the classes, this formula is called a pooled variance estimate. <MATH> \begin{array}{rrl} \hat{\sigma}^2 & = & \frac{1}{n-K}\sum_{k=1}^K\sum_{i:y_i=k}(x_i-\hat{\mu}_k)^2 \\ \end{array} </MATH> The formula:

subtract from each <math>x_i</math> the mean for its class. (the same as when we compute the variance for the class k)
sum all those square differences.
sum them over all the classes and then divide it by n minus k.
estimate the sample variance separately in each of the classes and then average them in order to weight each of them. The weight has to do with how many observations were in that class relative to the total number of observations. (minus 1 and the minus k is a detail that is to do with how many parameters we've estimated for each of these estimates)

A simplified version is: <MATH> \begin{array}{rrl} \hat{\sigma}^2 & = & \sum_{k=1}^K \frac{n_k-1}{n-K}.\hat{\sigma}^2_k \end{array} </MATH> where <math>\hat{\sigma}^2_k</math> is the usual formula for the estimated variance in the kth class ie: <MATH> \begin{array}{rrl} \hat{\sigma}^2_k & = & \frac{1}{n_k-1} \sum_{i:y_i=k} (x_i-\hat{\mu_k})^2 \end{array} </MATH>

Table of Contents

Statistical Learning - Simple Linear Discriminant Analysis (LDA)

About

Articles Related

Assumption

Model Construction

Gaussian density

Bayes Formula

Total

Simplification

Binary

Parameters Estimation