Information theory was find by Claude Shannon. It has quantified entropy. This is key measure of information which is usually expressed by the average number of bits needed to store or communicate one symbol in a message.
Information theory measure information in bits
<math> entropy(p_1,p_2,\dots,p_n)=-{p_1}log(p_1)-{p_2}log(p_2)-\dots-{p_n}log(p_n) </math>
Information gain is the amount of information gained by knowing the value of the attribute
<math> \text{Information gain} = \text{(Entropy of distribution before the split)} – \text{(entropy of distribution after it)} </math>
Information gain is the amount of information that's gained by knowing the value of the attribute, which is the entropy of the distribution before the split minus the entropy of the distribution after it. The largest information gain is equivalent to the smallest entropy.
An highly branching attributes such as an ID attribute (which is the Extreme case with one different Id by case) will give the maximal information gain but will not Machine Learning - (Overfitting|Overtraining|Robust|Generalization) (Underfitting) at all and will then lead to an algorithm that overfit.
With the Weather data set
14 records, 9 are “yes”
-(9/14 log_2 9/14 + 5/14 log_2 5/14) approx 0.94The outlook attribute contains 3 distinct values:
Expected new entropy:
-( {4/14} * 0 + {5/14} * 0.97 + {5/14} * 0.97 ) approx 0.69Distinct value | Yes records | Entropy |
---|---|---|
cool | 4 records, 3 are “yes” | 0.81 |
rainy | 4 records, 2 are “yes” | 1.0 |
sunny | 6 records, 4 are “yes” | 0.92 |
Expected new entropy:
-( {4/14} * 0.81 + {4/14} * 1.0 + {6/14} * 0.92 ) approx 0.91Distinct value | Yes records | Entropy |
---|---|---|
normal | 7 records, 6 are “yes” | 0.59 |
high | 7 records, 2 are “yes” | 0.86 |
Expected new entropy:
-( {7/14} * 0.81 + {7/14} * 0.86 ) approx 0.72Consider every possible binary partition; choose the partition with the highest gain
Distinct value | Yes records | Entropy |
---|---|---|
TRUE | 8 records, 6 are “yes” | 0.81 |
FALSE | 5 records, 3 are “yes” | 0.97 |
Expected new entropy:
-( {8/14} * 0.81 + {6/14} * 0.97 ) approx 0.87Attribute | Information Gain |
---|---|
outlook | 0.94 - 0.69 = 0.25 |
temperature | 0.94 - 0.91 = 0.03 |
humidity | 0.94 - 0.72 = 0.22 |
windy | 0.94 - 0.87 = 0.07 |
The highest information gain is with the outlook attribute.