About
The degree to which a system has no pattern is known as entropy. A high-entropy source is completely chaotic, is unpredictable, and is called true randomness.
Entropy is a function “Information” that satisfies:
Information_{1&2}(p_1p_2) = Information_1(p_1) + Information_2(p_2)where:
- p1p2 is the probability of event 1 and event 2
- p1 is the probability of an event 1
- p1 is the probability of an event 2
Mathematics - Logarithm Function (log)
Information_{x}(p_x) = log_2(p_x) I(X) = log_2(p_x) Entropy = H(X) = E(I(X)) = sum{x}{}{ p_x I(x)} = - sum{x}{}{ p_x log_2 p_x}where:
- H stands for entropy
- E for Ensemble ???
The entropy of a distribution with finite domain is maximized when all points have equal probability.
Bigger is the entropy, more is the event unpredicatble
Higher entropy means there is more unpredictability in the events being measured .
Higher entropy mean that the events being measured are less predictable.
100% predictability = 0 entropy
Fifty Fifty is an entropy of one.
Why is picking the attribute with the most information gain beneficial? It reduces entropy, which increases predictability. Information gain is positive when there is a decrease in entropy from choosing classifier/representation. A decrease in entropy signifies an decrease in unpredictability, which also means an increase in predictability.
Articles Related
Example
Flipping a Coin
A two class problem.
Entropy = - sum{x}{}{ p_x log_2 p_x} = -(0.5 log_2 0.5 + 0.5 log_2 0.5) = - 2 * (0.5 log_2 0.5) = 1Rolling a die
A six class problem.
<MATH> \begin{array}{rrl} Entropy & = & - \sum_{x}{ p_x log_2( p_x)} \\ & = & - 6 * (\frac{1}{6} log_2 (\frac{1}{6})) \\ & \approx & 2.58 \end{array} </MATH>
Rolling a weighted die
p_1 = 0.1, p_2 = 0.1, p_3 = 0.1, p_4 = 0.1, p_5 = 0.1, p_6 = 0.5 Entropy = - sum{x}{}{ p_x log_2 p_x} = - 5 * (1/6 log_2 1/6) - (0.5 log_2 0.5) approx 2.16The weighted die is more predictable than a fair die.
How unpredictable is your data?
Titanic
Titanic training set with a two class problem: survived or die
342 survivors
case: 342 survivors on a total of 891 passengers
- ( 342/891 log_2 342/891 + 549/891 log_2 549/891) approx 0.9650 survivors
case: 50 survivors on a total of 891 passengers
- ( 50/891 log_2 50/891 + 841/891 log_2 841/891) approx 0.31It's a more predictable data set.
Document / Reference
- Bill Howe (UW)