Data Mining - Entropy (Information Gain)

Thomas Bayes


The degree to which a system has no pattern is known as entropy. A high-entropy source is completely chaotic, is unpredictable, and is called true randomness.

Entropy is a function “Information” that satisfies:

Information_{1&2}(p_1p_2) = Information_1(p_1) + Information_2(p_2)


  • p1p2 is the probability of event 1 and event 2
  • p1 is the probability of an event 1
  • p1 is the probability of an event 2

Mathematics - Logarithm Function (log)

Information_{x}(p_x) = log_2(p_x)

I(X) = log_2(p_x)

Entropy = H(X) = E(I(X)) = sum{x}{}{ p_x I(x)} = - sum{x}{}{ p_x log_2 p_x}


  • H stands for entropy
  • E for Ensemble ???

The entropy of a distribution with finite domain is maximized when all points have equal probability.

Bigger is the entropy, more is the event unpredicatble

Higher entropy means there is more unpredictability in the events being measured .

Higher entropy mean that the events being measured are less predictable.

100% predictability = 0 entropy

Fifty Fifty is an entropy of one.

Why is picking the attribute with the most information gain beneficial? It reduces entropy, which increases predictability. Information gain is positive when there is a decrease in entropy from choosing classifier/representation. A decrease in entropy signifies an decrease in unpredictability, which also means an increase in predictability.


Flipping a Coin

A two class problem.

Entropy = - sum{x}{}{ p_x log_2 p_x}
        = -(0.5 log_2 0.5 + 0.5 log_2 0.5)
        = - 2 * (0.5 log_2 0.5)
        = 1

Rolling a die

A six class problem.

<MATH> \begin{array}{rrl} Entropy & = & - \sum_{x}{ p_x log_2( p_x)} \\ & = & - 6 * (\frac{1}{6} log_2 (\frac{1}{6})) \\ & \approx & 2.58 \end{array} </MATH>

Rolling a weighted die

p_1 = 0.1, p_2 = 0.1, p_3 = 0.1, p_4 = 0.1, p_5 = 0.1, p_6 = 0.5

Entropy = - sum{x}{}{ p_x log_2 p_x}
        = - 5 * (1/6 log_2 1/6) - (0.5 log_2 0.5)
        approx 2.16

The weighted die is more predictable than a fair die.

How unpredictable is your data?


Titanic training set with a two class problem: survived or die

342 survivors

case: 342 survivors on a total of 891 passengers

- ( 342/891 log_2 342/891 + 549/891 log_2 549/891) approx 0.96

50 survivors

case: 50 survivors on a total of 891 passengers

- ( 50/891 log_2 50/891 + 841/891 log_2 841/891) approx 0.31

It's a more predictable data set.

Document / Reference

Recommended Pages
Strip Le Mot D Epasse Du Noob
Authentication - Password

Authentication - Password Password credentials (i.e., username and password) is something you know and is therefore a group identifier. ionicasmeets/status/954269521531035648Ionica Smeets Password...
Card Puncher Data Processing
Cryptography - Key

A key is a parameter used in a cipher algorithm that determines the encryption operation (forward) and the decryption operation (backward). It's the only secret parameter that protect the anonymity...
Thomas Bayes
Data Mining - (Prediction|Guess)

Something predictable is showing a pattern and is therefore not truly random. entropytrue randomness Many forms of data mining model are predictive. For example, a model might predict income based on...
Claude Shannon
Data Mining - Information Gain

Information theory was find by Claude_ShannonClaude Shannon. It has quantified entropy. This is key measure of information which is usually expressed by the average number of bits needed to store or communicate...
Thomas Bayes
Data Mining - Maximum Entropy Algorithm

Maximum Entropy (MaxEnt) models are feature-based classifier models. In a two-class scenario, it is the same as using logistic regression to find a distribution over the classes. MaxEnt makes no...
Random Generator
Number - Random (Stochastic|Independent) or (Balanced)

Think of randomness as a lack of pattern. Something random should be unpredictable. We shouldn’t be able to predict the next value of the sequence The degree to which a system has no pattern is known...
Bin Interval
Statistics - (Discretizing|binning) (bin)

Discretization is the process of transforming numeric variables into nominal variables called bin. The created variables are nominal but are ordered (which is a concept that you will not find in true...
Thomas Bayes
Statistics - log-likelihood function (cross-entropy)

The “log-likelihood function” is a probabilistic function. The “log-likelihood function” is also referred to as the cross-entropy
Data Mining Tool 2

is an open-source project in machine learning, Data Mining. is a comprehensive collection of machine-learning algorithms for data mining tasks written in Java....

Share this page:
Follow us:
Task Runner