natural_language:tf-idf

About

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

It is often used as a weighting factor in information retrieval and text mining.

The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.

Articles Related

Type of Weight

TF

TF rewards tokens that appear many times in the same document. If a word occurs often in a document, then it is more important to the meaning of the document.

TF is computed as the frequency of a token in a document (bag of words)

Example:

a document d contains 100 tokens
the token t appears in d 5 times,

<MATH> \text{TF weight (of the token in the document)} = \frac{\displaystyle \text{Number of times the token appears in the document}}{\text{Total Number of Tokens in the document}} = \frac{5}{100} = \frac{1}{20} </MATH>

IDF

IDF rewards tokens that are rare overall in a dataset. If a rare word occurs in two documents, then it is more important to the meaning of each document.

IDF weight for a token, t, in a set of documents, U, is computed as follows:

<MATH> IDF(t) = \frac{\text{Total number of documents}}{\text{Number of documents in U that contain t}} = \frac{N}{n(t)} </MATH>

Note that <math>\frac{n(t)}{N}</math> is the frequency of t in U, and <math>\frac{N}{n(t)}</math> is the inverse frequency.

TF-IDF

Finally, to bring it all together, the total TF-IDF weight for a token in a document is the product of its tf and idf weights.

<MATH> \text{TF-IDF} = TF \times IDF </MATH>

Scope of Weight

Local

Sometimes token weights depend on the document the token belongs to, that is, the same token may have a different weight when it's found in different documents. We call these weights local weights. TF is an example of a local weight, because it depends on the length of the source.

Global

On the other hand, some token weights only depend on the token, and are the same everywhere that token is found. We call these weights global, and IDF is one such weight.

Calculation

IDF

Python with Spark:

rdd = sc.parallelize([('Doc1', ['Hello','nico','MonPoussin']), ('Doc2', ['Hello','Toi']),  ('Doc3', ['Hello','Pfff','nico'])])
N = rdd.count()
(rdd.flatMap(lambda (x,y):list(set(y)))
    .map(lambda x: (x,1.0))
    .reduceByKey(lambda a,b:a+b)
    .map(lambda (x,y):(x,N/y))
.collect())

[('MonPoussin', 3.0),
 ('Toi', 3.0),
 ('nico', 1.5),
 ('Pfff', 3.0),
 ('Hello', 1.0)]

Table of Contents

Text Mining - term frequency – inverse document frequency (tf-idf)