Text Mining - Bag of (words|tokens)

Text Mining


The idea is to treat strings (documents), as unordered collections of words, or tokens, i.e., as bags of words.

Bag of words techniques all apply to any sort of token, a “bag-of-words” is then much more a “bag-of-tokens”.

Stopwords add noise to bag-of-words comparisons, so they are usually excluded.

Words (ie Tokens) is the atomic unit of text comparison. If we want to compare two documents, we count how many tokens they share in common.


Bag-of-words comparisons are not very good when all tokens are treated the same: some tokens are more important than others. Weights give a way to specify which tokens to favor. With weights, when we compare documents, instead of counting common tokens, we sum up the weights of common tokens.

A good heuristic for assigning weights is called Term-Frequency/Inverse-Document-Frequency (tf-idf)


A bag of word can represent a document as vectors where:

  • Dimension : each unique token
  • Magnitudes: token weights

Example: With the count as weights, the string “Hello, world! Goodbye, world!” would be represented by this vector:

<math> \begin{array}{rrr} \text{hello} & \mapsto & 1 \\ \text{goodbye} & \mapsto & 1 \\ \text{world} & \mapsto & 2 \\ ..... \\ \text{Other Word} & \mapsto & 0 \\ \end{array} </math>

Documentation / Reference

Recommended Pages
Image Vector
Linear Algebra - Vector

tuple in Linear algebra are called vector. A vector is a list of scalar (real number) used to represent a When the letters are in bold in a formula, it signifies that they're vectors, To represent...
Text Mining
NLP - Stop Words

Stop_wordsStop Words are common words (token) that do not contribute much to the content or meaning of a document. Stopwords add noise, have less value and needs to be ignored/excluded in: bag-of-words...
Text Mining
Natural Language - Document

This page is the definition of a document in natural language processing. In natural language processing, a document is represented by: the bag of words model. ie a document has one or more term...
Text Mining
Natural Language - Document (Cosine) Similarity

Cosine similarity applied to document similarity. Each document becomes a vector in some high dimensional space. To compare two documents we compute the cosine of the angle between their two document...
Text Mining
Natural Language - Text Modeling

denormalizing the schema to have each word appear only once with a list of occurrences per word, i.e. word, list . It allows for aggressive delta-compression of the list (typically called a ), which is...
Text Mining
Text Mining - (Corpus|Corpora) - Structured set of Text Document

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts. See Text_corpus English: open Carnegie Mellon University...
Text Mining
Text Mining - term frequency – inverse document frequency (tf-idf)

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used...
Data System Architecture
What is a document ?

The concept of document can be difficult to grasp. This articles gives an easy definition that fits the computer science world.

Share this page:
Follow us:
Task Runner