About
A bag of words is a representation model of a piece of text.
The idea is to treat strings (documents), as unordered collections of words, or tokens, i.e., as bags of words.
Bag of words techniques all apply to any sort of token, a “bag-of-words” is then much more a “bag-of-tokens”.
Stopwords add noise to bag-of-words comparisons, so they are usually excluded.
Words (ie Tokens) is the atomic unit of text comparison. If we want to compare two documents, we count how many tokens they share in common.
(Comparison|Weights)
Bag-of-words comparisons are not very good when all tokens are treated the same: some tokens are more important than others. Weights give a way to specify which tokens to favor. With weights, when we compare documents, instead of counting common tokens, we sum up the weights of common tokens.
A good heuristic for assigning weights is called Term-Frequency/Inverse-Document-Frequency (tf-idf)
Vector
A bag of word can represent a document as vectors where:
- Dimension : each unique token
- Magnitudes: token weights
Example: With the count as weights, the string “Hello, world! Goodbye, world!” would be represented by this vector:
<math> \begin{array}{rrr} \text{hello} & \mapsto & 1 \\ \text{goodbye} & \mapsto & 1 \\ \text{world} & \mapsto & 2 \\ ..... \\ \text{Other Word} & \mapsto & 0 \\ \end{array} </math>
Problem
With a lot of document, most documents will have a count of 0 for most terms (ie high sparsity)