What is a bag of words model? known also as a bag of tokens in NLP

Text Mining

What is a bag of words model? known also as a bag of tokens in NLP

About

A bag of words is a representation model of a piece of text.

The idea is to treat strings (documents), as unordered collections of words, or tokens, i.e., as bags of words.

Bag of words techniques all apply to any sort of token, a “bag-of-words” is then much more a “bag-of-tokens”.

Stopwords add noise to bag-of-words comparisons, so they are usually excluded.

Words (ie Tokens) is the atomic unit of text comparison. If we want to compare two documents, we count how many tokens they share in common.

(Comparison|Weights)

Bag-of-words comparisons are not very good when all tokens are treated the same: some tokens are more important than others. Weights give a way to specify which tokens to favor. With weights, when we compare documents, instead of counting common tokens, we sum up the weights of common tokens.

A good heuristic for assigning weights is called Term-Frequency/Inverse-Document-Frequency (tf-idf)

Vector

A bag of word can represent a document as vectors where:

  • Dimension : each unique token
  • Magnitudes: token weights

Example: With the count as weights, the string “Hello, world! Goodbye, world!” would be represented by this vector:

<math> \begin{array}{rrr} \text{hello} & \mapsto & 1 \\ \text{goodbye} & \mapsto & 1 \\ \text{world} & \mapsto & 2 \\ ..... \\ \text{Other Word} & \mapsto & 0 \\ \end{array} </math>

Problem

With a lot of document, most documents will have a count of 0 for most terms (ie high sparsity)

Documentation / Reference





Discover More
Image Vector
Linear Algebra - Vector

tuple in Linear algebra are called vector. A vector is a list of scalar (real number) used to represent a When the letters are in bold in a formula, it signifies that they're vectors, To represent...
Text Mining
NLP - Stop Words

Stop_wordsStop Words are common words (token) that do not contribute much to the content or meaning of a document. Stopwords add noise, have less value and needs to be ignored/excluded in: bag-of-words...
Text Mining
Natural Language - Document

This page is the definition of a document in natural language processing. In natural language processing, a document is represented by: the bag of words model. ie a document has one or more term...
Text Mining
Natural Language - Document (Cosine) Similarity

Cosine similarity applied to document similarity. Each document becomes a vector in some high dimensional space. To compare two documents we compute the cosine of the angle between their two document...
Text Mining
Text Mining - (Corpus|Corpora) - Structured set of Text Document

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts. See Text_corpus English: open Carnegie Mellon University...
Text Mining
Text Mining - term frequency – inverse document frequency (tf-idf)

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used...
Text Mining
What are models of text in NLP? (Natural Language, Text Modeling)

This page talks model creation for natural language text. ie how to store and represent text ? Let's say that you want to search in a list of documents, documents that are similar on 2 dimensions,...
Data System Architecture
What is a document ?

The concept of document can be difficult to grasp. This articles gives an easy definition that fits the computer science world.



Share this page:
Follow us:
Task Runner