What are models of text in NLP? (Natural Language, Text Modeling)

About

This page talks about model creation for natural language text.

ie how to store and represent text ?

Example: Similarity Search with a Simple Bag of words

Let's say that you want to search in a list of documents, documents that are similar on 2 dimensions, ie on 2 terms, ie on 2 words

You would:

count up the instances of the words for each document
Calculate the euclidean distance for each document

<MATH> distance = \sqrt{(firstWordCount_1 - firstWordCount_2)² + (secondWordCount_1 - secondWordCount_2)² } </MATH>

show the lists of documents ordered by the smaller distance

With the words foo and bar, this distance calculation puts the document (foo10, bar1) much closer to a (foo1, bar10) than, say (foo200, bar1).

To correct it, you can normalize your vectors by dividing the number of word mentions by the total words of mentions on all documents to get the cosine distance. This is equivalent to projecting our points onto a unit circle and measuring the distances along the arc

List

List of models by simplicity:

bag of words with term document matrix
Bag of Concept: http://www.ercim.eu/publication/Ercim_News/enw64/sahlgren.html
N-gram language model
embeddings

Usage

Full text search (Ie Similarities)

For a full text search, modeling free-text in a database (text engine) is a simple matter of:

building an inverted file relation with tuples of the form word, documentID, position,
building a B+-tree index over the word column.
adding metadata to aid in rank-ordering search results
and applying some linguistic canonicalization of words

Performance optimization;

denormalizing the schema to have each word appear only once with a list of occurrences per word, i.e. word, list <documentID, position>. It allows for aggressive delta-compression of the list (typically called a postings list), which is critical given the characteristically skewed (Zipfian) distribution of words in documents.

Detect and correct spelling errors

The models can be used to detect and correct spelling errors.

The N-gram language model is the most widely used language modeling approach.