About
This page talks about model creation for natural language text.
ie how to store and represent text ?
Example: Similarity Search with a Simple Bag of words
Let's say that you want to search in a list of documents, documents that are similar on 2 dimensions, ie on 2 terms, ie on 2 words
You would:
- count up the instances of the words for each document
- Calculate the euclidean distance for each document
<MATH> distance = \sqrt{(firstWordCount_1 - firstWordCount_2)² + (secondWordCount_1 - secondWordCount_2)² } </MATH>
- show the lists of documents ordered by the smaller distance
With the words foo and bar, this distance calculation puts the document (foo10, bar1) much closer to a (foo1, bar10) than, say (foo200, bar1).
To correct it, you can normalize your vectors by dividing the number of word mentions by the total words of mentions on all documents to get the cosine distance. This is equivalent to projecting our points onto a unit circle and measuring the distances along the arc
List
List of models by simplicity:
- embeddings
Usage
Full text search (Ie Similarities)
For a full text search, modeling free-text in a database (text engine) is a simple matter of:
- building an inverted file relation with tuples of the form word, documentID, position,
- building a B+-tree index over the word column.
- adding metadata to aid in rank-ordering search results
- and applying some linguistic canonicalization of words
Performance optimization;
- denormalizing the schema to have each word appear only once with a list of occurrences per word, i.e. word, list <documentID, position>. It allows for aggressive delta-compression of the list (typically called a postings list), which is critical given the characteristically skewed (Zipfian) distribution of words in documents.
Detect and correct spelling errors
The models can be used to detect and correct spelling errors.
The N-gram language model is the most widely used language modeling approach.