This page talks about model creation for natural language text.
ie how to store and represent text ?
Let's say that you want to search in a list of documents, documents that are similar on 2 dimensions, ie on 2 terms, ie on 2 words
You would:
<MATH> distance = \sqrt{(firstWordCount_1 - firstWordCount_2)² + (secondWordCount_1 - secondWordCount_2)² } </MATH>
With the words foo and bar, this distance calculation puts the document (foo10, bar1) much closer to a (foo1, bar10) than, say (foo200, bar1).
To correct it, you can normalize your vectors by dividing the number of word mentions by the total words of mentions on all documents to get the cosine distance. This is equivalent to projecting our points onto a unit circle and measuring the distances along the arc
List of models by simplicity:
For a full text search, modeling free-text in a database (text engine) is a simple matter of:
Performance optimization;
The models can be used to detect and correct spelling errors.
The N-gram language model is the most widely used language modeling approach.