Table of Contents

What are models of text in NLP? (Natural Language, Text Modeling)

About

This page talks about model creation for natural language text.

ie how to store and represent text ?

Example: Similarity Search with a Simple Bag of words

Let's say that you want to search in a list of documents, documents that are similar on 2 dimensions, ie on 2 terms, ie on 2 words

You would:

<MATH> distance = \sqrt{(firstWordCount_1 - firstWordCount_2)² + (secondWordCount_1 - secondWordCount_2)² } </MATH>

With the words foo and bar, this distance calculation puts the document (foo10, bar1) much closer to a (foo1, bar10) than, say (foo200, bar1).

To correct it, you can normalize your vectors by dividing the number of word mentions by the total words of mentions on all documents to get the cosine distance. This is equivalent to projecting our points onto a unit circle and measuring the distances along the arc

List

List of models by simplicity:

Usage

Full text search (Ie Similarities)

For a full text search, modeling free-text in a database (text engine) is a simple matter of:

Performance optimization;

Detect and correct spelling errors

The models can be used to detect and correct spelling errors.

The N-gram language model is the most widely used language modeling approach.