What are models of text in NLP? (Natural Language, Text Modeling)

Text Mining

About

This page talks about model creation for natural language text.

ie how to store and represent text ?

Example: Similarity Search with a Simple Bag of words

Let's say that you want to search in a list of documents, documents that are similar on 2 dimensions, ie on 2 terms, ie on 2 words

You would:

  • count up the instances of the words for each document
  • Calculate the euclidean distance for each document

<MATH> distance = \sqrt{(firstWordCount_1 - firstWordCount_2)² + (secondWordCount_1 - secondWordCount_2)² } </MATH>

  • show the lists of documents ordered by the smaller distance

With the words foo and bar, this distance calculation puts the document (foo10, bar1) much closer to a (foo1, bar10) than, say (foo200, bar1).

To correct it, you can normalize your vectors by dividing the number of word mentions by the total words of mentions on all documents to get the cosine distance. This is equivalent to projecting our points onto a unit circle and measuring the distances along the arc

List

List of models by simplicity:

Usage

Full text search (Ie Similarities)

For a full text search, modeling free-text in a database (text engine) is a simple matter of:

  • building an inverted file relation with tuples of the form word, documentID, position,
  • building a B+-tree index over the word column.
  • adding metadata to aid in rank-ordering search results
  • and applying some linguistic canonicalization of words

Performance optimization;

  • denormalizing the schema to have each word appear only once with a list of occurrences per word, i.e. word, list <documentID, position>. It allows for aggressive delta-compression of the list (typically called a postings list), which is critical given the characteristically skewed (Zipfian) distribution of words in documents.

Detect and correct spelling errors

The models can be used to detect and correct spelling errors.

The N-gram language model is the most widely used language modeling approach.





Discover More
Text Mining
(Natural|Human) Language - Text (Mining|Analytics)

See Tweet Web site comments Weblogs Forum comment ... A tweet is analyzed differently than a long blog post and a blog comment is analyzed differently than a tweet. If you want to use any...
Data System Architecture
Logical Data Modeling

A data model in software engineering is a graph of entity that try to represent the reality and describes how data are represented and accessed. the real world consists of entities and relationships....
Lucene

Lucene Lucene is a text search engine library. The following application are Lucene application (ie build on it): * Solr * Elastic Search * New Relic Logs * ... The text data model of...
Data System Architecture
Text - Modelling

See
What is a Full Text Search Engine ?

Search Engine (Full Text Search) Full-text search is a battle between: * precision—returning as few irrelevant documents as possible * and recall—returning as many relevant documents as possible....
Text Mining
What is a bag of words model? known also as a bag of tokens in NLP

A bag of words is a representation model of a piece of text. The idea is to treat strings (documents), as unordered collections of words, or tokens, i.e., as bags of words. Bag of words techniques all...



Share this page:
Follow us:
Task Runner