Natural Language - Text Modeling


Logical Data Modeling in Text.

How to store and represent text ?


For a full text search, modeling free-text in a database (text engine) is a simple matter of:

  • building an inverted file relation with tuples of the form word, documentID, position,
  • building a B+-tree index over the word column.
  • adding metadata to aid in rank-ordering search results
  • and applying some linguistic canonicalization of words

Performance optimization;

  • denormalizing the schema to have each word appear only once with a list of occurrences per word, i.e. word, list <documentID, position>. It allows for aggressive delta-compression of the list (typically called a postings list), which is critical given the characteristically skewed (Zipfian) distribution of words in documents.

See also:


The models can be used to:

  • detect and correct spelling errors.

The N-gram language model is the most widely used language modeling approach.

Powered by ComboStrap