Table of Contents

Natural Language Processing - (Tokenization|Parser|Text Segmentation|Word Break rules|Text Analysis)

About

Tokenization is the process of breaking input text into small indexing elements – tokens.

Parsing and Tokenization are often call Text Analysis or Analysis in NLP.

The tokens (or terms) are used either:

Compiler are also tokenizining the code What is a Lexer ? known also as Tokenizer or Scanner - Lexical Analysis but with the help of a structure called the grammar

Tokenization

Pre

Pre-tokenization analysis can include but is not limited to:

During

Sentences beginnings and endings can be identified to provide for more accurate phrase and proximity searches.

Post (Analyse Process)

There are many post-tokenization steps that can be done, including (but not limited to):

Operations

This process are used for both indexing and querying but the same operation is not always needed for both operations. For instance,

Query

Algorithm

Unicode

The Unicode Text Segmentation algorithm Annex #29 converts tokens to lowercase; and then filters out stopwords. This is the default algorithm of Lucene.

Documentation / Reference