Tokenization is the process of breaking input text into small indexing elements – tokens.
Parsing and Tokenization are often call Text Analysis or Analysis in NLP.
The tokens (or terms) are used either:
Compiler are also tokenizining the code What is a Lexer ? known also as Tokenizer or Scanner - Lexical Analysis but with the help of a structure called the grammar
Pre-tokenization analysis can include but is not limited to:
Sentences beginnings and endings can be identified to provide for more accurate phrase and proximity searches.
There are many post-tokenization steps that can be done, including (but not limited to):
This process are used for both indexing and querying but the same operation is not always needed for both operations. For instance,
The Unicode Text Segmentation algorithm Annex #29 converts tokens to lowercase; and then filters out stopwords. This is the default algorithm of Lucene.