Natural Language Processing - (Tokenization|Parser|Text Segmentation|Word Break rules|Text Analysis)
About
Tokenization is the process of breaking input text into small indexing elements – tokens.
Parsing and Tokenization are often call Text Analysis or Analysis in NLP.
The tokens (or terms) are used either:
to build the index of those terms when a new document is added,
or to query in order to identify which documents contain the terms you are querying for.
Articles Related
Tokenization
Pre
Pre-tokenization analysis can include but is not limited to:
During
Sentences beginnings and endings can be identified to provide for more accurate phrase and proximity searches.
Post (Analyse Process)
There are many post-tokenization steps that can be done, including (but not limited to):
Stemming – Replacing/Mapping words with their stems. For instance with English stemming “bikes” is (replaced with|mapped to) “bike”; now query “bike” can find both documents containing “bike” and those containing “bikes”.
Stop Words Filtering – Common words like “the”, “and” and “a” rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some “noise” and actually improve search quality.
Text Normalization – Stripping accents and other character markings can make for better searching.
Synonym Expansion – Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set.
Operations
This process are used for both indexing and querying but the same operation is not always needed for both operations. For instance,
for indexing, a text normalization operations will increase recall because, for example, “ram”, “Ram” and “RAM” would all match a query for “ram”
for querying, to increase query-time precision, an operation can narrow the matches by, for example, ignoring all-cap acronyms if the searcher is interested in male sheep, but not Random Access Memory.
Query
-
Stemming, so words like “Run” and “Running” are considered equivalent terms.
-
Algorithm
Unicode
The Unicode Text Segmentation algorithm Annex #29 converts tokens to lowercase; and then filters out stopwords. This is the default algorithm of Lucene.
Documentation / Reference