About
Tokenization is the process of breaking input text into small indexing elements – tokens.
Parsing and Tokenization are often call Text Analysis or Analysis in NLP.
The tokens (or terms) are used either:
- to build the index of those terms when a new document is added,
- or to query in order to identify which documents contain the terms you are querying for.
Compiler are also tokenizining the code What is a Lexer ? known also as Tokenizer or Scanner - Lexical Analysis but with the help of a structure called the grammar
Articles Related
Tokenization
Pre
Pre-tokenization analysis can include but is not limited to:
- stripping HTML markup,
- transforming or removing text matching arbitrary patterns or sets of fixed strings.
During
Sentences beginnings and endings can be identified to provide for more accurate phrase and proximity searches.
Post (Analyse Process)
There are many post-tokenization steps that can be done, including (but not limited to):
- Stemming – Replacing/Mapping words with their stems. For instance with English stemming “bikes” is (replaced with|mapped to) “bike”; now query “bike” can find both documents containing “bike” and those containing “bikes”.
- Stop Words Filtering – Common words like “the”, “and” and “a” rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some “noise” and actually improve search quality.
- Text Normalization – Stripping accents and other character markings can make for better searching.
- Synonym Expansion – Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set.
- Tagging: Tag that are added to the tokens
- A dependency parse of the sentence
Operations
This process are used for both indexing and querying but the same operation is not always needed for both operations. For instance,
- for indexing, a text normalization operations will increase recall because, for example, “ram”, “Ram” and “RAM” would all match a query for “ram”
- for querying, to increase query-time precision, an operation can narrow the matches by, for example, ignoring all-cap acronyms if the searcher is interested in male sheep, but not Random Access Memory.
Query
- Case insensitivity, so “Analyzer” and “analyzer” match.
- Stemming, so words like “Run” and “Running” are considered equivalent terms.
- Stop Word Pruning, so small words like “an” and “my” don't affect the query.
Algorithm
Unicode
The Unicode Text Segmentation algorithm Annex #29 converts tokens to lowercase; and then filters out stopwords. This is the default algorithm of Lucene.