Lexical Analysis - (Token|Lexical unit|Lexeme|Symbol|Word)
Table of Contents
1 - About
A token is symbols of the vocabulary of the language.
Each token is a single atomic unit of the language.
The token syntax is typically a regular language, so a finite state automaton constructed from a regular expression can be used to recognize it.
A token is:
- a string of characters,
- categorized with a lexeme's type.
The process of finding and categorizing tokens from an input stream is called “tokenizing” and is performed by a Lexer (Lexical analyzer).
Token represents symbols of the vocabulary of a language.
A token is the result of parsing the document down to the atomic elements generally of a language.
2 - Articles Related
3 - Lexeme Type
A token might be:
- a literal (number, …)
- an operator (Assignment,Addition,…)
- a comment
- (Delimiters|End of Statement) (simple and compound symbols)
- (keyword|Identifiers), which include reserved words
- a reserved word,
Example:
Consider the following programming expression:
sum = 3 + 2;
Tokenized in the following table:
Token | |
---|---|
Lexeme | Lexeme type |
sum | Identifier |
= | Assignment operator |
3 | Integer literal |
+ | Addition operator |
2 | Integer literal |
; | End of statement |
4 - Properties
4.1 - Terminal / Non terminal
4.2 - Identifier
A token that has a name is called an identifier
5 - Symbol Table
A symbol table is a table of all token with a name (ie an identifier)