Lexical Analysis - (Token|Lexical unit|Lexeme|Symbol|Word)
Table of Contents
About
A token is symbols of the vocabulary of the language.
Each token is a single atomic unit of the language.
The token syntax is typically a regular language, so a finite state automaton constructed from a regular expression can be used to recognize it.
A token is:
- a string of characters,
- categorized with a lexeme's type.
The process of finding and categorizing tokens from an input stream is called “tokenizing” and is performed by a Lexer (Lexical analyzer).
Token represents symbols of the vocabulary of a language.
A token is the result of parsing the document down to the atomic elements generally of a language.
See also Natural Language - Token (Word|Term)
Articles Related
Lexeme Type
A token might be:
- a literal (number, …)
- an operator (Assignment,Addition,…)
- a comment
- (Delimiters|End of Statement) (simple and compound symbols)
- (keyword|Identifiers), which include reserved words
Example:
Consider the following programming expression:
sum = 3 + 2;
Tokenized in the following table:
Token | |
---|---|
Lexeme | Lexeme type |
sum | Identifier |
= | Assignment operator |
3 | Integer literal |
+ | Addition operator |
2 | Integer literal |
; | End of statement |
Properties
Terminal / Non terminal
Identifier
A token that has a name is called an identifier
Symbol Table
A symbol table is a table of all token with a name (ie an identifier)