Lexer are also known as:
Consider the following expression:
sum = 3 + 2;
A lexer will tokenized it in the following symbol table:
Token | |
---|---|
Lexeme | Token type |
sum | Identifier |
= | Assignment operator |
3 | Integer literal |
+ | Addition operator |
2 | Integer literal |
; | End of statement |
A lexer defines how the contents of a file is broken into tokens.
A lexer is ultimately implemented as finite automata.
A lexer reads an input character or byte stream (i.e. characters, binary data, etc.), divides it into tokens using:
This process of transforming an input text into discrete components called tokens is known as:
A Lexer is a stateful stream generator (ie the position in the source file is saved). Every time it is advanced, it returns the next token in the Source. Normally, the final Token emitted by the lexer is a EOF and it will repeatedly return the same EOF token whenever called.
Lexical analysis is the first step of a compiler. In the second step, the tokens can then then processed by a parser.
Lexers are generally quite simple and does nothing with the tokens. Most of the complexity is deferred to the following steps:
For example, a typical lexical analyzer recognizes parentheses as tokens, but does nothing to ensure that each “(” is matched with a “)”. This syntax analysis is left to the parser.
Tokenizing is generally done in a single pass.
Lexers can be generated by automated tools called compiler-compiler.
They are working:
You can also write your own lexer manually.
They all use ultimately a finite automaton modeling the recognition of the keyword then
where:
Example:
Why not using simply a regular expression