What is a Lexer ? known also as Tokenizer or Scanner - Lexical Analysis

Compiler

What is a Lexer ? known also as Tokenizer or Scanner - Lexical Analysis

About

Lexer are also known as:

  • Lexical analyzer
  • Lexical Tokenizer
  • Lexical Scanner

Example

Consider the following expression:

sum = 3 + 2;

A lexer will tokenized it in the following symbol table:

Token
Lexeme Token type
sum Identifier
= Assignment operator
3 Integer literal
+ Addition operator
2 Integer literal
; End of statement

Definition

A lexer defines how the contents of a file is broken into tokens.

A lexer is ultimately implemented as finite automata.

A lexer reads an input character or byte stream (i.e. characters, binary data, etc.), divides it into tokens using:

This process of transforming an input text into discrete components called tokens is known as:

  • lexical analysis,
  • lexing
  • tokenizing
  • or simply scanning.

A Lexer is a stateful stream generator (ie the position in the source file is saved). Every time it is advanced, it returns the next token in the Source. Normally, the final Token emitted by the lexer is a EOF and it will repeatedly return the same EOF token whenever called.

Lexical analysis is the first step of a compiler. In the second step, the tokens can then then processed by a parser.

Lexers are generally quite simple and does nothing with the tokens. Most of the complexity is deferred to the following steps:

  • or semantic analysis 1) (seeing if there is a problem, a possible optimization, …)

For example, a typical lexical analyzer recognizes parentheses as tokens, but does nothing to ensure that each “(” is matched with a “)”. This syntax analysis is left to the parser.

Steps

Tokenizing is generally done in a single pass.

  1. Groups the characters into meaningful sequences called lexemes that adheres to the grammar. Tokens are frequently in the rule defined by regular expressions.
  2. Categorize each lexeme (sequence of characters) into tokens (symbols of the vocabulary of the language). If the lexer finds an invalid token, it will report an error.

Implementation

Compiler-Compiler

Lexers can be generated by automated tools called compiler-compiler.

They are working:

  • from a grammar file that defines the lexer rules
  • then generates the lexer as code

Manually

You can also write your own lexer manually.

They all use ultimately a finite automaton modeling the recognition of the keyword then

Word Recognition Automaton

where:

  • the state (circle) represents a different position in the word then that has been reached so far that ranges from:
    • the empty string (ie nothing)
    • to the complete word

Example:

Why not using simply a regular expression (Lexer vs Regexp)

Why not using simply a regular expression

  • lack of recursion: You cannot have a regular expression inside another one, therefore you can't define any tree. Some regular expression engine allows recursion but this is the exception not the norm and the expressiveness is limited.
  • no lexical mode: The lexical mode (or context) is important when you want your parsing rules to be applied conditionally. The most obvious being for instance a html pre where you don't want any parsing inside.

2) 3)





Discover More
Card Puncher Data Processing
Antlr (ANother Tool for Language Recognition)

ANTLR is lexer generator. It translates: a grammar to a lexer and parser. ANTLR is implemented in Java and generates lexer and parser in the following languages: Java, Ruby, Python, C,...
Card Puncher Data Processing
Antlr - Lexer

in Antlr The lexer is a generated class from the grammar.
Card Puncher Data Processing
Automata - Deterministic finite-state automata (DFA)

A Deterministic finite-state automata (DFA) is a finite automaton that cannot be in more than one state at any one time. The term deterministic refers to the fact that on each input there is one and only...
Word Recognition Automaton
Automata - Finite Automata

A finite automaton is an automaton that has a set of states and its control moves from state to state in response to external inputs. It has a start and an end state and there are only a finite number...
CSS - Grammar

CSS grammar and lexical scanner Grammar
Card Puncher Data Processing
Code - Grammar / Syntax (Lexical)

This section regroups the entity of a computer language from a lexical point of view. It's the same as Parts of the speech for a natural language. Grammars are useful models when designing software...
Code Analysis Variable Unused Name Error
Code Quality - Code analysis

Code analysis is an application that scans code and reports : problems or improvement The application performing code analysis is a linter (a parser/lexer program) Bad name writing is a problem...
Compiler
Compiler - Lexer rule (Token names, Lexical rule, Token name)

lexer rule are rules written specifically for the lexer as they defines tokens The lexer creates token from the input text that match this rules. A lexer rule is also known as: lexical rule token...
Compiler
Computer Language - (Compiler|Interpreter) - Language translator

Computer Language are written in plain text. However, computers interpret only particular sequence of instructions. This transformation from a plain text language to instructions is called compilation...
Java Conceptuel Diagram
Java - IO - Character Stream

Character streams are build above byte stream. They decodes bytes into characters (or the other way around) using a specified charset. The Java platform stores character values using Unicode conventions....



Share this page:
Follow us:
Task Runner