Automata - Deterministic finite-state automata (DFA)

About

A Deterministic finite-state automata (DFA) is a finite automaton that cannot be in more than one state at any one time.

The term deterministic refers to the fact that on each input there is one and only one state to which the automaton can transition from its current state.

The inverse of Deterministic finite-state automata (DFA) is Nondeterministic finite-state automata (NFA) that may be in several states at once.

They are fast, without expensive backtracking.

Parsing Example

A Finite State Machine (FSM) will:

start with a default state (such as parsing)
then will parse a string character by character:
until it find a open token,
then, change your state
where you track and store all the characters
until you reach the closing token,
which then switches state back to the default one “parsing”.

A FSM versus a regular expression approach can come handy if you need to deal with matched nested parentheses or with deeper parsing level (different types of parentheses, etc.)

Definition

A DFA is denoted in the Five tuple notation as being a set of the following element.

<MATH> A = (Q, \sigma, \delta, q0, F) </MATH> where:

A is the name of the DFA
Q is its set of states
<math>\sigma</math> is its input symbols
<math>\delta</math> is its transition function
q its start state
F its set of accepting states

Finite

A finite automaton has the following function composition property

For the deterministic finite automata q, if the input string <math>s = y+z</math> , then processing a string s composed of two strings y and z is the same as processing first y then processing z.

<MATH>\delta(q,s) = \delta(\delta(q,y),z)</MATH>

where <math>\delta</math> is the state of an automaton q after processing a string (here s, y or z)

Regular Expression

Most regexp library are based on nondeterministic automata (NFA)

Lexer

A lexer ¹⁾ defines all token via several regular expression. To efficiently match the input against the possible tokens, all regexes are compiled together into a Deterministic Finite State Machine. The DFA is then walked character by character over the input string and fires a “match” event whenever one of the expressions is matched.

Advantages: it runs fast (each character is compared only once) and does not get any slower if you add more expressions.
Disadvantages: it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).

Library

Documentation / Reference

Chapter 2.2.1 - page 61 - Book - Hopcroft, Motwani, Ullman, Automata Theory, Languages, and Computation 3rd Edition. pdf

¹⁾

https://stackoverflow.com/questions/192957/efficiently-querying-one-string-against-multiple-regexes

²⁾

https://github.com/cs-au-dk/dk.brics.automaton