Multilingual Regular Expression Syntax (Pattern)
Table of Contents
About
This section talks about the Regular Expression Syntax.
Regular expressions are implemented with context-free grammars and are the building block of regular language.
Stephen Kleene was the fellow who invented regular expressions and showed that they describe the same languages that finite automata describe. Most regexp-packages transform a regular expression into an automaton based on nondeterministic automata (NFA). The automaton does then the parse work.
The Glob language behave a lot like a regular expression but is not. The difference is mostly in the signification of the star and is use a lot in Shell in order to match file name. See Glob - Globbing (Wildcard expression).
Example
Java regular expression used to parse log message at Twitter circa 2010
^(\\w+\\s+\\d+\\s+\\d+:\\d+:\\d+)\\s+
([^@]+?)@(\\s+)\\s+(\\S+):\\s+(\\S+)\\s+(\\S+)
\\s+((?:\\S+?,\\s+)*(?:\\S+?))\\s+(\\S+)\\s+(\\S+)
\\s+\\[([^\\]]+)\\]\\s+\"(\\w+)\\s+([^\"\\\\]*
(?:\\\\.[^\"\\\\]*)*)\\s+(\\S+)\"\\s+(\\S+)\\s+
(\\S+)\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)
\"\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"\\s*
(\\d*-[\\d-]*)?\\s*(\\d+)?\\s*(\\d*\\.[\\d\\.]*)?
(\\s+[-\\w]+)?.*$
where:
- ^ is a boundary meta matching the beginning of a line.
- + is a quantifier quantifying the number of occurrences
- \ is the escape character
- \d is a shorthand character classes matching digits
- \w is a shorthand character classes matching word characters (letters, digits, and underscores)
- \s is a shorthand character classes matching whitespace (spaces, tabs, and line breaks)
Usage
Regular expressions are used in many systems.
- File Search. E.g., UNIX a.*b.
- Data Structure Definition. Example: DTD’s describe XML tags with a RE format like person (name, addr, child*).
Syntax
/regularExpression/modifier
where:
- a regular expression is a string that is composed of:
- grouped (or not) to:
- capture (extract)
- or make assertions (assertion (look-around))
- that are quantified
- through a logical way.
- flag also called modifier modifies the behavior of the parsing (such as if the match should take the case (upper,lower) into account,…)
See meta to get an overview of the most important regexp (symbols|token) that have a meaning in the context of regular expression.
Grammar
The syntax can differ from one implementation to an other but then tend to follow the same principal.
Visualisation
Regular expression are visualized via rail road diagram.