Language - (Grammar | Syntax | Lexicon)


Grammar in the context of a compiler. Ie how to define it.

See the grammar section if you are interessed on the structure.

In processing computer languages, semantic processing generally comes after syntactic processing, but in some cases semantic processing is necessary for complete syntactic analysis, and these are done together or concurrently.

Different Grammar Type

The grammar language (ie type) is dependent of the parser generator

Below is a non-exhaustive list:

See also Grammar Conversion Tool


Computer language syntax is generally distinguished into three levels:

  • Words – the lexical level, determining how characters form tokens. In a compiler, this is the lexer works to discover Words.
  • Phrases – the grammar level, narrowly speaking, determining how tokens form phrases. In a compiler, this is the parser works to discover Words.
  • Context – determining what objects or variables names refer to, if types are valid, etc. In a compiler, this is the work of the semantic analysis to add context.


How you work to define the grammar depends on your taste and knowledge of the syntax.

There is basically two ways to define the rules tree.


A bottom-up approach will:

  • start by defining the tokens
  • then create the tree by adding hierarchical relation between the token
  • until the top of the syntax is met


The Top-down approach will:

  • start with the top level
  • slice every level
  • to get to the token (the leaf of the tree)


To Code



For example, a correct sentence always consists of a subject followed by a predicate.

This rule can be expressed with formula's that are called:

  • syntax rules,
  • syntax productions,
  • or syntactic equations.

This above sentence can be described by the following formula:

sentence = subject predicate.


  • Subject and predicate are syntactic classes

If we add to this formula the two further formulas

subject   = "John" | "Mary".
predicate = "eats" | "talks".


  • the symbol | is “or”

We define herewith exactly four possible sentences, namely

John eats         Mary eats
John talks        Mary talks

Shorthand notation

S = AB.
A = "a" | "b".
B = "c" | "d".
L = {ac, ad, bc, bd}


  • The set L of sentences which can be generated is called the language. It consists of only four sentences.

Nesting is a property particularly important in the definition of structured languages.

Graphical representation


A language is defined by the following symbol type:

Symbol Type Also Called Cardinality Definition Example
term set
These symbols are said to be terminal,
because they cannot be substituted by any other symbols.
John, Mary, eats, talks.
non-terminal identifier set They denote syntactic (classes|category) and can be substituted. Sentence, Subject and predicate
syntactic equations productions set These define the possible substitutions of non-terminal symbols.
An equation is specified for each non-terminal symbol.
start 1 This symbol is a non-terminal symbol. Sentence

Syntactic equations - Backus Naur Form (BNF), Extended BNF (EBNF) and ABNF

A language is, therefore, the set of sequences of terminal symbols which, starting with the start symbol, can be generated by repeated application of syntactic equations, that is, substitutions.

The idea of defining languages and their grammar with mathematical precision goes back to N. Chomsky. It became clear, however, that the presented, simple scheme of substitution rules was insufficient to represent the complexity of spoken languages. This remained true even after the formalisms were considerably expanded. In contrast, this work proved extremely fruitful for the theory of programming languages and mathematical formalisms. In passing, we emphasize that this rigour applied to the syntax only, not to the semantics.

The use of the Chomsky formalism is also responsible for the term programming language, because programming languages seemed to exhibit a structure similar to spoken languages. This term is rather unfortunate, because a programming language is not spoken, and therefore is not a language in the true sense of the word. Formalism or formal notation would have been more appropriate terms.

ambiguous language

The structure of a (well formed) sentence is relevant, because it is instrumental in establishing the sentence's meaning. Owing to the syntactic structure, the individual parts of the sentence and their meaning can be recognized independently, and together they yield the meaning of the whole.

expression = number | expression "+" expression.
number = "1" | "2" | "3" | "4" .

“4 + 2 + 1” is a well-formed expression. This expression may even be derived in several ways, each corresponding to a different structural trees, as shown below:

The two differing structures may also be expressed with appropriate parentheses, respectively :

  • (4 + 2) + 1 = 7
  • 4 + (2 + 1) = 7

Fortunately, thanks to the associativity of addition both yield the same value 7.

But this need not always be the case. The mere use of subtraction in place of addition yields a counter example which shows that the two differing structures also yield a different interpretation and result:

  • (4 - 2) - 1 = 1
  • 4 - (2 - 1) = 3.

The example illustrates two facts:

  • Interpretation of sentences always rests on the recognition of their syntactic structure.
  • Every sentence must have a single structure in order to be unambiguous.

We call a syntactic class ambiguous if it can be attributed several structures. A language is ambiguous if it contains at least one ambiguous syntactic class (construct).

context-free languages


Powered by ComboStrap