EBNF-Grammars

Module ebnf provides an EBNF-parser-generator that compiles an EBNF-Grammar into Python-code that can be executed to parse source text conforming to this grammar into concrete syntax trees.

Writing Grammar-Definitions

Specifying Grammars with EBNF

With DHParser, Grammars can be specified either directly in Python-code (see parse) or in one of several EBNF-dialects. (Yes, DHParser supports several different variants of EBNF! This makes it easy to crate a parser directly from Grammars found in external sources.) “EBNF” stands for the “Extended-Backus-Naur-Form” which is a common formalism for specifying Grammars for context-free-languages. (see EBNF in Wikipedia)

The recommended way of compiling grammars with DHParser is to either write the EBNF-specification for that Grammar into a text-file and then compile EBNF-source to an executable as well as importable Python-module with the help of the “dhparser”-script. Or, for bigger projects, to create a new domain-specific-language-project with the DHParser-script as described in the step-by-step-guide.

However, here we will show how to compile an EBNF-specified grammar from within Python-code and how to execute the parser that was generated by compiling the grammar.

As an example, we will realize a json-parser. Let’s start with creating some test-data:

>>> testobj = {'array': [1, 2.0, "a string"], 'number': -1.3e+25, 'bool': False}
>>> import json
>>> testdata = json.dumps(testobj)
>>> testdata
'{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'

We define the json-Grammar in top-down manner in EBNF. We’ll use a regular-expression look-alike syntax. EBNF, as you may recall, consists of a sequence of symbol definitions. The definiens of those definitions either is a string literal or regular expression or other symbols or a combination of these with four different operators: 1. sequences 2. alternatives 3. options and 4. repetitions. Here is how these elements are denoted in classical and regex-like EBNF-syntax:

element

classical EBNF

regex-like

insignificant whitespace

~

~

string literal

“…” or `…`

“…” or `…`

regular expr.

/…/

/…/

sequences

A B C

A B C

alternatives

A | B | C

A | B | C

options

[ … ]

…?

repetitions

{ … }

…*

one or more

…+

grouping

(…)

(…)

positive lookahead

& …

& …

negative lookahead

! …

! …

“insignificant whitespace” is a specialty of DHParser. Denoting insignificant whitespace with a particular sign ~ allows to eliminate it already during the parsing process without burdening later syntax-tree-processing stages with this common task. DHParser offers several more facilities to restrain the verbosity of the concrete syntax tree, so that the outcome of the parsing stage comes close (or at least closer) to the intended abstract-syntax-tree, already.

DHParser also contains a few experimental extensions to the common EBNF and PEG (Parsing-Expression-Grammar) formalisms:

additional operators

syntax

interleave

A ° B

repetition range

… (i, k)

lookbehind operators

positive lookbehind

<-&

negative lookahead

<-!

context sensitive parsers

pop and match

:: …

retrieve and match

: …

pop and always match

:? …

JSON consists of two complex data types, 1) associative arrays, called “object” and sequences of heterogeneous data, called array; and of four simple data types, 1) string, 2) number, 3) bool and 4) null. The structure of a JSON file can easily be described in EBNF:

>>> grammar = '''
...     json     = ~ _element _EOF
...     _EOF     = /$/
...     _element = object | array | string | number | bool | null
...     object   = "{" ~ member ( "," ~ §member )* "}" ~
...     member   = string ":" ~ _element
...     array    = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
...     string   = `"` _CHARS `"` ~
...     _CHARS   = /[^"\\\\]+/ | /\\[\\/bnrt\\]/
...     number   = _INT _FRAC? _EXP? ~
...     _INT     = `-`? ( /[1-9][0-9]+/ | /[0-9]/ )
...     _FRAC    = `.` /[0-9]+/
...     _EXP     = (`E`|`e`) [`+`|`-`] /[0-9]+/
...     bool     = "true" ~ | "false" ~
...     null     = "null" ~                                   '''

This is a rather common EBNF-grammar. A few peculiarities are noteworthy, though: First of all you might notice that some components of the grammar (or “production rules” as they are commonly called) have names with a leading underscore _. It is a convention to mark those elements, in which we are on interested on their own account, with an underscore _. When moving from the concrete syntax-tree to a more abstract syntax-tree, these elements could be substituted by their content, to simplify the tree.

Secondly, some production rules carry a name written in capital letters. This is also a convention to mark those symbols which with other parser-generators would represent tokens delivered by a lexical scanner. DHParser is a “scanner-less” parser, which means that the breaking down of the string into meaningful tokens is done in place with regular expressions (like in the definition of _EOF) or simple combinations of regular expressions (see the definition of _INT above). Their is no sharp distinction between tokens and other symbols in DHParser, but we keep it as a loose convention. Regular expressions are enclosed in forward slashes and follow the standard syntax of Perl-style regular expression that is also used by the “re”-module of the Python standard library. (Don’t worry about the number of backslashes in the line defining _CHARS for now!)

Finally, it is another helpful convention to indent the definitions of symbols that have only been introduced to simplify an otherwise unnecessarily complicated definition (e.g. the definition of number, above) or to make it more understandable by giving names to its components (like _EOF).

Let’s try this grammar on our test-string. In order to compile this grammar into executable Python-code, we use the high-level-function create_parser() from the dsl-module.

>>> from DHParser.dsl import create_parser
>>> parser = create_parser(grammar, branding="JSON")
>>> syntax_tree = parser(testdata)
>>> syntax_tree.content
'{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'

As expected serializing the content of the resulting syntax-tree yields exactly the input-string of the parsing process. What we cannot see here, is that the parser has structured the string into the individual elements described in the grammar. Since the concrete syntax-tree that the parser yields is rather verbose, it would not make sense to print it out. We’ll just look at a small part of it, to see what it looks like. Let’s just pick the sub-tree that captures the first json-array within the syntax-tree:

>>> print(syntax_tree.pick('array').as_sxpr())
(array
  (:Text "[")
  (_element
    (number
      (_INT "1")))
  (:Text ",")
  (:Whitespace " ")
  (_element
    (number
      (_INT "2")
      (_FRAC
        (:Text ".")
        (:RegExp "0"))))
  (:Text ",")
  (:Whitespace " ")
  (_element
    (string
      (:Text '"')
      (_CHARS "a string")
      (:Text '"')))
  (:Text "]"))

The nodes of the syntax-tree carry the names of the production rules by which they have been generated. Nodes, that have been created by components of a production receive the name of of the parser-type that has created the node (see parse) prefixed with a colon “:”. In DHParser, these nodes are called “anonymous”, because they lack the name of a proper grammatical component.

Chaining Grammars with @include

While it is always advisable to keep a grammar definition short and concise, so that it neatly fits within one file, the need might arise to reuse parts of a grammar definition in different grammar. While this can be done with copy & paste, using copy & past has the disadvantage that if the part that has been copied will be changed later (if only because an error has been detected and needs to be corrected), it must be copied to all places where it is used, once again.

In order to avoid such complications, DHParser supports a modest kind of modularization of grammar-definitions via a simple include mechanism. Simply add @include = "FILENAME"-directive at the place where you’d like to include another grammar file.

Say, we’d like to make the grammar-definition of floating-point numbers in our JSON-grammar reusable. In order to do so, we can write it to a dedicated file:

>>> number_ebnf = '''
...     number   = _INT _FRAC? _EXP? ~
...     _INT     = `-`? ( /[1-9][0-9]+/ | /[0-9]/ )
...     _FRAC    = `.` /[0-9]+/
...     _EXP     = (`E`|`e`) [`+`|`-`] /[0-9]+/   '''
>>> with open('number_include.ebnf', 'w', encoding='utf-8') as f:  _ = f.write(number_ebnf)

Now, we can simply include this file an any grammar:

>>> json_grammar_with_include = '''
...     json     = ~ _element _EOF
...     _EOF     = /$/
...     _element = object | array | string | number | bool | null
...     object   = "{" ~ member ( "," ~ §member )* "}" ~
...     member   = string ":" ~ _element
...     array    = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
...     string   = `"` _CHARS `"` ~
...     _CHARS   = /[^"\\\\]+/ | /\\[\\/bnrt\\]/
...     @include = "number_include.ebnf"
...     bool     = "true" ~ | "false" ~
...     null     = "null" ~                                   '''
>>> parser = create_parser(json_grammar_with_include, branding="JSON")
>>> syntax_tree = parser(testdata)
>>> syntax_tree.content
'{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'

The following command merely removes the grammar on the hard-disk, so that our doctest-run does not leave any trails:

>>> import os;  os.remove('number_include.ebnf')

It should be noted that DHParser’s include-mechanism does not (yet) support any kind of namespaces, so you have to make sure yourself that the same symbol names are not accidentally defined both in the including and included file(s).

Using DHParser’s include-mechanism for modularizing your EBNF-grammar(s) is a different thing than adding an include mechanism to your domain specific language. For the latter, see the manual of the preprocess-module.

Simplifying Syntax-Trees while Parsing

Usually, anonymous nodes are what you want to get rid of in the course of transforming the concrete syntax-tree into an abstract syntax-tree. (See transform). DHParser already eliminates per default all anonymous nodes that are not leaf-nodes by replacing them with their children during parsing. Anonymous leaf-nodes will be replaced by their content, if they are a single child of some parent, and otherwise be left in place. Without this optimization, each construct of the EBNF-grammar would leave a node in the syntax-tree:

>>> from DHParser.parse import CombinedParser, TreeReduction
>>> _ = TreeReduction(parser.json, CombinedParser.NO_TREE_REDUCTION)
>>> syntax_tree = parser(testdata)
>>> print(syntax_tree.pick('array').as_sxpr())
(array
  (:Text "[")
  (:Option
    (:Series
      (_element
        (number
          (_INT
            (:Alternative
              (:RegExp "1")))))
      (:ZeroOrMore
        (:Series
          (:Text ",")
          (:Whitespace " ")
          (_element
            (number
              (_INT
                (:Alternative
                  (:RegExp "2")))
              (:Option
                (_FRAC
                  (:Text ".")
                  (:RegExp "0"))))))
        (:Series
          (:Text ",")
          (:Whitespace " ")
          (_element
            (string
              (:Text '"')
              (_CHARS
                (:RegExp "a string"))
              (:Text '"')))))))
  (:Text "]"))

This can be helpful for understanding how parsing that is directed by an EBNF-grammar works (next to looking at the logs of the complete parsing-process, see trace), but other than that it is advisable to streamline the syntax-tree as early on as possible, because the processing time of all subsequent tree-processing stages increases with the number of nodes in the tree.

Because of this, DHParser offers further means of simplifying syntax-trees during the parsing stage, already. These are not turned on by default, because they allow to drop content or to remove named nodes from the tree; but they must be turned on by “directives” that are listed at the top of an EBNF-grammar and that guide the parser-generation process. DHParser-directives always start with an @-sign. For example, the @drop-directive advises the parser to drop certain nodes entirely, including their content. In the following example, the parser is directed to drop all insignificant whitespace:

>>> drop_insignificant_wsp = '@drop = whitespace \n'

Directives look similar to productions, only that on the right hand side of the equal sign follows a list of parameters. In the case of the drop-directive these can be either names of non-anonymous nodes that shall be dropped or one of four particular classes of anonymous nodes (strings, backticked, regexp, whitespace) that will be dropped.

Another useful directive advises the parser to treat named nodes as anonymous nodes and to eliminate them accordingly during parsing. This is useful, if we have introduced certain names in our grammar only as placeholders to render the definition of the grammar a bit more readable, not because we are interested in the structure that is captured by the production associated with them in its own right:

>>> hide_symbols = '@hide = /_\\w+/ \n'

Instead of passing a comma-separated list of symbols to the directive, which would also have been possible, we have leveraged our convention to prefix unimportant symbols with an underscore “_” by specifying the symbols that shall by anonymized with a regular expression.

Now, let’s examine the effect of these two directives:

>>> refined_grammar = drop_insignificant_wsp + hide_symbols + grammar
>>> parser = create_parser(refined_grammar, 'JSON')
>>> syntax_tree = parser(testdata)
>>> syntax_tree.content
'{"array":[1,2.0,"a string"],"number":-1.3e+25,"bool":false}'

You might have noticed that all insignificant white-spaces adjacent to the delimiters have been removed this time (but, of course not the significant whitespace between “a” and “string” in “a string”). And the difference, the use of these two directives makes, is even more obvious, if we look at (a section of) the syntax-tree:

>>> print(syntax_tree.pick('array').as_sxpr())
(array
  (:Text "[")
  (number "1")
  (:Text ",")
  (number
    (:RegExp "2")
    (:Text ".")
    (:RegExp "0"))
  (:Text ",")
  (string
    (:Text '"')
    (:RegExp "a string")
    (:Text '"'))
  (:Text "]"))

This tree looks more streamlined. But it still contains more structure than we might like to see in an abstract syntax tree. In particular, it still contains all the delimiters (“[”, “,”, ‘”’, …) next to the data. But other than in the UTF-8 representation of our json data, the delimiters are not needed any more, because the structural information is now retained in the tree-structure.

So how can we get rid of those delimiters? The rather coarse-grained tools that DHParser offers in the parsing stage require some care to do this properly.

The @drop-directive allows to drop all unnamed strings (i.e. strings that are not directly assigned to a symbol) and back-ticked strings (for the difference between strings and back-ticked strings, see below) and regular expressions. However, using @drop = whitespace, strings, backticked would also drop those parts captured as string that contain data:

>>> refined_grammar = '@drop = whitespace, strings, backticked \n' \
...                   + hide_symbols + grammar
>>> parser = create_parser(refined_grammar, 'JSON')
>>> syntax_tree = parser(testdata)
>>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
(array
  (number "1")
  (number
    (:RegExp "2")
    (:RegExp "0"))
  (string "a string"))

Here, suddenly, the number “2.0” has been turned into “20”! There are three ways to get around this problem:

  1. Assigning all non-delimiter strings to symbols. In this case we would have to rewrite the definition of “number” as such:

    number     = _INT _FRAC? _EXP? ~
      _INT     = _MINUS? ( /[1-9][0-9]+/ | /[0-9]/ )
      _FRAC    = _DOT /[0-9]+/
      _EXP     = (_Ecap|_Esmall) [_PLUS|MINUS] /[0-9]+/
      _MINUS   = `-`
      _PLUS    = `+`
      _DOT     = `.`
      _Ecap    = `E`
      _Esmall  = `e`
    

    A simpler alternative of this technique would be to make use of the fact that the document-parts captured by regular expression are not dropped (although regular expressions can also be listed in the @drop-directive, if needed) and that at the same time delimiters are almost always simple strings containing keywords or punctuation characters. Thus, one only needs to rewrite those string-expressions that capture data as regular expressions:

    number     = _INT _FRAC? _EXP? ~
      _INT     = /[-]/ ( /[1-9][0-9]+/ | /[0-9]/ )
      _FRAC    = /[.]/ /[0-9]+/
      _EXP     = (/E/|/e/) [/[-+]/] /[0-9]+/
    
  2. Assigning all delimiter strings to symbols and drop the nodes and content captured by these symbols. This means doing exactly the opposite of the first solution. Here is an excerpt of what a JSON-grammar employing this technique would look like:

    @hide = /_\\w+/
    @drop = whitespace, _BEGIN_ARRAY, _END_ARRAY, _KOMMA, _BEGIN_OBJECT, ...
    ...
    array = _BEGIN_ARRAY ~ ( _element ( _KOMMA ~ _element )* )? §_END_ARRAY ~
    ...
    

    It is important that all symbols listed for dropping are also listed in the hide-directive or - if the the hide-directive is given a regular expression instead of a list of names - carry a name that is matched by that regular expression. DHParser does not allow to drop the content of named nodes unless they are hidden, because the default assumption is that symbols in the grammar are defined to capture meaningful parts of the document that contain relevant data.

  3. Bailing out and leaving the further simplification of the syntax-tree to the next tree-processing stage which, if you follow DHParser’s suggested usage pattern, is the abstract-syntax-tree-transformation proper and which allows for a much more fine-grained specification of transformation rules. See transform.

To round this section up, we present the full grammar for a streamlined JSON-Parser according to the first solution-strategy. Observe, that the values of “bool” and “null” are now defined with regular expressions instead of string-literals, because the latter would be dropped because of the @drop = ... strings, ...-directive, leaving an empty named node without a value, whenever a bool value or null occurs in the input:

>>> json_gr = '''
...     @hide      = /_\\w+/
...     @drop      = whitespace, strings, backticked, _EOF
...     json       = ~ _element _EOF
...       _EOF     = /$/
...     _element   = object | array | string | number | bool | null
...     object     = "{" ~ member ( "," ~ §member )* "}" ~
...     member     = string ":" ~ _element
...     array      = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
...     string     = `"` _CHARS `"` ~
...       _CHARS   = /[^"\\\\]+/ | /\\[\\/bnrt\\]/
...     number     = _INT _FRAC? _EXP? ~
...       _INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
...       _FRAC    = /[.]/ /[0-9]+/
...       _EXP     = /[Ee]/ [/[-+]/] /[0-9]+/
...     bool       = /true/ ~ | /false/ ~
...     null       = /null/ ~                                  '''
>>> json_parser = create_parser(json_gr, 'JSON')
>>> syntax_tree = json_parser(testdata)
>>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
(array
  (number "1")
  (number
    (:RegExp "2")
    (:RegExp ".")
    (:RegExp "0"))
  (string "a string"))

This time the data is not distorted, any more. One oddity remains, however: We are most probably not interested in the fact that the number 2.0 consists of three components, each of which hast been captured by a regular expression. Luckily, there exists yet another directive that allows to reduce the tree further by merging adjacent anonymous leaf-nodes:

>>> json_gr = '@reduction = merge \n' + json_gr
>>> json_parser = create_parser(json_gr, 'JSON')
>>> syntax_tree = json_parser(testdata)
>>> print(syntax_tree.as_sxpr())
(json
  (object
    (member
      (string "array")
      (array
        (number "1")
        (number "2.0")
        (string "a string")))
    (member
      (string "number")
      (number "-1.3e+25"))
    (member
      (string "bool")
      (bool "false"))))

Merging adjacent anonymous leaf-nodes takes place after the @drop-directive comes into effect. It should be observed that merging only produces the desired result, if any delimiters have been dropped previously, because otherwise delimiters would be merged with content. Therefore, the @reduction = merge- directive should at best only be applied in conjunction with the @drop and @hide-directives.

There are for possible values for the @reduction-directive to control “early” flattening and merging (i.e. flattening and merging in the parsing-stage, already) of anonymous-nodes:

value

effect

none

no tree-reduction during parsing stage

flatten

flatten anonymous nodes

merge_treetops

flatten anonymous nodes and merge anonymous leaf-nodes when all siblings are anonymous leaf-nodes

merge

flatten anonymous nodes and merge anonymous leaf-nodes

Applying any of the here described tree-reductions (or “simplifications” for that matter) requires a bit of careful planning concerning which nodes will be named and which nodes will be dropped. This, however, pays off in terms of speed and a considerably simplified abstract-syntax-tree generation stage, because most of the unnecessary structure of concrete-syntax-trees has already been eliminated at the parsing stage.

Alternative Syntax with ->DROP and ->HIDE

Since DHParser version 1.5.1 there also exists an alternative syntax to specify symbols that are to be considered disposable and elements that are to be dropped by local ->DROP (drop) and ->HIDE (dispose) markers within the grammar, rather than with directives. These markers must be appended to the parser (skip/drop) or to the symbol definition (hide/dispose), e.g.:

WHITESPACE = /\s+/ ->DROP

There is a subtle difference between the two markers, in so far as the “->DROP”-marker may appear at one or more places inside a (complex) definition or at its very end, while the “->HIDE” marker may only appear at the very end of the definition. (The reason is that while it makes sense to drop nodes returned by anonymous parsers, it does not make sense to dispose (i.e. hide) anonymous nodes, because these will be eliminated (“flattened”), anyway.)

Since existing EBNF-practices vary, you can also use the alias “->SKIP” instead of “->DROP” for elements to be dropped and “->DISPOSE” instead of “->HIDE” for hidden symbols. (Internally, DHParser uses the names drop and dispose, but this should not concern users of DHParser.)

Here is the last example, rewritten with “->DROP” and “->HIDE”-markers. Note, that we still use the @drop-directive for whitespace, strings and back-ticked (strings), because otherwise we would have to add a “->DROP”-marker after each string-literal and tilde (whitespace) -marker. Also, we already added the “@reduction=merge”-directive:

>>> json_gr = '''
...     @drop      = whitespace, strings, backticked
...     @reduction = merge
...     json       = ~ element EOF
...       EOF     = /$/ -> DROP
...     element   = (object | array | string | number | bool | null) -> HIDE
...     object     = "{" ~ member ( "," ~ §member )* "}" ~
...     member     = string ":" ~ element
...     array      = "[" ~ ( element ( "," ~ element )* )? "]" ~
...     string     = `"` CHARS `"` ~
...       CHARS   = /[^"\\\\]+/ | /\\[\\/bnrt\\]/  -> HIDE
...     number     = INT FRAC? EXP? ~
...       INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ ) -> HIDE
...       FRAC    = /[.]/ /[0-9]+/  -> HIDE
...       EXP     = /[Ee]/ [/[-+]/] /[0-9]+/ -> HIDE
...     bool       = /true/ ~ | /false/ ~
...     null       = /null/ ~                                  '''
>>> json_parser = create_parser(json_gr, 'JSON')
>>> syntax_tree = json_parser(testdata)
>>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
(array
  (number "1")
  (number "2.0")
  (string "a string"))

The SKIP and DROP markers can also be used as prefixes to definitions, which may make it more obvious for the readers of the grammar which symbols will be hidden or dropped from the syntax-tree. If written as a prefix, they are written with a following colon “:” instead of a preceding arrow “->”, e.g. “HIDE:” and “DROP:”. For the sake of simplicity and to mirror the internal conventions of DHParser, where the names of (disposable) anonymous-nodes are preceded by a colon, the word “HIDE” can also be left out from the prefix annotation, so that a simple colon at the beginning of a definition suggests a disposable symbol. (This should not be confused with a colon in front of a symbol on the left hand side of a definition where it marks a “retrievable” value for that symbols. ) With these conventions our JSON-example looks like the following:

>>> json_gr = '''
...     @drop      = whitespace, strings, backticked
...     @reduction = merge
...     json       = ~ element EOF
...     DROP:EOF     = /$/
...     :element  = (object | array | string | number | bool | null)
...     object     = "{" ~ member ( "," ~ §member )* "}" ~
...     member     = string ":" ~ element
...     array      = "[" ~ ( element ( "," ~ element )* )? "]" ~
...     string     = `"` CHARS `"` ~
...     :CHARS   = /[^"\\\\]+/ | /\\[\\/bnrt\\]/
...     number     = INT FRAC? EXP? ~
...       :INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
...       :FRAC    = /[.]/ /[0-9]+/
...       :EXP     = /[Ee]/ [/[-+]/] /[0-9]+/
...     bool       = /true/ ~ | /false/ ~
...     null       = /null/ ~                                  '''
>>> json_parser = create_parser(json_gr, 'JSON')
>>> syntax_tree = json_parser(testdata)
>>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
(array
  (number "1")
  (number "2.0")
  (string "a string"))

Comments and Whitespace

Why whitespace isn’t trivial

Handling whitespace in text-documents is not all trivial, because whitespace can serve several different purposes and there can be different kinds of whitespace: Whitespace can serve a syntactic function as delimiter. But whitespace can also be purely aesthetic to render a document more readable.

Depending on the data model, whitespace can be considered as significant and be included in the data or as insignificant and be excluded from the data and only be re-inserted when displaying the data in a human-readable-form. (For example, one can model a sentence as a sequence of words and spaces or just as a sequence of words.) Note, that “significance” does not correlate with the syntactic or aesthetic function, but only depends on whether you’d like to keep the whitespace in you data or not.

There can be different kinds of whitespace with different meaning (and differing significance). For example, one can make a difference between horizontal whitespace (spaces and tabs) and vertical whitespace (including linefeeds). And there can be different sizes of whitespace with different meaning. For example in LaTeX, a single linefeed still counts as plain whitespace while an empty line (i.e. whitespace including two or more not linefeeds) signals a new paragraph.

Finally, even the position of whitespace can make a difference. A certain number of whitespaces at the beginning of a line can have the meaning of “indentation” (as in Python code) while at the end of the line or between brackets it is just plain insignificant whitespace. (This is actually something, where the boundaries of the EBNF-formalism become visible and you’d probably use a preprocessor or some kind of “semantic actions” to handle such cases. There is some support for either of these in DHParser.)

Significant Whitespace

A reasonable approach to coding whitespace is to use one particular symbol for each kind of whitespace. Those kinds of whitespace that are insignificant, i.e. that do not need to appear in the data, should be dropped from the syntax-tree. With DHParser this can be done already while parsing, using the @hide and @drop-directives described earlier.

But let’s first look at an example which only includes significant whitespace. The following parser parses sequences of paragraphs which consist of sequences of sentences which consist of sequences of main clauses and subordinate clauses which consist of sequences of words:

>>> text_gr = '''
...     @ hide         = /_\\w+/
...     document       = PBR* S? paragraph (PBR paragraph)* PBR* S? _EOF
...       _EOF         = /$/
...     paragraph      = sentence (S sentence)*
...     sentence       = (clause _c_delimiter S)* clause _s_delimiter
...       _c_delimiter = KOMMA | COLON | SEMICOLON
...       _s_delimiter = DOT | QUESTION_MARK | EXCLAMATION_MARK
...     clause         = word (S word)*
...     word           = /(?:[A-Z]|[a-z])[a-z']*/
...     DOT            = `.`
...     QUESTION_MARK  = `?`
...     EXCLAMATION_MARK = `!`
...     KOMMA          = `,`
...     COLON          = `:`
...     SEMICOLON      = `;`
...     PBR            = /[ \\t]*\\n[ \\t]*\\n[ \\t]*/
...     S              = /(?=[ \\n\\t])[ \\t]*(?:\\n[ \\t]*)?(?!\\n)/ '''

Here, we have two types of significant whitespace PBR (“paragraph-break”) and S (“space”). Both types allow for a certain amount of flexibility, so that two whitespaces of the same type do not need to have exactly the same content, but we could always normalize these whitespaces in a subsequent transformation step.

Two typical design patterns for significant whitespace are noteworthy, here:

  1. Both whitespaces match only if there was at least one whitespace character. We may allow whitespace to be optional (as at the beginning and end of the document), but if the option has not been taken, we don’t to see an empty whitespace-tag in the document, later on. (For insignificant whitespace, the opposite convention can be more convenient, because, typically, insignificant whitespace is dropped anyway, whether it’s got content or not.)

  2. The grammar is construed in such a way that the whitespace always appears between different elements at the same level, but not after the last or before the first element. The whitespace after the last word of a sentence or before the first word of a sentence is really whitespace between two sentences. If we pick out a sentence or a clause, we will have no dangling whitespace at its beginning or end. (Again, for soon to be dropped insignificant whitespace, another convention can be more advisable.)

Let’s just try our grammar on an example:

>>> text_example = '''
... I want to say, in all seriousness, that a great deal of harm is being
... done in the modern world by belief in the virtuousness of work, and that
... the road to happiness and prosperity lies in an organized diminution of
... work.
...
... First of all: what is work? Work is of two kinds: first, altering the
... position of matter at or near the earth's surface relatively to other
... such matter; second, telling other people to do so. The first kind is
... unpleasant and ill paid; the second is pleasant and highly paid.'''
>>> text_parser = create_parser(text_gr, 'Text')
>>> text_as_data = text_parser(text_example)
>>> sentence = text_as_data.pick(
... lambda nd: nd.name == "sentence" and nd.content.startswith('First'))
>>> print(sentence.as_sxpr())
(sentence
  (clause
    (word "First")
    (S " ")
    (word "of")
    (S " ")
    (word "all"))
  (COLON ":")
  (S " ")
  (clause
    (word "what")
    (S " ")
    (word "is")
    (S " ")
    (word "work"))
  (QUESTION_MARK "?"))

Again, it is a question of design, whether we leave whitespace in the data or not. Leaving it has the advantage, that serialization become as simple as printing the content of the data-tree:

>>> print(sentence)
First of all: what is work?

Otherwise one would have to program a dedicated serialization routine. Especially, if you receive data from a different source, you’ll appreciate not having to do this - and so will other people, receiving your data. Think about it! However, dropping the whitespace will yield more concise data.

Coding Comments

Allowing comments in a domain-specific language almost always makes sense, because it allows users to annotate the source texts while working on them and to share those comments with collaborators. From a technical point of view, adding comments to a DSL raises two questions:

  1. At what places shall we allow to insert comments in the source code? Common answers are: a) at the end of a line, b) almost everywhere, or c) both.

  2. How do we avoid pollution of the EBNF-grammar with comment markers? It’s already curtails the readability that we have to put whitespace symbols in so many places. And speaking of comments at the end of the line: If line-feeds aren’t important for us - as in our toy-grammar for prose-text, above - we probably wouldn’t want to reframe our grammar just to allow for at the end of the line comments.

Luckily, there exists a simple and highly intuitive solution that takes care of both of these concerns: We admit comments, wherever whitespace is allowed. And we code this by defining a symbol that means: “whitespace and, optionally, a comment”.

Let’s try this with our prose-text-grammar. In order to do so, we have to define a symbols for comments, a symbol for pure whitespace, and, finally, a symbol for whitespace with optional comment. Since, in our grammar, we actually have two kinds of whitespace, S and PBR, we’ll have to redefine both of them. As delimiters for comments, we use curly braces:

>>> wsp_gr = '''
...     PBR      = pure_PBR COMMENT (pure_PBR | pure_S)?
...              | (pure_S? COMMENT)? pure_PBR
...     S        = pure_S COMMENT pure_S? | COMMENT? pure_S
...     COMMENT  = /\\{[^}]*\\}/
...     pure_PBR = /[ \\t]*\\n[ \\t]*\\n[ \\t]*/
...     pure_S   = /(?=[ \\n\\t])[ \\t]*(?:\\n[ \\t]*)?(?!\\n)/'''

As can be seen, the concrete re-definition of the whitespace tokens requires a bit of careful consideration, because we want to allow additional whitespace next to comments, but at the same time avoid ending up with two whitespaces in sequence in our data. Let’s see, if we have succeeded:

>>> extended_text_gr = text_gr[:text_gr.rfind(' PBR')] + wsp_gr
>>> extended_parser = create_parser(extended_text_gr, 'Text')
>>> syntax_tree = extended_parser('What {check this again!} is work?')
>>> print(' '.join(nd.name for nd in syntax_tree.pick('clause').children))
word S word S word
>>> print(syntax_tree.pick('clause').as_sxpr())
(clause
  (word "What")
  (S
    (pure_S " ")
    (COMMENT "{check this again!}")
    (pure_S " "))
  (word "is")
  (S
    (pure_S " "))
  (word "work"))

We will not worry about the more sub-structure of the S-nodes right now. If we are not interested in the comments, we could use the @hide, @drop and @reduction = merge-directives to simplify these at the parsing stage. Or, we could extract the comments and normalize the whitespace at a later tree-processing stage. For now, let’s just check whether our comments work as expected:

>>> syntax_tree = extended_parser('What{check this again!} is work?')
>>> print(' '.join(nd.name for nd in syntax_tree.pick('clause').children))
word S word S word
>>> syntax_tree = extended_parser('What {check this again!}is work?')
>>> print(' '.join(nd.name for nd in syntax_tree.pick('clause').children))
word S word S word
>>> syntax_tree = extended_parser('What{check this again!}is work?')
>>> print(syntax_tree.errors[0])
1:24: Error (1040): Parser "pure_S = /(?=[ \\n\\t])[ \\t]*(?:\\n[ \\t]*)?(?!\\n)/" did not match: »is work?«

The last error was to be expected, because we did not allow comments to serve a substitutes for whitespace. Let’s check whether putting comments near paragraph breaks works as well:

>>> test_text = '''Happiness lies in the diminution of work.
...
... { Here comes the comment }
...
... What is work?'''
>>> syntax_tree = extended_parser(test_text)
>>> print(' '.join(nd.name for nd in syntax_tree.children))
paragraph PBR paragraph
>>> test_text = '''Happiness lies in the diminution of work.
... { Here comes the comment }
... What is work?'''
>>> syntax_tree = extended_parser(test_text)
>>> print(' '.join(nd.name for nd in syntax_tree.children))
paragraph

The last result might look surprising at first, but since a paragraph break requires at least one empty line as a separator, the input text is correctly understood by the parser as a single paragraph with two sentence interspersed by a single whitespace which, incidentally, contains a comment:

>>> print(' '.join(nd.name for nd in syntax_tree.pick('paragraph').children))
sentence S sentence
>>> print(syntax_tree.pick('paragraph')['S'].as_sxpr(flatten_threshold=0))
(S
  (pure_S
    ""
    "")
  (COMMENT "{ Here comes the comment }")
  (pure_S
    ""
    ""))

A common problem with whitespace is that it tends to pollute the Grammar, because wherever you’d like to allow whitespace, you’d have to insert a symbol for whitespace. The same problem exists when it comes to allowing comments, because you’d probably allow to insert comments in as many places as possible.

Insignificant Whitespace

Coding insignificant whitespace and comments is exactly the same as coding significant whitespace and comments and does not need to be repeated, here. (The combination of insignificant whitespace and significant comments, is slightly more complicated, and probably best outsourced to some degree to the post-parsing processing stages. It will not be discussed here.) However, DHParser offers some special support for insignificant whitespace and comments, which can make working with these easier in some cases.

First of all, DHParser has a special dedicated token for insignificant whitespace which is the tilde ~-character. We have seen this earlier in the definition of the json-Grammar.

The ~-whitespace-marker differs from the usual pattern for defining whitespace in that it is implicitly optional, or what amounts to the same, it matches the empty string. Normally, it is to be considered bad practice to define a symbol as optional. Rather, a symbol should always match something and only at the places where it is used, it should be marked as optional. If this rule is obeyed, it is always easy to tell, whether some element is optional or not at a specific place in the Grammar. Otherwise, it can become quite confusing indeed. However, since the tilde character is usually used very often, it is more convenient not to mark it with a question-mark or, if you use classical EBNF-syntax, to enclose it with square brackets.

The default regular expression for the tilde-whitespace captures arbitrarily many spaces and tabs and at most one linefeed, but not an empty line ([ \\t]*(?:\\n[ \\t]*)?(?!\\n)), as this is the most convenient way to define whitespace for text-data. However, the tilde whitespace can also be defined with any other regular expression with the @whitespace-directive.

Let’s go back to our JSON-grammar and define the optional insignificant whitespace marked by the tilde-character in such a way that it matches any amount of horizontal or vertical whitespace, which makes much more sense in the context of json than the default tilde-whitespace that is restricted vertically to at most a single linefeed:

>>> testdata = '{"array": [1, 2.0, "a string"], \n\n\n "number": -1.3e+25, "bool": false}'
>>> syntax_tree = json_parser(testdata)
>>> print(syntax_tree.errors[0])
1:33: Error (1010): »member« expected by parser 'object', but »\n\n\n "numbe...« found instead!
>>> json_gr = '@whitespace = /\\s*/ \n' + json_gr
>>> json_parser = create_parser(json_gr, "JSON")
>>> syntax_tree = json_parser(testdata)
>>> print(syntax_tree.errors)
[]

When redefining the tilde-whitespace, make sure that your regular expression also matches the empty string! There is no need to worry that the syntax tree get’s cluttered by empty whitespace-nodes, because tilde-whitespace always yields anonymous nodes and DHParser drops empty anonymous nodes right away.

Comments can be defined using the @comment-directive. DHParser automatically intermingles comments and whitespace so that where-ever tilde-whitespace is allowed, a comment defined by the @comment-directive is also allowed:

>>> json_gr = '@comment = /#[^\\n]*(?:\\n|$)/ \n' + json_gr
>>> json_parser = create_parser(json_gr, "JSON")
>>> testdata = '''{"array": [1, 2.0, "a string"], # a string
...                "number": -1.3e+25,  # a number
...                "bool": false}  # a bool'''
>>> syntax_tree = json_parser(testdata)
>>> print(syntax_tree.as_sxpr(compact = True))
(json
  (object
    (member
      (string "array")
      (array
        (number "1")
        (number "2.0")
        (string "a string")))
    (member
      (string "number")
      (number "-1.3e+25"))
    (member
      (string "bool")
      (bool "false"))))

Since the json-grammar still contains the @drop = whitespace, ...- directive from earlier on (next to other tree-reductions), the comments have been nicely dropped along with the tilde-whitespace.

There is one caveat: When using comments alongside with whitespace that captures at most one linefeed, the comments should be defined in such a way that the last character of a comment is never a linefeed.

Also a few limitations of the tilde-whitespace and directive-defined comments should be kept in mind: 1. Only one kind of insignificant whitespace can be defined this way. If there are more kinds of insignificant whitespace, all but one need to be defined conventionally as part of the grammar. 2. Both directive-defined comments and tilde-whitespace can only be defined by a regular expression. In particular, nested comments are impossible to define with regular expressions, only.

However, using tilde-whitespace has yet one more benefit: With the tilde-whitespace, cluttering of the grammar with whitespace-markers can be avoid, by adding implicit whitespace adjacent to string-literals. Remember the definition of the JSON-Grammar earlier. If you look at a definition like: object = "{" ~ member ( "," ~ §member )* "}" ~, you’ll notice that there are three whitespace markers, one next to each delimiter. Naturally so, because one usually wants to allow users of a domain specific language to put whitespace around delimiters.

You may wonder, why the tilde appears only on the right hand side of the literals, although you’d probably like to allow whitespace on both side of a literal like “{”. But if you look at the grammar closely, you’ll find that almost every symbol definition ends either with a tilde sign or a symbol the definition of which ends with a tilde sign, which means that they allow whitespace on the right hand side. Now, if all elements of the grammar allow whitespace on the right hand side, this means that automatically also have whitespace on the left-hand side, too, which is, of course the whitespace on the right hand side of the previous element.

In order to reduce cluttering the grammar with tile-signs, DHParser allows to turn on implicit tilde-whitespace adjacent to any string literal with the directive @ literalws = right or @ literalws = left. As the argument of the directive suggests, whitespace is either “eaten” at the right hand side or the left hand side of the literal. String literals can either be enclose in double quotes “…” or single quotes ‘…’. Both kinds of literals will have implicit whitespace, if the @literalws-directive is used.

(Don’t confuse implicit whitespace with insignificant whitespace: Insignificant whitespace is whitespace you do not need any more after parsing. Implicit whitespace is whitespace you do not denote explicitly in the grammar. It’s a specialty of DHParser and DHParser allows only the insignificant whitespace denoted by the tilde-character to be declared as “implicit”.)

If left-adjacent whitespace is declared as implicit with the @literalws-directive, the expression:

object     = "{" ~ member ( "," ~ §member )* "}" ~

can be written as:

object     = "{" member ( "," §member )* "}"

which is easier to read.

For situations where implicit whitespace is not desired, DHParser has a special kind of string literal, written with backticks, which never carries any implicit whitespace. This is important, when literals are used for signs that enclose content, like the quotation marks for the string literals in our JSON-Grammar:

string     = `"` _CHARS '"'  # mind the difference between `"` and '"'!

Regular expressions, also, never carry implicit whitespace. So, if you are using regular expressions as delimiters, you still have to add the tilde character, if adjacent insignificant whitespace is to be allowed:

bool       = /true/~ | /false/~

The complete json-grammar now looks like this:

>>> json_gr = '''
...     @hide      = /_\\w+/
...     @drop      = whitespace, strings, backticked, _EOF
...     @reduction = merge
...     @whitespace= /\\s*/
...     @comment   = /#[^\\n]*(?:\\n|$)/
...     @literalws = right
...     json       = ~ _element _EOF
...       _EOF     = /$/
...     _element   = object | array | string | number | bool | null
...     object     = "{" member ( "," §member )* "}"
...     member     = string ":" _element
...     array      = "[" ( _element ( "," _element )* )? "]"
...     string     = `"` _CHARS '"'
...       _CHARS   = /[^"\\\\]+/ | /\\[\\/bnrt\\]/
...     number     = _INT _FRAC? _EXP? ~
...       _INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
...       _FRAC    = /[.]/ /[0-9]+/
...       _EXP     = /[Ee]/ [/[-+]/] /[0-9]+/
...     bool       = /true/~ | /false/~
...     null       = /null/~                                    '''
>>> json_parser = create_parser(json_gr, "JSON")
>>> syntax_tree_ = json_parser(testdata)
>>> assert syntax_tree_.equals(syntax_tree)

The whitespace defined by the @whitespace-directive can be access from within the grammar via the name WHITESPACE__. Other than the tilde-sign this name refers to the pure whitespace that is not intermingles with comments. Similarly, comments defined by the @comment-directive can be accessed via the symbol COMMENT__.

Lookahead and Lookbehind

Lookahead and lookbehind operators are a convenient way to resolve or rather avoid ambiguities while at the same time keeping the DSL lean. Assume for example a simple DSL for writing definitions like:

>>> definitions = '''
...     dog   := carnivorous quadrupel that barks
...     human := featherless biped'''

Now, let’s try to draw up a grammar for “definitions”:

>>> def_DSL_first_try = ''' # WARNING: This grammar doesn't work, yet!
...     @literalws  = right
...     definitions = ~ definition { definition } EOF
...     definition  = definiendum ":=" definiens
...     definiendum = word
...     definiens   = word { word }
...     word        = /[a-z]+|[A-Z][a-z]*/~
...     EOF         = /$/ '''
>>> def_parser = create_parser(def_DSL_first_try, "defDSL")

Parsing our example with the generated parser yields an error, however:

>>> syntax_tree = def_parser(definitions)
>>> for e in syntax_tree.errors_sorted:  print(e)
3:11: Error (1040): Parser "word->/[a-z]+|[A-Z][a-z]*/" did not match: »:= featherless biped«

The reason for this error is that the parser definiens captures as many words as occur in a sequence, including the definiendum of the next definition which is the word “human”. But then the next definition does not find it definiendum, any more, because it has already been captured. (All this may not easily become clear from the error message itself, but can easily be found out by using the post-mortem debugger of module trace.)

An common technique to avoid this problem would be to introduce an end-of-statement, for example, a semicolon “;”. A more elegant way to solve the problem in this case is to make use of the fact that if a word is followed by the definition sign “:=” it cannot be part of the definiens any more, but must be a definiendum. This can be encoded by using the negative look-ahead operator “!”:

>>> def_DSL = def_DSL_first_try.replace('definiens   = word { word }',
...                                     'definiens   = word { word !":=" }')
>>> def_parser = create_parser(def_DSL, "defDSL")
>>> syntax_tree = def_parser(definitions)
>>> for d in syntax_tree.select('definition'):
...    print(f'A {d["definiendum"]} is a {str(d["definiens"]).strip()}')
A dog    is a carnivorous quadrupel that barks
A human  is a featherless biped

The statement word !":=" is a sequence of a word and a negative lookahead. This whole sequence only matches, if word matches and the negative look-ahead matches, which is only the case of the following text cannot be matched by “:=”.

We could have achieved the same effect with a positive lookahead by checking whether any of the possible follow-up-sequences of parser definiens ensues:

>>> def_DSL = def_DSL_first_try.replace('definiens   = word { word }',
...                                     'definiens   = word { word &(word|EOF) }')
>>> def_parser = create_parser(def_DSL, "defDSL")
>>> syntax_tree = def_parser(definitions)
>>> for d in syntax_tree.select('definition'):
...    print(f'A {d["definiendum"]} is a {str(d["definiens"]).strip()}')
A dog    is a carnivorous quadrupel that barks
A human  is a featherless biped

Generally, lookahead operators, whether positive or negative, never capture any text. They merely match or fail depending on whether the following parser would match or would fail to match the next piece of text. The positive lookahead matches, if the parser would match. The negative operator matches, if it would fail. Lookahead operators can also be though of as a boolean condition on the following text, where the positive lookahead operator “&” resembles an and, “and” the negative lookahead operator an “and not”. As in our example, these operators are very helpful for “exploring” the surroundings of a piece of text to be captured by a parser. They allow parsers to match or fail depending on the ensuing text.

A negative lookahead expression can also serve to encode the meaning of “without” if placed in front of another expression. Let’s rewrite our grammar of a definitions-DSL so as to exclude certain bad words:

>>> def_DSL = def_DSL[:def_DSL.find('word        =')] + '''
...     word        = !forbidden /[a-z]+|[A-Z][a-z]*/~
...     forbidden   = /[sf][a-z][a-z][a-z]/~
...     EOF         = /$/ '''
>>> def_parser = create_parser(def_DSL, "defDSL")
>>> syntax_tree = def_parser('nice := nice word')
>>> print(syntax_tree)
nice := nice word
>>> syntax_tree = def_parser('sxxx := bad word')
>>> print(str(syntax_tree).strip())
<<< Error on "sxxx := bad word" | Parser "word->!forbidden" did not match: »sxxx := bad word« >>>

The same effect can be achieved by using the subtraction operator “-”. This is just syntactic sugar make the use of the negative lookahead operator in the sense of “without” more intuitive:

>>> def_DSL = def_DSL[:def_DSL.find('word        =')] + '''
...     word        = all_words - forbidden
...     all_words   = /[a-z]+|[A-Z][a-z]*/~
...     forbidden   = /[sf][a-z][a-z][a-z]/~
...     EOF         = /$/ '''
>>> def_parser = create_parser(def_DSL, "defDSL")
>>> syntax_tree = def_parser('sxxx := bad word')
>>> print(str(syntax_tree).strip())
<<< Error on "sxxx := bad word" | Parser "word->!forbidden" did not match: »sxxx := bad word« >>>

Next to the lookahead operators, there also exist look-back operators. Be warned, though, that look back operators are an experimental feature in DHParser and that their implementation is highly idiosyncratic, that is, it is most likely not compatible with any other parser-generator-toolkit based on EBNF-grammars. Also, look-back operators in DHParser are more restricted than lookahead-operators. They can only be used in combination with simple text or regular expression parsers and - here comes the idiosyncratic part - they work in the opposite direction. This means that if you want to check whether a parser is preceded, say, by the keyword “BEGIN”, the text phrase that you have to check for with the look-back parser is actually “NIGEB”. If that still does not put you off, here is how look-back-operators are used: Let’s assume that our definition should not only allow for a definiens but, alternatively for enumerations and that the difference is indicated by using a simple equal sign “=” instead of the definition symbol “:=”. Then using look-back-operators to distinguish the case, we can rewrite our grammar as follows:

>>> def_DSL_extended = '''
...     @literalws  = right
...     definitions = ~ definition { definition } EOF
...     definition  = definiendum (":=" | "=") (definiens | enumeration)
...     definiendum = word
...     definiens   = <-& /\\s*=:/ word { word &(word|EOF) }
...     enumeration = <-& /\\s*=[^:]/ word { word &(word|EOF) }
...     word        = /[a-z]+|[A-Z][a-z]*/~
...     EOF         = /$/ '''
>>> def_parser = create_parser(def_DSL_extended, "defDSL")
>>> definitions = '''
...     dog   := carnivorous quadrupel that barks
...     drinks = water beer juice
...     human := featherless biped'''
>>> syntax_tree = def_parser(definitions)
>>> print(str(syntax_tree.pick('enumeration')).strip())
water beer juice

The look-back operators are <-& for the positive look-back and <-! for the negative look-back, each of which must be followed by a regular expression or a string. Of course, this example is rather wanton and the grammar can easily be rewritten without the look-back-operators.

Error-Catching

Providing users with proper error information is one of the most tenacious problem when implementing the parser for a domain specific language. There are three different challenges:

  1. Locating the error at the correct position in the source code.

  2. Providing proper error messages that explain the reason for the error.

  3. Resuming the parsing progress after an error has occurred at the nearest possible place without producing artificial follow-up errors.

In the following, DHParser’s techniques for the first two challenges, locating errors and customizing error messages will be described. Techniques for resuming the parsing process after an error occurred or for passing by erroneous passages in the source code will be explained below, under the heading “Fail-tolerant Parsing”.

Farthest-Fail-Heuristics

Without adding any hints to the grammar, DHParser applies only a very basic technique for locating the error if the grammar does not match a text which is known as “farthest failure” and locates the error at the “farthest” position where a parser failed, reporting the last named parser in the call chain (that first reached this location) as the cause of the failure. This approach often works surprisingly well for locating errors, unless the grammar relies to heavy on regular expressions capturing large chunks of text, because the error location works only on the level of the parsing expression grammar not at that of the atomic regular expressions. To see how farthest fail word, consider a parser for simple arithmetic expressions:

>>> arithmetic_grammar = '''@ literalws = right
... arithmetic = expression EOF
... expression = ~ term { ("+" | "-") term }
... term       = factor { ("*" | "/") factor }
... factor     = number | group
... group      = "(" expression ")"
... number     = /\\d+/~
... EOF        = /$/'''
>>> arithmetic = create_parser(arithmetic_grammar, "arithmetic")
>>> terms = arithmetic('(2 - 3 * (4 + 5)')
>>> print(terms.errors[0])
1:17: Error (1040): Parser "term->`*`" did not match: »«
>>> terms = arithmetic('(2 - 3) * ( )')
>>> print(terms.errors[0])
1:13: Error (1040): Parser "number->/\\d+/" did not match: »)«

As can be seen the location of the error is captured well enough, at least when we keep in mind that the computer cannot guess where we would have placed the forgotten closing bracket. It can only report the point where the mistake becomes apparent.

However, the reported fact that it was the sub-parser `*` of parser term that failed at this location does little to enlighten us with respect to the cause of the failure. The “farthest fail”-method as implemented by DHParser yields the first parser (of possibly several) that has been tried at the position where the farthest fail occurred. Thus, in this case, a failure of the parser capturing `*` is reported rather than of the parser expression->`+`. Changing this by reporting the last parser or all parsers that failed at this location would do little to remedy this situation, however. In this example, it would just be as confusing to learn that expression->`+` failed at the end of the parsed string.

Mandatory items “§”

Thus, “farthest fail”-method is not very suitable for explaining the failure or pinpointing which parser really was the culprit. Therefore, DHParser provides a simple annotation that allows to raise a parsing error deliberately, if a certain point in the chain of parsers has not been reached: By placing the “§”-sign as a “mandatory-marker” in front of a parser, the parser as well as all subsequent parsers in the same sequence, will not simply return a non-match when failing, but it will cause the entire parsing process to stop and report an error at the location of failure:

>>> arithmetic_grammar = arithmetic_grammar.replace(
...    'group      = "(" expression ")"',
...    'group      = "(" § expression ")"')
>>> arithmetic = create_parser(arithmetic_grammar, "arithmetic")
>>> terms = arithmetic('(2 - 3 * (4 + 5)')
>>> print(terms.errors[0])
1:17: Error (1010): »")"« expected by parser 'group', but END OF FILE found instead!
>>> terms = arithmetic('(2 - 3) * ( )')
>>> print(terms.errors[0])
1:13: Error (1010): »expression« expected by parser 'group', but »)« found instead!

The error messages give a much better indication of the cause of the error. What is reported as cause is either the name of the parser that was expected to match, as in the second case, or the rule of the parser in case of unnamed parsers, as in the first case. This usually, though unfortunately not always, yields a much better indication of the location and cause of an error than the farthest failure.

However, a little care has to be taken, not to place the mandatory marker in front of a parser that might fail at a location that could still be reached and matched by another branch of the grammar. (In our example it is clear that round brackets enclose only groups. Thus, if the opening round bracket has matched, we can be sure that what follows must be an expression followed by a closing round bracket, or, if not it is a mistake.) Luckily, although this may sound complicated, in practice it never is. Unless you grammar is very badly structured, you will hardly ever make this mistake, an if you do, you will notice soon enough.

Also, there is an important restriction: There is only one §-marker allowed per named parser. In case you have a long EBNF-expression on the right hand side of a symbol-definition, where you’d like to use the §-marker at more than one place, you can, however, always split it into several expression by introducing new symbols. These symbols, if they serve no other purpose, can be marked as disposable with the @ hide-directive (see Simplifying Syntax-Trees while Parsing).

The §-marker has proven to be a very simple means of pinpointing errors the DSL-code, and I recommend to use it from early on in the process of developing a new grammar. Plus, the §-marker offers two further benefits, namely customizing error messages and resuming the parsing process after a failure. That latter is particularly helpful if the DSL is to be used in with an integrated development environment, which benefits greatly from fail-tolerant parsing. However I only recommend to start using these, only after the grammar has reached a certain amount of maturity, because changing the grammar ofter requires re-adjusting customized error messages and resume-clauses as well, which can become tedious.

Custom error messages

While the error messages produced by the use of the §-marker are often quite understandable for the engineer designing the grammar of a DSL, they might not be so for the user the DSL, who might not know the names of the parsers of the grammar, let alone the expressions of the unnamed parsers und will therefore not always be able to make much sense of an error-messages that report just these.

In order to customize error messages, the symbol-related directive @ SYMBOLNAME_error = CONDITION, ERROR_STRING is used. The directive’s name consists of the name of a symbol that contains a §-marker and the appendix _error. The directive always takes two arguments, separated as usual by a comma, of which the first is condition-expression and the second an error message. The condition can be used to make the choice of an error-message dependent on the text following the point of failure. It can either be a regular expression or a simple string which must match (or be equal to in the case of the string) the first part of the text at the position where the parser defined by the symbol failed to match and raised an error. Only if the condition matches, the error message given as the second argument will be emitted. Otherwise, the fallback error-expression described above (”… expected by parser …”) will be shown. The empty string '' can be used as a fallback if the customized message shall always be emitted, no matter what the following text looks like.

The error string is a format string that may include any of the two arguments {0} or {1} where {0} will be replaced by the name or string representation of the parser that was expected to match but didn’t and {1} will be replaced by the first twenty or so letters of the unmatched rest of the text. Here is a simple example that could be part of a JSON-parser that is intended to deliver understandable error-messages:

>>> grammar = '''
... @ string_error  = '', 'Illegal character(s) {1} in string.'
... string          = `"` §characters `"` ~
... characters      = { plain | escape }
... plain           = /[^"\\\\]+/
... escape          = /\\[\\/bnrt\\]/'''
>>> json_string = create_parser(grammar, 'json_string')
>>> print(json_string('"alpha"'))
"alpha"
>>> for e in json_string('"al\\pha"').errors:  print(e)
1:4: Error (1010): Illegal character(s) »\pha"« in string.

It is possible to place an error code or number at the beginning of the error string, separated by a colon to override the default code of 1010 for mandatory continuation errors. (See below for an example where it makes sense to do so.)

Customized error-messages must always be specified in the grammar before definition of the symbol they are related to. It is possible to “stack” several different error-directives with different conditions and messages but related to the same symbol. The conditions are evaluated in the order the error-directives appear in the grammar. The first error message the matching condition matches the error location will be picked. Therefore, the more specific conditions should always be placed first and the more general or fallback conditions should be placed below these:

>>> grammar = ("@ string_error  = /\\\\/, 'Illegal escape sequence {1} "
...            "Allowed values are b,n,r,t,u'") + grammar
>>> json_string = create_parser(grammar, 'json_string')
>>> for e in json_string(r'"al\pha"').errors:  print(e)
1:4: Error (1010): Illegal escape sequence »\pha"« Allowed values are b,n,r,t,u

Here, the more specific and more understandable error message has been selected. Careful readers might notice that the the more general customized error message “Illegal character(s) … found in string” will now only be selected, if the string contains a character that not even regular expression engine recognizes, because the only other character that is not allowed within the string are the closing quotation marks that terminate the string and which do not cause the parser to fail (but only to terminate to early).

Also, it might be noticed that the errors are always caused by a failure to match the second "-sign, because the characters-parser also matches the empty string and thus never fails or raises any error. Nonetheless, the error can occur in the interior of the string and can - with the help of customized error messages - be described as such and properly be located.

Generally, though, emitting the right error messages based on a regular-expression-condition is not easy. In the worst case, badly customized error messages can even be more confusing than not customizing error messages at all.

One possible development-strategy is to wait for the feedback of testers and users or to monitor the errors that users typically make and then to customize the error messages to the actual needs of the users to help them understand why the computer refuses to parse a certain construct.

Enhancing grammars with error-paths

A frequently used approach to error catching is the encoding of alternative code paths for erroneous code right into the grammar. This is usually done by adding a further alternative with the alternative-operator | that is specifically meant to “throw” an error if all regular alternatives have been exhausted. DHParser offers a special syntax to add an error message if a certain part of the grammar is activated during the parsing of an erroneous source-text: @Error("[CODE:]MESSAGE") where MESSAGE has to be substituted by a particular error message and the optional “CODE:” with a natural number at the beginning of the error-message from which the error-code must be separated with a colon. This number can freely be chosen, but must adhere to the following convention:

  • numbers from 1 to 99 must be used for mere notices

  • numbers from 100 to 999 for warnings

  • numbers from 1000 to 9999 for errors

  • numbers of 10000 and more as fatal errors

Fatal errors result in the stop of the processing pipeline right after the step where the error occurred. “Normal” errors result in invalid results but DHParser will attempt to continue with the processing pipeline. This has the advantage that other errors further downstream will also be reported in the same pass. A possible disadvantage is that consequential errors may arise in the processing of faulty results from earlier stages in the pipeline. If too many consequential errors occur in the aftermath of an error, it should be considered to lift this error to a fatal error by assigning a higher number to it.

Warnings and notices should never should be used when the results will still be valid, though maybe not what they were intended to be. (Thus, the warning) Error codes starting with the digit “1” are reserved by DHParser. So, for custom-errors codes starting with “2”, “3” etc. must be used.

Here is an example how error-messages can be added within the grammar. The syntax uses the same @-marker as directives but other than directives the error-markers do not resemble symbol-definitions but occur within the right hand side of a symbol definition. The following is a grammar that parses a sequence of natural numbers separated by a blank of line-feed, e.g. “1 2 3”:

>>> sequence_of_natural_numbers = """@drop = whitespace
...     document = { ~ number | error } ~ EOF
...     number = /[1-9]\\d*/~
...     @error_error = "20:Parser stopped because of an error"
...     error  = !EOF           # no error-message on end of file
...              @Error("2010:Not a valid Number!")
...              § EOF
...     EOF = !/./"""
>>> number_parser = create_parser(sequence_of_natural_numbers)
>>> syntax_tree = number_parser("1 2 X 3 4 X 5")
>>> for error in syntax_tree.errors_sorted: print(error)
1:5: Error (2010): Not a valid Number!
1:5: Notice (20): Parser stopped because of an error

Let’s examine this in detail: The definition of the “document”-symbol without an error-branch would read:

document = { ~ number } ~ EOF

Here, we add an alternative branch to the number-parser in case the number parser does not match. Now, the number parser will not match on one of two conditions: a) the following text is not a number, or b) the end of the file has been reached. Now, since the latter is perfectly in order, we have to exclude this case at the beginning of the error-branch with a negative-lookahead:

error  = !EOF
         @Error("1010:Not a valid Number!")
         § EOF

While the first line of the definition of the error-branch excludes the kinds of non-match(es) that are not errors, the second line adds an error message with the current location of the parser. We add an error code, “1010”, to make it easier to identify this error in automated tests or the like. The last line is just a means to make the parser to stop parsing right on the spot (that is, unless this error is caught by a resume-directive, see below). It will produce the consequential error “EOF expected”, though. In order not to confuse the use, the error has been reconfigured as a simple notice that explains the situation with the directive @error_error = "20:Parser stopped because of an error" in the line before.

Instead of stopping the parser by jumping right to the end of the document, one might also try to skip only so much text as is needed to continue the parsing-process. In this trivial example, this is easily done by substituting the § EOF statement with a regular expression that consumes all characters until the next location where the parser might find a possibly valid item. The expression /.*?(?=\s|$)/ skips all characters up to but not including the next blank or up to the end of the file. Let’s try:

>>> alt_gr = sequence_of_natural_numbers.replace('§ EOF', '/.*?(?=\\s|$)/')
>>> reentrant_number_parser = create_parser(alt_gr)
>>> syntax_tree = reentrant_number_parser('1 2 X 3 4 X 5')
>>> for error in syntax_tree.errors_sorted:  print(error)
1:5: Error (2010): Not a valid Number!
1:11: Error (2010): Not a valid Number!

The location of the second error indicates that the parser has continued to read the document after the first error. We can double-check this by looking at the abstract syntax tree:

>>> print(syntax_tree.as_sxpr())
(document
  (number "1")
  (number "2")
  (error
    (ZOMBIE__ `(err "1:5: Error (2010): Not a valid Number!"))
    (:RegExp "X"))
  (number "3")
  (number "4")
  (error
    (ZOMBIE__ `(err "1:11: Error (2010): Not a valid Number!"))
    (:RegExp "X"))
  (number "5")
  (EOF))

Fail-tolerant Parsing

A serious limitation of all previously described error-handling mechanisms (with the exception the very last example) is that the parsing process still stops on the very first error. This is particularly annoying for beginners learning to code data or program code with a new DSL, because the compiler must be run again and again before all errors have been found and corrected. It would be much better to receive a list of all errors on the first run.

Generic approach

Fail tolerant parsing means that:

  1. the parsing process continues after a syntax error has been encountered, ideally until the end of the document

  2. that doing so the parser skips as little text as possible and thus reports as many errors as possible on the first run.

  3. that ideally no consequential errors (“Folgefehler”) are introduced which are not original errors in the source document, but merely artifacts of badly chosen locations for the resumption of the parsing process.

There are a number of techniques for fail-tolerant parsing. One technique that has already been shown earlier and which is not specific to DHParser but can be used with any parser-generator is to add grammar-paths for possibly erroneous code to the grammar:

>>> grammar = '''
... string          = `"` ([characters] `"` | string_error [`"`]) ~
...   string_error  = /[^"]*/
... characters      = { plain | escape }+
... plain           = /[^"\\\\]+/
... escape          = /\\[\\/bnrt\\]/'''
>>> json_string = create_parser(grammar, 'json_string')
>>> tree = json_string('"al\\pha"')
>>> print(tree.as_sxpr(flatten_threshold=0))
(string
  (:Text '"')
  (string_error "al\pha")
  (:Text '"'))

This time the parser did not need to stop at the erroneous part. The erroneous part itself has been caught within a node that betrays only by its name that there was an error. To produce error messages we have to add them explicitly to such nodes, afterwards:

>>> for string_error in tree.select('string_error'):
...     _ = tree.new_error(string_error, 'Fehler im String: ' + str(string_error))
>>> for e in tree.errors:  print(e)
1:2: Error (1000): Fehler im String: al\pha

Unfortunately, the error location is not very precise. This can be remedied by refining our error junction code:

>>> grammar = '''
... string          = `"` ([characters] `"` | string_error [`"`]) ~
...   string_error    = [characters] { ups [characters] }+
...   ups             = /[^"]/
... characters      = { plain | escape }+
... plain           = /[^"\\\\]+/
... escape          = /\\[\\/bnrt\\]/'''
>>> json_string = create_parser(grammar, 'json_string')
>>> tree = json_string('"al\\pha"')
>>> print(tree.as_sxpr())
(string
  (:Text '"')
  (string_error
    (characters
      (plain "al"))
    (ups "\")
    (characters
      (plain "pha")))
  (:Text '"'))

Here, the node named “ups” pinpoints the precise error location.

Like most techniques for fail-tolerant parsing, this one is not quite as easy to master in practice as it might look. Generally, adding a junction for erroneous code works best, when the passage that shall be by-passed is delineated by a easily recognizable follow-up strings. In this example the follow-up string would be the "-sign. The method fails, of course if the follow-up text is erroneous, too, or has even been forgotten. So, to be absolutely sure, one would have to consider different follow-up sequences, say empty lines, keywords that mark new parts of the document and the like.

Skip and Resume

DHParser also offers two other constructs for fail-tolerant parsing which are quite similar to the just described technique. However, they do not require adding code-paths to the grammar and reuse the error-locating ability of the §-marker. A disadvantage is that the DHParser-specific support for fail-tolerant parsing presently relies entirely on regular expressions for finding the right re-entry points.

DHParser allows to resume parsing after an error at a later point in the text. When trying to resume parsing two questions must be answered:

  1. At what location should the parsing process be resumed?

  2. Which parser in the parser call-chain should resume parsing? E.g. the parser that failed, the parser that called the parser that failed, … ?

The location where parsing should be resumed must be specified by a regular expression or a list of regular expressions. The resumption location is the nearest match of any of these expressions that does not fall into a comment (as specified by the @comment-directive described above). More precisely it is the location directly after the match, because this allows to search for the reentry-location both by the text preceding this location and the text following this location by using a lookahead operator inside the regular expression.

The parser that resumes parsing depends on the directive that guides the search for the reentry-point. DHParser offers two different directives for this purpose, the @..._skip-directive and the @..._resume-directive. The placeholder … stands for the name of a parser that contains a §-marker.

The skip-directive resumes parsing with the sequence-parser that contains the item(s) marked by the §-marker. In the following example, the skip-directive picks up parsing with the string- parser when an error was raised by the string-parser:

>>> grammar = '''
... @ string_error  = /\\\\/, 'Illegal escape sequence {1}'
... @ string_error  = '', 'Illegal character {1} in string.'
... @ string_skip   = /(?=")/
... string          = `"` §characters `"` ~
... characters      = { plain | escape }
... plain           = /[^"\\\\]+/
... escape          = /\\[\\/bnrt\\]/'''
>>> json_string = create_parser(grammar, 'json_string')
>>> tree = json_string('"al\\pha"')
>>> print(tree.content)
"al\pha"
>>> print(tree.errors[0])
1:4: Error (1010): Illegal escape sequence »\pha"«
>>> print(tree.as_sxpr())
(string
  (:Text '"')
  (characters
    (plain "al"))
  (ZOMBIE__ `(err "1:4: Error (1010): Illegal escape sequence »\pha\"«") "\pha")
  (:Text '"'))

After the error has occurred at the illegal escape-sequence, the skip-directive catches the error and skips to the location where the -character lies just ahead and continues parsing with the string-parser. The skipped passage is stored in a ZOMBIE__-Node within the syntax-tree and parsing can continue through to the end of the text.

In contrast to the skip-directive the resume-directive leaves the parser that raised the error and resumes one level higher up in the call chain. The @ ..._resume-directive that tells the calling parsers where to continue after the array parser has failed. So, the parser resuming the parsing process is not the array parser that has failed, but the first parser in the reverse call-stack of “array” that catches up at the location indicated by the @ ..._resume-directive. The location itself is determined by a regular expression, where the point for reentry is the location after the next match of the regular expression:

>>> grammar = grammar.replace('@ string_skip   = /(?=")/',
...                           r'@ string_resume = /("\s*)/')
>>> json_string = create_parser(grammar, 'json_string')
>>> tree = json_string('"al\\pha"')
>>> print(tree.content)
"al\pha"
>>> print(tree.errors[0])
1:4: Error (1010): Illegal escape sequence »\pha"«
>>> print(tree.as_sxpr())
(string
  (:Text '"')
  (characters
    (plain "al"))
  (ZOMBIE__ `(err "1:4: Error (1010): Illegal escape sequence »\pha\"«") '\pha"'))

Note, that this time, the zombie-node also contains the closing quotation marks. Also, it should be observed, that the regular expression of the resume-directives stops after the closing quotation marks as well as any ensuing whitespace. This is because parsing will continue with the calling parser of the string parser, so the resumption point must be at a reasonable place where the string parser might have returned, if no error had occurred.

A simple rule for specifying the reentry point of an error is to find a location where the next structural entity after the erroneous entity starts. Let’s try this for a (simplified) config-file parser:

>>> config_grammar = '''@literalws = right
... config     = ~ { section } EOF
... section    = heading { entry }
... heading    = "[" § identifier "]"
... entry      = identifier § ":" value
... identifier = /\\w+/~
... value      = `"` § TEXTLINE '"'
... TEXTLINE = /[^"\\n]*/
... EOF      =  !/./ '''
>>> config_parser = create_parser(config_grammar)

As of now, our parser is not fail-tolerant. This means it will stop parsing at the first error. Further errors are neither detected nor reported:

>>> cfg_data_with_errors = '''
... [entities]
... animal: "cat"
... plant: rose"
... Building: "Tour Eiffel"
... [colors
... red: "warm"
... blue 1: "cold"
... grey: "black and white" '''
>>> result = config_parser(cfg_data_with_errors)
>>> for error in result.errors_sorted:  print(error)
4:8: Error (1010): »value« expected by parser 'entry', but »rose"\nBuil...« found instead!

After adding suitable resume-clauses for those symbols the definition of which contain the mandatory marker §, all errors are reported in a single pass:

>>> config_grammar = '''
... @heading_resume = /\\n\\s*(?=\\w|\\[)/
... @entry_resume = /\\n\\s*(?=\\w|\\[)/
... ''' + config_grammar
>>> config_parser = create_parser(config_grammar)
>>> result = config_parser(cfg_data_with_errors)
>>> for error in result.errors_sorted:  print(error)
4:8: Error (1010): »value« expected by parser 'entry', but »rose"\nBuil...« found instead!
7:1: Error (1010): »"]"« expected by parser 'heading', but »red: "warm...« found instead!
8:6: Error (1010): »":"« expected by parser 'entry', but »1: "cold"\n...« found instead!

It can become difficult to find a reentry point with regular expressions that is on the same level of the parser call chain (or one level higher up in the case of the resume-directive) when an error occurs in a syntactic structure that can be recursively nested. Because of this it is also possible to specify the re-entry point with a parser. In this case, the search term has a different semantics however. If a parser is specified, it must match all characters from the point where the error occurred up to the reentry point. In the case of a simple string or a regular expression, DHParser searches for the first match of the expression and then picks the location after that match. In order to distinguish the two cases clearly, PEG-rules must always be enclosed in round brackets. Thus, a single regular expression or a singular string enclosed in round brackets will not be used as a search term but as a matching expression that determines the reentry-location by matching the complete text from the error location to the reentry-point.

As an example, let’s try this with a parser for arbitrarily nested lists of positive integers. First, we write our grammar without any re-entry rules:

>>> number_list_grammar = r'''@ literalws   = right
... @ hide   = /_\w+/
... @ drop   = _EOF, whitespace, strings
... _document = ~ [ list ] §_EOF
... list     = "[" [_items] § "]"
... _items   = _item { "," §_item }
... _item    = number | list
... number   = `0` | /[1-9][0-9]*/
... _EOF     =  !/./'''
>>> list_parser = create_parser(number_list_grammar)
>>> list_parser('[[1,2], [3,4]]').as_sxpr()
'(list (list (number "1") (number "2")) (list (number "3") (number "4")))'

The following example contains three errors: The letter “A” in a place where a number should be and, a bit later, the wrong delimiter “;” instead of a comma, and finally a missing value (or a superfluous comma) right before the end. Since there are no rule for resuming the parser after an error occurred, the parser will stop after the first error and also report that only part of the document could be parsed:

>>> example_with_errors = '[1, 2, A, [5, 6; 7], 8, ]'
>>> result = list_parser(example_with_errors)
>>> for e in result.errors: print(e)
1:8: Error (1010): »_item« expected by parser '_items', but »A, [5, 6; ...« found instead!

Now, let’s define some regular expression based rules to resume parsing after an error:

>>> resumption_rules = '''
... @list_resume = /]\\s*|$/
... @_items_skip = /(?=,)/, /(?=])/, /$/
... '''

Note, that for the “_items”-parser, several rules have been specified. DHParser will try all of these rules and resume at the closest of the locations these rules yield. The @list_resume-rule moves to a point after the list, where the list parser might have returned, if no error had occurred, which is after the closing square bracket plus any adjacent whitespace. The @_items_skip`-rule moves to any point within the sequence of items where the _items could catch up (another comma “,” followed by further items) or end, because the sequence is exhausted (“]” or the end of the document caught be the regular expression marker for the end of the string: /$/). See, how these rules play out in this particular example:

>>> list_parser = create_parser(resumption_rules + number_list_grammar)
>>> result = list_parser(example_with_errors)
>>> for e in result.errors: print(e)
1:8: Error (1010): »_item« expected by parser '_items', but »A, [5, 6; ...« found instead!
1:16: Error (1010): »"]"« expected by parser 'list', but »; 7], 8, ]« found instead!
1:25: Error (1010): »_item« expected by parser '_items', but »]« found instead!

All errors are located and reported properly in a single run and the parser continues right through until the end of the document as we’d expect from a fail-tolerant parser. However, the limitations of our regular-expression-based rules become apparent when nested structures are involved:

>>> example_with_errors_2 = '[1, 2, A, [5, 6; [7, 8], 9], 10, ]'
>>> result = list_parser(example_with_errors_2)
>>> for e in result.errors: print(e)
1:8: Error (1010): »_item« expected by parser '_items', but »A, [5, 6; ...« found instead!
1:16: Error (1010): »"]"« expected by parser 'list', but »; [7, 8], ...« found instead!
1:28: Error (1010): »_EOF« expected by parser '_document', but », 10, ]« found instead!

Here, the parser stopped before the end of the document, which shows that our resumption rules have been either incomplete or inadequate. Let’s turn on some debugging information to get a better insight into what went wrong:

>>> from DHParser.trace import resume_notices_on
>>> resume_notices_on(list_parser)
>>> result = list_parser(example_with_errors_2)
>>> for e in result.errors: print(e)
1:8: Error (1010): »_item« expected by parser '_items', but »A, [5, 6; ...« found instead!
1:9: Notice (50): Skipping from 1:8 'A, [5, ...' within _item->:_item to 1:9 ', [5, 6...'
1:16: Error (1010): »"]"« expected by parser 'list', but »; [7, 8], ...« found instead!
1:24: Notice (50): Resuming from list at 1:16 '; [7, 8...' with _items->:Series at 1:24 ', 9], 1...'
1:28: Error (1010): »_EOF« expected by parser '_document', but », 10, ]« found instead!

What is of interest here, is the second notice: It seems that the error was caught within the “list”-parser. By moving on to the spot after closing bracket as determined by the @list_resume-directive, however, the parsing-process did not end up at a location behind the grammatical structure where the error had occurred, but at the location after a nested structure, which in this example is the inner list “[7, 8]”.

This problem can be remedied by using the full power of parsing expression grammars for determining the resumption-position. (Note, that in order to clearly distinguish PEG-rules unambiguously from plain regular expressions or simple strings, they must be enclosed in round rackets!):

>>> resumption_rules = '''
... @list_resume = ({ list | /[^\\[\\]]+/ } ["]"])
... @_items_skip = /(?=,)/, /(?=])/, /$/
... '''
>>> list_parser = create_parser(resumption_rules + number_list_grammar)
>>> resume_notices_on(list_parser)
>>> result = list_parser(example_with_errors_2)
>>> from DHParser.error import RESUME_NOTICE
>>> for e in result.errors:
...     if e.code != RESUME_NOTICE: print(e)
1:8: Error (1010): »_item« expected by parser '_items', but »A, [5, 6; ...« found instead!
1:16: Error (1010): »"]"« expected by parser 'list', but »; [7, 8], ...« found instead!
1:34: Error (1010): »_item« expected by parser '_items', but »]« found instead!

This time, the parser does not terminate before the end. The resume-notices show that resumption does not get caught on the nested structure, any more:

>>> for e in result.errors:
...     if e.code == RESUME_NOTICE: print(e)
1:9: Notice (50): Skipping from 1:8 'A, [5, ...' within _item->:_item to 1:9 ', [5, 6...'
1:28: Notice (50): Resuming from list at 1:16 '; [7, 8...' with _items->:Series at 1:28 ', 10, ]'
1:34: Notice (50): Skipping from 1:34 ']' within _items->:Series to 1:34 ']'

Programming fail-tolerant parsers can be quite a challenge. DHParser’s @skip- and @resume-directives help separating the code for fail-tolerance from the grammar proper. The only hooks that are needed within the grammar proper are the mandatory markers (”§”). Still, it is a good strategy, to start adding the fail-tolerance code only later in the development of a grammar and to keep things as simple as possible. This can best be done by choosing resumption points that are as unambiguous as possible (like keywords that introduce specific parts of the document), even if this means that larger parts of the document will be skipped and, consequently, some errors will remain undetected until errors earlier in the document have been fixed. If syntax errors are sparse - as can reasonably be assumed - the harm done by skipping larger portions of the text is probably negligible or at any rate smaller than the harm done by introducing consequential errors as a result of poorly chosen resumption rules.

Macros

In order to reduce code-repetition within grammar-specifications DHParser offers a substitution-based macro-system. (To avoid code-repetition between grammars, includes can be used.)

Macros are defined similar to symbols, only that their name must always start with a dollar sign $:

$list($delimiter) = { /[^,.;:]+/ | !$delimiter /[,.;:]/ }
       { $delimiter ~ { /[^,.;:]+/ | !$delimiter /[,.;:]/ } }

The parameters of a macro are also marked with a dollar sign. Using macros is straight forward:

keywords = "Keywords:" $list(",")

Macros can use other macros in their definiens but they may not be nested recursively. It is also possible to define macros without any arguments. This makes sense, when breaking up a long macro definition into several macros which works similar to breaking up a long symbol-definition by introducing (disposable) helper-symbols. For this purpose helper-macros may refer to parameters of the “calling” macro without explicitly passing them as parameters:

$list($delimiter) = $item { $delimiter ~ $item }
$item = { /[^;:,.\n]+/ | !$delimiter /[.,:;]/  }+

A limitation of DHParser’s macro-system is that there is no way to define new symbols inside a macro-definition.

Here is a simple example for the use of macros to highlight particular words when parsing a text:

>>> hilight_grammar = '''@reduction = merge
...     song = { $skip("rain") rain } [$skip("rain")] EOF
...     $skip($phrase) = { !$phrase /.|\\n/ }
...     rain  = "rain"
...     EOF   = !/.|\\n/'''

>>> hilight_parser = create_parser(hilight_grammar)
>>> song = """
...     I'm singin' in the rain
...     Just singin' in the rain
...     What a glorious feeling
...     I'm happy again."""

>>> syntax_tree = hilight_parser(song)
>>> print(syntax_tree.as_sxpr())
(song
  (skip
    ""
    "    I'm singin' in the ")
  (rain "rain")
  (skip
    ""
    "    Just singin' in the ")
  (rain "rain")
  (skip
    ""
    "    What a glorious feeling"
    "    I'm happy again.")
  (EOF))

For another, more detailed example, see Macros.

Context sensitive parsers

DHParser does by intention not contain support for semantic actions, because these can introduce a context-sensitivity that can be hard to handle with a recursive descent parser. And compiler-generation, where semantic actions are mostly needed, is not the main domain of application for DHParser. (There are a few loopholes that can be (mis-)used for semantic actions, though…)

Sometimes, however, it would be ever so comfortable to break out of the paradigm of context free grammars - if only just a little bit. For example, when encoding data in XML or HTML, the closing tag must have the same tag-name as the opening tag:

<line>O Rose thou art sick.</line>

If you encode the tag-name-parser roughly following the XML-specs as:

tag_name        = /(?![0-9][:\\w][\\w:.-]*/

the following code would be accepted by the parser:

<line>O Rose thou art sick.</enil>

In this case, the remedy is easy: When post-processing the syntax tree, check whether all end-tags have the same tag-name as the corresponding start-tag and add an error message where this is not the case. However, this only works, because the tag-names have no influence on the structure of the syntax-tree of an XML-document.

Therefore, the same remedy would not work in many other cases. In CommonMark, for example, a “code fence is a sequence of at least three [but possibly more] consecutive backtick characters (`) or tildes (~)” Now, this is a bit more complicated than the XML-example, because here the content of the closing marker needs to be known at the time of parsing already, because otherwise the structure could not be determined correctly. A smaller number of tildes or backticks than used at the opening of a code fence would be part of the content of the fenced code that needs to be distinguished from the closing delimiter. The purpose is to allow you to fence code that may contain fenced code itself.:

~~~~~
In common mark code can be fenced with tilde-characters. You can just write:
~~~
fenced code
~~~
~~~~~

Here, a possible remedy is to employ a preprocessor, that distinguishes the fenced code from quoted fences and replaces the non-quoted fences by context-free opening and closing markers that can then be captured at parsing stage. Using pre-processors is often a clean and pragmatic solution and DHParser includes dedicated support for pre-processors. However, introducing pre-processors also hast some downsides. One disadvantage is that a preprocessor requires a makeshift parser of its own that must be strong enough not to stumble over the syntactic constructs that do not concern the preprocessor like comments for example.

DHParser therefore also offers another alternative for occasional context sensitivity by allowing to retrieve and compare earlier values of a symbol. Any users of these features should be aware, however, that extensive use of context sensitive parsers may slow-down parsing, because it does not play well with the the memoizing optimization that is commonly used with parsing expression grammars and also employed by DHParser. Also, since these features are specific for DHParser, switching to another parser generator will require factoring the context-sensitive-parser out of your grammar and re-implementing the functionality for which they have been used with the more conventional approach of pre- and post-processors.

Before explaining the full mechanism, let’s have a look at an example. The following minimal pseudo-XML-parser captures the value of the tag-name so that it can compared with the tag-name of the ending-tag:

>>> miniXML = '''
... @ hide   = EOF
... @ drop   = EOF, whitespace, strings
... document = ~ element ~ §EOF
... element  = STag §content ETag
... STag     = '<' TagName §'>'
... ETag     = '</' ::TagName §'>'
... TagName  = /\\w+/
... content  = [CharData] { (element | COMMENT__) [CharData] }
... CharData = /(?:(?!\\]\\]>)[^<&])+/
... EOF      =  !/./
... '''
>>> parseXML = create_parser(miniXML)
>>> print(parseXML('<line>O Rose thou art sick.</line>').as_sxpr())
(document
  (element
    (STag
      (TagName "line"))
    (content
      (CharData "O Rose thou art sick."))
    (ETag
      (TagName "line"))))
>>> result = parseXML('<line>O Rose thou art sick.</enil>')
>>> print(result.errors[0])
1:28: Error (1010): »ETag = `</` ::TagName "line" § `>`« expected by parser 'element', but »</enil>« found instead!

Here, the TagName-parser in the definition has been prefixed with a double colon ::. This double colon is the “Pop”-operator and can be put in front of any symbol defined in the grammar. If a symbol is annotated with the operator, then its parsing-rule will not be executed, but the last value that has been parsed by the symbol will be retrieved and match against the following part of the document by simple text-comparison. If the last value matches the “Pop”-parser will remove that value from the stack of earlier values and report a match. Otherwise a non-match will be reported and the value will be left on the stack. In the example, since the Pop-parser ::TagName follows a mandatory marker §, the non-match causes a syntax error.

Sometimes it is useful to compare the following text with a stored value without removing that value from the stack. For this purpose, there is the “Retrieve”-operator which is denoted by a single colon ::

>>> fencedTextEBNF =  '''@whitespace = vertical
... @hide    = EOF, fence_re
... @drop    = whitespace, EOF
... document = ~ { text | fenced } EOF
... text     = /[^\\~\\n]+/ ~  # /(?:[^\\~]+|\\n)+/
... fenced   = fence ~ { text | no_fence } ::fence ~
... no_fence = ! :fence fence_re ~
... fence    = fence_re
... fence_re = /\\~+/
... EOF      = !/./
... '''

Here the Pop-operator :: is used in the definition of fenced in just the same way as in the earlier example. The Retrieve-operator : is used in the definition of no_fence in combination with a negative lookahead !. This allows the no_fence-parsers to capture all “fences” which are not closing-fence of the current fenced environment:

>>> parseFenced = create_parser(fencedTextEBNF)
>>> fenced_test_1 = '''~~~
... fenced code
... ~~~'''
>>> print(parseFenced(fenced_test_1).as_sxpr())
(document (fenced (fence "~~~") (text "fenced code") (fence "~~~")))
>>> fenced_test_2 = '''~~~~~
... In common mark code can be fenced with tilde-characters. You can just write:
... ~~~
... fenced code
... ~~~
... ~~~~~'''
>>> print(parseFenced(fenced_test_2).as_sxpr())
(document
  (fenced
    (fence "~~~~~")
    (text "In common mark code can be fenced with tilde-characters. You can just write:")
    (no_fence "~~~")
    (text "fenced code")
    (no_fence "~~~")
    (fence "~~~~~")))

But what if the opening and closing fence are not one and the same string, but complements of each other, like opening and closing brackets? Say, you’d like to enclose code-examples in curled braces “{” and “}” and since the code examples themselves may contain braces, you’d like to allow the markup-writer to use an arbitrary number of braces as opening and closing delimiters:

>>> markup = "This ist a code example: {{ mapping = { 'a': 1, 'b': 2} }} with braces."

Here, the recalled value would need to be transformed or otherwise interpreted before the following text is either considered a match or a non-match. Since such a transformation can hardly be encoded in EBNF even an augmented EBNF, any more, one of DHParser’s loopholes for semantic actions must be used. It is possible to assign Python filter-functions to symbols, the value of which is retrieved with yet another directive, which is the @ XXXX_filter-directive, where “XXXX” stands for the name of the symbol:

>>> bracesExampleEBNF = '''
... @braces_filter = matching_bracket()
... document       = { text | codeblock }
... codeblock      = braces { text | opening_braces | (!:braces closing_braces) } ::braces
... braces         = opening_braces
... opening_braces = /\\{+/
... closing_braces = /\\}+/
... text           = /[^{}]+/
... '''

The function name that is passed to the directive must be the name of a function that is within reach of the generated parser (by either defining it in the same module or importing it) and it must have the signature:

Callable[str, List[str]], Optional[str]]

This function takes the following text as well as the stack of previous value of the symbol that is being retrieved as an argument and it must return either a stretch of matched text of None to indicate a non-match. The function matching_bracket() is already defined in DHParser.parse. Slightly simplified to cover only the case of curly braces, it looks like this:

>>> from typing import List, Optional
>>> def matching_bracket(text: str, stack: List[str]) -> Optional[str]:
...     value = stack[-1]
...     value = value.replace("{", "}")
...     return value if text[:len(value)] == value else None

>>> markup_parser = create_parser(bracesExampleEBNF)
>>> result = markup_parser(markup)
>>> print(result.as_sxpr())
(document
  (text "This ist a code example: ")
  (codeblock
    (braces
      (opening_braces "{{"))
    (text " mapping = ")
    (opening_braces "{")
    (text " 'a': 1, 'b': 2")
    (closing_braces "}")
    (text " ")
    (braces "}}"))
  (text " with braces."))

Here, the outer double braces “{{” and “}}” open up and close a new code block and could be discarded as delimiters during the AST-transformation, while the opening and closing braces within the code block are simply that: opening and closing braces.

Advanced usages

Apart from the Pop- and Retrieve-operator, DHParser offers a third retrieval operator that, like the Pop-operator, “pops” that last value from the stack and matches either this value or the empty strings. In other word, this operator always matches, as long as there is still a value on the stack, but it captures the beginning of the following text only if it matches the stored value. A non-match only happens, when the stack has already been exhausted. This “Pop anyways”-operator is denoted by a colon followed by a question mark :?. Weird as this may sound, this operator has astonishingly manifold use cases. Think for example of a modification of our minimal pseudo-XML parser the allows coders to omit the tag name in closing tags to save them some typing:

>>> miniXML = miniXML.replace('::TagName', ':?TagName')
>>> parseXML = create_parser(miniXML)
>>> data_tree = parseXML('<line>O Rose thou art sick.</>').as_tree()
>>> print(data_tree)
document
  element
    STag
      TagName "line"
    content
      CharData "O Rose thou art sick."
    ETag
      TagName

Another, rather tricky use case is to let the value of certain symbols be determined on first use by marking all appearances of theses symbols on the right hand side of the definitions wherein they appear with the single colon retrieval operator : and clearing the stack with [:?symbol] after the end of file has been reached. This technique has been employed for the “FlexibleEBNF”-parser in the examples folder. The FlexibleEBNF-parser “magically” adjusts itself to different syntactical flavors of EBNF. Here is an abbreviated excerpt of the grammar of this parser, to see how this technique can be used in a grammar:

syntax     = ~ { definition } EOF
definition = symbol §:DEF~ expression :ENDL~
DEF        = `=` | `:=` | `::=` | `<-` | /:\\n/ | `: `
ENDL       = `;` | ``
EOF        = !/./ [:?DEF] [:?ENDL]

This trick can also be used to parse indentation:

>>> tree_grammar = '''@whitespace = horizontal
... @hide = EOF, LF, SAME_INDENT
... @drop = strings, whitespace, EOF, LF, SAME_INDENT
...
... tree     = INDENT node DEDENT /\\s*/ EOF
... node     = tag_name [content]
... content  = string | children
... children = &(LF HAS_DEEPER_INDENT)
...            LF INDENT § node { LF SAME_INDENT § node }
...            !(LF HAS_DEEPER_INDENT) DEDENT
... tag_name = /\\w+/~
... string   = '"' § /(?:\\\\"|[^"\\n])*/ '"' ~
...
... INDENT            = / */
... SAME_INDENT       = :INDENT § !/ /
... HAS_DEEPER_INDENT = :INDENT / +/
... DEDENT            = &:?INDENT
...
... LF       = /\\n/
... EOF      = !/./
... '''
>>> tree_parser = create_parser(tree_grammar)
>>> syntax_tree = tree_parser(data_tree)
>>> # show but the first 22 lines of the syntax-tree:
>>> print('\n'.join(syntax_tree.as_sxpr().split('\n')[:22] + ['...']))
(tree
  (INDENT)
  (node
    (tag_name "document")
    (content
      (children
        (INDENT "  ")
        (node
          (tag_name "element")
          (content
            (children
              (INDENT "    ")
              (node
                (tag_name "STag")
                (content
                  (children
                    (INDENT "      ")
                    (node
                      (tag_name "TagName")
                      (content
                        (string "line")))
                    (DEDENT))))
...

In case you are surprised by the size of the resulting syntax-tree, keep in mind that the syntax-tree or “parse-tree” of the serialization of a data structure is not the data-structure itself, even if it happens to be a tree. However, the data-structure can be retrieved from the tree.

As can be seen the INDENT-elements have captured indentation of various increasing length. Also, observe the use of the “querying parser” HAS_DEEPER_INDENTATION. A “querying parser” is a parser that is meant to be used only inside a negative or positive lookahead. The first query with HAS_DEEPER_INDENTATION makes sure that the next call to INDENT will indeed capture a longer stretch of whitespace than represented by its present value, i.e. the last value on the “INDENT”-symbol’s value-stack.

DHParser makes sure that the value-stack of each captured variable will properly be un-winded when returning from a non-match to an earlier position in the document. However, there is one case where this does not work, namely, if a symbol has matched the empty string. Because value-stack unwinding is tied to the position in the text instead of the call stack (which DHParser only keeps track of when testing or debugging), it is impossible to decide where a zero-length capture must be un-winded or not in case DHParser returns exactly to the position of that capture. In this case DHParser does not unwind the capture. This can lead to a non-empty value stack at the end of the parsing process which will be reported as an error. There are three strategies to avoid this error:

  1. Make sure that all symbols for which values are retrieved capture at least one character of text if they match.

    In this example above, the symbol INDENT would have to be defined as INDENT = / +/ to achieve this. However, this would have made the formulation of the tree-grammar of this example more complicated.

  2. Make sure that in all cases where zero-length text could possibly be captured a subsequent non-match cannot occur before the zero-length value has been retrieved.

    In the example above, this situation can only occur in the parser tree and could be possibly avoided with a lookahead, e.g. tree = &(/ */ tag_name) INDENT node DEDENT. (However, since this avoidance strategy does not need to cover cases where the document is syntactically incorrect, anyway, we can rely on the fact that - since a syntactically correct document must contain a single root node at the very beginning - the parser will never retreat to the first captured INDENT.)

  3. As a last resort one could clear the stack with one or more optional [DEDENT] parsers that can safely be added after EOF. A modification that allows the document to consist of one or more top-level nodes would then look like: tree = { INDENT node DEDENT }+ /\\s*/ EOF [DEDENT]

The problem of error resumption

Resuming parsing after an error has occurred is complicated because, if done badly, it can lead to consequential errors, i.e. errors that are artifacts of the resumption rule or algorithm but not really errors in source text. This can be quite confusing and if worst comes to worst one cannot trust any error message after the first error messages which in turn defies the whole point of resuming parsing rather than breaking off after the first error.

Unfortunately, consequential errors are even harder to avoid when using context sensitive parsers. Say, we treat our XML-Parser with a little mistake:

>>> xmldoc = '''
... <doc>
...     <title>Heading</litle>
...     A few lines of Text
... </doc>'''
>>> result = parseXML(xmldoc)
>>> for e in result.errors_sorted: print(e)
3:21: Error (1010): »`>`« expected by parser 'ETag', but »litle>\n   ...« found instead!
5:7: Error (1050): Capture-stack not empty after end of parsing: TagName 1 item

Since our original mini-XML-grammar did not contain any error-resumption-directives, this is what we’d expect. Now let’s add a resume-directive to continue after misspelled tag-names:

>>> miniXML = '''
... @ hide   = EOF
... @ drop   = EOF, whitespace, strings
... document = ~ element ~ §EOF
... @element_resume = ('</' /\\w+/ '>')
... element  = STag §content ETag
... STag     = '<' TagName §'>'
... ETag     = '</' ::TagName §'>'
... TagName  = /\\w+/
... content  = [CharData] { (element | COMMENT__) [CharData] }
... CharData = /(?:(?!\\]\\]>)[^<&])+/
... EOF      =  !/./
... '''
>>> parseXML = create_parser(miniXML)
>>> result = parseXML(xmldoc)
>>> for e in result.errors_sorted: print(e)
3:19: Error (1010): »ETag = `</` ::TagName "title" § `>`« expected by parser 'element', but »</litle>\n ...« found instead!
5:1: Error (1010): »ETag = `</` ::TagName "title" § `>`« expected by parser 'element', but »</doc>« found instead!

The second error is merely a consequential error. One can imagine that, had the mistake of misspelling the ending tag occurred deeper in the XML hierarchy then a consequential error for every closing tag after the erroneous would have been reported, because the name of the opening tag corresponding to the misspelled closing tag has - because of the misspelling - never been removed from the stack.

We can try to avoid this problem by “popping” one element from the stack within our resume rule. This is of course only possible, if we specify our resume rule as full PEG-expression, not just a regular expression:

>>> new_resume_rule = r"@element_resume = (:?TagName '</' /\w+/ '>')"

The regular expression /\s*(?=>)/ merely has the purpose to cover ending-tags that fail to parse because of superfluous blanks before the closing >. The important part is the “pop anyways”-operator :?TagName. Since this operator does not capture any text in case of a misspelled tag-name, but merely removes it value from the stack, it it followed by the regular expression /\s*(?=>)/ which moves the parser forward to the next angular bracket.:

>>> i = miniXML.find('@element_resume')
>>> k = miniXML.find('\n', i)
>>> miniXML = miniXML[:i] + new_resume_rule + miniXML[k:]
>>> parseXML = create_parser(miniXML)
>>> result = parseXML(xmldoc)
>>> for e in result.errors_sorted: print(e)
3:19: Error (1010): »ETag = `</` ::TagName "title" § `>`« expected by parser 'element', but »</litle>\n ...« found instead!

If it only was so simple! Let’s see what happens if we treat our parser with another, rather obvious or common mistake: Forgetting the closing tag (or, what leads to a similar confusion of the parser, omitting the closing slash / in an empty tag):

>>> xmldoc = '''
... <doc>
...    <title>Heading <wrong></title>
... </doc>'''
>>> result = parseXML(xmldoc)
>>> for e in result.errors_sorted: print(e)
3:26: Error (1010): »ETag = `</` ::TagName "wrong" § `>`« expected by parser 'element', but »</title>\n<...« found instead!
4:1: Error (1010): »ETag = `</` ::TagName "title" § `>`« expected by parser 'element', but »</doc>« found instead!
4:7: Error (1010): »ETag = `</` ::TagName "doc" § `>`« expected by parser 'element', but END OF FILE found instead!

In this case the removal of the last value from the stack in the @element_resume-rule results in a cascade of consecutive errors (or “error-artifacts”), one for every outer closing tag.

Avoiding error-artifacts such as this is an art in itself and requires to look at the whole picture. As of now, there does not exist any comprehensive theory, how to do this. A good starting point might be to take a look at the errors, users of the domain specific language make most frequently, as well as on those error messages that confuse them most. (Bug-reports, help-desk questions or tele-metric data might yield the required empirical information for this approach.)

A more robust rule-set for our mini-XML-grammar might be:

>>> miniXML = r'''
... @ whitespace  = /\s*/
... @ hide   = EOF
... @ drop   = EOF, whitespace, strings
...
... document = ~ element ~ §EOF
... @element_resume = (:?TagName (&('</' :TagName '>') | '</' /\w+/ '>'))
... element  = STag §content ETag
... STag     = '<' TagName §'>'
... ETag     = '</' ::TagName §'>'
... TagName  = /\w+/
... content  = [CharData] { (element | COMMENT__) [CharData] }
...
... CharData = /(?:(?!\]\]>)[^<&])+/
... EOF      =  !/./        # no more characters ahead, end of file reached
... '''
>>> parseXML = create_parser(miniXML)
>>> result = parseXML(xmldoc)
>>> for e in result.errors_sorted: print(e)
3:26: Error (1010): »ETag = `</` ::TagName "wrong" § `>`« expected by parser 'element', but »</title>\n<...« found instead!

This rule has been designed to cover both of the previous cases. However, if we combine both errors, its limitations begin to show:

>>> xmldoc = '''
... <doc>
...     <title>Heading<wrong></litle>
...     A few lines of Text
... </doc>'''
>>> result = parseXML(xmldoc)
>>> for e in result.errors_sorted: print(e)
3:26: Error (1010): »ETag = `</` ::TagName "wrong" § `>`« expected by parser 'element', but »</litle>\n ...« found instead!
5:1: Error (1010): »ETag = `</` ::TagName "title" § `>`« expected by parser 'element', but »</doc>« found instead!

Here, it seems almost impossible to avoid an error-cascade in combination with context-sensitive parsers if met with a combination of different errors. In this particular case of an XML-parser, the best way out might be not to use context-sensitive parsers at all and check the matching XML-tags at a later processing stage.

Custom Parsers

Specifying a grammar in EBNF and generating a parser from this specification has the advantage that one can get a much better overview over the grammar than when writing a completely parser directly in Python. This also makes changing the grammar and explaining it to others much easier. However, there are cases where writing a parsers at least for particular components in Python can be advantageous. These include cases, where it is easier to write a parser in Python, where a Python-parser might be faster, or where a parser cannot be coded in EBNF, say, because it requires as Turing-complete programming languages.

Custom-parser example

As you might remember from one of the macros-examples, above (see Macros), a kind of parser that is a bit tedious to code in EBNF is a search parser. In the simplest imaginable case (e.g. ignoring comments ignoring nested structures, where the search phrase might occur, but should not be found) you could write { !"search_phrase" /.|\\n/ }. Be wary that this search parser could land you in the middle of a comment, e.g. when “searching” for “rain” in the line I'm singing in # the rain (assuming # as the comment sign). So it is better to write { !"search_phrase" (~ | /.|\\n/) }. To keep the example simple, we’ll leave such worries aside in the following:

>>> search_rain_grammar = '''@reduction = merge
...     song = { search_rain count_rain } [search_rain] EOF
...     search_rain = { !"rain" /.|\\n/ }
...     count_rain  = "rain"
...     EOF         = !/.|\\n/'''
>>> rain_counter = create_parser(search_rain_grammar)
>>> song = """
...     I'm singin' in the rain
...     Just singin' in the rain
...     What a glorious feeling
...     I'm happy again."""
>>> syntax_tree = rain_counter(song)
>>> print(syntax_tree.as_sxpr())
(song
  (search_rain
    ""
    "    I'm singin' in the ")
  (count_rain "rain")
  (search_rain
    ""
    "    Just singin' in the ")
  (count_rain "rain")
  (search_rain
    ""
    "    What a glorious feeling"
    "    I'm happy again.")
  (EOF))
>>> print(len(list(syntax_tree.select('count_rain'))))
2

Let’s now define a custom parser for “search_rain”. A custom parser is a factory-function that returns a “custom parse function” which in turn is a function that takes a StringView-object as input and returns a nodetree.Node-object or None as result. This factory-function can be a derived class from class parse.Parser but does not need to be. In the EBNF-grammar is specified as @name(param) where name is the name for the factory function on the Python side and param either an identifier or - if enclosed in quotation marks - a single string parameter on the Python side, which can be used to configure the factory function.

It is also possible to specify a custom parser function directly instead of a factory function. However, it must then be specified as the parameter of the predefined factory function “Custom”, e.g. @Custom(parse_func).

The parser (factory) function itself must be defined in a global namespace that is reachable from the Grammar-class generated from the EBNF-code. A good place is the beginning of the generated “xxxParser.py”-script in case of DHParser-projects created with the “dhparser”-command. If, as in the following, the parser is created with create_parser(), the custom code should be passed as Python-source-string to the additional_code-argument of the create_parser()-function. In the following snipped, we, therefore, write it directly into a string.

>>> search_src = """
... from DHParser.parse import CustomParseFunc
... from DHParser.stringview import StringView
... def search(what: str) -> CustomParseFunc:
...     def parse_func(rest: StringView) -> Optional[Node]:
...         i = rest.find(what)
...         if i >= 0:
...             return Node('', rest[:i])
...         else:
...             return Node('', rest)
...     return parse_func
... """

Note, that the name of the returned node ist left empty. It will be filled in by the CustomParser-object either with the anonymous-parser name “:CustomParser” or the name of the symbol to which the parser is assigned, because this information is not available within the function. If, however, the custom parser is defined as a class derived of CustomParser, the right name must be given when creating the node. In most cases this will simply be self.node_name.

Also, keep in mind that the parameter passed to the parsing function is of type StringView and not simply a string. This view represents the remaining part of the document to be parsed.

As said, referring to the custom parser factory function from the grammar is simple:

>>> search_rain_grammar = '''@reduction = merge
...     song = { search_rain count_rain } search_rain EOF
...     search_rain = @search("rain")
...     count_rain  = "rain"
...     EOF         = !/.|\\n/'''

>>> rain_counter = create_parser(search_rain_grammar,
...                              additional_code=search_src)

Let’s try using it:

>>> syntax_tree = rain_counter(song)
>>> print(syntax_tree.as_sxpr())
(song
  (search_rain
    ""
    "    I'm singin' in the ")
  (count_rain "rain")
  (search_rain
    ""
    "    Just singin' in the ")
  (count_rain "rain")
  (search_rain
    ""
    "    What a glorious feeling"
    "    I'm happy again.")
  (EOF))

Another good use cases for custom parsers are long lists of fixed alternatives, say, a list of state-names, e.g. “Afghanistan, Albania, … Zimbabwe” which would clutter the grammar and slow down parsing. In particular, if these lists are changed frequently and better directly fetched from a database then encoded in the grammar which would have to be recompiled every time there is a change in the data.

Implementing Custom Parsers

There exist three different ways to implement a custom parser:

1. via factory-functions The “standard” way has been shown above, where you write a factory function that returns the parser. The factory function is required when you want to pass arguments to your parsing function, because these will be saved in the closure of the factory function - as in the example above.

Caution

It is only possible to pass a single argument to custom parsers in the grammar which can either be a string (enclosed by quotation marks) or a symbol (without quotation marks).

If you want to pass more than one argument, you’ll have to encode the list or arguments in a single string and decode this string in your Python code.

This also works for parsing functions without arguments, of course:

>>> word_parser_factory = r"""
... def parse_word() -> CustomParseFunc:
...     RX_WORD = re.compile(r'\w+')
...     def inner_parse_word(rest: StringView) -> Optional[Node]:
...         m = rest.match(RX_WORD)
...         if m:
...             return Node('', rest[:m.end()])
...         return None
...     return inner_parse_word"""

As noted earlier, the parsing function (i.e. “inner_parse_word”) receives a StringView on the remaining part of the document as a parameter and that it should return a Node with an empty name, so that the proper node-name can be filled in by DHParser’s CustomParser-wrapper.

In a grammar this custom-parser needs to be invoked with brackets “()”, but without argument:

>>> word_grammar = """@reduction = merge
...     document = @parse_word()"""

>>> word_parser = create_parser(word_grammar, additional_code=word_parser_factory)
>>> print(word_parser("Hello").as_sxpr())
(document "Hello")

2. as a simple parser function If no arguments need to be passed to the custom parser it can also be implemented without a factory-wrapper, just as a simple parsing function:

>>> plain_word_parser = r"""
... def parse_word(rest: StringView) -> Optional[Node]:
...     m = rest.match(re.compile(r'\w+'))
...     if m:
...         return Node('', rest[:m.end()])
...     return None"""

If the custom-parsing function is implemented without a wrapper in the Python-code, it needs to be wrapped into a @Custom(…)-clause within the grammar, however. This tells DHParser that it should not expect the parsing-function invoked in the grammar to be factory-function:

>>> word_grammar = """@reduction = merge
...     document = @Custom(parse_word)"""

>>> word_parser = create_parser(word_grammar, additional_code=plain_word_parser)
>>> print(word_parser("Hello").as_sxpr())
(document "Hello")

3. as a fully implemented parsing class Finally, custom parsers can also be implemented as descendants of Parse. This is the most complicated case, because writing a derivative class of Parse or parse.UnaryParser or NaryParser requires to pay attention to all kinds of requirements, as for example, the need for a __deepcopy__-method. Therefore, implementing one’s own Parse-classes should be avoided if possible.

Nevertheless, here is the previous example implemented with A custom parser class:

>>> word_parser_class = """from DHParser.parse import Parser, copy_parser_base_attrs
... class ParseWord(Parser):
...     def __init__(self):
...         super().__init__()
...         self.RX_WORD = re.compile(r'\\w+')
...
...     def __deepcopy__(self, memo):
...         duplicate = self.__class__()
...         copy_parser_base_attrs(self, duplicate)
...         return duplicate
...
...     def _parse(self, location) -> Tuple[Optional[Node], int]:
...         document_view = self.grammar.document__
...         m = document_view.match(self.RX_WORD, location)
...         if m:
...             return Node(self.node_name, document_view[location: m.end()]), m.end()
...         return None, location"""

 >>> word_grammar = """@reduction = merge
 ...     document = @ParseWord()"""

 >>> word_parser = create_parser(word_grammar, additional_code=word_parser_class)
 >>> print(word_parser("Hello").as_sxpr())
 (document "Hello")

In contrast to the former two methods, implementing one’s own custom Parser-classes has the benefits of giving access to the Grammar-object and to allow the manipulation of the location pointer. A possible use case is when you’d like to store the unstructured source-data of some part of the source-document in the parsing tree, next to the data structured by the parser:

>>> save_source = """from DHParser.parse import Parser
... RX_LINE = re.compile(r'[^ \\t\\n]+(?:[ \\t]*[^ \\t\\n]+)*')
...
... class SaveSource(Parser):
...     def _parse(self, location) -> Tuple[Optional[Node], int]:
...         match = RX_LINE.match(self.grammar.text__, location)
...         source_code = match.group(0) if match else ''
...         # return unstructured source but do not move location pointer forward!
...         return Node('Quelle', source_code), location"""

>>> grammar = """@reduction = merge
...     document = line { /\\\\n/ line }
...     line = @SaveSource() word { ' ' word }
...     word = /\\w+/"""

>>> parser = create_parser(grammar, additional_code=save_source)
>>> print(parser("Hello World").as_sxpr())
(document (line (Quelle "Hello World") (word "Hello") (:Text " ") (word "World")))

Semantic Actions

Finally, custom parsers can also be used (or be abused) for semantic actions of any kind. Tracing certain points of the parser by dumping debug-information would be one reasonable use case of this kind. When storing or manipulating data in your custom parser or producing other side effects, you should be aware that even if the parser matches, it’s results may be thrown away later, because the recursive descent parser pursues an other branch. There is no support by DHParser for filling and unwinding data-stacks or the like during parsing - with the exception the context sensitive parsers described above. If you want to store data at all, it is best to save the data in the attributes of the nodes that your custom function returns, because then you can be sure that only data from valid nodes survives to the end of the parsing process.

However, since parser functions or classes do not have access to the parsing tree during parsing, any variables stored in this way will be accessible during parsing only to custom parsers that directly or indirectly call other custom parsers (via the sub-trees returned from the called parsers). This does not mean that storing variables in attributes during parsing is entirely useless. Variables stored in the node’s attributes will be accessible without any restrictions only in the processing stages following the parsing-stage (see Processing Pipelines).

Classes and Functions-Reference

Usually, the functions and classes of module DHParser.ebnf are not called directly, but the helper functions from module DHParser.dsl like DHParser.dsl.create_parser() will be called or the dhparser-script will be used from the command line. Therefore, no commented abbreviated reference of the most important functions of module ebnf will be provided here. See the source code documentation of module DHParser.ebnf for a full reference of all functions and classes.