EBNF-Grammars
=============

Module :py:mod:`ebnf` provides an EBNF-parser-generator that compiles an
EBNF-Grammar into Python-code that can be executed to parse source text
conforming to this grammar into concrete syntax trees.


Writing Grammar-Definitions
---------------------------

Specifying Grammars with EBNF
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

With DHParser, Grammars can be specified either directly in Python-code
(see :py:mod:`parse`) or in one of several EBNF-dialects. (Yes,
DHParser supports several different variants of EBNF! This makes it easy
to crate a parser directly from Grammars found in external sources.)
"EBNF" stands for the "Extended-Backus-Naur-Form" which is a common
formalism for specifying Grammars for context-free-languages.
(see `EBNF in Wikipedia <https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_)

The recommended way of compiling grammars with DHParser is to either
write the EBNF-specification for that Grammar into a text-file and then
compile EBNF-source to an executable as well as importable Python-module
with the help of the "dhparser"-script. Or, for bigger projects, to
create a new domain-specific-language-project with the DHParser-script
as described in the step-by-step-guide.

However, here we will show how to compile an EBNF-specified grammar
from within Python-code and how to execute the parser that was
generated by compiling the grammar.

As an example, we will realize a `json`_-parser.
Let's start with creating some test-data::

    >>> testobj = {'array': [1, 2.0, "a string"], 'number': -1.3e+25, 'bool': False}
    >>> import json
    >>> testdata = json.dumps(testobj)
    >>> testdata
    '{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'

We define the `json`_-Grammar in
top-down manner in EBNF. We'll use a regular-expression look-alike
syntax. EBNF, as you may recall, consists of a sequence of symbol
definitions (often also called "productions").
The definiens of those definitions either is a string
literal or regular expression or other symbols or a combination
of these with four different operators: 1. sequences
2. alternatives 3. options and 4. repetitions. Here is how these
elements are denoted in classical and regex-like EBNF-syntax:

========================  ==================  ================
literals and operators    classical EBNF      regex-like
========================  ==================  ================
insignificant whitespace  ~                   ~
string literal            "..." or \`...\`    "..." or \`...\`
regular expr.             /.../               /.../
sequences                 A B C               A B C
alternatives              A | B | C           A | B | C
options                   [ ... ]             ...?
repetitions               { ... }             ...*
one or more                                   ...+
grouping                  (...)               (...)
positive lookahead        & ...               & ...
negative lookahead        ! ...               ! ...
========================  ==================  ================

"insignificant whitespace" is a specialty of DHParser. Denoting
insignificant whitespace with a particular sign ``~`` allows to eliminate
it already during the parsing process without burdening later
syntax-tree-processing stages with this common task. DHParser offers
several more facilities to restrain the verbosity of the concrete
syntax tree, so that the outcome of the parsing stage comes close (or
at least closer) to the intended abstract-syntax-tree, already.

DHParser also contains a few experimental extensions to
the common EBNF and PEG (Parsing-Expression-Grammar) formalisms:

============================== ==========
additional operators           syntax
============================== ==========
interleave                     A ° B
repetition range               ... (i, k)
------------------------------ ----------
**lookbehind operators**
------------------------------ ----------
positive lookbehind            <-&
negative lookahead             <-!
------------------------------ ----------
**context sensitive parsers**
------------------------------ ----------
pop and match                  :: ...
retrieve and match             : ...
pop and always match           :? ...
============================== ==========

JSON consists of two complex data types, 1) associative arrays,
called "object" and sequences of heterogeneous data, called array; and
of four simple data types, 1) string, 2) number, 3) bool and 4) null.
The structure of a JSON file can easily be described in EBNF::

    >>> grammar = '''
    ...     json     = ~ _element _EOF
    ...     _EOF     = /$/
    ...     _element = object | array | string | number | bool | null
    ...     object   = "{" ~ member ( "," ~ §member )* "}" ~
    ...     member   = string ":" ~ _element
    ...     array    = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
    ...     string   = `"` _CHARS `"` ~
    ...     _CHARS   = /[^"\\\\]+/ | /\\[\\/bnrt\\]/
    ...     number   = _INT _FRAC? _EXP? ~
    ...     _INT     = `-`? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...     _FRAC    = `.` /[0-9]+/
    ...     _EXP     = (`E`|`e`) [`+`|`-`] /[0-9]+/
    ...     bool     = "true" ~ | "false" ~
    ...     null     = "null" ~                                   '''

This is a rather common EBNF-grammar. A few peculiarities are
noteworthy, though: First of all you might notice that some components
of the grammar (or "production rules" as they are commonly called) have
names with a leading underscore ``_``. It is a convention to mark those
elements, in which we are not interested on their own account, with an
underscore ``_``. When moving from the concrete syntax-tree to a more
abstract syntax-tree, these elements could be substituted by their
content to simplify the tree.

Secondly, some production rules carry a name written in capital letters.
This is also a convention to mark those symbols which represent tokens
that would be delivered by the lexical scanner of a classical parser generator.
DHParser, however, is a "scanner-less" parser, which means that the breaking down
of the string into meaningful tokens is done "in place" with regular
expressions (as seen in the definition of ``_EOF``, above) or simple combinations
of regular expressions (see the definition of ``_INT``, above). Their is
no sharp distinction between tokens and other symbols in DHParser.

Regular expressions are enclosed in
forward slashes and follow the standard syntax of Perl-style regular
expressions. As regular expressions are passed through to the `re`_-module
of the Python Standard Library this is exactly the regular expression
syntax of this module. If installed, Matthew Barnett's `regex`_-module
is preferred to the re-module of the Standard Library.

Finally, it is another helpful convention to indent the definitions of
symbols that have only been introduced to simplify an otherwise
unnecessarily complicated definition (e.g. the definition of ``number``,
above) or to make it more understandable by giving names to its
components (like ``_INT``, ``_FRAC`` and ``_EXP``).

Before we try this with some examples, we will for the purpose
of learning turn DHParser's
optimizations off. (Most of DHParser's speed-optimizations work
by "compressing" nested parsers into "smart" regular expressions
if possible. However, the serialization of the respective
parser objects might yield different node-names for anonymous-nodes
although functionally it is the same.) Speed optmizations can be
turned off by setting the configuration value "optimizations" to
an empty frozen set::

    >>> from DHParser.configuration import get_config_value, set_config_value
    >>> save_optimizations = get_config_value('optimizations')
    >>> set_config_value('optimizations', frozenset({}))

Let's try this grammar on our test-string.  In order to compile
this grammar into executable Python-code, we use the high-level-function
:py:func:`~dsl.create_parser` from the :py:mod:`dsl`-module.

    >>> from DHParser.dsl import create_parser
    >>> parser = create_parser(grammar, branding="JSON")
    >>> syntax_tree = parser(testdata)
    >>> syntax_tree.content  # serialize tree as string via the content-property
    '{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'

The thus created parser yields a syntax tree if called with source-code
string as argument. The syntax tree is made up of :py:class:`~nodtree.Node`-objects
as defined in DHParser's :py:mod:`nodetree`-module. (See section :ref:`node_objects`
in the manual about document-trees for more details.) Node-objects have
a ``content``-property that yields the tree originating in this object as
a flat string.

As expected serializing the content of the resulting syntax-tree yields exactly
the input-string of the parsing process. What we cannot see here, is that the
parser has structured the string into the individual elements described in the
grammar. Since the concrete syntax-tree that the parser yields is rather
verbose, it would not make sense to print it out. We'll just look at a small
part of it, to see what it looks like. Let's just pick the sub-tree that
captures the first json-array within the syntax-tree::

    >>> print(syntax_tree.pick('array').as_sxpr())
    (array
      (:Text "[")
      (_element
        (number
          (_INT "1")))
      (:Text ",")
      (:Whitespace " ")
      (_element
        (number
          (_INT "2")
          (_FRAC
            (:Text ".")
            (:RegExp "0"))))
      (:Text ",")
      (:Whitespace " ")
      (_element
        (string
          (:Text '"')
          (_CHARS "a string")
          (:Text '"')))
      (:Text "]"))

.. _definition_anonymous_nodes:

The nodes of the syntax-tree carry the names of the production rules
by which they have been generated. Nodes, that have been created by
components of a production receive the name of of the parser-type
that has created the node (see :py:mod:`parse`) prefixed
with a colon ":". In DHParser, these nodes are called "anonymous",
because they do not represent a symbol in the grammar on their own.

.. _include_directive:

Chaining Grammars with @include
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

While it is always advisable to keep a grammar definition short and concise,
so that it neatly fits within one file, the need might arise to reuse parts
of a grammar definition in a different grammar. In many case copy&paste will
do. But DHParser also supports a modest kind of
modularization of grammar-definitions via a simple include mechanism.
Just add ``@include = "FILENAME"``-directive at the place where
you'd like to include another grammar file.

Say, we'd like to make the grammar-definition of floating-point numbers
in our JSON-grammar reusable. In order to do so, we can write it to
a dedicated file::

    >>> number_ebnf = '''
    ...     number   = _INT _FRAC? _EXP? ~
    ...     _INT     = `-`? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...     _FRAC    = `.` /[0-9]+/
    ...     _EXP     = (`E`|`e`) [`+`|`-`] /[0-9]+/   '''
    >>> with open('number_include.ebnf', 'w', encoding='utf-8') as f:  _ = f.write(number_ebnf)

Now, we can simply include this file an any grammar::

    >>> json_grammar_with_include = '''
    ...     json     = ~ _element _EOF
    ...     _EOF     = /$/
    ...     _element = object | array | string | number | bool | null
    ...     object   = "{" ~ member ( "," ~ §member )* "}" ~
    ...     member   = string ":" ~ _element
    ...     array    = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
    ...     string   = `"` _CHARS `"` ~
    ...     _CHARS   = /[^"\\\\]+/ | /\\[\\/bnrt\\]/
    ...     @include = "number_include.ebnf"
    ...     bool     = "true" ~ | "false" ~
    ...     null     = "null" ~                                   '''
    >>> parser = create_parser(json_grammar_with_include, branding="JSON")
    >>> syntax_tree = parser(testdata)
    >>> syntax_tree.content
    '{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'

The following command merely removes the grammar on the hard-disk, so that
doctest-run of this manual does not leave any trails::

    >>> import os;  os.remove('number_include.ebnf')

It should be noted that DHParser's include-mechanism does not (yet) support
any kind of namespaces, so you have to make sure yourself that the same
symbol-names are not accidentally defined both in the including and included
file(s).

Using DHParser's include-mechanism for modularizing your EBNF-grammar(s) is
a different thing than adding an include mechanism to your domain specific
language. For the latter, see the manual of the :py:mod:`preprocess`-module.

.. _simplifying_syntax_trees:

Simplifying Syntax-Trees while Parsing
--------------------------------------

Usually, anonymous nodes are what you want to get rid of in the course
of transforming the concrete syntax-tree into an abstract syntax-tree.
(See :py:mod:`transform`). DHParser already eliminates per
default all anonymous nodes that are not leaf-nodes by replacing them
with their children during parsing. Anonymous leaf-nodes will be
replaced by their content, if they are a single child of some parent,
and otherwise be left in place. Without this optimization, each
construct of the EBNF-grammar would leave a node in the syntax-tree.
We will show this here just for the purpose of demonstration::

    >>> from DHParser.parse import CombinedParser, TreeReduction
    >>> _ = TreeReduction(parser.json, CombinedParser.NO_TREE_REDUCTION)
    >>> syntax_tree = parser(testdata)
    >>> print(syntax_tree.pick('array').as_sxpr())
    (array
      (:Text "[")
      (:Option
        (:Series
          (_element
            (number
              (_INT
                (:Alternative
                  (:RegExp "1")))))
          (:ZeroOrMore
            (:Series
              (:Text ",")
              (:Whitespace " ")
              (_element
                (number
                  (_INT
                    (:Alternative
                      (:RegExp "2")))
                  (:Option
                    (_FRAC
                      (:Text ".")
                      (:RegExp "0"))))))
            (:Series
              (:Text ",")
              (:Whitespace " ")
              (_element
                (string
                  (:Text '"')
                  (_CHARS
                    (:RegExp "a string"))
                  (:Text '"')))))))
      (:Text "]"))

This maximally verbose syntax-tree is usually not of much use. It can
be helpful, though. for understanding how parsing that is directed by
an EBNF-grammar works (next to looking at the logs of the complete
parsing-process, see :py:mod:`trace`), but other than that it
is advisable to streamline the syntax-tree as early on as possible,
because the processing time of all subsequent tree-processing stages
increases with the number of nodes in the tree.

Because of this, DHParser offers further means of simplifying
syntax-trees during the parsing stage, already. These are not turned
on by default, because they allow to drop content or to remove named
nodes from the tree. They need to be turned on by "directives" that
are listed at the top of an EBNF-grammar and that guide the
parser-generation process. DHParser-directives always start with an
``@``-sign. For example, the ``@drop``-directive advises the parser to
drop certain nodes entirely, including their content. For example, the
directive ``@drop = whitespace`` instructs the parser to drop all
:ref:`insignificant whitespace <insignificant_whitespace>`::

    >>> drop_insignificant_wsp = '@drop = whitespace \n'

Directives look similar to definitions, only that on the right hand
side of the equal sign follows a list of parameters. In the case
of the drop-directive these can be either names of non-anonymous
nodes that shall be dropped or one of four particular classes of
anonymous nodes (``strings``, ``backticked``, ``regexp``, ``whitespace``) that
will be dropped.

Another useful directive is the ``@hide``-directive that advises the parser
to treat certain named nodes as
anonymous nodes. Anonymous nodes may be eliminated during the parsing
process but, in contrast to the `@drop`-directive, this will only be
happen if their content can be preserved in another node. This
is useful, if we have introduced certain names in our grammar
only as placeholders to render the definition of the grammar a bit
more readable, not because we are interested in the structure that is
captured by the production associated with them in its own right::

    >>> hide_symbols = '@hide = /_\\w+/ \n'

.. HINT: In DHParser-versions below 1.7 the hide-directive was
   called "disposable" and had to be written as ``@disposable = ...``
   This has been changed for greater clarity. For backwards-compatibility,
   the old name "disposable" for the hide-directive will still be 
   understood by future versions of DHParser, however.  

Just as with the @drop-directive, it is possible to pass either
an explicit list of all symbols that shall be "hidden"
or a regular expression for identifying these symbols by their name to the
@hide-directive, Here, we have leveraged our convention
to prefix unimportant symbols with an underscore "_" by specifying the
symbols that shall by anonymized with a regular expression.
(See :ref:`below < _alternative_simplification_syntax>` for yet another,
arguably, more convenient syntax for specifying symbols that can
safely be dropped or hidden.)

Now, let's examine the effect of these two directives::

    >>> refined_grammar = drop_insignificant_wsp + hide_symbols + grammar
    >>> parser = create_parser(refined_grammar, 'JSON')
    >>> syntax_tree = parser(testdata)
    >>> syntax_tree.content
    '{"array":[1,2.0,"a string"],"number":-1.3e+25,"bool":false}'

You might have noticed that all insignificant white-spaces adjacent to
the delimiters have been removed this time (but, of course not the
significant whitespace between "a" and "string" in "a string"). And
the difference, the use of these two directives makes, is even more
obvious, if we look at (a section of) the syntax-tree::

    >>> print(syntax_tree.pick('array').as_sxpr())
    (array
      (:Text "[")
      (number "1")
      (:Text ",")
      (number
        (:RegExp "2")
        (:Text ".")
        (:RegExp "0"))
      (:Text ",")
      (string
        (:Text '"')
        (:RegExp "a string")
        (:Text '"'))
      (:Text "]"))

This tree looks more streamlined. But it still contains more structure
than we might like to see in an abstract syntax tree. In particular, it
still contains all the delimiters ("[", ",", '"', ...) next to the data. But
other than in the UTF-8 representation of our json data, the delimiters are
not needed any more, because the structural information is now retained
in the tree-structure.

So how can we get rid of those delimiters? The rather coarse-grained tools
that DHParser offers in the parsing stage require some care to do this
properly.

The @drop-directive allows to drop all unnamed strings (i.e. strings
that are not directly assigned to a symbol) and back-ticked strings (for
the difference between strings and back-ticked strings, see below) and
regular expressions. However, using ``@drop = whitespace, strings, backticked``
would also drop those parts captured as string that contain data::

    >>> refined_grammar = '@drop = whitespace, strings, backticked \n' \
    ...                   + hide_symbols + grammar
    >>> parser = create_parser(refined_grammar, 'JSON')
    >>> syntax_tree = parser(testdata)
    >>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
    (array
      (number "1")
      (number
        (:RegExp "2")
        (:RegExp "0"))
      (string "a string"))

Here, suddenly, the number "2.0" has been turned into "20"! There
are three ways to get around this problem:

1. Assigning all non-delimiter strings to symbols. In this case
   we would have to rewrite the definition of "number" as such::

      number     = _INT _FRAC? _EXP? ~
        _INT     = _MINUS? ( /[1-9][0-9]+/ | /[0-9]/ )
        _FRAC    = _DOT /[0-9]+/
        _EXP     = (_Ecap|_Esmall) [_PLUS|MINUS] /[0-9]+/
        _MINUS   = `-`
        _PLUS    = `+`
        _DOT     = `.`
        _Ecap    = `E`
        _Esmall  = `e`

   A simpler alternative of this technique would be to make use of
   the fact that the document-parts captured by regular expression
   are not dropped (although regular expressions can also be listed
   in the @drop-directive, if needed) and that at the same time
   delimiters are almost always simple strings containing keywords
   or punctuation characters. Thus, one only needs to rewrite those
   string-expressions that capture data as regular expressions::

      number     = _INT _FRAC? _EXP? ~
        _INT     = /[-]/ ( /[1-9][0-9]+/ | /[0-9]/ )
        _FRAC    = /[.]/ /[0-9]+/
        _EXP     = (/E/|/e/) [/[-+]/] /[0-9]+/

2. Assigning all delimiter strings to symbols and drop the nodes
   and content captured by these symbols. This means doing exactly
   the opposite of the first solution. Here is an excerpt of what
   a JSON-grammar employing this technique would look like::

      @hide = /_\\w+/
      @drop = whitespace, _BEGIN_ARRAY, _END_ARRAY, _KOMMA, _BEGIN_OBJECT, ...
      ...
      array = _BEGIN_ARRAY ~ ( _element ( _KOMMA ~ _element )* )? §_END_ARRAY ~
      ...

   It is important that all symbols listed for dropping are also listed
   in the hide-directive or - if the the hide-directive is given a regular 
   expression instead of a list of names - carry a name that is matched by
   that regular expression. DHParser does not allow to drop 
   the content of named nodes unless they are hidden,
   because the default assumption is that symbols in the grammar are
   defined to capture meaningful parts of the document that contain
   relevant data.

3. Bailing out and leaving the further simplification of the syntax-tree
   to the next tree-processing stage which, if you follow DHParser's suggested
   usage pattern, is the abstract-syntax-tree-transformation proper
   and which allows for a much more fine-grained specification of
   transformation rules. See :py:mod:`transform`.

To round this section up, we present the full grammar for a streamlined
JSON-Parser according to the first solution-strategy. Observe, that the
values of "bool" and "null" are now defined with regular expressions
instead of string-literals, because the latter would be dropped because
of the ``@drop = ... strings, ...``-directive, leaving an empty named node
without a value, whenever a bool value or null occurs in the input::

    >>> json_gr = '''
    ...     @hide      = /_\\w+/
    ...     @drop      = whitespace, strings, backticked, _EOF
    ...     json       = ~ _element _EOF
    ...       _EOF     = /$/
    ...     _element   = object | array | string | number | bool | null
    ...     object     = "{" ~ member ( "," ~ §member )* "}" ~
    ...     member     = string ":" ~ _element
    ...     array      = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
    ...     string     = `"` _CHARS `"` ~
    ...       _CHARS   = /[^"\\\\]+/ | /\\[\\/bnrt\\]/
    ...     number     = _INT _FRAC? _EXP? ~
    ...       _INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...       _FRAC    = /[.]/ /[0-9]+/
    ...       _EXP     = /[Ee]/ [/[-+]/] /[0-9]+/
    ...     bool       = /true/ ~ | /false/ ~
    ...     null       = /null/ ~                                  '''
    >>> json_parser = create_parser(json_gr, 'JSON')
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
    (array
      (number "1")
      (number
        (:RegExp "2")
        (:RegExp ".")
        (:RegExp "0"))
      (string "a string"))

This time the data is not distorted, any more. One oddity remains, however: We
are most probably not interested in the fact that the number 2.0 consists of
three components, each of which hast been captured by a regular expression.
Luckily, there exists yet another directive that allows to reduce the tree
further by merging adjacent anonymous leaf-nodes::

    >>> json_gr = '@reduction = merge \n' + json_gr
    >>> json_parser = create_parser(json_gr, 'JSON')
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.as_sxpr())
    (json
      (object
        (member
          (string "array")
          (array
            (number "1")
            (number "2.0")
            (string "a string")))
        (member
          (string "number")
          (number "-1.3e+25"))
        (member
          (string "bool")
          (bool "false"))))

Merging adjacent anonymous leaf-nodes takes place after the @drop-directive
comes into effect. It should be observed that merging only produces the desired
result, if any delimiters have been dropped previously, because otherwise
delimiters would be merged with content. Therefore, the ``@reduction = merge``-
directive should at best only be applied in conjunction with the ``@drop`` and
``@hide``-directives.

.. _table_reduction_directive:

There are for possible values for the ``@reduction``-directive to
control "early" flattening and merging
(i.e. flattening and merging in the parsing-stage, already)
of anonymous-nodes:

+--------------------+----------------------------------------+
| value              | effect                                 |
+====================+========================================+
| ``none``           | no tree-reduction during parsing stage |
+--------------------+----------------------------------------+
| ``flatten``        | flatten anonymous nodes                |
+--------------------+----------------------------------------+
| ``merge_treetops`` | flatten anonymous nodes and merge      |
|                    | anonymous leaf-nodes when all          |
|                    | siblings are anonymous leaf-nodes      |
+--------------------+----------------------------------------+
| ``merge``          | flatten anonymous nodes and merge      |
|                    | anonymous leaf-nodes                   |
+--------------------+----------------------------------------+

Applying any of the here described tree-reductions (or "simplifications" for
that matter) requires a bit of careful planning concerning which nodes
will be named and which nodes will be dropped. This, however, pays off in
terms of speed and a considerably simplified abstract-syntax-tree generation
stage, because most of the unnecessary structure of concrete-syntax-trees
has already been eliminated at the parsing stage.

.. _alternative_simplification_syntax:

**Alternative Syntax with ->DROP and ->HIDE**

Since DHParser version 1.5.1 there also exists an alternative syntax to
specify symbols that are to be considered disposable and elements that are
to be dropped by local ``->DROP`` (drop) and ``->HIDE`` (dispose) markers
within the grammar, rather than with directives. These markers must be appended
to the parser (skip/drop) or to the symbol definition (hide/dispose), e.g.::

    WHITESPACE = /\s+/ ->DROP

There is a subtle difference between the two markers, in so far as the
"->DROP"-marker may appear at one or more places inside a (complex) definition
or at its very end, while the "->HIDE" marker may only appear at the very
end of the definition. (The reason is that while it makes sense to drop nodes
returned by anonymous parsers, it does not make sense to dispose (i.e. hide)
anonymous nodes, because these will be eliminated ("flattened"), anyway.)

Since existing EBNF-practices vary, you can also use the alias "->SKIP" 
instead of "->DROP" for elements to be dropped and "->DISPOSE" instead 
of "->HIDE" for hidden symbols. (Internally, DHParser uses the names
drop and dispose, but this should not concern users of DHParser.)

Here is the last example, rewritten with "->DROP" and "->HIDE"-markers. Note, that
we still use the @drop-directive for whitespace, strings and back-ticked (strings),
because otherwise we would have to add a "->DROP"-marker after each string-literal
and tilde (whitespace) -marker. Also, we already added the
"@reduction=merge"-directive::

    >>> json_gr = '''
    ...     @drop      = whitespace, strings, backticked
    ...     @reduction = merge
    ...     json       = ~ element EOF
    ...       EOF     = /$/ -> DROP
    ...     element   = (object | array | string | number | bool | null) -> HIDE
    ...     object     = "{" ~ member ( "," ~ §member )* "}" ~
    ...     member     = string ":" ~ element
    ...     array      = "[" ~ ( element ( "," ~ element )* )? "]" ~
    ...     string     = `"` CHARS `"` ~
    ...       CHARS   = /[^"\\\\]+/ | /\\[\\/bnrt\\]/  -> HIDE
    ...     number     = INT FRAC? EXP? ~
    ...       INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ ) -> HIDE
    ...       FRAC    = /[.]/ /[0-9]+/  -> HIDE
    ...       EXP     = /[Ee]/ [/[-+]/] /[0-9]+/ -> HIDE
    ...     bool       = /true/ ~ | /false/ ~
    ...     null       = /null/ ~                                  '''
    >>> json_parser = create_parser(json_gr, 'JSON')
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
    (array
      (number "1")
      (number "2.0")
      (string "a string"))

The HIDE and DROP markers can also be used as prefixes to definitions, which may
make it more obvious for the readers of the grammar which symbols will be hidden 
or dropped from the syntax-tree. If written as a prefix, they are written with
a following colon ":" instead of a preceding arrow "->", e.g. "HIDE:" and "DROP:".
For the sake of simplicity and to mirror the internal conventions of DHParser, 
where the names of (disposable) anonymous-nodes are preceded by a colon, the word
"HIDE" can also be left out from the prefix annotation, so that a simple 
colon at the beginning of a definition suggests a disposable symbol. (This
should not be confused with a colon in front of a symbol on the left hand side
of a definition where it marks a "retrievable" value for that symbols. )
With these conventions our JSON-example looks like the following::

    >>> json_gr = '''
    ...     @drop      = whitespace, strings, backticked
    ...     @reduction = merge
    ...     json       = ~ element EOF
    ...     DROP:EOF     = /$/
    ...     :element  = (object | array | string | number | bool | null)
    ...     object     = "{" ~ member ( "," ~ §member )* "}" ~
    ...     member     = string ":" ~ element
    ...     array      = "[" ~ ( element ( "," ~ element )* )? "]" ~
    ...     string     = `"` CHARS `"` ~
    ...     :CHARS   = /[^"\\\\]+/ | /\\[\\/bnrt\\]/
    ...     number     = INT FRAC? EXP? ~
    ...       :INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...       :FRAC    = /[.]/ /[0-9]+/
    ...       :EXP     = /[Ee]/ [/[-+]/] /[0-9]+/
    ...     bool       = /true/ ~ | /false/ ~
    ...     null       = /null/ ~                                  '''
    >>> json_parser = create_parser(json_gr, 'JSON')
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
    (array
      (number "1")
      (number "2.0")
      (string "a string"))

.. _comments_and_whitespace:

Comments and Whitespace
-----------------------

Why whitespace isn't trivial
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Handling whitespace in text-documents is not all trivial, because
whitespace can serve several different purposes and there can be
different kinds of whitespace: Whitespace can serve a syntactic function
as delimiter. But whitespace can also be purely aesthetic to render
a document more readable.

Depending on the data model, whitespace can be considered as
significant and be included in the data or as
insignificant and be excluded from the data and only be re-inserted
when displaying the data in a human-readable-form. (For example, one
can model a sentence as a sequence of words and spaces or just as
a sequence of words.) Note, that "significance" does not correlate
with the syntactic or aesthetic function, but only depends on whether
you'd like to keep the whitespace in you data or not.

There can be different kinds of whitespace with different meaning
(and differing significance). For example, one can make a difference
between horizontal whitespace (spaces and tabs) and vertical
whitespace (including linefeeds). And there can be different sizes
of whitespace with different meaning. For example in LaTeX, a single
linefeed still counts as plain whitespace while an empty line (i.e.
whitespace including two or more not linefeeds) signals a new
paragraph.

Finally, even the position of whitespace can make a difference.
A certain number of whitespaces at the beginning of a line can
have the meaning of "indentation" (as in Python code) while at
the end of the line or between brackets it is just plain
insignificant whitespace. (This is actually something, where
the boundaries of the EBNF-formalism become visible and you'd
probably use a preprocessor or some kind of "semantic actions"
to handle such cases. There is some support for either of these
in DHParser.)

Significant Whitespace
^^^^^^^^^^^^^^^^^^^^^^

A reasonable approach to coding whitespace is to use one
particular symbol for each kind of whitespace. Those kinds of
whitespace that are insignificant, i.e. that do not need to
appear in the data, should be dropped from the syntax-tree.
With DHParser this can be done already while parsing, using
the ``@hide`` and ``@drop``-directives described earlier.

But let's first look at an example which only includes significant
whitespace. The following parser parses sequences of paragraphs which
consist of sequences of sentences which consist of sequences
of main clauses and subordinate clauses which consist of sequences
of words::

    >>> text_gr = '''
    ...     @ hide         = /_\\w+/
    ...     document       = PBR* S? paragraph (PBR paragraph)* PBR* S? _EOF
    ...       _EOF         = /$/
    ...     paragraph      = sentence (S sentence)*
    ...     sentence       = (clause _c_delimiter S)* clause _s_delimiter
    ...       _c_delimiter = KOMMA | COLON | SEMICOLON
    ...       _s_delimiter = DOT | QUESTION_MARK | EXCLAMATION_MARK
    ...     clause         = word (S word)*
    ...     word           = /(?:[A-Z]|[a-z])[a-z']*/
    ...     DOT            = `.`
    ...     QUESTION_MARK  = `?`
    ...     EXCLAMATION_MARK = `!`
    ...     KOMMA          = `,`
    ...     COLON          = `:`
    ...     SEMICOLON      = `;`
    ...     PBR            = /[ \\t]*\\n[ \\t]*\\n[ \\t]*/
    ...     S              = /(?=[ \\n\\t])[ \\t]*(?:\\n[ \\t]*)?(?!\\n)/ '''

Here, we have two types of significant whitespace ``PBR``
("paragraph-break") and ``S`` ("space"). Both types allow for a certain
amount of flexibility, so that two whitespaces of the same type do not
need to have exactly the same content, but we could always normalize
these whitespaces in a subsequent transformation step.

Two typical design patterns for significant whitespace are noteworthy, here:

1. Both whitespaces match only if there was at least one whitespace
   character. We may allow whitespace to be optional (as at the
   beginning and end of the document), but if the option has not been
   taken, we don't to see an empty whitespace-tag in the document, later
   on. (For insignificant whitespace, the opposite convention can be
   more convenient, because, typically, insignificant whitespace is
   dropped anyway, whether it's got content or not.)

2. The grammar is construed in such a way that the whitespace always
   appears *between* different elements at the same level, but not after
   the last or before the first element. The whitespace after the last
   word of a sentence or before the first word of a sentence is really
   whitespace between two sentences. If we pick out a sentence or a
   clause, we will have no dangling whitespace at its beginning or end.
   (Again, for soon to be dropped insignificant whitespace, another
   convention can be more advisable.)

Let's just try our grammar on an example::

    >>> text_example = '''
    ... I want to say, in all seriousness, that a great deal of harm is being
    ... done in the modern world by belief in the virtuousness of work, and that
    ... the road to happiness and prosperity lies in an organized diminution of
    ... work.
    ...
    ... First of all: what is work? Work is of two kinds: first, altering the
    ... position of matter at or near the earth's surface relatively to other
    ... such matter; second, telling other people to do so. The first kind is
    ... unpleasant and ill paid; the second is pleasant and highly paid.'''
    >>> text_parser = create_parser(text_gr, 'Text')
    >>> text_as_data = text_parser(text_example)
    >>> sentence = text_as_data.pick(
    ... lambda nd: nd.name == "sentence" and nd.content.startswith('First'))
    >>> print(sentence.as_sxpr())
    (sentence
      (clause
        (word "First")
        (S " ")
        (word "of")
        (S " ")
        (word "all"))
      (COLON ":")
      (S " ")
      (clause
        (word "what")
        (S " ")
        (word "is")
        (S " ")
        (word "work"))
      (QUESTION_MARK "?"))

Again, it is a question of design, whether we leave whitespace in the data or
not. Leaving it has the advantage, that serialization becomes as simple as
printing the content of the data-tree::

    >>> print(sentence)
    First of all: what is work?

Otherwise one would have to program a dedicated serialization routine.
Especially, if you receive data from a different source, you'll
appreciate not having to do this - and so will other people, receiving
your data. Think about it! However, dropping the whitespace will yield
more concise data.

Coding Comments
^^^^^^^^^^^^^^^

Allowing comments in a domain-specific language almost always makes sense,
because it allows users to annotate the source texts while working on them
and to share those comments with collaborators. From a technical point of
view, adding comments to a DSL raises two questions:

1. At what places shall we allow to insert comments in the source code?
   Common answers are: a) at the end of a line, b) almost everywhere, or
   c) both.

2. How do we avoid pollution of the EBNF-grammar with comment markers?
   It already curtails the readability that we have to put whitespace
   symbols in so many places. And speaking of comments at the end of
   the line: If line-feeds aren't important for us - as in our toy-grammar
   for prose-text, above - we probably wouldn't want to reframe our
   grammar just to allow for comments at the end of the line.

Luckily, there exists a simple and highly intuitive solution that takes
care of both of these concerns: We admit comments, wherever whitespace
is allowed. And we code this by defining a symbol that means: "whitespace
and, optionally, a comment".

Let's try this with our prose-text-grammar. In order to do so, we have
to define a symbols for comments, a symbol for pure whitespace, and,
finally, a symbol for whitespace with optional comment. Since, in
our grammar, we actually have two kinds of whitespace, ``S`` and ``PBR``,
we'll have to redefine both of them. As delimiters for comments, we
use curly braces::

    >>> wsp_gr = '''
    ...     PBR      = pure_PBR COMMENT (pure_PBR | pure_S)?
    ...              | (pure_S? COMMENT)? pure_PBR
    ...     S        = pure_S COMMENT pure_S? | COMMENT? pure_S
    ...     COMMENT  = /\\{[^}]*\\}/
    ...     pure_PBR = /[ \\t]*\\n[ \\t]*\\n[ \\t]*/
    ...     pure_S   = /(?=[ \\n\\t])[ \\t]*(?:\\n[ \\t]*)?(?!\\n)/'''

As can be seen, the concrete re-definition of the whitespace tokens
requires a bit of careful consideration, because we want to allow
additional whitespace next to comments, but at the same time avoid
ending up with two whitespaces in sequence in our data. Let's see, if
we have succeeded::

    >>> extended_text_gr = text_gr[:text_gr.rfind(' PBR')] + wsp_gr
    >>> extended_parser = create_parser(extended_text_gr, 'Text')
    >>> syntax_tree = extended_parser('What {check this again!} is work?')
    >>> print(' '.join(nd.name for nd in syntax_tree.pick('clause').children))
    word S word S word
    >>> print(syntax_tree.pick('clause').as_sxpr())
    (clause
      (word "What")
      (S
        (pure_S " ")
        (COMMENT "{check this again!}")
        (pure_S " "))
      (word "is")
      (S
        (pure_S " "))
      (word "work"))

We will not worry about the more sub-structure of the S-nodes right now. If
we are not interested in the comments, we could use the ``@hide``,
``@drop`` and ``@reduction = merge``-directives to simplify these at the
parsing stage. Or, we could extract the comments and normalize the whitespace
at a later tree-processing stage. For now, let's just check whether our
comments work as expected::

    >>> syntax_tree = extended_parser('What{check this again!} is work?')
    >>> print(' '.join(nd.name for nd in syntax_tree.pick('clause').children))
    word S word S word
    >>> syntax_tree = extended_parser('What {check this again!}is work?')
    >>> print(' '.join(nd.name for nd in syntax_tree.pick('clause').children))
    word S word S word
    >>> syntax_tree = extended_parser('What{check this again!}is work?')
    >>> print(syntax_tree.errors[0])
    1:24: Error (1040): Parser pure_S = /(?=[ \\n\\t])[ \\t]*(?:\\n[ \\t]*)?(?!\\n)/ did not match: »is work?«


The last error was to be expected, because we did not allow comments
to serve a substitutes for whitespace. Let's check whether putting comments
near paragraph breaks works as well::

    >>> test_text = '''Happiness lies in the diminution of work.
    ...
    ... { Here comes the comment }
    ...
    ... What is work?'''
    >>> syntax_tree = extended_parser(test_text)
    >>> print(' '.join(nd.name for nd in syntax_tree.children))
    paragraph PBR paragraph
    >>> test_text = '''Happiness lies in the diminution of work.
    ... { Here comes the comment }
    ... What is work?'''
    >>> syntax_tree = extended_parser(test_text)
    >>> print(' '.join(nd.name for nd in syntax_tree.children))
    paragraph

The last result might look surprising at first, but since a paragraph
break requires at least one empty line as a separator, the input text
is correctly understood by the parser as a single paragraph with
two sentence interspersed by a single whitespace which, incidentally,
contains a comment::

    >>> print(' '.join(nd.name for nd in syntax_tree.pick('paragraph').children))
    sentence S sentence
    >>> print(syntax_tree.pick('paragraph')['S'].as_sxpr(flatten_threshold=0))
    (S
      (pure_S
        ""
        "")
      (COMMENT "{ Here comes the comment }")
      (pure_S
        ""
        ""))

A common problem with whitespace is that it tends to pollute
the Grammar, because wherever you'd like to allow whitespace,
you'd have to insert a symbol for whitespace. The same problem
exists when it comes to allowing comments, because you'd
probably allow to insert comments in as many places as possible.

.. _insignificant_whitespace:

Insignificant Whitespace
^^^^^^^^^^^^^^^^^^^^^^^^

To make grammars more readable, DHParser uses the
tilde ``~``-character as a dedicated token for
insignificant whitespace. We have seen this earlier in the
definition of the json-Grammar.

The ``~``-whitespace-marker differs from the usual pattern for
defining whitespace in that it is implicitly optional, or what
amounts to the same, it matches the empty string. Normally,
it is to be considered bad practice to define a symbol as
optional. Rather, a symbol should always match something and
only at the places where it is used, it should be marked as
optional. If this rule is obeyed, it is always easy to tell,
whether some element is optional or not at a specific place
in the Grammar. Otherwise, it can become quite confusing
indeed. However, since the tilde character is usually used
very often, it is more convenient not to mark it with a
question-mark or, if you use classical EBNF-syntax, to enclose
it with square brackets.

The default regular expression for the tilde-whitespace captures
arbitrarily many spaces and tabs and at most one linefeed, but
not an empty line (``[ \\t]*(?:\\n[ \\t]*)?(?!\\n)``), as this is
the most convenient way to define whitespace for text-data.
However, the tilde whitespace can also be defined with any
other regular expression with the ``@whitespace``-directive.

Let's go back to our JSON-grammar and define the optional
insignificant whitespace marked by the tilde-character in such a
way that it matches any amount of horizontal or vertical
whitespace, which makes much more sense in the context of json
than the default tilde-whitespace that is restricted vertically
to at most a single linefeed::

    >>> testdata = '{"array": [1, 2.0, "a string"], \n\n\n "number": -1.3e+25, "bool": false}'
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.errors[0])
    1:33: Error (1010): member expected by parser 'object', but »\n\n\n "numbe...« found instead!
    >>> json_gr = '@whitespace = /\\s*/ \n' + json_gr
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.errors)
    []

When redefining the tilde-whitespace, make sure that your regular expression
also matches the empty string! There is no need to worry that the syntax tree
get's cluttered by empty whitespace-nodes, because tilde-whitespace always
yields anonymous nodes and DHParser drops empty anonymous nodes right away.

Comments can be defined using the ``@comment``-directive. DHParser
automatically intermingles comments and whitespace so that where-ever
tilde-whitespace is allowed, a comment defined by the
``@comment``-directive is also allowed:

    >>> json_gr = '@comment = /#[^\\n]*(?:\\n|$)/ \n' + json_gr
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> testdata = '''{"array": [1, 2.0, "a string"], # a string
    ...                "number": -1.3e+25,  # a number
    ...                "bool": false}  # a bool'''
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.as_sxpr(compact = True))
    (json
      (object
        (member
          (string "array")
          (array
            (number "1")
            (number "2.0")
            (string "a string")))
        (member
          (string "number")
          (number "-1.3e+25"))
        (member
          (string "bool")
          (bool "false"))))

Since the json-grammar still contains the ``@drop = whitespace, ...``-
directive from earlier on (next to other tree-reductions), the comments
have been dropped along with the tilde-whitespace.

There is one caveat: When using comments alongside with whitespace that
captures at most one linefeed, the comments should be defined in such
a way that the last character of a comment is never a linefeed.

Sometimes it is desirable to keep the comments in the abstract-syntax-tree,
for example, when the comments contain embedded processing instructions
for the compilation- or later processing-stages (see ref:`processing_pipelines`).
If you only want to drop the insignificant whitespace but not the
comments inside
(or between) the insignificant whitespace, you can add the keyword
"no_comments" to the "@drop"-directive, e.g.
``@drop = no_comments, ...``. Alternatively, you can
append the suffix "_keep" to the "@comment"-directive, e.g.
``@comment_keep = ...``. Either way of indicating that comments shall
not be dropped has the consequence that comments will remain in the
syntax tree in dedicated "comments\_\_"-nodes while whitespace
is dropped::

    >>> a = json_gr.find("@drop")
    >>> b = json_gr.find("\n", a, len(json_gr))
    >>> json_gr = json_gr[:a] + "@drop = strings, backticked, no_comments" + json_gr[b:]
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> keep_comments_st = json_parser(testdata)
    >>> print(keep_comments_st.as_sxpr(compact = True))
    (json
      (object
        (member
          (string "array")
          (array
            (number "1")
            (number "2.0")
            (string "a string")))
        (comment__
          " # a string"
          "               ")
        (member
          (string "number")
          (number "-1.3e+25"))
        (comment__
          "  # a number"
          "               ")
        (member
          (string "bool")
          (bool "false"))
        (comment__ "  # a bool")))

Otherwise they will
either be dropped completely if whitespace is dropped or the content
of the comment appears as part of the whitespace inside the
":Whitespace"-nodes when whitespace is not dropped::

    >>> a = json_gr.find("@drop")
    >>> b = json_gr.find("\n", a, len(json_gr))
    >>> json_gr = json_gr[:a] + "@drop = strings, backticked" + json_gr[b:]
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> keep_whitespace_st = json_parser(testdata)
    >>> print(keep_whitespace_st.as_sxpr(compact = True))
    (json
      (object
        (member
          (string "array")
          (:Whitespace " ")
          (array
            (number "1")
            (:Whitespace " ")
            (number "2.0")
            (:Whitespace " ")
            (string "a string")))
        (:Whitespace
          " # a string"
          "               ")
        (member
          (string "number")
          (:Whitespace " ")
          (number "-1.3e+25"))
        (:Whitespace
          "  # a number"
          "               ")
        (member
          (string "bool")
          (:Whitespace " ")
          (bool "false"))
        (:Whitespace "  # a bool")))

A few limitations of the tilde-whitespace and directive-defined
comments should be kept in mind: 1. Only one kind of insignificant
whitespace can be defined this way. If there are more kinds of
insignificant whitespace, all but one need to be defined conventionally
as part of the grammar. 2. Both directive-defined comments and
tilde-whitespace can only be defined by a regular expression. In
particular, nested comments are impossible to define with regular
expressions, only.

However, using tilde-whitespace has yet one more benefit: With the
tilde-whitespace, cluttering of the grammar with whitespace-markers
can be avoid, by adding implicit whitespace adjacent to string-literals.
Remember the definition of the JSON-Grammar earlier. If you look at
a definition like: ``object = "{" ~ member ( "," ~ §member )* "}" ~``,
you'll notice that there are three whitespace markers, one next to
each delimiter. Naturally so, because one usually wants to allow users
of a domain specific language to put whitespace around delimiters.

You may wonder, why the tilde appears only on the right hand side
of the literals, although you'd probably like to allow whitespace
on both side of a literal like "{". But if you look at the grammar
closely, you'll find that almost every symbol definition ends
either with a tilde sign or a symbol the definition of which ends
with a tilde sign, which means that they allow whitespace on the
right hand side. Now, if all elements of the grammar allow
whitespace on the right hand side, this means that automatically
also have whitespace on the left-hand side, too, which is, of
course the whitespace on the right hand side of the previous
element.

In order to reduce cluttering the grammar with tile-signs, DHParser
allows to turn on implicit tilde-whitespace adjacent to any
string literal with the directive ``@ literalws = right`` or
``@ literalws = left``. As the argument of the directive suggests,
whitespace is either "eaten" at the right hand side or the left
hand side of the literal. String literals can either be
enclose in double quotes "..." or single quotes '...'. Both
kinds of literals will have implicit whitespace, if the
``@literalws``-directive is used.

(Don't confuse implicit whitespace
with insignificant whitespace: Insignificant whitespace is whitespace
you do not need any more after parsing. Implicit whitespace is
whitespace you do not denote explicitly in the grammar. It's
a specialty of DHParser and DHParser allows only the insignificant
whitespace denoted by the tilde-character to be declared as
"implicit".)

If left-adjacent whitespace is declared as implicit with the
``@literalws``-directive, the expression::

    object     = "{" ~ member ( "," ~ §member )* "}" ~

can be written as::

    object     = "{" member ( "," §member )* "}"

which is easier to read.

For situations where implicit whitespace is not desired, DHParser
has a special kind of string literal, written with backticks, which
never carries any implicit whitespace. This is important, when
literals are used for signs that enclose content, like the quotation
marks for the string literals in our JSON-Grammar::

    string     = `"` _CHARS '"'  # mind the difference between `"` and '"'!

Regular expressions, also, never carry implicit whitespace.
So, if you are using regular expressions as delimiters, you
still have to add the tilde character, if adjacent insignificant
whitespace is to be allowed::

    bool       = /true/~ | /false/~

The complete json-grammar now looks like this::

    >>> json_gr = '''
    ...     @hide      = /_\\w+/
    ...     @drop      = whitespace, strings, backticked, _EOF
    ...     @reduction = merge
    ...     @whitespace= /\\s*/
    ...     @comment   = /#[^\\n]*(?:\\n|$)/
    ...     @literalws = right
    ...     json       = ~ _element _EOF
    ...       _EOF     = /$/
    ...     _element   = object | array | string | number | bool | null
    ...     object     = "{" member ( "," §member )* "}"
    ...     member     = string ":" _element
    ...     array      = "[" ( _element ( "," _element )* )? "]"
    ...     string     = `"` _CHARS '"'
    ...       _CHARS   = /[^"\\\\]+/ | /\\[\\/bnrt\\]/
    ...     number     = _INT _FRAC? _EXP? ~
    ...       _INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...       _FRAC    = /[.]/ /[0-9]+/
    ...       _EXP     = /[Ee]/ [/[-+]/] /[0-9]+/
    ...     bool       = /true/~ | /false/~
    ...     null       = /null/~                                    '''
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> syntax_tree_ = json_parser(testdata)
    >>> assert syntax_tree_.equals(syntax_tree)

The whitespace defined by the ``@whitespace``-directive can be accessed from
within the grammar via the name ``WHITESPACE__``. Other than the tilde-sign
this name refers to the pure whitespace that is not intermingles with
comments. Similarly, comments defined by the ``@comment``-directive can
be accessed via the symbol ``COMMENT__``.

.. _lookaraound_operators:

Lookahead and Lookbehind
------------------------

Lookahead and lookbehind operators are a convenient way to resolve or rather
avoid ambiguities while at the same time keeping the DSL lean. Assume for
example a simple DSL for writing definitions like::

    >>> definitions = '''
    ...     dog   := carnivorous quadrupel that barks
    ...     human := featherless biped'''

Now, let's try to draw up a grammar for "definitions"::

    >>> def_DSL_first_try = ''' # WARNING: This grammar doesn't work, yet!
    ...     @literalws  = right
    ...     definitions = ~ definition { definition } EOF
    ...     definition  = definiendum ":=" definiens
    ...     definiendum = word
    ...     definiens   = word { word }
    ...     word        = /[a-z]+|[A-Z][a-z]*/~
    ...     EOF         = /$/ '''
    >>> def_parser = create_parser(def_DSL_first_try, "defDSL")

Parsing our example with the generated parser yields an error, however::

    >>> syntax_tree = def_parser(definitions)
    >>> for e in syntax_tree.errors_sorted:  print(e)
    3:11: Error (1040): Parser word->/[a-z]+|[A-Z][a-z]*/ did not match: »:= featherless biped«

The reason for this error is that the parser ``definiens`` captures as many
words as occur in a sequence, including the definiendum of the next definition
which is the word "human". But then the next definition does not find it
definiendum, any more, because it has already been captured. (All this may not
easily become clear from the error message itself, but can easily be found
out by using the post-mortem debugger of module :py:mod:`trace`.)

An common technique to avoid this problem would be to introduce an
end-of-statement, for example, a semicolon ";". A more elegant way to solve
the problem in this case is to make use of the fact that if a word is
followed by the definition sign ":=" it cannot be part of the definiens
any more, but must be a definiendum. This can be encoded by using the
negative look-ahead operator "!":

    >>> def_DSL = def_DSL_first_try.replace('definiens   = word { word }',
    ...                                     'definiens   = word { word !":=" }')
    >>> def_parser = create_parser(def_DSL, "defDSL")
    >>> syntax_tree = def_parser(definitions)
    >>> for d in syntax_tree.select('definition'):
    ...    print(f'A {d["definiendum"]} is a {str(d["definiens"]).strip()}')
    A dog    is a carnivorous quadrupel that barks
    A human  is a featherless biped

The statement ``word !":="`` is a sequence of a ``word`` and a negative
lookahead. This whole sequence only matches, if ``word`` matches and the
negative look-ahead matches, which is only the case of the following
text cannot be matched by ":=".

We could have achieved the same effect with a positive lookahead by
checking whether any of the possible follow-up-sequences of parser
``definiens`` ensues::

    >>> def_DSL = def_DSL_first_try.replace('definiens   = word { word }',
    ...                                     'definiens   = word { word &(word|EOF) }')
    >>> def_parser = create_parser(def_DSL, "defDSL")
    >>> syntax_tree = def_parser(definitions)
    >>> for d in syntax_tree.select('definition'):
    ...    print(f'A {d["definiendum"]} is a {str(d["definiens"]).strip()}')
    A dog    is a carnivorous quadrupel that barks
    A human  is a featherless biped

Generally, lookahead operators, whether positive or negative, never capture any
text. They merely match or fail depending on whether the following parser would
match or would fail to match the next piece of text. The positive lookahead
matches, if the parser would match. The negative operator matches, if it would
fail. Lookahead operators can also be though of as a boolean condition on the
following text, where the positive lookahead operator "&" resembles an and,
"and" the negative lookahead operator an "and not". As in our example, these
operators are very helpful for "exploring" the surroundings of a piece of text
to be captured by a parser. They allow parsers to match or fail depending on
the ensuing text.

A negative lookahead expression can also serve to encode the meaning of
"without" if placed in front of another expression. Let's rewrite our
grammar of a definitions-DSL so as to exclude certain bad words::

    >>> def_DSL = def_DSL[:def_DSL.find('word        =')] + '''
    ...     word        = !forbidden /[a-z]+|[A-Z][a-z]*/~
    ...     forbidden   = /[sf][a-z][a-z][a-z]/~
    ...     EOF         = /$/ '''
    >>> def_parser = create_parser(def_DSL, "defDSL")
    >>> syntax_tree = def_parser('nice := nice word')
    >>> print(syntax_tree)
    nice := nice word
    >>> syntax_tree = def_parser('sxxx := bad word')
    >>> print(str(syntax_tree).strip())
    <<< Error on "sxxx := bad word" | Parser word->!forbidden did not match: »sxxx := bad word« >>>

The same effect can be achieved by using the subtraction operator "-". This
is just syntactic sugar make the use of the negative lookahead operator
in the sense of "without" more intuitive::

    >>> def_DSL = def_DSL[:def_DSL.find('word        =')] + '''
    ...     word        = all_words - forbidden
    ...     all_words   = /[a-z]+|[A-Z][a-z]*/~
    ...     forbidden   = /[sf][a-z][a-z][a-z]/~
    ...     EOF         = /$/ '''
    >>> def_parser = create_parser(def_DSL, "defDSL")
    >>> syntax_tree = def_parser('sxxx := bad word')
    >>> print(str(syntax_tree).strip())
    <<< Error on "sxxx := bad word" | Parser word->!forbidden did not match: »sxxx := bad word« >>>

Next to the lookahead operators, there also exist look-back operators. Be
warned, though, that look back operators are an **experimental** feature
in DHParser and that their implementation is highly idiosyncratic, that
is, it is most likely not compatible with any other
parser-generator-toolkit based on EBNF-grammars. Also, look-back
operators in DHParser are more restricted than lookahead-operators. They
can only be used in combination with simple text or regular expression
parsers and - here comes the idiosyncratic part - they work in the
opposite direction. This means that if you want to check whether a
parser is preceded, say, by the keyword "BEGIN", the text phrase that
you have to check for with the look-back parser is actually "NIGEB". If
that still does not put you off, here is how look-back-operators are
used: Let's assume that our definition should not only allow for a
definiens but, alternatively for enumerations and that the difference is
indicated by using a simple equal sign "=" instead of the definition
symbol ":=". Then using look-back-operators to distinguish the case, we
can rewrite our grammar as follows::

    >>> def_DSL_extended = '''
    ...     @literalws  = right
    ...     definitions = ~ definition { definition } EOF
    ...     definition  = definiendum (":=" | "=") (definiens | enumeration)
    ...     definiendum = word
    ...     definiens   = <-& /\\s*=:/ word { word &(word|EOF) }
    ...     enumeration = <-& /\\s*=[^:]/ word { word &(word|EOF) }
    ...     word        = /[a-z]+|[A-Z][a-z]*/~
    ...     EOF         = /$/ '''
    >>> def_parser = create_parser(def_DSL_extended, "defDSL")
    >>> definitions = '''
    ...     dog   := carnivorous quadrupel that barks
    ...     drinks = water beer juice
    ...     human := featherless biped'''
    >>> syntax_tree = def_parser(definitions)
    >>> print(str(syntax_tree.pick('enumeration')).strip())
    water beer juice

The look-back operators are ``<-&`` for the positive look-back and ``<-!``
for the negative look-back, each of which must be followed by a regular
expression or a string. Of course, this example is rather wanton and the
grammar can easily be rewritten without the look-back-operators.

.. _error_catching:

Error-Catching
--------------

Providing users with proper error information is one of the most tenacious
problem when implementing the parser for a domain specific language.
There are three different challenges:

1. Locating the error at the correct position in the source code.
2. Providing proper error messages that explain the reason for the error.
3. Resuming the parsing progress after an error has occurred at the nearest
   possible place without producing artificial follow-up errors.

In the following, DHParser's techniques for the first two challenges,
locating errors and customizing error messages will be described.
Techniques for resuming the parsing process after an error occurred
or for passing by erroneous passages in the source code will be
explained :ref:`below <fail_tolerant_parsing>`.


Farthest-Fail-Heuristics
^^^^^^^^^^^^^^^^^^^^^^^^

Without adding any hints to the grammar, DHParser applies only a
very basic technique for locating the error if the grammar does
not match a text which is known as "farthest  failure" and locates
the error at the "farthest" position where a parser failed,
reporting the last named parser in the call chain (that first reached
this location) as the cause of the failure. This approach often works
surprisingly well for locating errors, unless the grammar relies to
heavy on regular expressions capturing large chunks of text, because
the error location works only on the level of the parsing expression
grammar not at that of the atomic regular expressions. To see how
farthest fail word, consider a parser for simple arithmetic
expressions::

    >>> arithmetic_grammar = '''@ literalws = right
    ... arithmetic = expression EOF
    ... expression = ~ term { ("+" | "-") term }
    ... term       = factor { ("*" | "/") factor }
    ... factor     = number | group
    ... group      = "(" expression ")"
    ... number     = /\\d+/~
    ... EOF        = /$/'''
    >>> arithmetic = create_parser(arithmetic_grammar, "arithmetic")
    >>> terms = arithmetic('(2 - 3 * (4 + 5)')
    >>> print(terms.errors[0])
    1:17: Error (1040): Parser term->`*` did not match: »«
    >>> terms = arithmetic('(2 - 3) * ( )')
    >>> print(terms.errors[0])
    1:13: Error (1040): Parser number->/\\d+/ did not match: »)«

As can be seen the location of the error is captured well enough,
at least when we keep in mind that the computer cannot guess where
we would have placed the forgotten closing bracket. It can only
report the point where the mistake becomes apparent.

However, the reported fact that it was the sub-parser \`*\` of
parser term that failed at this location does little to enlighten
us with respect to the cause of the failure. The "farthest fail"-method
as implemented by DHParser yields the
first parser (of possibly several) that has been tried at the
position where the farthest fail occurred. Thus, in this case,
a failure of the parser capturing \`*\` is reported rather than
of the parser expression->\`+\`. Changing this by reporting the
last parser or all parsers that failed at this location would
do little to remedy this situation, however. In this example,
it would just be as confusing to learn that expression->\`+\` failed
at the end of the parsed string.


.. _mandatory_items:

Mandatory items "§"
^^^^^^^^^^^^^^^^^^^

Thus, "farthest fail"-method is not very suitable for explaining
the failure or pinpointing which parser really was the culprit.
Therefore, DHParser provides a simple annotation that allows to
raise a parsing error deliberately, if a certain point in the
chain of parsers has not been reached: By placing the "§"-sign
as a "mandatory-marker" in front of a parser, the parser as well
as all subsequent parsers in the same sequence, will not simply
return a non-match when failing, but it will cause the entire
parsing process to stop and report an error at the location
of failure::

    >>> arithmetic_grammar = arithmetic_grammar.replace(
    ...    'group      = "(" expression ")"',
    ...    'group      = "(" § expression ")"')
    >>> arithmetic = create_parser(arithmetic_grammar, "arithmetic")
    >>> terms = arithmetic('(2 - 3 * (4 + 5)')
    >>> print(terms.errors[0])
    1:17: Error (1010): »)« expected by parser 'group', but END OF FILE found instead!
    >>> terms = arithmetic('(2 - 3) * ( )')
    >>> print(terms.errors[0])
    1:13: Error (1010): expression expected by parser 'group', but »)« found instead!

The error messages give a much better indication of the cause of the
error. What is reported as cause is either the name of the parser that
was expected to match, as in the second case, or the rule of the parser
in case of unnamed parsers, as in the first case. This usually, though
unfortunately not always, yields a much better indication of the
location and cause of an error than the farthest failure.

However, a little care has to be taken, not to place the mandatory
marker in front of a parser that might fail at a location that could
still be reached and matched by another branch of the grammar. (In our
example it is clear that round brackets enclose only groups. Thus, if
the opening round bracket has matched, we can be sure that what follows
must be an expression followed by a closing round bracket, or, if not it
is a mistake.) Luckily, although this may sound complicated, in practice
it never is. Unless you grammar is very badly structured, you will
hardly ever make this mistake, an if you do, you will notice soon
enough.

Also, there is an important restriction: There is only one §-marker
allowed per named parser. In case you have a long EBNF-expression on the
right hand side of a symbol-definition, where you'd like to use the
§-marker at more than one place, you can, however, always split it into
several expression by introducing new symbols. These symbols, if they
serve no other purpose, can be marked as disposable with the
``@ hide``-directive (see :ref:`simplifying_syntax_trees`).

The §-marker has proven to be a very simple means of pinpointing errors
the DSL-code, and I recommend to use it from early on in the process of
developing a new grammar. Plus, the §-marker offers two further benefits,
namely customizing error messages and resuming the parsing process after
a failure. That latter is particularly helpful if the DSL is to be
used in with an integrated development environment, which benefits greatly
from fail-tolerant parsing. However I only recommend to start using these,
only after the grammar has reached a certain amount of maturity, because
changing the grammar ofter requires re-adjusting customized error messages
and resume-clauses as well, which can become tedious.

.. _custom_error_messages:

Custom error messages
^^^^^^^^^^^^^^^^^^^^^

While the error messages produced by the use of the §-marker are often
quite understandable for the engineer designing the grammar of a DSL,
they might not be so for the user the DSL, who might not know the names
of the parsers of the grammar, let alone the expressions of the unnamed
parsers und will therefore not always be able to make much sense of
an error-messages that report just these.

In order to customize error messages, the symbol-related directive
``@ SYMBOLNAME_error = CONDITION, ERROR_STRING`` is used. The directive's
name consists of the name of a symbol that contains a §-marker and the
appendix ``_error``. The directive always takes two arguments, separated
as usual by a comma, of which the first is condition-expression and
the second an error message. The condition can be used to make
the choice of an error-message dependent on the text following the
point of failure. It can either be
a regular expression or a simple string which must match (or be equal
to in the case of the string) the first part of the text at the
position where the parser defined by the symbol failed to match and
raised an error. Only if the condition matches, the error message
given as the second argument will be emitted. Otherwise, the fallback
error-expression described above ("... expected by parser ...") will
be shown. The empty string ``''`` can be used as a fallback if the
customized message shall always be emitted, no matter what the
following text looks like.

The error string is a format string that may include any of the
two arguments ``{0}`` or ``{1}`` where ``{0}`` will be replaced by
the name or string representation of the parser that was expected
to match but didn't and ``{1}`` will be replaced by the first twenty
or so letters of the unmatched rest of the text. Here is a simple
example that could be part of a JSON-parser that is intended to
deliver understandable error-messages::

    >>> grammar = '''
    ... @ string_error  = '', 'Illegal character(s) {1} in string.'
    ... string          = `"` §characters `"` ~
    ... characters      = { plain | escape }
    ... plain           = /[^"\\\\]+/
    ... escape          = /\\[\\/bnrt\\]/'''
    >>> json_string = create_parser(grammar, 'json_string')
    >>> print(json_string('"alpha"'))
    "alpha"
    >>> for e in json_string('"al\\pha"').errors:  print(e)
    1:4: Error (1010): Illegal character(s) »\pha"« in string.
    1:4: Error (1040): Parser "string" stopped before end, at: »\pha"« Terminating parser.

It is possible to place an error code or number at the beginning of
the error string, separated by a colon to override the default code
of 1010 for mandatory continuation errors. (See
:ref:`below <grammar_code_for_errors>` for an example where it makes
sense to do so.)

Customized error-messages must always be specified in the grammar
before definition of the symbol they are related to. It is possible
to "stack" several different error-directives with
different conditions and messages but related to the same symbol.
The conditions are evaluated in the order the
error-directives appear in the grammar. The first error message
the matching condition matches the error location will be picked.
Therefore, the more specific conditions should always be placed first
and the more general or fallback conditions should be placed
below these::

    >>> grammar = ("@ string_error  = /\\\\/, 'Illegal escape sequence {1} "
    ...            "Allowed values are b,n,r,t,u'") + grammar
    >>> json_string = create_parser(grammar, 'json_string')
    >>> for e in json_string(r'"al\pha"').errors:  print(e)
    1:4: Error (1010): Illegal escape sequence »\pha"« Allowed values are b,n,r,t,u
    1:4: Error (1040): Parser "string" stopped before end, at: »\pha"« Terminating parser.

Here, the more specific and more understandable error message
has been selected. Careful readers might notice that the the
more general customized error message "Illegal character(s)
... found in string" will now only be selected, if the
string contains a character that not even regular expression
engine recognizes, because the only other character that
is not allowed within the string are the closing quotation
marks that terminate the string and which do not cause the
parser to fail (but only to terminate to early).

Also, it might be noticed that the errors are always caused
by a failure to match the second ``"``-sign, because the
characters-parser also matches the empty string and thus
never fails or raises any error. Nonetheless, the error
can occur in the interior of the string and can - with
the help of customized error messages - be described as such
and properly be located.

Generally, though, emitting the right error messages based on a
regular-expression-condition is not easy. In the worst case, badly
customized error messages can even be more confusing than not
customizing error messages at all.

One possible development-strategy is to wait for the feedback of testers and
users or to monitor the errors that users typically make and then to customize
the error messages to the actual needs of the users to help them understand why
the computer refuses to parse a certain construct.

.. _grammar_code_for_errors:

Enhancing grammars with error-paths
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A frequently used approach to error catching is the encoding of
alternative code paths for erroneous code right into the grammar. This
is usually done by adding a further alternative with the
alternative-operator ``|`` that is specifically meant to "throw" an
error if all regular alternatives have been exhausted. DHParser offers a
special syntax to add an error message if a certain part of the grammar
is activated during the parsing of an erroneous source-text:
``@Error("[CODE:]MESSAGE")`` where MESSAGE has to be substituted by a
particular error message and the optional "CODE:" with a natural number
at the beginning of the error-message from which the error-code must be
separated with a colon. This number can freely be chosen, but must
adhere to the following convention:

    * numbers from 1 to 99 must be used for mere notices
    * numbers from 100 to 999 for warnings
    * numbers from 1000 to 9999 for errors
    * numbers of 10000 and more as fatal errors

Fatal errors result in the stop of the :ref:`processing pipeline
<processing_pipelines>` right after the step where the error occurred.
"Normal" errors result in invalid results but DHParser will attempt to
continue with the processing pipeline. This has the advantage that other
errors further downstream will also be reported in the same pass. A
possible disadvantage is that consequential errors may arise in the
processing of faulty results from earlier stages in the pipeline. If too
many consequential errors occur in the aftermath of an error, it should
be considered to lift this error to a fatal error by assigning a higher
number to it.

Warnings and notices should never should be used when the results will
still be valid, though maybe not what they were intended to be. (Thus,
the warning) Error codes starting with the digit "1" are reserved by
DHParser. So, for custom-errors codes starting with "2", "3" etc. must
be used.

Here is an example how error-messages can be added within the grammar.
The syntax uses the same ``@``-marker as directives but other than
directives the error-markers do not resemble symbol-definitions but
occur within the right hand side of a symbol definition. The following
is a grammar that parses a sequence of natural numbers separated by a
blank of line-feed, e.g. "1 2 3"::

    >>> sequence_of_natural_numbers = """@drop = whitespace
    ...     document = { ~ number | error } ~ EOF
    ...     number = /[1-9]\\d*/~
    ...     @error_error = "20:Parser stopped because of an error"
    ...     error  = !EOF           # no error-message on end of file 
    ...              @Error("2010:Not a valid Number!") 
    ...              § EOF
    ...     EOF = !/./"""
    >>> number_parser = create_parser(sequence_of_natural_numbers)
    >>> syntax_tree = number_parser("1 2 X 3 4 X 5")
    >>> for error in syntax_tree.errors_sorted: print(error)
    1:5: Error (2010): Not a valid Number!
    1:5: Notice (20): Parser stopped because of an error
    1:5: Error (1040): Parser "document" stopped before end, at: »X 3 4 X 5« Terminating parser.

Let's examine this in detail: The definition of the "document"-symbol without
an error-branch would read::

    document = { ~ number } ~ EOF

Here, we add an alternative branch to the ``number``-parser in case the
number parser does not match. Now, the number parser will not match on
one of two conditions: a) the following text is not a number, or b) the
end of the file has been reached. Now, since the latter is perfectly in
order, we have to exclude this case at the beginning of the error-branch
with a negative-lookahead::

    error  = !EOF           
             @Error("1010:Not a valid Number!") 
             § EOF

While the first line of the definition of the error-branch excludes the
kinds of non-match(es) that are not errors, the second line adds an
error message with the current location of the parser. We add an error
code, "1010", to make it easier to identify this error in automated
tests or the like. The last line is just a means to make the parser to
stop parsing right on the spot (that is, unless this error is caught by
a resume-directive, see :ref:`below <fail_tolerant_parsing>`). It will
produce the consequential error "EOF expected", though. In order not to
confuse the use, the error has been reconfigured as a simple notice that
explains the situation with the directive ``@error_error = "20:Parser
stopped because of an error"`` in the line before.

Instead of stopping the parser by jumping right to the end of the document, one
might also try to skip only so much text as is needed to continue the
parsing-process. In this trivial example, this is easily done by substituting
the ``§ EOF`` statement with a regular expression that consumes all characters
until the next location where the parser might find a possibly valid item. The
expression ``/.*?(?=\s|$)/`` skips all characters up to but not including the
next blank or up to the end of the file. Let's try::

    >>> alt_gr = sequence_of_natural_numbers.replace('§ EOF', '/.*?(?=\\s|$)/')
    >>> reentrant_number_parser = create_parser(alt_gr)
    >>> syntax_tree = reentrant_number_parser('1 2 X 3 4 X 5')
    >>> for error in syntax_tree.errors_sorted:  print(error)
    1:5: Error (2010): Not a valid Number!
    1:11: Error (2010): Not a valid Number!

The location of the second error indicates that the parser has continued
to read the document after the first error. We can double-check this by
looking at the abstract syntax tree::

    >>> print(syntax_tree.as_sxpr())
    (document
      (number "1")
      (number "2")
      (error
        (ZOMBIE__ `(err "1:5: Error (2010): Not a valid Number!"))
        (:RegExp "X"))
      (number "3")
      (number "4")
      (error
        (ZOMBIE__ `(err "1:11: Error (2010): Not a valid Number!"))
        (:RegExp "X"))
      (number "5")
      (EOF))


.. _fail_tolerant_parsing:

Fail-tolerant Parsing
---------------------

A serious limitation of all previously described error-handling
mechanisms (with the exception the very last example) is that the
parsing process still stops on the very first error. This is
particularly annoying for beginners learning to code data or program
code with a new DSL, because the compiler must be run again and again
before all errors have been found and corrected. It would be much better
to receive a list of all errors on the first run.

.. _generic_fail_tolerant_parsing:

Generic approach
^^^^^^^^^^^^^^^^

Fail tolerant parsing means that:

1. the parsing process continues after a syntax error has been
   encountered, ideally until the end of the document

2. that doing so the parser skips as little text as possible
   and thus reports as many errors as possible on the first
   run.

3. that ideally no consequential errors ("Folgefehler") are
   introduced which are not original errors in the source
   document, but merely artifacts of badly chosen locations
   for the resumption of the parsing process.

There are a number of techniques for fail-tolerant parsing. One
technique that has already been shown :ref:`earlier
<grammar_code_for_errors>` and which is not specific to DHParser but can
be used with any parser-generator is to add grammar-paths for possibly
erroneous code to the grammar::

    >>> grammar = '''
    ... string          = `"` ([characters] `"` | string_error [`"`]) ~
    ...   string_error  = /[^"]*/
    ... characters      = { plain | escape }+
    ... plain           = /[^"\\\\]+/
    ... escape          = /\\[\\/bnrt\\]/'''
    >>> json_string = create_parser(grammar, 'json_string')
    >>> tree = json_string('"al\\pha"')
    >>> print(tree.as_sxpr(flatten_threshold=0))
    (string
      (:Text '"')
      (string_error "al\pha")
      (:Text '"'))

This time the parser did not need to stop at the erroneous part. The erroneous
part itself has been caught within a node that betrays only by its name
that there was an error. To produce error messages we have to add them
explicitly to such nodes, afterwards:

    >>> for string_error in tree.select('string_error'):
    ...     _ = tree.new_error(string_error, 'Fehler im String: ' + str(string_error))
    >>> for e in tree.errors:  print(e)
    1:2: Error (1000): Fehler im String: al\pha

Unfortunately, the error location is not very precise. This can be remedied
by refining our error junction code::

    >>> grammar = '''
    ... string          = `"` ([characters] `"` | string_error [`"`]) ~
    ...   string_error    = [characters] { ups [characters] }+
    ...   ups             = /[^"]/
    ... characters      = { plain | escape }+
    ... plain           = /[^"\\\\]+/
    ... escape          = /\\[\\/bnrt\\]/'''
    >>> json_string = create_parser(grammar, 'json_string')
    >>> tree = json_string('"al\\pha"')
    >>> print(tree.as_sxpr())
    (string
      (:Text '"')
      (string_error
        (characters
          (plain "al"))
        (ups "\")
        (characters
          (plain "pha")))
      (:Text '"'))

Here, the node named "ups" pinpoints the precise error location.

Like most techniques for fail-tolerant parsing, this one is not quite
as easy to master in practice as it might look. Generally, adding
a junction for erroneous code works best, when the passage that shall
be by-passed is delineated by a easily recognizable follow-up strings.
In this example the follow-up string would be the ``"``-sign. The method fails,
of course if the follow-up text is erroneous, too, or has even been
forgotten. So, to be absolutely sure, one would have to consider
different follow-up sequences, say empty lines, keywords that mark
new parts of the document and the like.

.. _skip_and_resume:

Skip and Resume
^^^^^^^^^^^^^^^

DHParser also offers two other constructs for fail-tolerant parsing
which are quite similar to the just described technique. However, they
do not require adding code-paths to the grammar and reuse the
error-locating ability of the §-marker. A disadvantage is that the
DHParser-specific support for fail-tolerant parsing presently relies
entirely on regular expressions for finding the right re-entry points.

DHParser allows to resume parsing after an error at a later point
in the text. When trying to resume parsing two questions must be
answered:

  1. At what location should the parsing process be resumed?

  2. Which parser in the parser call-chain should resume parsing?
     E.g. the parser that failed, the parser that called the parser
     that failed, ... ?

The location where parsing should be resumed must be specified by
a regular expression or a list of regular expressions. The resumption
location is the nearest match of any of these expressions that does
not fall into a comment (as specified by the ``@comment``-directive
described above). More precisely it is the location directly after
the match, because this allows to search for the reentry-location
both by the text preceding this location and the text following
this location by using a lookahead operator inside the regular
expression.

The parser that resumes parsing depends on the directive that guides
the search for the reentry-point. DHParser offers two different
directives for this purpose, the ``@..._skip``-directive and the
``@..._resume``-directive. The placeholder ... stands for the name
of a parser that contains a §-marker.

The skip-directive resumes parsing with the sequence-parser that
contains the item(s) marked by the §-marker. In the following
example, the skip-directive picks up parsing with the string-
parser when an error was raised by the string-parser::

    >>> grammar = '''
    ... @ string_error  = /\\\\/, 'Illegal escape sequence {1}'
    ... @ string_error  = '', 'Illegal character {1} in string.'
    ... @ string_skip   = /(?=")/
    ... string          = `"` §characters `"` ~
    ... characters      = { plain | escape }
    ... plain           = /[^"\\\\]+/
    ... escape          = /\\[\\/bnrt\\]/'''
    >>> json_string = create_parser(grammar, 'json_string')
    >>> tree = json_string('"al\\pha"')
    >>> print(tree.content)
    "al\pha"
    >>> print(tree.errors[0])
    1:4: Error (1010): Illegal escape sequence »\pha"«
    >>> print(tree.as_sxpr())
    (string
      (:Text '"')
      (characters
        (plain "al"))
      (string_skip_R1__ `(err "1:4: Error (1010): Illegal escape sequence »\pha\"«") "\pha")
      (:Text '"'))

After the error has occurred at the illegal escape-sequence, the
skip-directive catches the error and skips to the location where the
`"`-character lies just ahead and continues parsing with the string-parser.
The skipped passage is stored in a ``ZOMBIE__``-Node within the syntax-tree
and parsing can continue through to the end of the text.

In contrast to the skip-directive the resume-directive leaves the parser
that raised the error and resumes one level higher up in the call chain.
The ``@ ..._resume``-directive that tells the *calling*
parsers where to continue after the array parser has failed.
So, the parser resuming the parsing process is not the array parser that
has failed, but the first parser in the reverse call-stack of "array" that
catches up at the location indicated by the ``@ ..._resume``-directive.
The location itself is determined by a regular expression, where the
point for reentry is the location *after* the next match of the regular
expression::

    >>> grammar = grammar.replace('@ string_skip   = /(?=")/',
    ...                           r'@ string_resume = /("\s*)/')
    >>> json_string = create_parser(grammar, 'json_string')
    >>> tree = json_string('"al\\pha"')
    >>> print(tree.content)
    "al\pha"
    >>> print(tree.errors[0])
    1:4: Error (1010): Illegal escape sequence »\pha"«
    >>> print(tree.as_sxpr())
    (string
      (:Text '"')
      (characters
        (plain "al"))
      (string_resume_R1__ `(err "1:4: Error (1010): Illegal escape sequence »\pha\"«") '\pha"'))

Note, that this time, the zombie-node also contains the closing
quotation marks. Also, it should be observed, that the regular
expression of the resume-directives stops after the closing quotation
marks as well as any ensuing whitespace. This is because parsing will
continue with the calling parser of the string parser, so the resumption
point must be at a reasonable place where the string parser might have
returned, if no error had occurred.

A simple rule for specifying the reentry point of an error is to find a
location where the next structural entity after the erroneous entity
starts. Let's try this for a (simplified) `config-file
<https://docs.python.org/3/library/configparser.html>`_ parser::

    >>> config_grammar = '''@literalws = right
    ... config     = ~ { section } EOF
    ... section    = heading { entry }
    ... heading    = "[" § identifier "]"
    ... entry      = identifier § ":" value
    ... identifier = /\\w+/~
    ... value      = `"` § TEXTLINE '"'
    ... TEXTLINE = /[^"\\n]*/
    ... EOF      =  !/./ '''
    >>> config_parser = create_parser(config_grammar)

As of now, our parser is not fail-tolerant. This means it will stop parsing
at the first error. Further errors are neither detected nor reported::

    >>> cfg_data_with_errors = '''
    ... [entities]
    ... animal: "cat"
    ... plant: rose"
    ... Building: "Tour Eiffel"
    ... [colors
    ... red: "warm"
    ... blue 1: "cold"
    ... grey: "black and white" '''
    >>> result = config_parser(cfg_data_with_errors)
    >>> for error in result.errors_sorted:  print(error)
    4:8: Error (1010): value expected by parser 'entry', but »rose"\nBuil...« found instead!
    4:8: Error (1040): Parser "config" stopped before end, at: »rose"\nBuil...« Terminating parser.

After adding suitable `resume`-clauses for those symbols the definition
of which contain the mandatory marker `§`, all errors are reported in
a single pass::

    >>> config_grammar = '''
    ... @heading_resume = /\\n\\s*(?=\\w|\\[)/
    ... @entry_resume = /\\n\\s*(?=\\w|\\[)/
    ... ''' + config_grammar
    >>> config_parser = create_parser(config_grammar)
    >>> result = config_parser(cfg_data_with_errors)
    >>> for error in result.errors_sorted:  print(error)
    4:8: Error (1010): value expected by parser 'entry', but »rose"\nBuil...« found instead!
    7:1: Error (1010): »]« expected by parser 'heading', but »red: "warm...« found instead!
    8:6: Error (1010): »:« expected by parser 'entry', but »1: "cold"\n...« found instead!

It can become difficult to find a reentry point with regular expressions
that is on the same level of the parser call chain (or one level higher
up in the case of the resume-directive) when an error occurs in a
syntactic structure that can be recursively nested. Because of this it
is also possible to specify the re-entry point with a parser. In this
case, the search term has a different semantics however. If a parser is
specified, it must match all characters from the point where the error
occurred up to the reentry point. In the case of a simple string or a
regular expression, DHParser searches for the first match of the
expression and then picks the location after that match. In order to
distinguish the two cases clearly, PEG-rules must always be enclosed in
round brackets. Thus, a single regular expression or a singular string
enclosed in round brackets will not be used as a search term but as a
matching expression that determines the reentry-location by matching the
complete text from the error location to the reentry-point.

As an example, let's try this with a parser for arbitrarily nested lists
of positive integers. First, we write our grammar without any re-entry rules::

    >>> number_list_grammar = r'''@ literalws   = right
    ... @ hide   = /_\w+/
    ... @ drop   = _EOF, whitespace, strings
    ... _document = ~ [ list ] §_EOF
    ... list     = "[" [_items] § "]"
    ... _items   = _item { "," §_item }
    ... _item    = number | list
    ... number   = `0` | /[1-9][0-9]*/
    ... _EOF     =  !/./'''
    >>> list_parser = create_parser(number_list_grammar)
    >>> list_parser('[[1,2], [3,4]]').as_sxpr()
    '(list (list (number "1") (number "2")) (list (number "3") (number "4")))'

The following example contains three errors: The letter "A" in a place where a
number should be and, a bit later, the wrong delimiter ";" instead of a comma,
and finally a missing value (or a superfluous comma) right before the end.
Since there are no rule for resuming the parser after an error occurred, the
parser will stop after the first error and also report that only part of the
document could be parsed::

    >>> example_with_errors = '[1, 2, A, [5, 6; 7], 8, ]'
    >>> result = list_parser(example_with_errors)
    >>> for e in result.errors: print(e)
    1:8: Error (1010): _item expected by parser '_items', but »A, [5, 6; ...« found instead!
    1:8: Error (1040): Parser "_document" stopped before end, at: »A, [5, 6; ...« Terminating parser.

Now, let's define some regular expression based rules to resume parsing after
an error::

    >>> resumption_rules = '''
    ... @list_resume = /]\\s*|$/
    ... @_items_skip = /(?=,)/, /(?=])/, /$/
    ... '''

Note, that for the "_items"-parser, several rules have been specified.
DHParser will try all of these rules and resume at the closest of the
locations these rules yield. The `@list_resume`-rule moves to a point
after the list, where the list parser might have returned, if no error
had occurred, which is after the closing square bracket plus any
adjacent whitespace. The `@_items_skip`-rule moves to any point within
the sequence of items where the `_items` could catch up (another comma
"," followed by further items) or end, because the sequence is exhausted
("]" or the end of the document caught be the regular expression marker
for the end of the string: /$/). See, how these rules play out in this
particular example::

    >>> list_parser = create_parser(resumption_rules + number_list_grammar)
    >>> result = list_parser(example_with_errors)
    >>> for e in result.errors: print(e)
    1:8: Error (1010): _item expected by parser '_items', but »A, [5, 6; ...« found instead!
    1:16: Error (1010): »]« expected by parser 'list', but »; 7], 8, ]« found instead!
    1:25: Error (1010): _item expected by parser '_items', but »]« found instead!

All errors are located and reported properly in a single run and the
parser continues right through until the end of the document as we'd
expect from a fail-tolerant parser. However, the limitations of our
regular-expression-based rules become apparent when nested structures
are involved::

    >>> example_with_errors_2 = '[1, 2, A, [5, 6; [7, 8], 9], 10, ]'
    >>> result = list_parser(example_with_errors_2)
    >>> for e in result.errors: print(e)
    1:8: Error (1010): _item expected by parser '_items', but »A, [5, 6; ...« found instead!
    1:16: Error (1010): »]« expected by parser 'list', but »; [7, 8], ...« found instead!
    1:28: Error (1010): _EOF expected by parser '_document', but », 10, ]« found instead!
    1:28: Error (1040): Parser "_document" stopped before end, at: », 10, ]« Terminating parser.

Here, the parser stopped before the end of the document, which shows
that our resumption rules have been either incomplete or inadequate.
Let's turn on some debugging information to get a better insight into
what went wrong::

    >>> from DHParser.trace import resume_notices_on
    >>> resume_notices_on(list_parser)
    >>> result = list_parser(example_with_errors_2)
    >>> for e in result.errors: print(e)
    1:8: Error (1010): _item expected by parser '_items', but »A, [5, 6; ...« found instead!
    1:9: Notice (50): Skipping from 1:8 'A, [5, ...' within _item->:_item to 1:9 ', [5, 6...'
    1:16: Error (1010): »]« expected by parser 'list', but »; [7, 8], ...« found instead!
    1:24: Notice (50): Resuming from list at 1:16 '; [7, 8...' with _items->:Series at 1:24 ', 9], 1...'
    1:28: Error (1010): _EOF expected by parser '_document', but », 10, ]« found instead!
    1:28: Error (1040): Parser "_document" stopped before end, at: », 10, ]« Terminating parser.

What is of interest here, is the second notice: It seems that the error
was caught within the "list"-parser. By moving on to the spot after
closing bracket as determined by the `@list_resume`-directive, however,
the parsing-process did not end up at a location behind the grammatical
structure where the error had occurred, but at the location after a
nested structure, which in this example is the inner list "[7, 8]".

This problem can be remedied by using the full power of parsing
expression grammars for determining the resumption-position. (Note, that
in order to clearly distinguish PEG-rules unambiguously from plain
regular expressions or simple strings, they must be enclosed in round
rackets!)::

    >>> resumption_rules = '''
    ... @list_resume = ({ list | /[^\\[\\]]+/ } ["]"])
    ... @_items_skip = /(?=,)/, /(?=])/, /$/
    ... '''
    >>> list_parser = create_parser(resumption_rules + number_list_grammar)
    >>> resume_notices_on(list_parser)
    >>> result = list_parser(example_with_errors_2)
    >>> from DHParser.error import RESUME_NOTICE
    >>> for e in result.errors:
    ...     if e.code != RESUME_NOTICE: print(e)
    1:8: Error (1010): _item expected by parser '_items', but »A, [5, 6; ...« found instead!
    1:16: Error (1010): »]« expected by parser 'list', but »; [7, 8], ...« found instead!
    1:34: Error (1010): _item expected by parser '_items', but »]« found instead!

This time, the parser does not terminate before the end. The
resume-notices show that resumption does not get caught on the nested
structure, any more::

    >>> for e in result.errors:
    ...     if e.code == RESUME_NOTICE: print(e)
    1:9: Notice (50): Skipping from 1:8 'A, [5, ...' within _item->:_item to 1:9 ', [5, 6...'
    1:28: Notice (50): Resuming from list at 1:16 '; [7, 8...' with _items->:Series at 1:28 ', 10, ]'
    1:34: Notice (50): Skipping from 1:34 ']' within _items->:Series to 1:34 ']'


Programming fail-tolerant parsers can be quite a challenge. DHParser's @skip-
and @resume-directives help separating the code for fail-tolerance from the
grammar proper. The only hooks that are needed within the grammar proper are
the mandatory markers ("§"). Still, it is a good strategy, to start adding the
fail-tolerance code only later in the development of a grammar and to keep
things as simple as possible. This can best be done by choosing resumption
points that are as unambiguous as possible (like keywords that introduce
specific parts of the document), even if this means that larger
parts of the document will be skipped and, consequently, some errors will
remain undetected until errors earlier in the document have been fixed.
If syntax errors are sparse - as can reasonably be assumed - the harm done
by skipping larger portions of the text is probably negligible or at any
rate smaller than the harm done by introducing consequential errors as
a result of poorly chosen resumption rules.

.. _macro_system:

Macros
------

In order to reduce code-repetition within grammar-specifications
DHParser offers a substitution-based macro-system. (To avoid
code-repetition *between* grammars, :ref:`includes <include_directive>`
can be used.)

Macros are defined similar to symbols, only that their name must
always start with a dollar sign ``$``::

    $list($delimiter) = { /[^,.;:]+/ | !$delimiter /[,.;:]/ }
           { $delimiter ~ { /[^,.;:]+/ | !$delimiter /[,.;:]/ } }

The parameters of a macro are also marked with a dollar sign.
Using macros is straight forward::

    keywords = "Keywords:" $list(",")

Macros can use other macros in their definiens but they may not be
nested recursively. It is also possible to define macros without any
arguments. This makes sense, when breaking up a long macro definition
into several macros which works similar to breaking up a long
symbol-definition by introducing (disposable) helper-symbols. For this
purpose helper-macros may refer to parameters of the "calling" macro
without explicitly passing them as parameters::

    $list($delimiter) = $item { $delimiter ~ $item }
    $item = { /[^;:,.\n]+/ | !$delimiter /[.,:;]/  }+

A limitation of DHParser's macro-system is that there is no way to
define new symbols inside a macro-definition.

Here is a simple example for the use of macros to highlight particular
words when parsing a text::

    >>> hilight_grammar = '''@reduction = merge
    ...     song = { $skip("rain") rain } [$skip("rain")] EOF
    ...     $skip($phrase) = { !$phrase /.|\\n/ }
    ...     rain  = "rain"
    ...     EOF   = !/.|\\n/'''

    >>> hilight_parser = create_parser(hilight_grammar)
    >>> song = """
    ...     I'm singin' in the rain
    ...     Just singin' in the rain
    ...     What a glorious feeling
    ...     I'm happy again."""

    >>> syntax_tree = hilight_parser(song)
    >>> print(syntax_tree.as_sxpr())
    (song
      (skip
        ""
        "    I'm singin' in the ")
      (rain "rain")
      (skip
        ""
        "    Just singin' in the ")
      (rain "rain")
      (skip
        ""
        "    What a glorious feeling"
        "    I'm happy again.")
      (EOF))

For another, more detailed example, see the :any:`macros`-section in
the overview.

.. _context_sensitive_parsers:

Context sensitive parsers
-------------------------

DHParser does by intention not contain support for semantic actions,
because these can introduce a context-sensitivity that can be hard to
handle with a recursive descent parser. And compiler-generation, where
semantic actions are mostly needed, is not the main domain of
application for DHParser. (There are a few loopholes that can be
(mis-)used for semantic actions, though...)

Sometimes, however, it would be ever so comfortable to break out of the
paradigm of context free grammars - if only just a little bit. For
example, when encoding data in XML or HTML, the closing tag must have
the same tag-name as the opening tag::

    <line>O Rose thou art sick.</line>

If you encode the tag-name-parser roughly following the `XML-specs <https://www.w3.org/TR/xml/>`_
as::

    tag_name        = /(?![0-9][:\\w][\\w:.-]*/

the following code would be accepted by the parser:

    <line>O Rose thou art sick.</enil>

In this case, the remedy is easy: When post-processing the syntax tree,
check whether all end-tags have the same tag-name as the corresponding
start-tag and add an error message where this is not the case. However,
this only works, because the tag-names have no influence on the
structure of the syntax-tree of an XML-document.

Therefore, the same remedy would not work in many other cases. In
`CommonMark <https://commonmark.org/>`_, for example, a "`code fence
<https://spec.commonmark.org/0.30/#fenced-code-blocks>`_ is a sequence
of at least three [but possibly more] consecutive backtick characters
(`) or tildes (~)" Now, this is a bit more complicated than the
XML-example, because here the content of the closing marker needs to be
known at the time of parsing already, because otherwise the structure
could not be determined correctly. A smaller number of tildes or
backticks than used at the opening of a code fence would be part of the
content of the fenced code that needs to be distinguished from the
closing delimiter. The purpose is to allow you to fence code that may
contain fenced code itself.::

    ~~~~~
    In common mark code can be fenced with tilde-characters. You can just write:
    ~~~
    fenced code
    ~~~
    ~~~~~

Here, a possible remedy is to employ a preprocessor, that distinguishes
the fenced code from quoted fences and replaces the non-quoted fences by
context-free opening and closing markers that can then be captured at
parsing stage. Using pre-processors is often a clean and pragmatic
solution and DHParser includes dedicated support for pre-processors.
However, introducing pre-processors also hast some downsides. One
disadvantage is that a preprocessor requires a makeshift parser of its
own that must be strong enough not to stumble over the syntactic
constructs that do not concern the preprocessor like comments for
example.

DHParser therefore also offers another alternative for occasional
context sensitivity by allowing to retrieve and compare earlier values
of a symbol. Any users of these features should be aware, however, that
extensive use of context sensitive parsers may slow-down parsing,
because it does not play well with the the memoizing optimization that
is commonly used with parsing expression grammars and also employed by
DHParser. Also, since these features are specific for DHParser,
switching to another parser generator will require factoring the
context-sensitive-parser out of your grammar and re-implementing the
functionality for which they have been used with the more conventional
approach of pre- and post-processors.

Before explaining the full mechanism, let's have a look at an example.
The following minimal pseudo-XML-parser captures the value of the
tag-name so that it can compared with the tag-name of the ending-tag::

    >>> miniXML = '''
    ... @ hide   = EOF
    ... @ drop   = EOF, whitespace, strings
    ... document = ~ element ~ §EOF
    ... element  = STag §content ETag
    ... STag     = '<' TagName §'>'
    ... ETag     = '</' ::TagName §'>'
    ... TagName  = /\\w+/
    ... content  = [CharData] { (element | COMMENT__) [CharData] }
    ... CharData = /(?:(?!\\]\\]>)[^<&])+/
    ... EOF      =  !/./
    ... '''
    >>> parseXML = create_parser(miniXML)
    >>> print(parseXML('<line>O Rose thou art sick.</line>').as_sxpr())
    (document
      (element
        (STag
          (TagName "line"))
        (content
          (CharData "O Rose thou art sick."))
        (ETag
          (TagName "line"))))
    >>> result = parseXML('<line>O Rose thou art sick.</enil>')
    >>> print(result.errors[0])
    1:28: Error (1010): ETag = `</` ::TagName "line" § `>` expected by parser 'element', but »</enil>« found instead!

Here, the TagName-parser in the definition has been prefixed with a
double colon ``::``. This double colon is the "Pop"-operator and can be
put in front of any symbol defined in the grammar. If a symbol is
annotated with the operator, then its parsing-rule will not be executed,
but the last value that has been parsed by the symbol will be retrieved
and match against the following part of the document by simple
text-comparison. If the last value matches the "Pop"-parser will remove
that value from the stack of earlier values and report a match.
Otherwise a non-match will be reported and the value will be left on the
stack. In the example, since the Pop-parser ``::TagName`` follows a
mandatory marker ``§``, the non-match causes a syntax error.

Sometimes it is useful to compare the following text with a stored value
without removing that value from the stack. For this purpose, there is
the "Retrieve"-operator which is denoted by a single colon `:`::

    >>> fencedTextEBNF =  '''@whitespace = vertical
    ... @hide    = EOF, fence_re
    ... @drop    = whitespace, EOF
    ... document = ~ { text | fenced } EOF
    ... text     = /[^\\~\\n]+/ ~  # /(?:[^\\~]+|\\n)+/
    ... fenced   = fence ~ { text | no_fence } ::fence ~
    ... no_fence = ! :fence fence_re ~
    ... fence    = fence_re
    ... fence_re = /\\~+/
    ... EOF      = !/./
    ... '''

Here the Pop-operator ``::`` is used in the definition of ``fenced`` in
just the same way as in the earlier example. The Retrieve-operator ``:``
is used in the definition of ``no_fence`` in combination with a negative
lookahead ``!``. This allows the no_fence-parsers to capture all
"fences" which are not closing-fence of the current fenced environment::

    >>> parseFenced = create_parser(fencedTextEBNF)
    >>> fenced_test_1 = '''~~~
    ... fenced code
    ... ~~~'''
    >>> print(parseFenced(fenced_test_1).as_sxpr())
    (document (fenced (fence "~~~") (text "fenced code") (fence "~~~")))
    >>> fenced_test_2 = '''~~~~~
    ... In common mark code can be fenced with tilde-characters. You can just write:
    ... ~~~
    ... fenced code
    ... ~~~
    ... ~~~~~'''
    >>> print(parseFenced(fenced_test_2).as_sxpr())
    (document
      (fenced
        (fence "~~~~~")
        (text "In common mark code can be fenced with tilde-characters. You can just write:")
        (no_fence "~~~")
        (text "fenced code")
        (no_fence "~~~")
        (fence "~~~~~")))

But what if the opening and closing fence are not one and the same
string, but complements of each other, like opening and closing
brackets? Say, you'd like to enclose code-examples in curled braces "{"
and "}" and since the code examples themselves may contain braces, you'd
like to allow the markup-writer to use an arbitrary number of braces as
opening and closing delimiters::

    >>> markup = "This ist a code example: {{ mapping = { 'a': 1, 'b': 2} }} with braces."

Here, the recalled value would need to be transformed or otherwise
interpreted before the following text is either considered a match or a
non-match. Since such a transformation can hardly be encoded in EBNF
even an augmented EBNF, any more, one of DHParser's loopholes for
semantic actions must be used. It is possible to assign Python
filter-functions to symbols, the value of which is retrieved with yet
another directive, which is the ``@ XXXX_filter``-directive, where
"XXXX" stands for the name of the symbol::

    >>> bracesExampleEBNF = '''
    ... @braces_filter = matching_bracket()
    ... document       = { text | codeblock }
    ... codeblock      = braces { text | opening_braces | (!:braces closing_braces) } ::braces
    ... braces         = opening_braces
    ... opening_braces = /\\{+/
    ... closing_braces = /\\}+/
    ... text           = /[^{}]+/
    ... '''

The function name that is passed to the directive must be the name of a
function that is within reach of the generated parser (by either
defining it in the same module or importing it) and it must have the
signature:

     Callable[str, List[str]], Optional[str]]

This function takes the following text as well as the stack of previous
value of the symbol that is being retrieved as an argument and it must
return either a stretch of matched text of ``None`` to indicate a
non-match. The function ``matching_bracket()`` is already defined in
:py:mod:`DHParser.parse`. Slightly simplified to cover only the case of
curly braces, it looks like this::

    >>> from typing import List, Optional
    >>> def matching_bracket(text: str, stack: List[str]) -> Optional[str]:
    ...     value = stack[-1]
    ...     value = value.replace("{", "}")
    ...     return value if text[:len(value)] == value else None

    >>> markup_parser = create_parser(bracesExampleEBNF)
    >>> result = markup_parser(markup)
    >>> print(result.as_sxpr())
    (document
      (text "This ist a code example: ")
      (codeblock
        (braces
          (opening_braces "{{"))
        (text " mapping = ")
        (opening_braces "{")
        (text " 'a': 1, 'b': 2")
        (closing_braces "}")
        (text " ")
        (braces "}}"))
      (text " with braces."))

Here, the outer double braces "{{" and "}}" open up and close a new code
block and could be discarded as delimiters during the
AST-transformation, while the opening and closing braces within the code
block are simply that: opening and closing braces.

Advanced usages
^^^^^^^^^^^^^^^

Apart from the Pop- and Retrieve-operator, DHParser offers a third
retrieval operator that, like the Pop-operator, "pops" that last value
from the stack and matches either this value or the empty strings. In
other word, this operator always matches, as long as there is still a
value on the stack, but it captures the beginning of the following text
only if it matches the stored value. A non-match only happens, when the
stack has already been exhausted. This "Pop anyways"-operator is denoted
by a colon followed by a question mark ``:?``. Weird as this may sound,
this operator has astonishingly manifold use cases. Think for example of
a modification of our minimal pseudo-XML parser the allows coders to
omit the tag name in closing tags to save them some typing:

    >>> miniXML = miniXML.replace('::TagName', ':?TagName')
    >>> parseXML = create_parser(miniXML)
    >>> data_tree = parseXML('<line>O Rose thou art sick.</>').as_tree()
    >>> print(data_tree)
    document
      element
        STag
          TagName "line"
        content
          CharData "O Rose thou art sick."
        ETag
          TagName

Another, rather tricky use case is to let the value of certain symbols
be determined on first use by marking all appearances of theses symbols
on the right hand side of the definitions wherein they appear with the
single colon retrieval operator ``:`` and clearing the stack with
``[:?symbol]`` after the end of file has been reached. This technique
has been employed for the "FlexibleEBNF"-parser in the examples folder.
The FlexibleEBNF-parser "magically" adjusts itself to different
syntactical flavors of EBNF. Here is an abbreviated excerpt of the
grammar of this parser, to see how this technique can be used in a
grammar::

    syntax     = ~ { definition } EOF
    definition = symbol §:DEF~ expression :ENDL~
    DEF        = `=` | `:=` | `::=` | `<-` | /:\\n/ | `: `
    ENDL       = `;` | ``
    EOF        = !/./ [:?DEF] [:?ENDL]

This trick can also be used to parse indentation::

    >>> tree_grammar = '''@whitespace = horizontal
    ... @hide = EOF, LF, SAME_INDENT
    ... @drop = strings, whitespace, EOF, LF, SAME_INDENT
    ...
    ... tree     = INDENT node DEDENT /\\s*/ EOF
    ... node     = tag_name [content]
    ... content  = string | children
    ... children = &(LF HAS_DEEPER_INDENT)
    ...            LF INDENT § node { LF SAME_INDENT § node }
    ...            !(LF HAS_DEEPER_INDENT) DEDENT
    ... tag_name = /\\w+/~
    ... string   = '"' § /(?:\\\\"|[^"\\n])*/ '"' ~
    ...
    ... INDENT            = / */
    ... SAME_INDENT       = :INDENT § !/ /
    ... HAS_DEEPER_INDENT = :INDENT / +/
    ... DEDENT            = &:?INDENT
    ...
    ... LF       = /\\n/
    ... EOF      = !/./
    ... '''
    >>> tree_parser = create_parser(tree_grammar)
    >>> syntax_tree = tree_parser(data_tree)
    >>> # show but the first 22 lines of the syntax-tree:
    >>> print('\n'.join(syntax_tree.as_sxpr().split('\n')[:22] + ['...']))
    (tree
      (INDENT)
      (node
        (tag_name "document")
        (content
          (children
            (INDENT "  ")
            (node
              (tag_name "element")
              (content
                (children
                  (INDENT "    ")
                  (node
                    (tag_name "STag")
                    (content
                      (children
                        (INDENT "      ")
                        (node
                          (tag_name "TagName")
                          (content
                            (string "line")))
                        (DEDENT))))
    ...

In case you are surprised by the size of the resulting syntax-tree,
keep in mind that the syntax-tree or "parse-tree" of the
serialization of a data structure is not the data-structure itself,
even if it happens to be a tree. However, the data-structure can
be retrieved from the tree.

As can be seen the INDENT-elements have captured indentation of
various increasing length. Also, observe the use of the
"querying parser" HAS_DEEPER_INDENTATION. A "querying parser" is
a parser that is meant to be used only inside a negative or
positive lookahead. The first query with HAS_DEEPER_INDENTATION
makes sure that the next call to INDENT will indeed capture
a longer stretch of whitespace than represented by its present
value, i.e. the last value on the "INDENT"-symbol's value-stack.

DHParser makes sure that the value-stack of each captured
variable will properly be un-winded when returning from a
non-match to an earlier position in the document. However, there
is one case where this does not work, namely, if a symbol has matched
the empty string. Because value-stack unwinding is tied to the
position in the text instead of the call stack (which DHParser
only keeps track of when testing or debugging), it is impossible
to decide where a zero-length capture must be un-winded or not
in case DHParser returns exactly to the position of that capture.
In this case DHParser does not unwind the capture. This can lead
to a non-empty value stack at the end of the parsing process which
will be reported as an error. There are three strategies to avoid
this error:

1. Make sure that all symbols for which values are retrieved
   capture at least one character of text if they match.

   In this example above, the symbol INDENT would have to be
   defined as ``INDENT = / +/`` to achieve this. However, this
   would have made the formulation of the tree-grammar of this
   example more complicated.

2. Make sure that in all cases where zero-length text could
   possibly be captured a subsequent non-match cannot occur
   before the zero-length value has been retrieved.

   In the example above, this situation can only occur in the
   parser ``tree`` and could be possibly avoided with a lookahead, e.g.
   ``tree = &(/ */ tag_name) INDENT node DEDENT``. (However,
   since this avoidance strategy does not need to cover cases
   where the document is syntactically incorrect, anyway, we can
   rely on the fact that - since a syntactically correct document
   must contain a single root node at the very beginning - the
   parser will never retreat to the first captured INDENT.)

3. As a last resort one could clear the stack with one or more
   optional ``[DEDENT]`` parsers that can safely be added after
   ``EOF``. A modification that allows the document to consist
   of one or more top-level nodes would then look like:
   ``tree = { INDENT node DEDENT }+ /\\s*/ EOF [DEDENT]``


The problem of error resumption
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Resuming parsing after an error has occurred is complicated
because, if done badly, it can lead to consequential errors, i.e.
errors that are artifacts of the resumption rule or algorithm
but not really errors in source text. This can be quite
confusing and if worst comes to worst one cannot trust any
error message after the first error messages which in turn defies
the whole point of resuming parsing rather than breaking off
after the first error.

Unfortunately, consequential errors are even harder to avoid
when using context sensitive parsers. Say, we treat our
XML-Parser with a little mistake::

    >>> xmldoc = '''
    ... <doc>
    ...     <title>Heading</litle>
    ...     A few lines of Text
    ... </doc>'''
    >>> result = parseXML(xmldoc)
    >>> for e in result.errors_sorted: print(e)
    3:21: Error (1010): »>« expected by parser 'ETag', but »litle>\n   ...« found instead!
    3:21: Error (1040): Parser "document" stopped before end, at: »litle>\n   ...« Terminating parser.
    5:7: Error (1050): Capture-stack not empty after end of parsing: TagName 1 item


Since our original mini-XML-grammar did not contain any
error-resumption-directives, this is what we'd expect.
Now let's add a resume-directive to continue after misspelled
tag-names::

    >>> miniXML = '''
    ... @ hide   = EOF
    ... @ drop   = EOF, whitespace, strings
    ... document = ~ element ~ §EOF
    ... @element_resume = ('</' /\\w+/ '>')
    ... element  = STag §content ETag
    ... STag     = '<' TagName §'>'
    ... ETag     = '</' ::TagName §'>'
    ... TagName  = /\\w+/
    ... content  = [CharData] { (element | COMMENT__) [CharData] }
    ... CharData = /(?:(?!\\]\\]>)[^<&])+/
    ... EOF      =  !/./
    ... '''
    >>> parseXML = create_parser(miniXML)
    >>> result = parseXML(xmldoc)
    >>> for e in result.errors_sorted: print(e)
    3:19: Error (1010): ETag = `</` ::TagName "title" § `>` expected by parser 'element', but »</litle>\n ...« found instead!
    5:1: Error (1010): ETag = `</` ::TagName "title" § `>` expected by parser 'element', but »</doc>« found instead!

The second error is merely a consequential error. One can imagine that,
had the mistake of misspelling the ending tag occurred deeper in the XML
hierarchy then a consequential error for every closing tag after the erroneous
would have been reported, because the name of the opening tag corresponding
to the misspelled closing tag has - because of the misspelling - never been
removed from the stack.

We can try to avoid this problem by "popping" one element from the stack
within our resume rule. This is of course only possible, if we specify our
resume rule as full PEG-expression, not just a regular expression::

    >>> new_resume_rule = r"@element_resume = (:?TagName '</' /\w+/ '>')"

The regular expression ``/\s*(?=>)/`` merely has the purpose to cover
ending-tags that fail to parse because of superfluous blanks before the
closing ``>``. The important part is the "pop anyways"-operator
``:?TagName``. Since this operator does not capture any text in case
of a misspelled tag-name, but merely removes it value from the stack,
it it followed by the regular expression ``/\s*(?=>)/`` which moves the
parser forward to the next angular bracket.::

    >>> i = miniXML.find('@element_resume')
    >>> k = miniXML.find('\n', i)
    >>> miniXML = miniXML[:i] + new_resume_rule + miniXML[k:]
    >>> parseXML = create_parser(miniXML)
    >>> result = parseXML(xmldoc)
    >>> for e in result.errors_sorted: print(e)
    3:19: Error (1010): ETag = `</` ::TagName "title" § `>` expected by parser 'element', but »</litle>\n ...« found instead!

If it only was so simple! Let's see what happens if we treat our
parser with another, rather obvious or common mistake: Forgetting
the closing tag (or, what leads to a similar confusion of the parser,
omitting the closing slash ``/`` in an empty tag)::

    >>> xmldoc = '''
    ... <doc>
    ...    <title>Heading <wrong></title>
    ... </doc>'''
    >>> result = parseXML(xmldoc)
    >>> for e in result.errors_sorted: print(e)
    3:26: Error (1010): ETag = `</` ::TagName "wrong" § `>` expected by parser 'element', but »</title>\n<...« found instead!
    4:1: Error (1010): ETag = `</` ::TagName "title" § `>` expected by parser 'element', but »</doc>« found instead!
    4:7: Error (1010): ETag = `</` ::TagName "doc" § `>` expected by parser 'element', but END OF FILE found instead!

In this case the removal of the last value from the stack in the
``@element_resume``-rule results in a cascade of consecutive errors
(or "error-artifacts"), one for every outer closing tag.

Avoiding error-artifacts such as this is an art in itself and
requires to look at the whole picture. As of now, there does not
exist any comprehensive theory, how to do this. A good starting
point might be to take a look at the errors, users of the
domain specific language make most frequently, as well as on
those error messages that confuse them most. (Bug-reports,
help-desk questions or tele-metric data might yield the
required empirical information for this approach.)

A more robust rule-set for our mini-XML-grammar might be::

    >>> miniXML = r'''
    ... @ whitespace  = /\s*/
    ... @ hide   = EOF
    ... @ drop   = EOF, whitespace, strings
    ...
    ... document = ~ element ~ §EOF
    ... @element_resume = (:?TagName (&('</' :TagName '>') | '</' /\w+/ '>'))
    ... element  = STag §content ETag
    ... STag     = '<' TagName §'>'
    ... ETag     = '</' ::TagName §'>'
    ... TagName  = /\w+/
    ... content  = [CharData] { (element | COMMENT__) [CharData] }
    ...
    ... CharData = /(?:(?!\]\]>)[^<&])+/
    ... EOF      =  !/./        # no more characters ahead, end of file reached
    ... '''
    >>> parseXML = create_parser(miniXML)
    >>> result = parseXML(xmldoc)
    >>> for e in result.errors_sorted: print(e)
    3:26: Error (1010): ETag = `</` ::TagName "wrong" § `>` expected by parser 'element', but »</title>\n<...« found instead!

This rule has been designed to cover both of the previous cases. However,
if we combine both errors, its limitations begin to show::

    >>> xmldoc = '''
    ... <doc>
    ...     <title>Heading<wrong></litle>
    ...     A few lines of Text
    ... </doc>'''
    >>> result = parseXML(xmldoc)
    >>> for e in result.errors_sorted: print(e)
    3:26: Error (1010): ETag = `</` ::TagName "wrong" § `>` expected by parser 'element', but »</litle>\n ...« found instead!
    5:1: Error (1010): ETag = `</` ::TagName "title" § `>` expected by parser 'element', but »</doc>« found instead!

Here, it seems almost impossible to avoid an error-cascade in
combination with context-sensitive parsers if met with a combination of
different errors. In this particular case of an XML-parser, the best way
out might be not to use context-sensitive parsers at all and check the
matching XML-tags at a later processing stage.


.. _custom_parsers:

Custom Parsers
--------------

Specifying a grammar in EBNF and generating a parser from this
specification has the advantage that one can get a much better overview
over the grammar than when writing a completely parser directly in
Python. This also makes changing the grammar and explaining it to others
much easier. However, there are cases where writing a parsers at least
for particular components in Python can be advantageous. These include
cases, where it is easier to write a parser in Python, where a
Python-parser might be faster, or where a parser cannot be coded in
EBNF, say, because it requires as Turing-complete programming languages.

Custom-parser example
^^^^^^^^^^^^^^^^^^^^^

As you might remember from one of the macros-examples, above (see
:any:`macro_system`), a kind of parser that is a bit tedious to code in
EBNF is a search parser. In the simplest imaginable case (e.g. ignoring
comments ignoring nested structures, where the search phrase might
occur, but should not be found) you could write ``{ !"search_phrase"
/.|\\n/ }``. Be wary that this search parser could land you in the
middle of a comment, e.g. when "searching" for "rain" in the line ``I'm
singing in # the rain`` (assuming ``#`` as the comment sign). So it is
better to write ``{ !"search_phrase" (~ | /.|\\n/) }``. To keep the
example simple, we'll leave such worries aside in the following::

    >>> search_rain_grammar = '''@reduction = merge
    ...     song = { search_rain count_rain } [search_rain] EOF
    ...     search_rain = { !"rain" /.|\\n/ }
    ...     count_rain  = "rain"
    ...     EOF         = !/.|\\n/'''
    >>> rain_counter = create_parser(search_rain_grammar)
    >>> song = """
    ...     I'm singin' in the rain
    ...     Just singin' in the rain
    ...     What a glorious feeling
    ...     I'm happy again."""
    >>> syntax_tree = rain_counter(song)
    >>> print(syntax_tree.as_sxpr())
    (song
      (search_rain
        ""
        "    I'm singin' in the ")
      (count_rain "rain")
      (search_rain
        ""
        "    Just singin' in the ")
      (count_rain "rain")
      (search_rain
        ""
        "    What a glorious feeling"
        "    I'm happy again.")
      (EOF))
    >>> print(len(list(syntax_tree.select('count_rain'))))
    2

Let's now define a custom parser for "search_rain". A custom parser is a
factory-function that returns a "custom parse function" which in turn is
a function that takes a :py:class:`~stringview.StringView`-object as
input and returns a :py:class:`nodetree.Node`-object or ``None`` as
result. This factory-function can be a derived class from class
:py:class:`parse.Parser` but does not need to be. In the EBNF-grammar is
specified as ``@name(param)`` where ``name`` is the name for the factory
function on the Python side and ``param`` either an identifier or - if
enclosed in quotation marks - a single string parameter on the Python
side, which can be used to configure the factory function.

It is also possible to specify a custom parser function directly instead
of a factory function. However, it must then be specified as the
parameter of the predefined factory function "Custom", e.g.
``@Custom(parse_func)``.

The parser (factory) function itself must be defined in a global
namespace that is reachable from the Grammar-class generated from the
EBNF-code. A good place is the beginning of the generated
"xxxParser.py"-script in case of DHParser-projects created with the
"dhparser"-command. If, as in the following, the parser is created with
:py:func:`~dsl.create_parser`, the custom code should be passed as
Python-source-string to the ``additional_code``-argument of the
:py:func:`~dsl.create_parser`-function. In the following snipped, we,
therefore, write it directly into a string.

    >>> search_src = """
    ... from DHParser.parse import CustomParseFunc
    ... from DHParser.stringview import StringView
    ... def search(what: str) -> CustomParseFunc:
    ...     def parse_func(rest: StringView) -> Optional[Node]:
    ...         i = rest.find(what)
    ...         if i >= 0:
    ...             return Node('', rest[:i])
    ...         else:
    ...             return Node('', rest)
    ...     return parse_func
    ... """

Note, that the name of the returned node ist left empty. It will be
filled in by the :py:class:`~parse.CustomParser`-object either with the
anonymous-parser name ":CustomParser" or the name of the symbol to which
the parser is assigned, because this information is not available within
the function. If, however, the custom parser is defined as a class
derived of :py:class:`~parse.CustomParser`, the right name must be given
when creating the node. In most cases this will simply be
``self.node_name``.

Also, keep in mind that the parameter passed to the parsing function is
of type :py:class:`~stringview.StringView` and not simply a string. This
view represents the remaining part of the document to be parsed.

As said, referring to the custom parser factory function from the grammar is simple::

    >>> search_rain_grammar = '''@reduction = merge
    ...     song = { search_rain count_rain } search_rain EOF
    ...     search_rain = @search("rain")
    ...     count_rain  = "rain"
    ...     EOF         = !/.|\\n/'''

    >>> rain_counter = create_parser(search_rain_grammar,
    ...                              additional_code=search_src)

Let's try using it::

    >>> syntax_tree = rain_counter(song)
    >>> print(syntax_tree.as_sxpr())
    (song
      (search_rain
        ""
        "    I'm singin' in the ")
      (count_rain "rain")
      (search_rain
        ""
        "    Just singin' in the ")
      (count_rain "rain")
      (search_rain
        ""
        "    What a glorious feeling"
        "    I'm happy again.")
      (EOF))

Another good use cases for custom parsers are long lists of fixed
alternatives, say, a list of state-names, e.g. "Afghanistan, Albania,
... Zimbabwe" which would clutter the grammar and slow down parsing. In
particular, if these lists are changed frequently and better directly
fetched from a database then encoded in the grammar which would have to
be recompiled every time there is a change in the data.

Implementing Custom Parsers
^^^^^^^^^^^^^^^^^^^^^^^^^^^

There exist three different ways to implement a custom parser:

**1. via factory-functions** The "standard" way has been shown above,
where you write a factory function that returns the parser. The factory
function is required when you want to pass arguments to your parsing
function, because these will be saved in the closure of the factory
function - as in the example above.

.. caution:: It is only possible to pass a single argument to custom
    parsers in the grammar which can either be a string (enclosed by
    quotation marks) or a symbol (without quotation marks).

    If you want to pass more than one argument, you'll have to encode
    the list or arguments in a single string and decode this string in
    your Python code.

This also works for factory functions without arguments, of course::

    >>> word_parser_factory = r"""
    ... def parse_word() -> CustomParseFunc:
    ...     RX_WORD = re.compile(r'\w+')
    ...     def inner_parse_word(rest: StringView) -> Optional[Node]:
    ...         m = rest.match(RX_WORD)
    ...         if m:
    ...             return Node('', rest[:m.end()])
    ...         return None
    ...     return inner_parse_word"""

As noted earlier, the parsing function (i.e. "inner_parse_word")
receives a :py:class:`~stringview.StringView` on the remaining part of
the document as a parameter and that it should return a Node with an
empty name, so that the proper node-name can be filled in by DHParser's
:py:class:`~parse.CustomParser`-wrapper.

In a grammar this custom-parser needs to be invoked with brackets "()",
but without argument::

    >>> word_grammar = """@reduction = merge
    ...     document = @parse_word()"""

    >>> word_parser = create_parser(word_grammar, additional_code=word_parser_factory)
    >>> print(word_parser("Hello").as_sxpr())
    (document "Hello")

**2. as a simple parser function** If no arguments need to be passed
to the custom parser it can also be implemented without a
factory-wrapper, just as a simple parsing function::

    >>> plain_word_parser = r"""
    ... def parse_word(rest: StringView) -> Optional[Node]:
    ...     m = rest.match(re.compile(r'\w+'))
    ...     if m:
    ...         return Node('', rest[:m.end()])
    ...     return None"""

If the custom-parsing function is implemented without a wrapper in the
Python-code, it needs to be wrapped into a `@Custom(...)`-clause within
the grammar, however. This tells DHParser that it should not expect the
parsing-function invoked in the grammar to be factory-function::

    >>> word_grammar = """@reduction = merge
    ...     document = @Custom(parse_word)"""

    >>> word_parser = create_parser(word_grammar, additional_code=plain_word_parser)
    >>> print(word_parser("Hello").as_sxpr())
    (document "Hello")

**3. as a fully implemented parsing class** Finally, custom parsers can
also be implemented as descendants of :py:class:`~parse.Parse`. This is
the most complicated case, because writing a derivative class of
:py:class:`~parse.Parse` or :py:class:`parse.UnaryParser` or
:py:class:`~parse.NaryParser` requires to pay attention to all kinds of
requirements, as for example, the need for a ``__deepcopy__``-method.
Therefore, implementing one's own Parse-classes should be avoided if
possible.

Nevertheless, here is the previous example implemented with A
custom parser class::

   >>> word_parser_class = """from DHParser.parse import Parser, copy_parser_base_attrs
   ... class ParseWord(Parser):
   ...     def __init__(self):
   ...         super().__init__()
   ...         self.RX_WORD = re.compile(r'\\w+')
   ...
   ...     def __deepcopy__(self, memo):
   ...         duplicate = self.__class__()
   ...         copy_parser_base_attrs(self, duplicate)
   ...         return duplicate
   ...
   ...     def _parse(self, location) -> Tuple[Optional[Node], int]:
   ...         document_view = self.grammar.document__
   ...         m = document_view.match(self.RX_WORD, location)
   ...         if m:
   ...             return Node(self.node_name, document_view[location: m.end()]), m.end()
   ...         return None, location"""

    >>> word_grammar = """@reduction = merge
    ...     document = @ParseWord()"""

    >>> word_parser = create_parser(word_grammar, additional_code=word_parser_class)
    >>> print(word_parser("Hello").as_sxpr())
    (document "Hello")

In contrast to the former two methods, implementing one's own custom
Parser-classes has the benefits of giving access to the
:py:class:`~parse.Grammar`-object and to allow the manipulation of the location
pointer. A possible use case is when you'd like to store the
unstructured source-data of some part of the source-document in the
parsing tree, next to the data structured by the parser::

    >>> save_source = """from DHParser.parse import Parser
    ... RX_LINE = re.compile(r'[^ \\t\\n]+(?:[ \\t]*[^ \\t\\n]+)*')
    ...
    ... class SaveSource(Parser):
    ...     def _parse(self, location) -> Tuple[Optional[Node], int]:
    ...         match = RX_LINE.match(self.grammar.text__, location)
    ...         source_code = match.group(0) if match else ''
    ...         # return unstructured source but do not move location pointer forward!
    ...         return Node('Quelle', source_code), location"""

    >>> grammar = """@reduction = merge
    ...     document = line { /\\\\n/ line }
    ...     line = @SaveSource() word { ' ' word }
    ...     word = /\\w+/"""

    >>> parser = create_parser(grammar, additional_code=save_source)
    >>> print(parser("Hello World").as_sxpr())
    (document (line (Quelle "Hello World") (word "Hello") (:Text " ") (word "World")))


Semantic Actions
^^^^^^^^^^^^^^^^

Finally, custom parsers can also be used (or be abused) for semantic
actions of any kind. Tracing certain points of the parser by dumping
debug-information would be one reasonable use case of this kind. When
storing or manipulating data in your custom parser or producing other
side effects, you should be aware that even if the parser matches, it's
results may be thrown away later, because the recursive descent parser
pursues an other branch. There is no support by DHParser for filling and
unwinding data-stacks or the like during parsing - with the exception
the :ref:`context sensitive parsers <context_sensitive_parsers>`
described above. If you want to store data at all, it is best to save
the data in the attributes of the nodes that your custom function
returns, because then you can be sure that only data from valid nodes
survives to the end of the parsing process.

However, since parser functions or classes do not have access to the
parsing tree during parsing, any variables stored in this way will be
accessible during parsing only to custom parsers that directly or
indirectly call other custom parsers (via the sub-trees returned from
the called parsers). This does not mean that storing variables in
attributes during parsing is entirely useless. Variables stored in the
node's attributes will be accessible without any restrictions only in
the processing stages following the parsing-stage (see
:any:`processing_pipelines`).


Now, let's not forget to restore the turned off optimizations (see above)::

    >>> set_config_value('optimizations', save_optimizations)


*Classes and Functions-Reference*
---------------------------------

Usually, the functions and classes of module :py:mod:`DHParser.ebnf` are not
called directly, but the helper functions from module :py:mod:`DHParser.dsl`
like :py:func:`DHParser.dsl.create_parser` will be called or the
`dhparser`-script will be used from the command line. Therefore, no
commented abbreviated reference of the most important functions of
module ``ebnf`` will be provided here. See the source code documentation
of module :py:mod:`DHParser.ebnf` for a full reference of all functions and
classes.

For DHParser's EBNF-dirctives, seee the :ref:`Reference of EBNF-directives <directives_reference>`

.. _json: https://www.json.org/
.. _re: https://docs.python.org/3/library/re.html
.. _regex: https://github.com/mrabarnett/mrab-regex