EBNF-Grammars
Module ebnf provides an EBNF-parser-generator that compiles an
EBNF-Grammar into Python-code that can be executed to parse source text
conforming to this grammar into concrete syntax trees.
Writing Grammar-Definitions
Specifying Grammars with EBNF
With DHParser, Grammars can be specified either directly in Python-code
(see parse) or in one of several EBNF-dialects. (Yes,
DHParser supports several different variants of EBNF! This makes it easy
to crate a parser directly from Grammars found in external sources.)
“EBNF” stands for the “Extended-Backus-Naur-Form” which is a common
formalism for specifying Grammars for context-free-languages.
(see EBNF in Wikipedia)
The recommended way of compiling grammars with DHParser is to either write the EBNF-specification for that Grammar into a text-file and then compile EBNF-source to an executable as well as importable Python-module with the help of the “dhparser”-script. Or, for bigger projects, to create a new domain-specific-language-project with the DHParser-script as described in the step-by-step-guide.
However, here we will show how to compile an EBNF-specified grammar from within Python-code and how to execute the parser that was generated by compiling the grammar.
As an example, we will realize a json-parser. Let’s start with creating some test-data:
>>> testobj = {'array': [1, 2.0, "a string"], 'number': -1.3e+25, 'bool': False}
>>> import json
>>> testdata = json.dumps(testobj)
>>> testdata
'{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'
We define the json-Grammar in top-down manner in EBNF. We’ll use a regular-expression look-alike syntax. EBNF, as you may recall, consists of a sequence of symbol definitions. The definiens of those definitions either is a string literal or regular expression or other symbols or a combination of these with four different operators: 1. sequences 2. alternatives 3. options and 4. repetitions. Here is how these elements are denoted in classical and regex-like EBNF-syntax:
element |
classical EBNF |
regex-like |
|---|---|---|
insignificant whitespace |
~ |
~ |
string literal |
“…” or `…` |
“…” or `…` |
regular expr. |
/…/ |
/…/ |
sequences |
A B C |
A B C |
alternatives |
A | B | C |
A | B | C |
options |
[ … ] |
…? |
repetitions |
{ … } |
…* |
one or more |
…+ |
|
grouping |
(…) |
(…) |
positive lookahead |
& … |
& … |
negative lookahead |
! … |
! … |
“insignificant whitespace” is a specialty of DHParser. Denoting
insignificant whitespace with a particular sign ~ allows to eliminate
it already during the parsing process without burdening later
syntax-tree-processing stages with this common task. DHParser offers
several more facilities to restrain the verbosity of the concrete
syntax tree, so that the outcome of the parsing stage comes close (or
at least closer) to the intended abstract-syntax-tree, already.
DHParser also contains a few experimental extensions to the common EBNF and PEG (Parsing-Expression-Grammar) formalisms:
additional operators |
syntax |
|---|---|
interleave |
A ° B |
repetition range |
… (i, k) |
lookbehind operators |
|
postivie lookbehind |
<-& |
negtive lookahead |
<-! |
constext sensitive parsers |
|
pop and match |
:: … |
retrieve and match |
: … |
pop and always match |
:? … |
JSON consists of two complex data types, 1) associative arrays, called “object” and sequences of heterogeneous data, called array; and of four simple data types, 1) string, 2) number, 3) bool and 4) null. The structure of a JSON file can easily be described in EBNF:
>>> grammar = '''
... json = ~ _element _EOF
... _EOF = /$/
... _element = object | array | string | number | bool | null
... object = "{" ~ member ( "," ~ §member )* "}" ~
... member = string ":" ~ _element
... array = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
... string = `"` _CHARS `"` ~
... _CHARS = /[^"\\\\]+/ | /\\[\\/bnrt\\]/
... number = _INT _FRAC? _EXP? ~
... _INT = `-`? ( /[1-9][0-9]+/ | /[0-9]/ )
... _FRAC = `.` /[0-9]+/
... _EXP = (`E`|`e`) [`+`|`-`] /[0-9]+/
... bool = "true" ~ | "false" ~
... null = "null" ~ '''
This is a rather common EBNF-grammar. A few peculiarities are noteworthy, though:
First of all you might notice that some components of the grammar
(or “production rules” as they are commonly called) have names with a leading
underscore _. It is a convention to mark those elements, in which we are on
interested on their own account, with an underscore _. When moving from the
concrete syntax-tree to a more abstract syntax-tree, these elements could be
substituted by their content, to simplify the tree.
Secondly, some production rules carry a name written in capital letters. This is also
a convention to mark those symbols which with other parser-generators would
represent tokens delivered by a lexical scanner. DHParser is a “scanner-less”
parser, which means that the breaking down of the string into meaningful tokens
is done in place with regular expressions (like in the definition of _EOF)
or simple combinations of regular expressions (see the definition of _INT above).
Their is no sharp distinction between tokens and other symbols in DHParser,
but we keep it as a loose convention. Regular expressions are enclosed in forward
slashes and follow the standard syntax of Perl-style regular expression that is
also used by the “re”-module of the Python standard library. (Don’t worry about
the number of backslashes in the line defining _CHARS for now!)
Finally, it is another helpful convention to indent the definitions of symbols
that have only been introduced to simplify an otherwise unnecessarily
complicated definition (e.g. the definition of number, above) or to make
it more understandable by giving names to its components (like _EOF).
Let’s try this grammar on our test-string. In order to compile
this grammar into executable Python-code, we use the high-level-function
create_parser() from the dsl-module.
>>> from DHParser.dsl import create_parser
>>> parser = create_parser(grammar, branding="JSON")
>>> syntax_tree = parser(testdata)
>>> syntax_tree.content
'{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'
As expected serializing the content of the resulting syntax-tree yields exactly the input-string of the parsing process. What we cannot see here, is that the parser has structured the string into the individual elements described in the grammar. Since the concrete syntax-tree that the parser yields is rather verbose, it would not make sense to print it out. We’ll just look at a small part of it, to see what it looks like. Let’s just pick the sub-tree that captures the first json-array within the syntax-tree:
>>> print(syntax_tree.pick('array').as_sxpr())
(array
(:Text "[")
(_element
(number
(_INT "1")))
(:Text ",")
(:Whitespace " ")
(_element
(number
(_INT "2")
(_FRAC
(:Text ".")
(:RegExp "0"))))
(:Text ",")
(:Whitespace " ")
(_element
(string
(:Text '"')
(_CHARS "a string")
(:Text '"')))
(:Text "]"))
The nodes of the syntax-tree carry the names of the production rules
by which they have been generated. Nodes, that have been created by
components of a production receive the name of of the parser-type
that has created the node (see parse) prefixed
with a colon “:”. In DHParser, these nodes are called “anonymous”,
because they lack the name of a proper grammatical component.
Chaining Grammars with @include
While it is always advisable to keep a grammar definition short and concise, so that it neatly fits within one file, the need might arise to reuse parts of a grammar definition in different grammar. While this can be done with copy & paste, using copy & past has the disadvantage that if the part that has been copied will be changed later (if only because an error has been detected and needs to be corrected), it must be copied to all places where it is used, once again.
In order to avoid such complications, DHParser supports a modest kind of
modularization of grammar-definitions via a simple include mechanism.
Simply add @include = "FILENAME"-directive at the place where
you’d like to include another grammar file.
Say, we’d like to make the grammar-definition of floating-point numbers in our JSON-grammar reusable. In order to do so, we can write it to a dedicated file:
>>> number_ebnf = '''
... number = _INT _FRAC? _EXP? ~
... _INT = `-`? ( /[1-9][0-9]+/ | /[0-9]/ )
... _FRAC = `.` /[0-9]+/
... _EXP = (`E`|`e`) [`+`|`-`] /[0-9]+/ '''
>>> with open('number_include.ebnf', 'w', encoding='utf-8') as f: _ = f.write(number_ebnf)
Now, we can simply include this file an any grammar:
>>> json_grammar_with_include = '''
... json = ~ _element _EOF
... _EOF = /$/
... _element = object | array | string | number | bool | null
... object = "{" ~ member ( "," ~ §member )* "}" ~
... member = string ":" ~ _element
... array = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
... string = `"` _CHARS `"` ~
... _CHARS = /[^"\\\\]+/ | /\\[\\/bnrt\\]/
... @include = "number_include.ebnf"
... bool = "true" ~ | "false" ~
... null = "null" ~ '''
>>> parser = create_parser(json_grammar_with_include, branding="JSON")
>>> syntax_tree = parser(testdata)
>>> syntax_tree.content
'{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'
The following command merely removes the grammar on the hard-disk, so that our doctest-run does not leave any trails:
>>> import os; os.remove('number_include.ebnf')
It should be noted that DHParser’s include-mechanism does not (yet) support any kind of namespaces, so you have to make sure yourself that the same symbol names are not accidentally defined both in the including and included file(s).
Using DHParser’s include-mechanism for modularizing your EBNF-grammar(s) is
a different thing than adding an include mechanism to your domain specific
language. For the latter, see the manual of the DHParser.preprocess-module.
Simplifying Syntax-Trees while Parsing
Usually, anonymous nodes are what you want to get rid of in the course
of transforming the concrete syntax-tree into an abstract syntax-tree.
(See transform). DHParser already eliminates per
default all anonymous nodes that are not leaf-nodes by replacing them
with their children during parsing. Anonymous leaf-nodes will be
replaced by their content, if they are a single child of some parent,
and otherwise be left in place. Without this optimization, each
construct of the EBNF-grammar would leave a node in the syntax-tree:
>>> from DHParser.parse import CombinedParser, TreeReduction
>>> _ = TreeReduction(parser.json, CombinedParser.NO_TREE_REDUCTION)
>>> syntax_tree = parser(testdata)
>>> print(syntax_tree.pick('array').as_sxpr())
(array
(:Text "[")
(:Option
(:Series
(_element
(number
(_INT
(:Alternative
(:RegExp "1")))))
(:ZeroOrMore
(:Series
(:Text ",")
(:Whitespace " ")
(_element
(number
(_INT
(:Alternative
(:RegExp "2")))
(:Option
(_FRAC
(:Text ".")
(:RegExp "0"))))))
(:Series
(:Text ",")
(:Whitespace " ")
(_element
(string
(:Text '"')
(_CHARS
(:RegExp "a string"))
(:Text '"')))))))
(:Text "]"))
This can be helpful for understanding how parsing that is directed by
an EBNF-grammar works (next to looking at the logs of the complete
parsing-process, see trace), but other than that it
is advisable to streamline the syntax-tree as early on as possible,
because the processing time of all subsequent tree-processing stages
increases with the number of nodes in the tree.
Because of this, DHParser offers further means of simplifying
syntax-trees during the parsing stage, already. These are not turned
on by default, because they allow to drop content or to remove named
nodes from the tree; but they must be turned on by “directives” that
are listed at the top of an EBNF-grammar and that guide the
parser-generation process. DHParser-directives always start with an
@-sign. For example, the @drop-directive advises the parser to
drop certain nodes entirely, including their content. In the following
example, the parser is directed to drop all insignificant whitespace:
>>> drop_insignificant_wsp = '@drop = whitespace \n'
Directives look similar to productions, only that on the right hand
side of the equal sign follows a list of parameters. In the case
of the drop-directive these can be either names of non-anonymous
nodes that shall be dropped or one of four particular classes of
anonymous nodes (strings, backticked, regexp, whitespace) that
will be dropped.
Another useful directive advises the parser to treat named nodes as anonymous nodes and to eliminate them accordingly during parsing. This is useful, if we have introduced certain names in our grammar only as placeholders to render the definition of the grammar a bit more readable, not because we are interested in the text that is captured by the production associated with them in their own right:
>>> disposable_symbols = '@disposable = /_\\w+/ \n'
Instead of passing a comma-separated list of symbols to the directive, which would also have been possible, we have leveraged our convention to prefix unimportant symbols with an underscore “_” by specifying the symbols that shall by anonymized with a regular expression.
Now, let’s examine the effect of these two directives:
>>> refined_grammar = drop_insignificant_wsp + disposable_symbols + grammar
>>> parser = create_parser(refined_grammar, 'JSON')
>>> syntax_tree = parser(testdata)
>>> syntax_tree.content
'{"array":[1,2.0,"a string"],"number":-1.3e+25,"bool":false}'
You might have noticed that all insignificant whitespaces adjacent to the delimiters have been removed this time (but, of course not the significant whitespace between “a” and “string” in “a string”). And the difference, the use of these two directives makes, is even more obvious, if we look at (a section of) the syntax-tree:
>>> print(syntax_tree.pick('array').as_sxpr())
(array
(:Text "[")
(number "1")
(:Text ",")
(number
(:RegExp "2")
(:Text ".")
(:RegExp "0"))
(:Text ",")
(string
(:Text '"')
(:RegExp "a string")
(:Text '"'))
(:Text "]"))
This tree looks more streamlined. But it still contains more structure than we might like to see in an abstract syntax tree. In particular, it still contains als the delimiters (“[”, “,”, ‘”’, …) next to the data. But other than in the UTF-8 representation of our json data, the delimiters are not needed any more, because the structural information is now retained in the tree-structure.
So how can we get rid of those delimiters? The rather coarse-grained tools that DHParser offers in the parsing stage require some care to do this properly.
The @drop-directive allows to drop all unnamed strings (i.e. strings
that are not directly assigned to a symbol) and back-ticked strings (for
the difference between strings and back-ticked strings, see below) and
regular expressions. However, using @drop = whitespace, strings, backticked
would also drop those parts captured as string that contain data:
>>> refined_grammar = '@drop = whitespace, strings, backticked \n' \
... + disposable_symbols + grammar
>>> parser = create_parser(refined_grammar, 'JSON')
>>> syntax_tree = parser(testdata)
>>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
(array
(number "1")
(number
(:RegExp "2")
(:RegExp "0"))
(string "a string"))
Here, suddenly, the number “2.0” has been turned into “20”! There are three ways to get around this problem:
Assigning all non-delimiter strings to symbols. In this case we would have to rewrite the definition of “number” as such:
number = _INT _FRAC? _EXP? ~ _INT = _MINUS? ( /[1-9][0-9]+/ | /[0-9]/ ) _FRAC = _DOT /[0-9]+/ _EXP = (_Ecap|_Esmall) [_PLUS|MINUS] /[0-9]+/ _MINUS = `-` _PLUS = `+` _DOT = `.` _Ecap = `E` _Esmall = `e`
A simpler alternative of this technique would be to make use of the fact that the document-parts captured by regular expression are not dropped (although regular expressions can also be listed in the @drop-directive, if needed) and that at the same time delimiters are almost always simple strings containing keywords or punctuation characters. Thus, one only needs to rewrite those string-expressions that capture data as regular expressions:
number = _INT _FRAC? _EXP? ~ _INT = /[-]/ ( /[1-9][0-9]+/ | /[0-9]/ ) _FRAC = /[.]/ /[0-9]+/ _EXP = (/E/|/e/) [/[-+]/] /[0-9]+/
Assigning all delimiter strings to symbols and drop the nodes and content captured by these symbols. This means doing exactly the opposite of the first solution. Here is an excerpt of what a JSON-grammar employing this technique would look like:
@disposable = /_\\w+/ @drop = whitespace, _BEGIN_ARRAY, _END_ARRAY, _KOMMA, _BEGIN_OBJECT, ... ... array = _BEGIN_ARRAY ~ ( _element ( _KOMMA ~ _element )* )? §_END_ARRAY ~ ...
It is important that all symbols listed for dropping are also made disposable, either by listing them in the disposable-directive as well or using names that the regular-expressions for disposables matches. Otherwise, DHParser does not allow to drop the content of named nodes, because the default assumption is that symbols in the grammar are defined to capture meaningful parts of the document that contain relevant data.
Bailing out and leaving the further simplification of the syntax-tree to the next tree-processing stage which, if you follow DHParser’s suggested usage pattern, is the abstract-syntax-tree-transformation proper and which allows for a much more fine-grained specification of transformation rules. See
transform.
To round this section up, we present the full grammar for a streamlined
JSON-Parser according to the first solution-strategy. Observe, that the
values of “bool” and “null” are now defined with regular expressions
instead of string-literals, because the latter would be dropped because
of the @drop = ... strings, ...-directive, leaving an empty named node
without a value, whenever a bool value or null occurs in the input:
>>> json_gr = '''
... @disposable = /_\\w+/
... @drop = whitespace, strings, backticked, _EOF
... json = ~ _element _EOF
... _EOF = /$/
... _element = object | array | string | number | bool | null
... object = "{" ~ member ( "," ~ §member )* "}" ~
... member = string ":" ~ _element
... array = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
... string = `"` _CHARS `"` ~
... _CHARS = /[^"\\\\]+/ | /\\[\/bnrt\\]/
... number = _INT _FRAC? _EXP? ~
... _INT = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
... _FRAC = /[.]/ /[0-9]+/
... _EXP = /[Ee]/ [/[-+]/] /[0-9]+/
... bool = /true/ ~ | /false/ ~
... null = /null/ ~ '''
>>> json_parser = create_parser(json_gr, 'JSON')
>>> syntax_tree = json_parser(testdata)
>>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
(array
(number "1")
(number
(:RegExp "2")
(:RegExp ".")
(:RegExp "0"))
(string "a string"))
This time the data is not distorted, any more. One oddity remains, however: We are most probably not interested in the fact that the number 2.0 consists of three components, each of which hast been captured by a regular expression. Luckily, there exists yet another directive that allows to reduce the tree further by merging adjacent anonymous leaf-nodes:
>>> json_gr = '@reduction = merge \n' + json_gr
>>> json_parser = create_parser(json_gr, 'JSON')
>>> syntax_tree = json_parser(testdata)
>>> print(syntax_tree.as_sxpr())
(json
(object
(member
(string "array")
(array
(number "1")
(number "2.0")
(string "a string")))
(member
(string "number")
(number "-1.3e+25"))
(member
(string "bool")
(bool "false"))))
Merging adjacent anonymous leaf-nodes takes place after the @drop-directive
comes into effect. It should be observed that merging only produces the desired
result, if any delimiters have been dropped previously, because otherwise
delimiters would be merged with content. Therefore, the @reduction = merge-
directive should at best only be applied in conjunction with the @drop and
@disposable-directives.
There are for possible values for the @direction-directive:
value |
effect |
|---|---|
|
no tree-reduction during parsing stage |
|
flatten anonymous nodes |
|
flatten anonymous nodes and merge anonymous leaf-nodes when all siblings are anonymous leaf-nodes |
|
flatten anonymous nodes and merge anonymous leaf-nodes |
Applying any of the here described tree-reductions (or “simplifications” for that matter) requires a bit of careful planning concerning which nodes will be named and which nodes will be dropped. This, however, pays off in terms of speed and a considerably simplified abstract-syntax-tree generation stage, because most of the unnecessary structure of concrete-syntax-trees has already been eliminated at the parsing stage.
Lookahead and Lookbehind
Lookahead and lookbehind operators are a convenient way to resolve or rather avoid ambiguities while at the same time keeping the DSL lean. Assume for example a simple DSL for writing definitions like:
>>> definitions = '''
... dog := carnivorous quadrupel that barks
... human := featherless biped'''
Now, let’s try to draw up a grammar for “definitions”:
>>> def_DSL_first_try = ''' # WARNING: This grammar doesn't work, yet!
... @literalws = right
... definitions = ~ definition { definition } EOF
... definition = definiendum ":=" definiens
... definiendum = word
... definiens = word { word }
... word = /[a-z]+|[A-Z][a-z]*/~
... EOF = /$/ '''
>>> def_parser = create_parser(def_DSL_first_try, "defDSL")
Parsing our example with the generated parser yields an error, however:
>>> syntax_tree = def_parser(definitions)
>>> for e in syntax_tree.errors_sorted: print(e)
3:11: Error (1040): Parser "word->/[a-z]+|[A-Z][a-z]*/" did not match: »:= featherless biped«
The reason for this error is that the parser definiens captures as many
words as occur in a sequence, including the definiendum of the next definition
which is the word “human”. But then the next definition does not find it
definiendum, any more, because it has already been captured. (All this may not
easily become clear from the error message itself, but can easily be found
out by using the post-mortem debugger of module trace.)
An common technique to avoid this problem would be to introduce an end-of-statement, for example, a semicolon “;”. A more elegant way to solve the problem in this case is to make use of the fact that if a word is followed by the definition sign “:=” it cannot be part of the definiens any more, but must be a definiendum. This can be encoded by using the negative look-ahead operator “!”:
>>> def_DSL = def_DSL_first_try.replace('definiens = word { word }',
... 'definiens = word { word !":=" }')
>>> def_parser = create_parser(def_DSL, "defDSL")
>>> syntax_tree = def_parser(definitions)
>>> for d in syntax_tree.select('definition'):
... print(f'A {d["definiendum"]} is a {str(d["definiens"]).strip()}')
A dog is a carnivorous quadrupel that barks
A human is a featherless biped
The statement word !":=" is a sequence of a word and a negative lookahead.
This whole sequence only matches, if word matches and the negative looakahead
matches, which is only the case of the following text cannot be matched by “:=”.
We could have achieved the same effect with a positive lookahead by checking
whether any of the possible follow-up-sequences of parser definiens ensues:
>>> def_DSL = def_DSL_first_try.replace('definiens = word { word }',
... 'definiens = word { word &(word|EOF) }')
>>> def_parser = create_parser(def_DSL, "defDSL")
>>> syntax_tree = def_parser(definitions)
>>> for d in syntax_tree.select('definition'):
... print(f'A {d["definiendum"]} is a {str(d["definiens"]).strip()}')
A dog is a carnivorous quadrupel that barks
A human is a featherless biped
Generally, lookahead operators, whether positive or negative, never capture any text. They merely match or fail depending on whether the following parser would match or would fail to match the next piece of text. The positive lookahead matches, if the parser would match. The negative operator matches, if it would fail. Lookahead operators can also be though of as a boolean condition on the following text, where the positive lookahead operator “&” resembles an and, “and” the negative lookahead operator an “and not”. As in our example, these operators are very helpful for “exploring” the surroundings of a piece of text to be captured by a parser. They allow parsers to match or fail depending on the ensuing text.
A negative lookahead expression can also serve to encode the meaning of “without” if placed in front of another expression. Let’s rewrite our grammar of a definitions-DSL so as to exclude certain bad words:
>>> def_DSL = def_DSL[:def_DSL.find('word =')] + '''
... word = !forbidden /[a-z]+|[A-Z][a-z]*/~
... forbidden = /[sf][a-z][a-z][a-z]/~
... EOF = /$/ '''
>>> def_parser = create_parser(def_DSL, "defDSL")
>>> syntax_tree = def_parser('nice := nice word')
>>> print(syntax_tree)
nice := nice word
>>> syntax_tree = def_parser('sxxx := bad word')
>>> print(str(syntax_tree).strip())
<<< Error on "sxxx := bad word" | Parser "word->!forbidden" did not match: »sxxx := bad word« >>>
The same effect can be achieved by using the subtraction operator “-”. This is just syntactic sugar make the use of the negative lookahead operator in the sense of “without” more intuitive:
>>> def_DSL = def_DSL[:def_DSL.find('word =')] + '''
... word = all_words - forbidden
... all_words = /[a-z]+|[A-Z][a-z]*/~
... forbidden = /[sf][a-z][a-z][a-z]/~
... EOF = /$/ '''
>>> def_parser = create_parser(def_DSL, "defDSL")
>>> syntax_tree = def_parser('sxxx := bad word')
>>> print(str(syntax_tree).strip())
<<< Error on "sxxx := bad word" | Parser "word->!forbidden" did not match: »sxxx := bad word« >>>
Next to the lookahead operators, there also exist lookback operators. Be warned, though, that look back operators are an experimental feature in DHParser and that their implementation is highly idiosyncratic, that is, it is most likely not compatible with any other parser-generator-toolkit based on EBNF-grammers. Also, lookback operators in DHParser are more restricted than lookahead-operators. They can only be used in combination with simple text or regular expression parsers and - here comes the idiosyncratic part - they work in the opposite direction. This means that if you want to check whether a parser is preceded, say, by the keyword “BEGIN”, the text phrase that you have to check for with the lookback parser is actually “NIGEB”. If that still does not put you off, here is how lookback-operators are used: Let’s assume that our definition should not only allow for a definiens but, alternatively for enumerations and that the difference is indicated by using a simple equal sign “=” instead of the definition symbol “:=”. Then using lookback-operators to distinguish the case, we can rewrite our grammar as follows:
>>> def_DSL_extended = '''
... @literalws = right
... definitions = ~ definition { definition } EOF
... definition = definiendum (":=" | "=") (definiens | enumeration)
... definiendum = word
... definiens = <-& /\\s*=:/ word { word &(word|EOF) }
... enumeration = <-& /\\s*=[^:]/ word { word &(word|EOF) }
... word = /[a-z]+|[A-Z][a-z]*/~
... EOF = /$/ '''
>>> def_parser = create_parser(def_DSL_extended, "defDSL")
>>> definitions = '''
... dog := carnivorous quadrupel that barks
... drinks = water beer juice
... human := featherless biped'''
>>> syntax_tree = def_parser(definitions)
>>> print(str(syntax_tree.pick('enumeration')).strip())
water beer juice
The lookback operators are <-& for the positive lookback and <-! for the
negative lookback, each of which must be followed by a regular expression or a string.
Of course, this example is rather wanton and the grammar can easily be rewritten
without the lookback-operators.
Error-Catching
Providing users with proper error information is one of the most tenacious problem when implementing the parser for a domain specific language. There are three different challenges:
Locating the error at the correct position in the source code.
Providing proper error messages that explain the reason for the error.
Resuming the parsing progress after an error has occurred at the nearest possible place without producing artificial follow-up errors.
In the following, DHParser’s techniques for the first two challenges, locating errors and customizing error messages will be described. Techniques for resuming the parsing process after an error occurred or for passing by erroneous passages in the source code will be explained below, under the heading “Fail-tolerant Parsing”.
Farthest-Fail-Heuristics
Without adding any hints to the grammar, DHParser applies only a very basic technique for locating the error if the grammar does not match a text which is known as “farthest failure” and locates the error at the “farthest” position where a parser failed, reporting the last named parser in the call chain (that first reached this location) as the cause of the failure. This approach often works surprisingly well for locating errors, unless the grammar relies to heavy on regular expressions capturing large chunks of text, because the error location works only on the level of the parsing expression grammar not at that of the atomic regular expressions. To see how farthest fail word, consider a parser for simple arithmetic expressions:
>>> arithmetic_grammar = '''@ literalws = right
... arithmetic = expression EOF
... expression = ~ term { ("+" | "-") term }
... term = factor { ("*" | "/") factor }
... factor = number | group
... group = "(" expression ")"
... number = /\\d+/~
... EOF = /$/'''
>>> arithmetic = create_parser(arithmetic_grammar, "arithmetic")
>>> terms = arithmetic('(2 - 3 * (4 + 5)')
>>> print(terms.errors[0])
1:17: Error (1040): Parser "term->`*`" did not match: »«
>>> terms = arithmetic('(2 - 3) * ( )')
>>> print(terms.errors[0])
1:13: Error (1040): Parser "number->/\\d+/" did not match: »)«
As can be seen the location of the error is captured well enough, at least when we keep in mind that the computer cannot guess where we would have placed the forgotten closing bracket. It can only report the point where the mistake becomes apparent.
However, the reported fact that it was the sub-parser `*` of parser term that failed at this location does little to enlighten us with respect to the cause of the failure. The “farthest fail”-method as implemented by DHParser yields the first parser (of possibly several) that has been tried at the position where the farthest fail occurred. Thus, in this case, a failure of the parser capturing `*` is reported rather than of the parser expression->`+`. Changing this by reporting the last parser or all parsers that failed at this location would do little to remedy this situation, however. In this example, it would just be as confusing to learn that expression->`+` failed at the end of the parsed string.
Mandatory items “§”
Thus, “farthest fail”-method is not very suitable for explaining the failure or pinpointing which parser really was the culprit. Therefore, DHParser provides a simple annotation that allows to raise a parsing error deliberately, if a certain point in the chain of parsers has not been reached: By placing the “§”-sign as a “mandatory-marker” in front of a parser, the parser as well as all subsequent parsers in the same sequence, will not simply return a non-match when failing, but it will cause the entire parsing process to stop and report an error at the location of failure:
>>> arithmetic_grammar = arithmetic_grammar.replace(
... 'group = "(" expression ")"',
... 'group = "(" § expression ")"')
>>> arithmetic = create_parser(arithmetic_grammar, "arithmetic")
>>> terms = arithmetic('(2 - 3 * (4 + 5)')
>>> print(terms.errors[0])
1:17: Error (1010): '`)` ~' expected by parser 'group', but »...« found instead!
>>> terms = arithmetic('(2 - 3) * ( )')
>>> print(terms.errors[0])
1:13: Error (1010): 'expression' expected by parser 'group', but »)...« found instead!
The error messages give a much better indication of the cause of the error. What is reported as cause is either the name of the parser that was expected to match, as in the second case, or the rule of the parser in case of unnamed parsers, as in the first case. This usually, though unfortunately not always, yields a much better indication of the location and cause of an error than the farthest failure.
However, a little care has to be taken, not to place the mandatory marker in front of a parser that might fail at a location that could still be reached and matched by another branch of the grammar. (In our example it is clear that round brackets enclose only groups. Thus, if the opening round bracket has matched, we can be sure that what follows must be an expression followed by a closing round bracket, or, if not it is a mistake.) Luckily, although this may sound complicated, in practice it never is. Unless you grammar is very badly structured, you will hardly ever make this mistake, an if you do, you will notice soon enough.
Also, there is an important restriction: There is only one §-marker
allowed per named parser. In case you have a long EBNF-expression on the
right hand side of a symbol-definition, where you’d like to use the
§-marker at more than one place, you can, however, always split it into
several expression by introducing new symbols. These symbols, if they
serve no other purpose, can be marked as disposable with the
@ disposable-directive (see Simplifying Syntax-Trees while Parsing).
The §-marker has proven to be a very simple means of pinpointing errors the DSL-code, and I recommend to use it from early on in the process of developing a new grammar. Plus, the §-marker offers two further benefits, namely customizing error messages and resuming the parsing process after a failure. That latter is particularly helpful if the DSL is to be used in with an integrated development environment, which benefits greatly from fail-tolerant parsing. However I only recommend to start using these, only after the grammar has reached a certain amount of maturity, because changing the grammar ofter requires re-adjusting customized error messages and resume-clauses as well, which can become tedious.
Custom error messages
While the error messages produced by the use of the §-marker are often quite understandable for the engineer designing the grammar of a DSL, they might not be so for the user the DSL, who might not know the names of the parsers of the grammar, let alone the expressions of the unnamed parsers und will therefore not always be able to make much sense of an error-messages that report just these.
In order to customize error messages, the symbol-related directive
@ SYMBOLNAME_error = CONDITION, ERROR_STRING is used. The directive’s
name consists of the name of a symbol that contains a §-marker and the
appendix _error. The directive always takes two arguments, separated
as usual by a comma, of which the first is condition-expression and
the second an error message. The condition can be used to make
the choice of an error-message dependant on the text following the
point of failure. It can either be
a regular expression or a simple string which must match (or be equal
to in the case of the string) the first part of the text at the
position where the parser defined by the symbol failed to match and
raised an error. Only if the condition matches, the error message
given as the second argument will be emitted. Otherwise, the fallback
error-expression described above (”… expected by parser …”) will
be shown. The empty string '' can be used as a fallback if the
customized message shall always be emitted, no matter what the
following text looks like.
The error string is a format string that may include any of the
two arguments {0} or {1} where {0} will be replaced by
the name or string representation of the parser that was expected
to match but didn’t and {1} will be replaced by the first twenty
or so letters of the unmatched rest of the text. Here is a simple
example that could be part of a JSON-parser that is intended to
deliver understandable error-messages:
>>> grammar = '''
... @ string_error = '', 'Illegal character(s) »{1}« in string.'
... string = `"` §characters `"` ~
... characters = { plain | escape }
... plain = /[^"\\\\]+/
... escape = /\\[\/bnrt\\]/'''
>>> json_string = create_parser(grammar, 'json_string')
>>> print(json_string('"alpha"'))
"alpha"
>>> for e in json_string('"al\\pha"').errors: print(e)
1:4: Error (1010): Illegal character(s) »\pha"...« in string.
It is possible to place an error code or number at the beginning of the error string, separated by a colon to override the default code of 1010 for mandatory continuation errors. (See below for ean example where it makes sense to do so.)
Customized error-messages must always be specified in the grammar before definition of the symbol they are related to. It is possible to “stack” several different error-directives with different conditions and messages but related to the same symbol. The conditions are evaluated in the order the error-directives appear in the grammar. The first error message the matching condition matches the error location will be picked. Therefore, the more specific conditions should always be placed first and the more general or fallback conditions should be placed below these:
>>> grammar = ("@ string_error = /\\\/, 'Illegal escape sequence »{1}« "
... "Allowed values are b,n,r,t,u'") + grammar
>>> json_string = create_parser(grammar, 'json_string')
>>> for e in json_string('"al\pha"').errors: print(e)
1:4: Error (1010): Illegal escape sequence »\pha"...« Allowed values are b,n,r,t,u
Here, the more specific and more understandable error message has been selected. Careful readers might notice that the the more general customized error message “Illegal character(s) … found in string” will now only be selected, if the string contains a character that not even regular expression engine recognizes, because the only other character that is not allowed within the string are the closing quotation marks that terminate the string and which do not cause the parser to fail (but only to terminate to early).
Also, it might be noticed that the errors are always caused
by a failure to match the second "-sign, because the
characters-parser also matches the empty string and thus
never fails or raises any error. Nonetheless, the error
can occur in the interior of the string and can - with
the help of customized error messages - be described as such
and properly be located.
Generally, though, emitting the right error messages based on a regular-expression-condition is not easy. In the worst case, badly customized error messages can even be more confusing than not customizing error messages at all.
One possible development-strategy is to wait for the feedback of testers and users or to monitor the errors that users typically make and then to customize the error messages to the actual needs of the users to help them understand why the computer refuses to parse a certain construct.
Enhancing grammars with error-paths
A frequently used approach to error catching is the encoding of alternative code
paths for erroneous code right into the grammar. This is usually done by adding
a further alternative with the alternative-operator | that is specifically
meant to “throw” an error if all regular alternatives have been exhausted.
DHParser offers a special syntax to add an error message if a certain part of
the grammar is activated during the parsing of an erroneous source-text:
@Error("[CODE:]MESSAGE") where MESSAGE has to be substituted by a particular
error message and the optional “CODE:” with a natural number at the beginning of
the error-message from which the error-code must be separated with a colon. This
number can freely be chosen, but must adhere to the following convention:
numbers from 1 to 99 must be used for mere notices
numbers from 100 to 999 for warnings
numbers from 1000 to 9999 for errors
numbers of 10000 and more as fatal errors
Fatal errors result in the stop of the processing pipeline right after the step where the error occurred. “Normal” errors result in invalid results but DHParser will attempt to continue with the processing pipeline. This has the advantage that other errors further downstream will also be reported in the same passe. A possible disadvantage is that consequential errors may arise in the processing of faulty results from earlier stages in the pipeline. If too many consequential errors occur in the aftermath of an error, it should be considered to lift this error to a fatal error by assigning a higher number to it.
Warnings and notices should never should be used when the results will still be valid, though maybe not what they were intended to be. (Thus, the warning) Error codes starting with the digit “1” are reserved by DHParser. So, for custom-errors codes starting with “2”, “3” etc. must be used.
Here is an example how error-messages can be added within the grammar. The
syntax uses the same @-marker as directives but other than directives the
error-markers do not resemble symbol-definitions but occur within the right hand
side of a symbol definition. The following is a grammar that parses a sequence
of natural numbers separated by a blank of line-feed, e.g. “1 2 3”:
>>> sequence_of_natural_numbers = """@drop = whitespace
... document = { ~ number | error } ~ EOF
... number = /[1-9]\\d*/~
... @error_error = "20:Parser stopped because of an error"
... error = !EOF # no error-message on end of file
... @Error("2010:Not a valid Number!")
... § EOF
... EOF = !/./"""
>>> number_parser = create_parser(sequence_of_natural_numbers)
>>> syntax_tree = number_parser("1 2 X 3 4 X 5")
>>> for error in syntax_tree.errors_sorted: print(error)
1:5: Error (2010): Not a valid Number!
1:5: Notice (20): Parser stopped because of an error
Let’s examine this in detail: The definition of the “document”-symbol without an error-branch would read:
document = { ~ number } ~ EOF
Here, we add an alternative branch to the number-parser in case the number
parser does not match. Now, the number parser will not match on one of two
conditions: a) the following text is not a number, or b) the end of the file has
been reached. Now, since the latter is perfectly in order, we have to exclude
this case at the beginning of the error-branch with a negative-lookahead:
error = !EOF
@Error("1010:Not a valid Number!")
§ EOF
While the first line of the definition of the error-branch excludes the kinds of
non-match(es) that are not errors, the second line adds an error message with
the current location of the parser. We add an error code, “1010”, to make it
easier to identify this error in automated tests or the like. The last line is
just a means to make the parser to stop parsing right on the spot (that is,
unless this error is caught by a resume-directive, see below). It will produce the consequential error “EOF
expected”, though. In order not to confuse the use, the error has been
reconfigured as a simple notice that explains the situation with the directive
@error_error = "20:Parser stopped because of an error" in the line before.
Instead of stopping the parser by jumping right to the end of the document, one
might also try to skip only so much text as is needed to continue the
parsing-process. In this trivial example, this is easily done by substituting
the § EOF statement with a regular expression that consumes all characters
until the next location where the parser might find a possibly valid item. The
expression /.*?(?=\s|$)/ skips all characters up to but not including the
next blank or up to the end of the file. Let’s try:
>>> alt_gr = sequence_of_natural_numbers.replace('§ EOF', '/.*?(?=\\s|$)/')
>>> reentrant_number_parser = create_parser(alt_gr)
>>> syntax_tree = reentrant_number_parser('1 2 X 3 4 X 5')
>>> for error in syntax_tree.errors_sorted: print(error)
1:5: Error (2010): Not a valid Number!
1:11: Error (2010): Not a valid Number!
The location of the second error indicates that the parser has continued to read the document after the first error. We can double-check this by looking at the abstract syntax tree:
>>> print(syntax_tree.as_sxpr())
(document
(number "1")
(number "2")
(error
(ZOMBIE__ `(err "1:5: Error (2010): Not a valid Number!"))
(:RegExp "X"))
(number "3")
(number "4")
(error
(ZOMBIE__ `(err "1:11: Error (2010): Not a valid Number!"))
(:RegExp "X"))
(number "5")
(EOF))
Fail-tolerant Parsing
A serious limitation of all previously described error-handling mechanisms (with the exception the very last example) is that the parsing process still stops on the very first error. This is particularly annoying for beginners learning to code data or program code with a new DSL, because the compiler must be run again and again before all errors have been found and corrected. It would be much better to receive a list of all errors on the first run.
Generic approach
Fail tolerant parsing means that:
the parsing process continues after a syntax error has been encountered, ideally until the end of the document
that doing so the parser skips as little text as possible and thus reports as many errors as possible on the first run.
that ideally no consequential errors (“Folgefehler”) are introduced which are not original errors in the source document, but merely artifacts of badly chosen locations for the resumption of the parsing process.
There are a number of techniques for fail-tolerant parsing. One technique that has already been shown earlier and which is not specific to DHParser but can be used with any parser-generator is to add grammar-paths for possibly erroneous code to the grammar:
>>> grammar = '''
... string = `"` ([characters] `"` | string_error [`"`]) ~
... string_error = /[^"]*/
... characters = { plain | escape }+
... plain = /[^"\\\\]+/
... escape = /\\[\\/bnrt\\]/'''
>>> json_string = create_parser(grammar, 'json_string')
>>> tree = json_string('"al\\pha"')
>>> print(tree.as_sxpr(flatten_threshold=0))
(string
(:Text '"')
(string_error "al\pha")
(:Text '"'))
This time the parser did not need to stop at the erroneous part. The erroneous part itself has been caught within a node that betrays only by its name that there was an error. To produce error messages we have to add them explicitly to such nodes, afterwards:
>>> for string_error in tree.select('string_error'):
... _ = tree.new_error(string_error, 'Fehler im String: ' + str(string_error))
>>> for e in tree.errors: print(e)
1:2: Error (1000): Fehler im String: al\pha
Unfortunately, the error location is not very precise. This can be remedied by refining our error junction code:
>>> grammar = '''
... string = `"` ([characters] `"` | string_error [`"`]) ~
... string_error = [characters] { ups [characters] }+
... ups = /[^"]/
... characters = { plain | escape }+
... plain = /[^"\\\\]+/
... escape = /\\[\\/bnrt\\]/'''
>>> json_string = create_parser(grammar, 'json_string')
>>> tree = json_string('"al\\pha"')
>>> print(tree.as_sxpr())
(string
(:Text '"')
(string_error
(characters
(plain "al"))
(ups "\")
(characters
(plain "pha")))
(:Text '"'))
Here, the node named “ups” pinpoints the precise error location.
Like most techniques for fail-tolerant parsing, this one is not quite
as easy to master in practice as it might look. Generally, adding
a junction for erroneous code works best, when the passage that shall
be by-passed is delineated by a easily recognizable follow-up strings.
In this example the follow-up string would be the "-sign. The method fails,
of course if the follow-up text is erroneous, too, or has even been
forgotten. So, to be absolutely sure, one would have to consider
different follow-up sequences, say empty lines, keywords that mark
new parts of the document and the like.
Skip and Resume
DHParser also offers two other constructs for fail-tolerant parsing which are quite similar to the just described technique. However, they do not require adding code-paths to the grammar and reuse the error-locating ability of the §-marker. A disadvantage is that the DHParser-specific support for fail-tolerant parsing presently relies entirely on regular expressions for finding the right re-entry points.
DHParser allows to resume parsing after an error at a later point in the text. When trying to resume parsing two questions must be answered:
At what location should the parsing process be resumed?
Which parser in the parser call-chain should resume parsing? E.g. the parser that failed, the parser that called the parser that failed, … ?
The location where parsing should be resumed must be specified by
a regular expression or a list of regular expressions. The resumption
location is the nearest match of any of these expressions that does
not fall into a comment (as specified by the @comment-directive
described above). More precisely it is the location directly after
the match, because this allows to search for the reentry-location
both by the text preceding this location and the text following
this location by using a lookahead operator inside the regular
expression.
The parser that resumes parsing depends on the directive that guides
the search for the reentry-point. DHParser offers two different
directives for this purpose, the @..._skip-directive and the
@..._resume-directive. The placeholder … stands for the name
of a parser that contains a §-marker.
The skip-directive resumes parsing with the sequence-parser that contains the item(s) marked by the §-marker. In the following example, the skip-directive picks up parsing with the string- parser when an error was raised by the string-parser:
>>> grammar = '''
... @ string_error = /\\\/, 'Illegal escape sequence »{1}«'
... @ string_error = '', 'Illegal character "{1}" in string.'
... @ string_skip = /(?=")/
... string = `"` §characters `"` ~
... characters = { plain | escape }
... plain = /[^"\\\\]+/
... escape = /\\[\\/bnrt\\]/'''
>>> json_string = create_parser(grammar, 'json_string')
>>> tree = json_string('"al\\pha"')
>>> print(tree.content)
"al\pha"
>>> print(tree.errors[0])
1:4: Error (1010): Illegal escape sequence »\pha"...«
>>> print(tree.as_sxpr())
(string
(:Text '"')
(characters
(plain "al"))
(ZOMBIE__ `(err "1:4: Error (1010): Illegal escape sequence »\pha\"...«") "\pha")
(:Text '"'))
After the error has occurred at the illegal escape-sequence, the
skip-directive catches the error and skips to the location where the
“-character lies just ahead and continues parsing with the string-parser.
The skipped passage is stored in a ZOMBIE__-Node within the syntax-tree
and parsing can continue through to the end of the text.
In contrast to the skip-directive the resume-directive leaves the parser
that raised the error and resumes one level higher up in the call chain.
The @ ..._resume-directive that tells the calling
parsers where to continue after the array parser has failed.
So, the parser resuming the parsing process is not the array parser that
has failed, but the first parser in the reverse call-stack of “array” that
catches up at the location indicated by the @ ..._resume-directive.
The location itself is determined by a regular expression, where the
point for reentry is the location after the next match of the regular
expression:
>>> grammar = grammar.replace('@ string_skip = /(?=")/',
... '@ string_resume = /("\s*)/')
>>> json_string = create_parser(grammar, 'json_string')
>>> tree = json_string('"al\\pha"')
>>> print(tree.content)
"al\pha"
>>> print(tree.errors[0])
1:4: Error (1010): Illegal escape sequence »\pha"...«
>>> print(tree.as_sxpr())
(string
(:Text '"')
(characters
(plain "al"))
(ZOMBIE__ `(err "1:4: Error (1010): Illegal escape sequence »\pha\"...«") '\pha"'))
Note, that this time, the zombie-node also contains the closing quotation marks. Also, it should be observed, that the regular expression of the resume-directives stops after the closing quotation marks as well as any ensuing whitespace. This is because parsing will continue with the calling parser of the string parser, so the resumption point must be at a reasonable place where the string parser might have returned, if no error had occurred.
A simple rule for specifying the reentry point of an error is to find a location where the next structural entity after the erroneous entity starts. Let’s try this for a (simplified) config-file parser:
>>> config_grammar = '''@literalws = right
... config = ~ { section } EOF
... section = heading { entry }
... heading = "[" § identifier "]"
... entry = identifier § ":" value
... identifier = /\\w+/~
... value = `"` § TEXTLINE '"'
... TEXTLINE = /[^"\\n]*/
... EOF = !/./ '''
>>> config_parser = create_parser(config_grammar)
As of now, our parser is not fail-tolerant. This means it will stop parsing at the first error. Further errors are neither detected nor reported:
>>> cfg_data_with_errors = '''
... [entities]
... animal: "cat"
... plant: rose"
... Building: "Tour Eiffel"
... [colors
... red: "warm"
... blue 1: "cold"
... grey: "black and white" '''
>>> result = config_parser(cfg_data_with_errors)
>>> for error in result.errors_sorted: print(error)
4:8: Error (1010): 'value' expected by parser 'entry', but »rose"\nBuil...« found instead!
After adding suitable resume-clauses for those symbols the definition of which contain the mandatory marker §, all errors are reported in a single pass:
>>> config_grammar = '''
... @heading_resume = /\\n\\s*(?=\\w|\\[)/
... @entry_resume = /\\n\\s*(?=\\w|\\[)/
... ''' + config_grammar
>>> config_parser = create_parser(config_grammar)
>>> result = config_parser(cfg_data_with_errors)
>>> for error in result.errors_sorted: print(error)
4:8: Error (1010): 'value' expected by parser 'entry', but »rose"\nBuil...« found instead!
7:1: Error (1010): '`]` ~' expected by parser 'heading', but »red: "warm...« found instead!
8:6: Error (1010): '`:` ~' expected by parser 'entry', but »1: "cold"\n...« found instead!
It can become difficult to find a reentry point with regular expressions that is on the same level of the parser call chain (or one level higher up in the case of the resume-directive) when an error occurs in a syntactic structure that can be recursively nested. Because of this it is also possible to specify the re-entry point with a parser. In this case, the search term has a different semantics however. If a parser is specified, it must match all characters from the point where the error occurred up to the reentry point. In the case of a simple string or a regular expression, DHParser searches for the first match of the expression and then picks the location after that match. In order to distinguish the two cases clearly, PEG-rules must always be enclosed in round brackets. Thus, a single regular expression or a singular string enclosed in round brackets will not be used as a search term but as a matching expression that determines the reentry-location by matching the complete text from the error location to the reentry-point.
As an example, let’s try this with a parser for arbitrarily nested lists of positive integers. First, we write our grammar without any re-entry rules:
>>> number_list_grammar = '''@ literalws = right
... @ disposable = /_\w+/
... @ drop = _EOF, whitespace, strings
... _document = ~ [ list ] §_EOF
... list = "[" [_items] § "]"
... _items = _item { "," §_item }
... _item = number | list
... number = `0` | /[1-9][0-9]*/
... _EOF = !/./'''
>>> list_parser = create_parser(number_list_grammar)
>>> list_parser('[[1,2], [3,4]]').as_sxpr()
'(list (list (number "1") (number "2")) (list (number "3") (number "4")))'
The following example contains three errors: The letter “A” in a place where a number should be and, a bit later, the wrong delimiter “;” instead of a comma, and finally a missing value (or a superfluous comma) right before the end. Since there are no rule for resuming the parser after an error ocurred, the parser will stop after the first error and also report that only part of the document could be parsed:
>>> example_with_errors = '[1, 2, A, [5, 6; 7], 8, ]'
>>> result = list_parser(example_with_errors)
>>> for e in result.errors: print(e)
1:8: Error (1010): '_item' expected by parser '_items', but »A, [5, 6; ...« found instead!
Now, let’s define some regular expression based rules to resume parsing after an error:
>>> resumption_rules = '''
... @list_resume = /]\\s*|$/
... @_items_skip = /(?=,)/, /(?=])/, /$/
... '''
Note, that for the “_items”-parser, several rules have been specified. DHParser will try all of these rules and resume at the closest of the locations these rules yield. The @list_resume-rule moves to a point after the list, where the list parser might have returned, if no error had occurred, which is after the closing square bracket plus any adjacent whitespace. The ‘@_items_skip`-rule moves to any point within the sequence of items where the _items could catch up (another comma “,” followed by further items) or end, because the sequence is exhausted (“]” or the end of the document caught be the regular expression marker for the end of the string: /$/). See, how these rules play out in this particular example:
>>> list_parser = create_parser(resumption_rules + number_list_grammar)
>>> result = list_parser(example_with_errors)
>>> for e in result.errors: print(e)
1:8: Error (1010): '_item' expected by parser '_items', but »A, [5, 6; ...« found instead!
1:16: Error (1010): '`]` ~' expected by parser 'list', but »; 7], 8, ]...« found instead!
1:25: Error (1010): '_item' expected by parser '_items', but »]...« found instead!
All errors are located and reported properly in a single run and the parser continues right through until the end of the document as we’d expect from a fail-tolerant parser. However, the limitations of our regular-expression-based rules become apparent when nested structures are involved:
>>> example_with_errors_2 = '[1, 2, A, [5, 6; [7, 8], 9], 10, ]'
>>> result = list_parser(example_with_errors_2)
>>> for e in result.errors: print(e)
1:8: Error (1010): '_item' expected by parser '_items', but »A, [5, 6; ...« found instead!
1:16: Error (1010): '`]` ~' expected by parser 'list', but »; [7, 8], ...« found instead!
1:28: Error (1010): '_EOF' expected by parser '_document', but », 10, ]...« found instead!
Here, the parser stopped before the end of the document, which shows that our resumption rules have been either incomplete or inadequate. Let’s turn on some debugging information to get a better insight into what went wrong:
>>> from DHParser.trace import resume_notices_on
>>> resume_notices_on(list_parser)
>>> result = list_parser(example_with_errors_2)
>>> for e in result.errors: print(e)
1:8: Error (1010): '_item' expected by parser '_items', but »A, [5, 6; ...« found instead!
1:9: Notice (50): Skipping from 1:8 'A, [5, ...' within _item->:_item to 1:9 ', [5, 6...'
1:16: Error (1010): '`]` ~' expected by parser 'list', but »; [7, 8], ...« found instead!
1:24: Notice (50): Resuming from list at 1:16 '; [7, 8...' with _items->:Series at 1:24 ', 9], 1...'
1:28: Error (1010): '_EOF' expected by parser '_document', but », 10, ]...« found instead!
What is of interest here, is the second notice: It seems that the error was caught within the “list”-parser. By moving on to the spot after closing bracket as determined by the @list_resume-directive, however, thar parsing-process did not end up at a location behind the grammatical structure where the error had occurred, but at the location after a nested structure, which in this example is the inner list “[7, 8]”.
This problem can be remedied by using the full power of parsing expression grammars for determining the resumption-position. (Note, that in order to clearly distinguish PEG-rules unambiguously from plain regular expressions or simple strings, they must be enclosed in round rackets!):
>>> resumption_rules = '''
... @list_resume = ({ list | /[^\\[\\]]+/ } ["]"])
... @_items_skip = /(?=,)/, /(?=])/, /$/
... '''
>>> list_parser = create_parser(resumption_rules + number_list_grammar)
>>> resume_notices_on(list_parser)
>>> result = list_parser(example_with_errors_2)
>>> from DHParser.error import RESUME_NOTICE
>>> for e in result.errors:
... if e.code != RESUME_NOTICE: print(e)
1:8: Error (1010): '_item' expected by parser '_items', but »A, [5, 6; ...« found instead!
1:16: Error (1010): '`]` ~' expected by parser 'list', but »; [7, 8], ...« found instead!
1:34: Error (1010): '_item' expected by parser '_items', but »]...« found instead!
This time, the parser does not terminate before the end. The resume-notices show that resumption does not get caught on the nested structure, any more:
>>> for e in result.errors:
... if e.code == RESUME_NOTICE: print(e)
1:9: Notice (50): Skipping from 1:8 'A, [5, ...' within _item->:_item to 1:9 ', [5, 6...'
1:28: Notice (50): Resuming from list at 1:16 '; [7, 8...' with _items->:Series at 1:28 ', 10, ]'
1:34: Notice (50): Skipping from 1:34 ']' within _items->:Series to 1:34 ']'
Programming fail-tolerant parsers can be quite a challenge. DHParser’s @skip- and @resume-directives help separating the code for fail-tolerance from the grammar proper. The only hooks that are needed within the grammar proper are the mandatory markers (”§”). Still, it is a good strategy, to start adding the fail-tolerance code only later in the development of a grammar and to keep things as simple as possible. This can best be done by choosing resumption points that are as unambiguous as possible (like keywords that introduce specific parts of the document), even if this means that larger parts of the document will be skipped and, consequently, some errors will remain undetected until errors earlier in the document have been fixed. If syntax errors are sparse - as can reasonably be assumed - the harm done by skipping larger portions of the text is probably negligible or at any rate smaller than the harm done by introducing consequential errors as a result of poorly chosen resumption rules.
Macros
In order to reduce code-repetition within grammar-specifications DHParser offers a substitution-based macro-system. (To avoid code-repetition between grammars, includes can be used.)
Macros are definined similar to symbols, only that their name must
always start with a dollar sign $:
$list($delimiter) = { /[^,.;:]+/ | !$delimiter /[,.;:]/ }
{ $delimiter ~ { /[^,.;:]+/ | !$delimiter /[,.;:]/ } }
The parameters of a macro are also marked with a dollar sign. Using macros is straight forward:
keywords = "Keywords:" $list(",")
Macros can use other macros in their definiens but they may not be nested recursively. It is also possible to define macros without any arguments. This makes sense, when breaking up a long macro definition into several macros which works similar to breaking up a long symbol-definition by introducing (disposable) helper-symbols. For this purpose helper-macros may refer to parameters of the “calling” macro without explicitly passing them as parameters:
$list($delimiter) = $item { $delimiter ~ $item }
$item = { /[^;:,.\n]+/ | !$delimiter /[.,:;]/ }+
A limitation of DHParser’s macro-system is that there is no way to define new symbols inside a macro-definition.
For a more detailed example, see Macros.
Context sensitive parsers
DHParser does by intention not contain support for semantic actions, because these can introduce a context-sensitivity that can be hard to handle with a recursive descent parser. And compiler-generation, where semantic actions are mostly needed, is not the main domain of application for DHParser. (There are a few loopholes that can be (mis-)used for semantic actions, though…)
Sometimes, however, it would be ever so comfortable to break out of the paradigm of context free grammars - if only just a little bit. For example, when encoding data in XML or HTML, the closing tag must have the same tag-name as the opening tag:
<line>O Rose thou art sick.</line>
If you encode the tag-name-parser roughly following the XML-specs as:
tag_name = /(?![0-9][:\\w][\\w:.-]*/
the following code would be accepted by the parser:
<line>O Rose thou art sick.</enil>
In this case, the remedy is easy: When post-processing the syntax tree, check whether all end-tags have the same tag-name as the corresponding start-tag and add an error message where this is not the case. However, this only works, because the tag-names have no influence on the structure of the syntax-tree of an XML-document.
Therefore, the same remedy would not work in many other cases. In CommonMark, for example, a “code fence is a sequence of at least three [but possibly more] consecutive backtick characters (`) or tildes (~)” Now, this is a bit more complicated than the XML-example, because here the content of the closing marker needs to be known at the time of parsing already, because otherwise the structure could not be determined correctly. A smaller number of tildes or backticks than used at the opening of a code fence would be part of the content of the fenced code that needs to be distinguished from the closing delimiter. The purpose is to allow you to fence code that may contain fenced code itself.:
~~~~~
In common mark code can be fenced with tilde-characters. You can just write:
~~~
fenced code
~~~
~~~~~
Here, a possible remedy is to employ a preprocessor, that distinguishes the fenced code from quoted fences and replaces the non-quoted fences by context-free opening and closing markers that can then be captured at parsing stage. Using preprocessors is often a clean and pragmatic solution and DHParser includes dedicated support for preprocessors. However, introducing preprocessors also hast some downsides. One disadvantage is that a preprocessor requires a makeshift parser of its own that must be strong enough not to stumble over the syntactic constructs that do not concern the preprocessor like comments for example.
DHParser therefore also offers another alternative for occasional context sensitivity by allowing to retrieve and compare earlier values of a symbol. Any users of these features should be aware, however, that extensive use of context sensitive parsers may slow-down parsing, because it does not play well with the the memoizing optimization that is commonly used with parsing expression grammars and also employed by DHParser. Also, since these features are specific for DHParser, switching to another parser generator will require factoring the context-sensitive-parser out of your grammar and re-implementing the functionality for which they have been used with the more conventional approach of pre- and post-processors.
Before explaining the full mechanism, let’s have a look at an example. The following minimal pseudo-XML-parser captures the value of the tag-name so that it can compared with the tag-name of the ending-tag:
>>> miniXML = '''
... @ disposable = EOF
... @ drop = EOF, whitespace, strings
... document = ~ element ~ §EOF
... element = STag §content ETag
... STag = '<' TagName §'>'
... ETag = '</' ::TagName §'>'
... TagName = /\\w+/
... content = [CharData] { (element | COMMENT__) [CharData] }
... CharData = /(?:(?!\\]\\]>)[^<&])+/
... EOF = !/./
... '''
>>> parseXML = create_parser(miniXML)
>>> print(parseXML('<line>O Rose thou art sick.</line>').as_sxpr())
(document
(element
(STag
(TagName "line"))
(content
(CharData "O Rose thou art sick."))
(ETag
(TagName "line"))))
>>> result = parseXML('<line>O Rose thou art sick.</enil>')
>>> print(result.errors[0])
1:28: Error (1010): 'ETag = `</` ::TagName "line" § `>`' expected by parser 'element', but »</enil>...« found instead!
Here, the TagName-parser in the definition has been prefixed with a
double colon ::. This double colon is the “Pop”-operator and can be
put in front of any symbol defined in the grammar. If a symbol is
annotated with the operator, then its parsing-rule will not be executed,
but the last value that has been parsed by the symbol will be retrieved
and match against the following part of the document by simple
text-comparison. If the last value matches the “Pop”-parser will remove
that value from the stack of earlier values and report a match.
Otherwise a non-match will be reported and the value will be left on the
stack. In the example, since the Pop-parser ::TagName follows a
mandatory marker §, the non-match causes a syntax error.
Sometimes it is useful to compare the following text with a stored value without removing that value from the stack. For this purpose, there is the “Retrieve”-operator which is denoted by a single colon ::
>>> fencedTextEBNF = '''@whitespace = vertical
... @disposable = EOF, fence_re
... @drop = whitespace, EOF
... document = ~ { text | fenced } EOF
... text = /[^\\~\\n]+/ ~ # /(?:[^\\~]+|\\n)+/
... fenced = fence ~ { text | no_fence } ::fence ~
... no_fence = ! :fence fence_re ~
... fence = fence_re
... fence_re = /\\~+/
... EOF = !/./
... '''
Here the Pop-operator :: is used in the definition of fenced in just the same way
as in the earlier example. The Retrieve-operator : is used in the definition of
no_fence in combination with a negative lookahead !. This allows the no_fence-parsers
to capture all “fences” which are not closing-fence of the current fenced environment:
>>> parseFenced = create_parser(fencedTextEBNF)
>>> fenced_test_1 = '''~~~
... fenced code
... ~~~'''
>>> print(parseFenced(fenced_test_1).as_sxpr())
(document (fenced (fence "~~~") (text "fenced code") (fence "~~~")))
>>> fenced_test_2 = '''~~~~~
... In common mark code can be fenced with tilde-characters. You can just write:
... ~~~
... fenced code
... ~~~
... ~~~~~'''
>>> print(parseFenced(fenced_test_2).as_sxpr())
(document
(fenced
(fence "~~~~~")
(text "In common mark code can be fenced with tilde-characters. You can just write:")
(no_fence "~~~")
(text "fenced code")
(no_fence "~~~")
(fence "~~~~~")))
But what if the opening and closing fence are not one and the same string, but complements of each other, like opening and closing brackets? Say, you’d like to enclose code-examples in curled braces “{” and “}” and since the code examples themselves may contain braces, you’d like to allow the markup-writer to use an arbitrary number of braces as opening and closing delimiters:
>>> markup = "This ist a code example: {{ mapping = { 'a': 1, 'b': 2} }} with braces."
Here, the recalled value would need to be transformed or otherwise interpreted before the
following text is either considered a match or a non-match. Since such a transformation
can hardly be encoded in EBNF even an augmented EBNF, any more, one of DHParser’s loopholes
for semantic actions must be used. It is possible to assign Python filter-functions to
symbols, the value of which is retrieved with yet another directive, which is the
@ XXXX_filter-directive, where “XXXX” stands for the name of the symbol:
>>> bracesExampleEBNF = '''
... @braces_filter = matching_bracket()
... document = { text | codeblock }
... codeblock = braces { text | opening_braces | (!:braces closing_braces) } ::braces
... braces = opening_braces
... opening_braces = /\\{+/
... closing_braces = /\\}+/
... text = /[^{}]+/
... '''
The function name that is passed to the directive must be the name of a function that is within reach of the generated parser (by either defining it in the same module or importing it) and it must have the signature:
Callable[str, List[str]], Optional[str]]
This function takes the following text as well as the stack of previous value of
the symbol that is being retrieved as an argument and it must return either
a stretch of matched text of None to indicate a non-match. The function
matching_bracket() is already defined in DHParser.parse. Slightly
simplified to cover only the case of curly braces, it looks like this:
>>> from typing import List, Optional
>>> def matching_bracket(text: str, stack: List[str]) -> Optional[str]:
... value = stack[-1]
... value = value.replace("{", "}")
... return value if text[:len(value)] == value else None
>>> markup_parser = create_parser(bracesExampleEBNF)
>>> result = markup_parser(markup)
>>> print(result.as_sxpr())
(document
(text "This ist a code example: ")
(codeblock
(braces
(opening_braces "{{"))
(text " mapping = ")
(opening_braces "{")
(text " 'a': 1, 'b': 2")
(closing_braces "}")
(text " ")
(braces "}}"))
(text " with braces."))
Here, the outer double braces “{{” and “}}” open up and close a new code block and could be discarded as delimiters during the AST-transformation, while the opening and closing braces within the code block are simply that: opening and closing braces.
Advanced usages
Apart from the Pop- and Retrieve-operator, DHParser offers a third retrieval operator that,
like the Pop-operator, “pops” that last value from the stack and matches either this value
or the empty strings. In other word, this operator always matches, as long as there is still
a value on the stack, but it captures the beginning of the following text only
if it matches the stored value. A non-match only happens, when the stack has already been
exhausted. This “Pop anyways”-operator is denoted by a colon followed by a question mark :?.
Weird as this may sound, this operator has astonishingly manifold use cases. Think for
example of a modification of our minimal pseudo-XML parser the allows coders to omit the
tag name in closing tags to save them some typing:
>>> miniXML = miniXML.replace('::TagName', ':?TagName')
>>> parseXML = create_parser(miniXML)
>>> data_tree = parseXML('<line>O Rose thou art sick.</>').as_tree()
>>> print(data_tree)
document
element
STag
TagName "line"
content
CharData "O Rose thou art sick."
ETag
TagName
Another, rather tricky use case is to let the value of certain symbols be determined
on first use by marking all appearances of theses symbols on the right hand side of
the definitions wherein they appear with the single colon retrieval operator : and
clearing the stack with [:?symbol] after the end of file has been reached.
This technique has been employed for the “FlexibleEBNF”-parser in the examples folder.
The FlexibleEBNF-parser “magically” adjusts itself to different syntactical flavors of
EBNF. Here is an abbreviated excerpt of the grammar of this parser, to see how this
technique can be used in a grammar:
syntax = ~ { definition } EOF
definition = symbol §:DEF~ expression :ENDL~
DEF = `=` | `:=` | `::=` | `<-` | /:\\n/ | `: `
ENDL = `;` | ``
EOF = !/./ [:?DEF] [:?ENDL]
This trick can also be used to parse indentation:
>>> tree_grammar = '''@whitespace = horizontal
... @disposable = EOF, LF, SAME_INDENT
... @drop = strings, whitespace, EOF, LF, SAME_INDENT
...
... tree = INDENT node DEDENT /\\s*/ EOF
... node = tag_name [content]
... content = string | children
... children = &(LF HAS_DEEPER_INDENT)
... LF INDENT § node { LF SAME_INDENT § node }
... !(LF HAS_DEEPER_INDENT) DEDENT
... tag_name = /\\w+/~
... string = '"' § /(?:\\\\"|[^"\\n])*/ '"' ~
...
... INDENT = / */
... SAME_INDENT = :INDENT § !/ /
... HAS_DEEPER_INDENT = :INDENT / +/
... DEDENT = &:?INDENT
...
... LF = /\\n/
... EOF = !/./
... '''
>>> tree_parser = create_parser(tree_grammar)
>>> syntax_tree = tree_parser(data_tree)
>>> # show but the first 22 lines of the syntax-tree:
>>> print('\n'.join(syntax_tree.as_sxpr().split('\n')[:22] + ['...']))
(tree
(INDENT)
(node
(tag_name "document")
(content
(children
(INDENT " ")
(node
(tag_name "element")
(content
(children
(INDENT " ")
(node
(tag_name "STag")
(content
(children
(INDENT " ")
(node
(tag_name "TagName")
(content
(string "line")))
(DEDENT))))
...
In case you are surprised by the size of the resulting syntax-tree, keep in mind that the syntax-tree or “parse-tree” of the serialization of a data structure is not the data-structure itself, even if it happens to be a tree. However, the data-structure can be retrieved from the tree.
As can be seen the INDENT-elements have captured indentation of various increasing length. Also, observe the use of the “querying parser” HAS_DEEPER_INDENTATION. A “querying parser” is a parser that is meant to be used only inside a negative or positive lookahead. The first query with HAS_DEEPER_INDENTATION makes sure that the next call to INDENT will indeed capture a longer stretch of whitespace than represented by its present value, i.e. the last value on the “INDENT”-symbol’s value-stack.
DHParser makes sure that the value-stack of each captured variable will properly be unwinded when returning from a non-match to an earlier position in the document. However, there is one case where this does not work, namely, if a symbol has matched the empty string. Because value-stack unwinding is tied to the position in the text instead of the call stack (which DHParser only keeps track of when testing or debugging), it is impossible to decide where a zero-length capture must be unwinded or not in case DHParser returns exactly to the position of that capture. In this case DHParser does not unwind the capture. This can lead to a non-empty value stack at the end of the parsing process which will be reported as an error. There are three strategies to avoid this error:
Make sure that all symbols for which values are retrieved capture at least one character of text if they match.
In this example above, the symbol INDENT would have to be defined as
INDENT = / +/to achieve this. However, this would have made the formulation of the tree-grammar of this example more complicated.Make sure that in all cases where zero-length text could possibly be captured a subsequent non-match cannot occur before the zero-length value has been retrieved.
In the example above, this situation can only occur in the parser
treeand could be possibly avoided with a lookahead, e.g.tree = &(/ */ tag_name) INDENT node DEDENT. (However, since this avoidance strategy does not need to cover cases where the document is syntactically incorrect, anyway, we can rely on the fact that - since a syntactically correct document must contain a single root node at the very beginning - the parser will never retreat to the first captured INDENT.)As a last resort one could clear the stack with one or more optional
[DEDENT]parsers that can safely be added afterEOF. A modification that allows the document to consist of one or more top-level nodes would then look like:tree = { INDENT node DEDENT }+ /\\s*/ EOF [DEDENT]
The problem of error resumption
Resuming parsing after an error has occurred is complicated because, if done badly, it can lead to consequential errors, i.e. errors that are artifacts of the resumption rule or algorithm but not really errors in source text. This can be quite confusing and if worst comes to worst one cannot trust any error message after the first error messages which in turn defies the whole point of resuming parsing rather than breaking off after the first error.
Unfortunately, consequential errors are even harder to avoid when using context sensitive parsers. Say, we treat our XML-Parser with a little mistake:
>>> xmldoc = '''
... <doc>
... <title>Heading</litle>
... A few lines of Text
... </doc>'''
>>> result = parseXML(xmldoc)
>>> for e in result.errors_sorted: print(e)
3:21: Error (1010): '`>`' expected by parser 'ETag', but »litle>\n ...« found instead!
5:7: Error (1050): Capture-stack not empty after end of parsing: TagName 1 item
Since our original mini-XML-grammar did not contain any error-resumption-directives, this is what we’d expect. Now let’s add a resume-directive to continue after misspelled tag-names:
>>> miniXML = '''
... @ disposable = EOF
... @ drop = EOF, whitespace, strings
... document = ~ element ~ §EOF
... @element_resume = ('</' /\w+/ '>')
... element = STag §content ETag
... STag = '<' TagName §'>'
... ETag = '</' ::TagName §'>'
... TagName = /\\w+/
... content = [CharData] { (element | COMMENT__) [CharData] }
... CharData = /(?:(?!\\]\\]>)[^<&])+/
... EOF = !/./
... '''
>>> parseXML = create_parser(miniXML)
>>> result = parseXML(xmldoc)
>>> for e in result.errors_sorted: print(e)
3:19: Error (1010): 'ETag = `</` ::TagName "title" § `>`' expected by parser 'element', but »</litle>\n ...« found instead!
5:1: Error (1010): 'ETag = `</` ::TagName "title" § `>`' expected by parser 'element', but »</doc>...« found instead!
The second error is merely a consequential error. One can imagine that, had the mistake of misspelling the ending tag occurred deeper in the XML hierarchy then a consequential error for every closing tag after the erroneous would have been reported, because the name of the opening tag corresponding to the misspelled closing tag has - because of the misspelling - never been removed from the stack.
We can try to avoid this problem by “popping” one element from the stack within our resume rule. This is of course only possible, if we specify our resume rule as full PEG-expression, not just a regular expression:
>>> new_resume_rule = "@element_resume = (:?TagName '</' /\w+/ '>')"
The regular expression /\s*(?=>)/ merely has the purpose to cover
ending-tags that fail to parse because of superfluous blanks before the
closing >. The important part is the “pop anyways”-operator
:?TagName. Since this operator does not capture any text in case
of a misspelled tag-name, but merely removes it value from the stack,
it it followed by the regular expression /\s*(?=>)/ which moves the
parser forward to the next angular bracket.:
>>> i = miniXML.find('@element_resume')
>>> k = miniXML.find('\n', i)
>>> miniXML = miniXML[:i] + new_resume_rule + miniXML[k:]
>>> parseXML = create_parser(miniXML)
>>> result = parseXML(xmldoc)
>>> for e in result.errors_sorted: print(e)
3:19: Error (1010): 'ETag = `</` ::TagName "title" § `>`' expected by parser 'element', but »</litle>\n ...« found instead!
If it only was so simple! Let’s see what happens if we treat our
parser with another, rather obvious or common mistake: Forgetting
the closing tag (or, what leads to a similar confusion of the parser,
omitting the closing slash / in an empty tag):
>>> xmldoc = '''
... <doc>
... <title>Heading <wrong></title>
... </doc>'''
>>> result = parseXML(xmldoc)
>>> for e in result.errors_sorted: print(e)
3:26: Error (1010): 'ETag = `</` ::TagName "wrong" § `>`' expected by parser 'element', but »</title>\n<...« found instead!
4:1: Error (1010): 'ETag = `</` ::TagName "title" § `>`' expected by parser 'element', but »</doc>...« found instead!
4:7: Error (1010): 'ETag = `</` ::TagName "doc" § `>`' expected by parser 'element', but »...« found instead!
In this case the removal of the last value from the stack in the
@element_resume-rule results in a cascade of consecutive errors
(or “error-artifacts”), one for every outer closing tag.
Avoiding error-artifacts such as this is an art in itself and requires to look at the whole picture. As of now, there does not exist any comprehensive theory, how to do this. A good starting point might be to take a look at the errors, users of the domain specific language make most frequently, as well as on those error messages that confuse them most. (Bug-reports, help-desk questions or tele-metric data might yield the required empirical information for this approach.)
A more robust rule-set for our mini-XML-grammar might be:
>>> miniXML = '''
... @ whitespace = /\s*/
... @ disposable = EOF
... @ drop = EOF, whitespace, strings
...
... document = ~ element ~ §EOF
... @element_resume = (:?TagName (&('</' :TagName '>') | '</' /\w+/ '>'))
... element = STag §content ETag
... STag = '<' TagName §'>'
... ETag = '</' ::TagName §'>'
... TagName = /\w+/
... content = [CharData] { (element | COMMENT__) [CharData] }
...
... CharData = /(?:(?!\]\]>)[^<&])+/
... EOF = !/./ # no more characters ahead, end of file reached
... '''
>>> parseXML = create_parser(miniXML)
>>> result = parseXML(xmldoc)
>>> for e in result.errors_sorted: print(e)
3:26: Error (1010): 'ETag = `</` ::TagName "wrong" § `>`' expected by parser 'element', but »</title>\n<...« found instead!
This rule has been designed to cover both of the previous cases. However, if we combine both errors, its limitations begin to show:
>>> xmldoc = '''
... <doc>
... <title>Heading<wrong></litle>
... A few lines of Text
... </doc>'''
>>> result = parseXML(xmldoc)
>>> for e in result.errors_sorted: print(e)
3:26: Error (1010): 'ETag = `</` ::TagName "wrong" § `>`' expected by parser 'element', but »</litle>\n ...« found instead!
5:1: Error (1010): 'ETag = `</` ::TagName "title" § `>`' expected by parser 'element', but »</doc>...« found instead!
Here, it seems almost impossible to avoid an error-cascade in combination with context-sensitive parsers if met with a combination of different errors. In this particular case of an XML-parser, the best way out might be not to use context-sensitive parsers at all and check the matching XML-tags at a later processing stage.
Custom Parsers
Specifying a grammar in EBNF and generating a parser from this specification has the advantage that one can get a much better overview over the grammar than when writing a completely parser directly in Python. This also makes changing the grammar and explaining it to others much easier. However, there are cases where writing a parsers at least for particular components in Python can be advantageous. These include cases, where it is easier to write a parser in Python, where a Python-parser might be faster, or where a parser cannot be coded in EBNF, say, because it requires as Turing-complete programming languages.
A kind of parser that is a bit tedious to code in EBNF is a search parser. In
the simplest imaginable case (e.g. ignoring comments ignoring nested structures,
where the search phrase might occur, but should not be found) you could write
{ !"search_phrase" /.|\\n/ }. Be wary that this search parser could land you in
the middle of a comment, e.g. when “searching” for “rain” in the line I'm
singing in # the rain (assuming # as the comment sign). So it is better to
write { !"search_phrase" (~ | /.|\\n/) }. To keep the example simple, we’ll leave
such worries aside in the following:
>>> search_rain_grammar = '''@reduction = merge
... song = { search_rain count_rain } [search_rain] EOF
... search_rain = { !"rain" /.|\\n/ }
... count_rain = "rain"
... EOF = !/.|\\n/'''
>>> rain_counter = create_parser(search_rain_grammar)
>>> song = """
... I'm singin' in the rain
... Just singin' in the rain
... What a glorious feeling
... I'm happy again."""
>>> syntax_tree = rain_counter(song)
>>> print(syntax_tree.as_sxpr())
(song
(search_rain
""
" I'm singin' in the ")
(count_rain "rain")
(search_rain
""
" Just singin' in the ")
(count_rain "rain")
(search_rain
""
" What a glorious feeling"
" I'm happy again.")
(EOF))
>>> print(len(list(syntax_tree.select('count_rain'))))
2
Let’s now define a custom parser for “search_rain”. A custom parser is a
factory-function that returns a “custom parse function” which in turn is a
function that takes a StringView-object as input and
returns a nodetree.Node-object or None as result. This
factory-function can be a derived class from class parse.Parser but
does not need to be. In the EBNF-grammar is specified as @name(param)
where name is the name for the factory function on the Python side and
param either an identifier or - if enclosed in quotation marks - a single string
paramter on the Python side, which can be used to configure the factory function.
It is also possible to specify a custom parser function directly instead of
a factory function. However, it must then be specified as the parameter of the
predefined factory function “Custom”, e.g. @Custom(parse_func).
The parser (factory) function itself must be defined in a global namespace
that is reachable from the Grammar-class generated from the EBNF-code. A good
place is the beginning of the generated “xxxParser.py”-script in case of
DHParser-projects created with the “dhparser”-command. If, as in the following,
the parser is created with create_parser(), the custom code
should be passed as Python-source-string to the additional_code-argument
of the create_parser()-function. In the following snipped, we,
therefore, write it directly into a string.
>>> search_src = """
... from DHParser.parse import CustomParseFunc
... from DHParser.stringview import StringView
... def search(what: string) -> CustomParseFunc:
... def parse_func(rest: StringView) -> Optional[Node]:
... i = rest.find(what)
... if i >= 0:
... return Node('', rest[:i])
... else:
... return Node('', rest)
... return parse_func
... """
Note, that the name of the returned node ist left empty. It will be filled in by
the CustomParser-object either with the anonymous-parser name
“:CustomParser” or the name of the symbol to which the parser is assigned,
because this information is not available within the function. If, however, the
custom parser is defined as a class derived of CustomParser,
the right name must be given when creating the node. In most cases this will
simply be self.node_name.
As said, referring to the custom parser factory function from the grammar is simple:
>>> search_rain_grammar = '''@reduction = merge
... song = { search_rain count_rain } search_rain EOF
... search_rain = @search("rain")
... count_rain = "rain"
... EOF = !/.|\\n/'''
>>> rain_counter = create_parser(search_rain_grammar,
... additional_code=search_src)
Let’s try using it:
>>> syntax_tree = rain_counter(song)
>>> print(syntax_tree.as_sxpr())
(song
(search_rain
""
" I'm singin' in the ")
(count_rain "rain")
(search_rain
""
" Just singin' in the ")
(count_rain "rain")
(search_rain
""
" What a glorious feeling"
" I'm happy again.")
(EOF))
Another good use cases for custom parsers are long lists of fixed alternatives, say, a list of state-names, e.g. “Afghanistan, Albania, … Zimbabwe” which would clutter the grammar and slow down parsing. In particular, if these lists are changed frequently and better directly fetched from a database then encoded in the grammar which would have to be recompiled every time there is a change in the data.
Finally, custom parsers can also be used or misused for semantic actions of any kind. Tracing certain points of the parser by dumping debug-information would be one reasonable use case of this kind. When storing or manipulating data in your custom parser of producing other side effects, you should be aware that even if the parser matches, it’s results may be thrown away later, because the rescursive descent parser pursues an other branch. There is no support by DHParser for filling and unwinding data-stacks or the like during parsing - with the exception the context sensitive parsers described above. If you want to store data at all, it is best to save the data in the attributes of the nodes that your custom function returns, because then you can be sure that only data from valid nodes survives to the end of the parsing process.
Classes and Functions-Reference
Usually, the functions and classes of module DHParser.ebnf are not
called directly, but the helper functions from module DHParser.dsl
like DHParser.dsl.create_parser() will be called or the
dhparser-script will be used from the command line. Therefore, no
commented abbreviated reference of the most important functions of
module ebnf will be provided here. See the source code documentation
of module DHParser.ebnf for a full reference of all functions and
classes.
Comments and Whitespace
Why whitespace isn’t trivial
Handling whitespace in text-documents is not all trivial, because whitespace can serve several different purposes and there can be different kinds of whitespace: Whitespace can serve a syntactic function as delimiter. But whitespace can also be purely aesthetic to render a document more readable.
Depending on the data model, whitespace can be considered as significant and be included in the data or as insignificant and be excluded from the data and only be re-inserted when displaying the data in a human-readable-form. (For example, one can model a sentence as a sequence of words and spaces or just as a sequence of words.) Note, that “significance” does not correlate with the syntactic or aesthetic function, but only depends on whether you’d like to keep the whitespace in you data or not.
There can be different kinds of whitespace with different meaning (and differing significance). For example, one can make a difference between horizontal whitespace (spaces and tabs) and vertical whitespace (including linefeeds). And there can be different sizes of whitespace with different meaning. For example in LaTeX, a single linefeed still counts as plain whitespace while an empty line (i.e. whitespace including two or more not linefeeds) signals a new paragraph.
Finally, even the position of whitespace can make a difference. A certain number of whitespaces at the beginning of a line can have the meaning of “indentation” (as in Python code) while at the end of the line or between brackets it is just plain insignificant whitespace. (This is actually something, where the boundaries of the EBNF-formalism become visible and you’d probably use a preprocessor or some kind of “semantic actions” to handle such cases. There is some support for either of these in DHParser.)
Significant Whitespace
A reasonable approach to coding whitespace is to use one particular symbol for each kind of whitespace. Those kinds of whitespace that are insignificant, i.e. that do not need to appear in the data, should be dropped from the syntax-tree. With DHParser this can be done already while parsing, using the
@disposableand@drop-directives described earlier.But let’s first look at an example which only includes significant whitespace. The following parser parses sequences of paragraphs which consist of sequences of sentences which consist of sequences of main clauses and subordinate clauses which consist of sequences of words:
Here, we have two types of significant whitespace
PBR(“paragraph-break”) andS(“space”). Both types allow for a certain amount of flexibility, so that two whitespaces of the same type do not need to have exactly the same content, but we could always normalize these whitespaces in a subsequent transformation step.Two typical design patterns for significant whitespace are noteworthy, here:
Both whitespaces match only if there was at least one whitespace character. We may allow whitespace to be optional (as at the beginning and end of the document), but if the option has not been taken, we don’t to see an empty whitespace-tag in the document, later on. (For insignificant whitespace, the opposite convention can be more convenient, because, typically, insignificant whitespace is dropped anyway, whether it’s got content or not.)
The grammar is construed in such a way that the whitespace always appears between different elements at the same level, but not after the last or before the first element. The whitespace after the last word of a sentence or before the first word of a sentence is really whitespace between two sentences. If we pick out a sentence or a clause, we will have no dangling whitespace at its beginning or end. (Again, for soon to be dropped insignificant whitespace, another convention can be more advisable.)
Let’s just try our grammar on an example:
Again, it is a question of design, whether we leave whitespace in the data or not. Leaving it has the advantage, that serialization become as simple as printing the content of the data-tree:
Otherwise one would have to program a dedicated serialization routine. Especially, if you receive data from a different source, you’ll appreciate not having to do this - and so will other people, receiving your data. Think about it! However, dropping the whitespace will yield more concise data.
Coding Comments
Allowing comments in a domain-specific language almost always makes sense, because it allows users to annotate the source texts while working on them and to share those comments with collaborators. From a technical point of view, adding comments to a DSL raises two questions:
At what places shall we allow to insert comments in the source code? Common answers are: a) at the end of a line, b) almost everywhere, or c) both.
How do we avoid pollution of the EBNF-grammar with comment markers? It’s already curtails the readability that we have to put whitespace symbols in so many places. And speaking of comments at the end of the line: If line-feeds aren’t important for us - as in our toy-grammar for prose-text, above - we probably wouldn’t want to reframe our grammar just to allow for at the end of the line comments.
Luckily, there exists a simple and highly intuitive solution that takes care of both of these concerns: We admit comments, wherever whitespace is allowed. And we code this by defining a symbol that means: “whitespace and, optionally, a comment”.
Let’s try this with our prose-text-grammar. In order to do so, we have to define a symbols for comments, a symbol for pure whitespace, and, finally, a symbol for whitespace with optional comment. Since, in our grammar, we actually have two kinds of whitespace,
SandPBR, we’ll have to redefine both of them. As delimiters for comments, we use curly braces:As can be seen, the concrete re-definition of the whitespace tokens requires a bit of careful consideration, because we want to allow additional whitespace next to comments, but at the same time avoid ending up with two whitespaces in sequence in our data. Let’s see, if we have succeeded:
We will not worry about the more sub-structure of the S-nodes right now. If we are not interested in the comments, we could use the
@disposable,@dropand@reduction = merge-directives to simplify these at the parsing stage. Or, we could extract the comments and normalize the whitespace at a later tree-processing stage. For now, let’s just check wether our comments work as expected:The last error was to be expected, because we did not allow comments to serve a substitutes for whitespace. Let’s check whether putting comments near paragraph breaks works as well:
The last result might look surprising at first, but since a paragraph break requires at least one empty line as a separator, the input text is correctly understood by the parser as a single paragraph with two sentence interspersed by a single whitespace which, incidentally, contains a comment:
A common problem with whitespace is that it tends to pollute the Grammar, because wherever you’d like to allow whitespace, you’d have to insert a symbol for whitespace. The same problem exists when it comes to allowing comments, because you’d probably allow to insert comments in as many places as possible.
insignificant whitespace
Coding insignificant whitespace and comments is exactly the same as coding significant whitespace and comments and does not need to be repeated, here. (The combination of insignificant whitespace and significant comments, is slightly more complicated, and probably best outsourced to some degree to the post-parsing processing stages. It will not be discussed here.) However, DHParser offers some special support for insignificant whitespace and comments, which can make working with these easier in some cases.
First of all, DHParser has a special dedicated token for insignificant whitespace which is the tilde
~-character. We have seen this earlier in the definition of the json-Grammar.The
~-whitespace-marker differs from the usual pattern for defining whitespace in that it is implicitly optional, or what amounts to the same, it matches the empty string. Normally, it is to be considered bad practice to define a symbol as optional. Rather, a symbol should always match something and only at the places where it is used, it should be marked as optional. If this rule is obeyed, it is always easy to tell, wether some element is optional or not at a specific place in the Grammar. Otherwise, it can become quite confusing indeed. However, since the tilde character is usually used very often, it is more convenient not to mark it with a question-mark or, if you use classical EBNF-syntax, to enclose it with square brackets.The default regular expression for the tilde-whitespace captures arbitrarily many spaces and tabs and at most one linefeed, but not an empty line (
[ \\t]*(?:\\n[ \\t]*)?(?!\\n)), as this is the most convenient way to define whitespace for text-data. However, the tilde whitespace can also be defined with any other regular expression with the@whitespace-directive.Let’s go back to our JSON-grammar and define the optional insignificant whitespace marked by the tilde-character in such a way that it matches any amount of horizontal or vertical whitespace, which makes much more sense in the context of json than the default tilde-whitespace that is restricted vertically to at most a single linefeed:
When redefining the tilde-whitespace, make sure that your regular expression also matches the empty string! There is no need to worry that the syntax tree get’s cluttered by empty whitespace-nodes, because tilde-whitespace always yields anonymous nodes and DHParser drops empty anonymous nodes right away.
Comments can be defined using the
@comment-directive. DHParser automatically intermingles comments and whitespace so that where-ever tilde-whitespace is allowed, a comment defined by the@comment-directive is also allowed:Since the json-grammar still contains the
@drop = whitespace, ...- directive from earlier on (next to other tree-reductions), the comments have been nicely dropped along with the tilde-whitespace.There is one caveat: When using comments alongside with whitespace that captures at most one linefeed, the comments should be defined in such a way that the last character of a comment is never a linefeed.
Also a few limitations of the tilde-whitespace and directive-defined comments should be kept in mind: 1. Only one kind of insignificant whitespace can be defined this way. If there are more kinds of insignificant whitespace, all but one need to be defined conventionally as part of the grammar. 2. Both directive-defined comments and tilde-whitespace can only be defined by a regular expression. In particular, nested comments are impossible to define with regular expressions, only.
However, using tilde-whitespace has yet one more benefit: With the tilde-whitespace, cluttering of the grammar with whitespace-markers can be avoid, by adding implicit whitespace adjacent to string-literals. Remember the definition of the JSON-Grammar earlier. If you look at a definition like:
object = "{" ~ member ( "," ~ §member )* "}" ~, you’ll notice that there are three whitespace markers, one next to each delimiter. Naturally so, because one usually wants to allow users of a domain specific language to put whitespace around delimiters.You may wonder, why the tilde appears only on the right hand side of the literals, although you’d probably like to allow whitespace on both side of a literal like “{”. But if you look at the grammar closely, you’ll find that almost every symbol definition ends either with a tilde sign or a symbol the definition of which ends with a tilde sign, which means that they allow whitespace on the right hand side. Now, if all elements of the grammar allow whitespace on the right hand side, this means that automatically also have whitespace on the left-hand side, too, which is, of course the whitespace on the right hand side of the previous element.
In order to reduce cluttering the grammar with tile-signs, DHParser allows to turn on implicit tilde-whitespace adjacent to any string literal with the directive
@ literalws = rightor@ literalws = left. As the argument of the directive suggests, whitespace is either “eaten” at the right hand side or the left hand side of the literal. String literals can either be enclose in double quotes “…” or single quotes ‘…’. Both kinds of literals will have implicit whitespace, if the@literalws-directive is used.(Don’t confuse implicit whitespace with insignificant whitespace: Insignificant whitespace is whitespace you do not need any more after parsing. Implicit whitespace is whitespace you do not denote explicitly in the grammar. It’s a specialty of DHParser and DHParser allows onl the insignificant whitespace denoted by the tilde-character to be declared as “implicit”.)
If left-adjacent whitespace is declared as implicit with the
@literalws-directive, the expression:object = "{" ~ member ( "," ~ §member )* "}" ~can be written as:
object = "{" member ( "," §member )* "}"which is easier to read.
For situations where implicit whitespace is not desired, DHParser has a special kind of string literal, written with backticks, which never carries any implicit whitespace. This is important, when literals are used for signs that enclose content, like the quotation marks for the string literals in our JSON-Grammar:
Regular expressions, also, never carry implicit whitespace. So, if you are using regular expressions as delimiters, you still have to add the tilde character, if adjacent insignificant whitespace is to be allowed:
The complete json-grammar now looks like this:
The whitespace defined by the
@whitespace-directive can be access from within the grammar via the nameWHITESPACE__. Other than the tilde-sign this name refers to the pure whitespace that is not intermingles with comments. Similarly, comments defined by the@comment-directive can be accessed via the symbolCOMMENT__.