Source code for ebnf

# ebnf.py - EBNF -> Python-Parser compilation for DHParser
#
# Copyright 2016  by Eckhart Arnold (arnold@badw.de)
#                 Bavarian Academy of Sciences an Humanities (badw.de)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.  See the License for the specific language governing
# permissions and limitations under the License.


'''Module ``ebnf`` provides an EBNF-parser-generator. The parser-generator
compiles an EBNF-Grammar into an executable Python-class. An instance of
this class can be called to parse text-documents conforming to the grammar.
It will the concrete-syntax-tree of the document. Various flavors of EBNF-
of PEG- (Parsing-Expression-Grammar) notations are supported.

Usually, the classes and functions provided by the DHParser.ebnf will
not be called directly, because it is much simpler to use the high-level
API in module :py:mod:`DHParser.dsl`, in particular
:py:func:`DHParser.dsl.create_parser`.

The ebnf-module is structured just like the parser-modules that are
generated by running the "dhparser"-script on an EBNF-grammar or
by executing the test-runner script in the project-directory of
a DHParser-project. It consists of the following sections, divided
by a comment-block with a sections name:

1. The *EBNF-preprocessor* takes care of
   chaining the main file and the included files (if there are
   any) into one single document that is passed on to the parser.
   (=> :py:func:`preprocessor_factory`)

2. The *EBNF-parser* translates the EBNF-grammar into a
   concrete-syntax-tree. (=> :py:class:`ConfigurableEBNFGrammar`)

3. The *AST-transformation* molds the concrete-syntax-tree into
   abstract-syntax-tree (AST) that represents the syntactical structure
   of the grammar-source-code. (=> :py:func:`get_ebnf_transformer`)

4. Finally, the *EBNF-compiler* compiles the AST into an executable
   Python-class that is a descendant of :py:class:`~parse.Grammar`
   and is composed mostly of instances of the parser-classes
   from module :py:mod:`DHParser.parse`. (=> :py:class:`EBNFCompiler`)

   (Each symbol of the grammar
   is represented by a class variable to which a nested call of
   parser-class-instantiations is assigned. These parser-instances
   serve as a prototype from which grammar objects are derived
   via deep-copy upon the instantiation of the grammar-class.)

   This section also contains source-code-snippets and -templates
   for the Python-parser code the compiler produces.

The most notable difference to ordinary DHParser-projects is that
the DHParser.ebnf-module contains two :py:class:`~parse.Grammar`-classes, one
for parsing code that strictly follows DHParser's EBNF-syntax
(:py:class:`ConfigurableEBNFGrammar`) and another one that is able to
parse many different brands of EBNF-syntax (:py:class:`HeuristicEBNFGrammar`)
at the cost of parsing speed. When parsing or compiling an EBNF-grammar with
any of the high-level functions into Python-code, the faster "configurable"
EBNF-grammar is tried first, and, if that fails with particular errors which
suggest that the failure might be merely due to the use of a different brand
of EBNF, a second attempt is made with the slower heuristic EBNF-parser.

The EBNF-compiler is actually split into two classes, :py:class:`EBNFCompiler`,
which contains the EBNF-AST -> Python compiler proper, and
:py:class:`EBNFDirectives` which is a helper class to keep track of the
directives used to avoid overburdening the compiler-class with instance
variables.

Just as DHParser's (auto-)generated parser-scripts, the classes contained in
:py:mod:`DHParser.ebnf` should not be instantiated directly. Other than
the parser scripts, though, the ebnf-module does not provide Junctions
with factory-functions for each stage from preprocessing to compiling.
Instead it provides factory-functions that return one singleton instance
per thread of each class, namely:

* :py:func:`~ebnf.get_ebnf_parser`
* :py:func:`~ebnf.get_ebnf_transformer`
* :py:func:`~ebnf.get_ebnf_compiler`

These are supplemented by the quick-use-functions:

* :py:func:`~ebnf.preprocess_ebnf`
* :py:func:`~ebnf.parse_ebnf`
* :py:func:`~ebnf.transform_ebnf`
* :py:func:`~ebnf.compile_ebnf`

The following example shows how the classes and functions of the
ebnf-module can be connected to produce runnable Python-code from
an EBNF-grammar. It is meant as a help to understand the role of
these classes better as well as - in a simplified manner - the
basic working mechanisms of higher level functions like
:py:func:`DHParser.dsl.create_parser`. In any practical application,
the use of the high-level functions from :py:mod:`DHParser.dsl` is
to be preferred. One can think of the DHParser.dsl-module
as the public API of the ebnf-module.

This said, here is how a Python-parser can be generated
from a grammar, step by step::

    >>> arithmetic_ebnf = r"""
    ... @ whitespace  = vertical
    ... @ literalws   = right
    ... @ drop        = whitespace, strings
    ... expression = term  { (add | sub) term}
    ... term       = factor { (div | mul) factor}
    ... factor     = [minus] (NUMBER | VARIABLE | group)
    ... group      = "(" expression ")"
    ... add        = "+"
    ... sub        = "-"
    ... mul        = "*"
    ... div        = "/"
    ... minus      = `-`
    ... NUMBER     = /(?:0|(?:[1-9]\\d*))(?:\\.\\d+)?/~
    ... VARIABLE   = /[A-Za-z]/~"""

    >>> # 1. Compilation of an EBNF-grammar into Python-source-code
    >>> ebnf_parser = ConfigurableEBNFGrammar()
    >>> ebnf_transformer = EBNFTransform()
    >>> ebnf_compiler = EBNFCompiler()

    >>> CST = ebnf_parser(arithmetic_ebnf)
    >>> AST = ebnf_transformer(CST)   # CST should be considered invalid after that
    >>> ebnf_compiler.set_grammar_name("Arithmetic")
    >>> from DHParser.configuration import get_config_value, set_config_value
    >>> save = get_config_value('optimization_level')
    >>> set_config_value('optimization_level', 0)
    >>> python_src = ebnf_compiler(AST)
    >>> set_config_value('optimization_level', save)
    >>> assert not AST.errors
    >>> print(python_src)
    class ArithmeticGrammar(Grammar):
        r"""Parser for an Arithmetic source file.
        """
        expression = Forward()
        disposable__ = re.compile('$.')
        static_analysis_pending__ = []  # type: List[bool]
        parser_initialization__ = ["upon instantiation"]
        COMMENT__ = r''
        comment_rx__ = RX_NEVER_MATCH
        WHITESPACE__ = r'\\s*'
        WSP_RE__ = mixin_comment(whitespace=WHITESPACE__, comment=COMMENT__)
        wsp__ = Whitespace(WSP_RE__)
        dwsp__ = Drop(Whitespace(WSP_RE__))
        VARIABLE = Series(RegExp('[A-Za-z]'), dwsp__)
        NUMBER = Series(RegExp('(?:0|(?:[1-9]\\\\d*))(?:\\\\.\\\\d+)?'), dwsp__)
        minus = Text("-")
        div = Series(Text("/"), dwsp__)
        mul = Series(Text("*"), dwsp__)
        sub = Series(Text("-"), dwsp__)
        add = Series(Text("+"), dwsp__)
        group = Series(Series(Drop(Text("(")), dwsp__), expression, Series(Drop(Text(")")), dwsp__))
        factor = Series(Option(minus), Alternative(NUMBER, VARIABLE, group))
        term = Series(factor, ZeroOrMore(Series(Alternative(div, mul), factor)))
        expression.set(Series(term, ZeroOrMore(Series(Alternative(add, sub), term))))
        root__ = expression
    <BLANKLINE>
    parsing: PseudoJunction = create_parser_junction(ArithmeticGrammar)
    get_grammar = parsing.factory # for backwards compatibility, only
    <BLANKLINE>

    >>> # 2. Execution of the Python-source and extraction of the Grammar-class
    >>> code = compile(DHPARSER_IMPORTS + python_src, '<string>', 'exec')
    >>> namespace = {}
    >>> exec(code, namespace)
    >>> ArithmeticGrammar = namespace['ArithmeticGrammar']

    >>> # 3. Instantiation of the Grammar class and parsing of an expression
    >>> arithmetic_parser = ArithmeticGrammar()
    >>> syntax_tree = arithmetic_parser("2 + 3 * 4")
    >>> print(syntax_tree.as_sxpr())
    (expression
      (term
        (factor
          (NUMBER "2")))
      (add "+")
      (term
        (factor
          (NUMBER "3"))
        (mul "*")
        (factor
          (NUMBER "4"))))

Of course, the first part, compiling of the grammar to Python-code,
could also have been achieved with::

    >>> python_src = compile_ebnf(arithmetic_ebnf, "Arithmetic").result

And for the execution of the Python-source and extraction of the Grammar-class,
one can use :py:func:`DHParser.toolkit.compile_python_object`::

    >>> from DHParser import toolkit
    >>> ArithmeticGrammar = toolkit.compile_python_object(
    ...     DHPARSER_IMPORTS + python_src, "ArithmeticGrammar")
    >>> arithmetic_parser = ArithmeticGrammar()
    >>> syntax_tree_2 = arithmetic_parser("2 + 3 * 4")
    >>> assert syntax_tree_2.equals(syntax_tree)

The recommended canonical way for the last step, however, would be::

    >>> parsing = toolkit.compile_python_object(
    ...     DHPARSER_IMPORTS + python_src, "parsing")
    >>> arithmetic_parser = parsing.factory()
    >>> syntax_tree_3 = arithmetic_parser("2 + 3 * 4")
    >>> assert syntax_tree_3.equals(syntax_tree)

By using the factory function of the parsing-junction to
get a grammar-object instead of instantiating it directly, it is avoided
to instantiate the grammar-object more than once per thread. Re-using
the same grammar-object is more efficient
than re-instantiating it for every new document to be parsed. At the same-time
grammar-objects must not be shared between threads or processes. (See also
:py:class:`~toolkit.ThreadLocalSingletonFactory`.)
'''

from __future__ import annotations

import copy
from functools import partial
import keyword
import os
import sys
from typing import Callable, Dict, List, Set, FrozenSet, Tuple, Union, Optional

if sys.version_info >= (3, 6, 0):
    OrderedDict = dict
else:
    from collections import OrderedDict

try:
    import cython
except ImportError:
    import DHParser.externallibs.shadow_cython as cython

from DHParser.compile import CompilerError, Compiler, CompilationResult, compile_source
from DHParser.configuration import access_thread_locals, get_config_value, \
    NEVER_MATCH_PATTERN, ALLOWED_PRESET_VALUES
from DHParser.error import Error, AMBIGUOUS_ERROR_HANDLING, WARNING, REDECLARED_TOKEN_WARNING,\
    REDEFINED_DIRECTIVE, UNUSED_ERROR_HANDLING_WARNING, \
    DIRECTIVE_FOR_NONEXISTANT_SYMBOL, UNDEFINED_SYMBOL_IN_TRANSTABLE_WARNING, \
    UNCONNECTED_SYMBOL_WARNING, REORDERING_OF_ALTERNATIVES_REQUIRED, BAD_ORDER_OF_ALTERNATIVES, \
    EMPTY_GRAMMAR_ERROR, MALFORMED_REGULAR_EXPRESSION, PEG_EXPRESSION_IN_DIRECTIVE_WO_BRACKETS, \
    STRUCTURAL_ERROR_IN_AST, SYMBOL_NAME_IS_PYTHON_KEYWORD, UNDEFINED_SYMBOL, ERROR, FATAL, \
    WRONG_NUMBER_OF_ARGUMENTS, UNKNOWN_MACRO_ARGUMENT, \
    UNDEFINED_MACRO, RECURSIVE_MACRO_CALL, UNUSED_MACRO_ARGUMENTS_WARNING, has_errors
from DHParser.parse import Parser, Grammar, mixin_comment, mixin_nonempty, Forward, RegExp, SmartRE, \
    Drop, DropFrom, Lookahead, NegativeLookahead, Alternative, Series, Option, ZeroOrMore, OneOrMore, \
    Text, Capture, Retrieve, Pop, optional_last_value, GrammarError, Whitespace, Always, Never, \
    Synonym, INFINITE, matching_bracket, ParseFunc, update_scanner, CombinedParser, parser_names
from DHParser.preprocess import PreprocessorFunc, PreprocessorResult, gen_find_include_func, \
    preprocess_includes, make_preprocessor, chain_preprocessors
from DHParser.nodetree import Node, RootNode, Path, WHITESPACE_PTYPE, KEEP_COMMENTS_PTYPE, \
    TOKEN_PTYPE, ZOMBIE_TAG, flatten_sxpr
from DHParser.toolkit import load_if_file, wrap_str_literal, escape_ctrl_chars, md5, \
    sane_parser_name, re, expand_table, unrepr, compile_python_object, \
    ThreadLocalSingletonFactory, TypeAlias
from DHParser.transform import TransformerFunc, transformer, remove_brackets, change_name, \
    reduce_single_child, replace_by_single_child, is_empty, remove_children, add_error, \
    remove_tokens, remove_anonymous_tokens, flatten, forbid, assert_content, remove_children_if, \
    all_of, not_one_of, apply_if, neg, has_parent, BLOCK_LEAVES
from DHParser.versionnumber import __version__


__all__ = ('DHPARSER_IMPORTS',
           'get_ebnf_preprocessor',
           'get_ebnf_grammar',
           'get_ebnf_transformer',
           'get_ebnf_compiler',
           'preprocess_ebnf',
           'parse_ebnf',
           'transform_ebnf',
           'compile_ebnf_ast',
           'HeuristicEBNFGrammar',
           'ConfigurableEBNFGrammar',
           'EBNFTransform',
           'EBNFCompilerError',
           'EBNFDirectives',
           'WHITESPACE_TYPES',
           'EBNFCompiler',
           'grammar_changed',
           'compile_ebnf')


########################################################################
#
# EBNF preprocessing
#
########################################################################


RE_INCLUDE = r'@[ \t]*include[ \t]*=[ \t]*"(?P<name>.*)"'


def EBNFTokenizer(original_text) -> Tuple[str, List[Error]]:
    return original_text, []


def preprocessor_factory() -> PreprocessorFunc:
    find_next_include = gen_find_include_func(RE_INCLUDE, HeuristicEBNFGrammar.comment_rx__)
    include_prep = partial(preprocess_includes, find_next_include=find_next_include)
    EBNFPreprocessor = make_preprocessor(EBNFTokenizer)
    return chain_preprocessors(include_prep, EBNFPreprocessor)


get_ebnf_preprocessor = ThreadLocalSingletonFactory(preprocessor_factory)



[docs]
def preprocess_ebnf(ebnf: str, source_name="source") -> PreprocessorResult:
    r"""Preprocesses the @include-directives of an EBNF-source."""
    return get_ebnf_preprocessor()(ebnf, source_name)



########################################################################
#
# EBNF parsing
#
########################################################################



[docs]
class HeuristicEBNFGrammar(Grammar):
    r"""Parser for an EBNF source file that heuristically detects the
    used syntactical variant of EBNF on the fly.

    This grammar is tuned for flexibility, that is, it supports as many
    different flavors of EBNF as possible. However, this flexibility
    comes at the cost of some ambiguities. In particular:

       1. the alternative OR-operator / could be mistaken for the start
          of a regular expression and vice versa, and
       2. character ranges [a-z] can be mistaken for optional blocks
          and vice versa

    A strategy to avoid these ambiguities is to do all of the following:

        - replace the free_char-parser by a never matching parser
        - if this is done, it is safe to replace the char_range_heuristics-
          parser by an always matching parser
        - replace the regex_heuristics by an always matching parser

    Ambiguities can also be avoided by NOT using all the syntactic variants
    made possible by this EBNF-grammar within one and the same EBNF-document.

    EBNF-definition of the Grammar::

        @ comment    = /(?!#x[A-Fa-f0-9])#.*(?:\n|$)|\/\*(?:.|\n)*?\*\/|\(\*(?:.|\n)*?\*\)/
            # comments can be either C-Style: /* ... */
            # or pascal/modula/oberon-style: (* ... *)
            # or python-style: # ... \n, excluding, however, character markers: #x20
        @ whitespace = /\s*/                            # whitespace includes linefeed
        @ literalws  = right                            # trailing whitespace of literals will be ignored tacitly
        @ hide       = is_mdef, component, pure_elem, countable, no_range, FOLLOW_UP,
                       MOD_SYM, MOD_SEP, ANY_SUFFIX, EOF
        @ drop       = whitespace, MOD_SYM, EOF, no_range        # do not include these even in the concrete syntax tree
        @ RNG_BRACE_filter = matching_bracket()         # filter or transform content of RNG_BRACE on retrieve

        # re-entry-rules for resuming after parsing-error

        @ definition_resume = /\n\s*(?=@|\w+\w*\s*=)/
        @ directive_resume  = /\n\s*(?=@|\w+\w*\s*=)/

        # specialized error messages for certain cases

        @ definition_error  = /,/, 'Delimiter "," not expected in definition!\nEither this was meant to '
                                   'be a directive and the directive symbol @ is missing\nor the error is '
                                   'due to inconsistent use of the comma as a delimiter\nfor the elements '
                                   'of a sequence.'


        #: top-level

        syntax     = ~ { definition | directive | macrodef } EOF
        definition = [modifier] symbol §:DEF~ [ :OR~ ] expression [ MOD_SYM~ hide ]
                     :ENDL~ & FOLLOW_UP  # [:OR~] to support v. Rossum's syntax
          modifier = (drop | [hide]) MOD_SEP   # node LF after modifier allowed!
          is_def   = [ MOD_SEP symbol ] :DEF | MOD_SEP is_mdef
          _is_def  = [ MOD_SEP symbol ] _DEF | MOD_SEP is_mdef
          MOD_SEP  = / *: */

        directive  = "@" §symbol "=" component { "," component } & FOLLOW_UP
          # component  = (regexp | literals | procedure | symbol !DEF)
          component  = regexp | literals | procedure | symbol !_DEF !_is_def
                     | &`$` !is_mdef § placeholder !is_def
                     | "(" expression ")"  | RAISE_EXPR_WO_BRACKETS expression
          literals   = { literal }+                       # string chaining, only allowed in directives!
          procedure  = SYM_REGEX "()"                     # procedure name, only allowed in directives!

        macrodef   =  [modifier] "$" name~ ["(" §placeholder { "," placeholder }  ")"]
                     :DEF~ [ OR~ ] macrobody [ MOD_SYM~ hide ] :ENDL~ & FOLLOW_UP
          macrobody  = expression
          is_mdef    = "$" name ["(" placeholder { "," placeholder }  ")"] ~:DEF

        FOLLOW_UP  = `@` | `$` | modifier | symbol | EOF


        #: components

        expression = sequence { :OR~ sequence }
        sequence   = ["§"] ( interleave | lookaround )  # "§" means all following terms mandatory
                     { !`@` !(symbol :DEF) :AND~ ["§"] ( interleave | lookaround ) }
        interleave = difference { "°" ["§"] difference }
        lookaround = flowmarker § part
        difference = term [!`->` "-" § part]
        term       = (oneormore | counted | repetition | option | pure_elem) [ MOD_SYM~ drop ]
        part       = (oneormore | pure_elem) [ MOD_SYM~ drop ]


        #: tree-reduction-markers aka "AST-hints"

        drop       = "DROP" | "Drop" | "drop" | "SKIP" | "Skip" | "skip"
        hide       = "HIDE" | "Hide" | "hide" | "DISPOSE" | "Dispose" | "dispose"


        #: elements

        countable  = option | oneormore | element
        pure_elem  = element § !ANY_SUFFIX              # element strictly without a suffix
        element    = [retrieveop] symbol !is_def        # negative lookahead to be sure it's not a definition
                   | literal
                   | plaintext
                   | char_ranges
                   | regexp
                   | char_range
                   | character ~
                   | any_char
                   | whitespace
                   | group
                   | macro !is_def
                   | placeholder !is_def
                   | parser                             # a user-defined parser


        ANY_SUFFIX = /[?*+]/


        #: flow-operators

        flowmarker = "!"  | "&"                         # '!' negative lookahead, '&' positive lookahead
                   | "<-!" | "<-&"                      # '<-!' negative lookbehind, '<-&' positive lookbehind
        retrieveop = "::" | ":?" | ":"                  # '::' pop, ':?' optional pop, ':' retrieve


        #: groups

        group      = "(" no_range §expression ")"
        oneormore  = "{" no_range expression "}+" | element "+"
        repetition = "{" no_range §expression "}" | element "*" no_range
        option     = !char_range "[" §expression "]" | element "?"
        counted    = countable range | countable :TIMES~ multiplier | multiplier :TIMES~ §countable

        range      = RNG_BRACE~ multiplier [ :RNG_DELIM~ multiplier ] ::RNG_BRACE~
        no_range   = !multiplier | &multiplier :TIMES
        multiplier = /[1-9]\d*/~


        #: leaf-elements

        parser     = "@" name "(" [argument] ")"        # a user defined parser
          argument = literal | name~

        symbol     = SYM_REGEX ~                        # e.g. expression, term, parameter_list
        literal    = /"(?:(?<!\\)\\"|[^"])*?"/~         # e.g. "(", '+', 'while'
                   | /'(?:(?<!\\)\\'|[^'])*?'/~         # whitespace following literals will be ignored tacitly.
                   | /’(?:(?<!\\)\\’|[^’])*?’/~
        plaintext  = /`(?:(?<!\\)\\`|[^`])*?`/~         # like literal but does not eat whitespace
                   | /´(?:(?<!\\)\\´|[^´])*?´/~
        regexp     = :RE_LEADIN RE_CORE :RE_LEADOUT ~   # e.g. /\w+/, ~/#.*(?:\n|$)/~
        # regexp     = /\/(?:(?<!\\)\\(?:\/)|[^\/])*?\//~     # e.g. /\w+/, ~/#.*(?:\n|$)/~

        char_range = `[` &char_range_heuristics [`^`] { range_desc }+ "]"
        char_ranges = RE_LEADIN range_chain { `|` range_chain } RE_LEADOUT ~
          range_chain = `[` [`^`] { range_desc }+ `]`
          range_desc = (character | free_char) [ [`-`] (character | free_char) ]

        character  = :CH_LEADIN HEXCODE
        free_char  = /[^\n\[\]\\]/ | /\\[nrtfv`´'"(){}\[\]\/\\]/
        any_char   = "."
        whitespace = /~/~                               # insignificant whitespace


        #: macros

        macro       = "$" name "(" no_range expression { "," no_range expression } ")"
        placeholder = "$" name !`(` ~

        name        = SYM_REGEX


        #: delimiters

        EOF = !/./ [:?ENDL] [:?DEF] [:?OR] [:?AND]      # [:?DEF], [:?OR], ... clear stack by eating stored value
                   [:?RNG_DELIM] [:?BRACE_SIGN] [:?CH_LEADIN] [:?TIMES] [:?RE_LEADIN] [:?RE_LEADOUT]

        DEF        = _DEF
        _DEF       = `=` | `:=` | `::=` | `<-` | /:\n/ | `: `  # with `: `, retrieve markers mustn't be followed by a blank!
        OR         = `|` | `/` !regex_heuristics
        AND        =  `,` | ``
        ENDL       = `;` | ``

        RNG_BRACE  = :BRACE_SIGN
        BRACE_SIGN = `{` | `(`
        RNG_DELIM  = `,`
        TIMES      = `*`

        RE_LEADIN  = `/` &regex_heuristics | `^/`
        RE_LEADOUT = `/`

        CH_LEADIN  = `0x` | `#x` | `\x` | `\u` | `\U`

        MOD_SYM   = `->`

        #: heuristics

        char_range_heuristics  = ! ( /[\n]/ | more_than_one_blank
                                   | ~ literal_heuristics
                                   | ~ [`::`|`:?`|`:`] STRICT_SYM_REGEX /\s*\]/ )
                                 & ({ range_desc }+ `]`)
          STRICT_SYM_REGEX     = /(?!\d)\w+/
        more_than_one_blank    = /[^ \]]*[ ][^ \]]*[ ]/
        literal_heuristics     = /~?\s*"(?:[\\]\]|[^\]]|[^\\]\[[^"]*)*"/
                               | /~?\s*'(?:[\\]\]|[^\]]|[^\\]\[[^']*)*'/
                               | /~?\s*`(?:[\\]\]|[^\]]|[^\\]\[[^`]*)*`/
                               | /~?\s*´(?:[\\]\]|[^\]]|[^\\]\[[^´]*)*´/
                               | /~?\s*\/(?:[\\]\]|[^\]]|[^\\]\[[^\/]*)*\//
        regex_heuristics       = ! ( / +`[^`]*` +\//
                                   | / +´[^´]*´ +\//
                                   | / +'[^']*' +\//
                                   | / +"[^"]*" +\//
                                   | / +\w+ +\// )
                                 ( /[^\/\n*?+\\]*[*?+\\][^\/\n]*\//
                                 | /[^\w]+\//
                                 | /[^ ]/ )

        #: basic-regexes

        RE_CORE    = /(?:(?<!\\)\\(?:\/)|[^\/])*/       # core of a regular expression, i.e. the dots in /.../
        SYM_REGEX  = /(?!\d)\w(?:-?\w)*/                # regular expression for symbols
        HEXCODE    = /(?:[A-Fa-f1-9]|0(?!x)){1,8}/


        #: error-markers

        RAISE_EXPR_WO_BRACKETS = ``
    """
    countable = Forward()
    element = Forward()
    expression = Forward()
    source_hash__ = "ec18718eb3b0857b7831545a28098fe0"
    disposable__ = re.compile(
        '(?:$.)|(?:FOLLOW_UP$|is_mdef$|MOD_SYM$|countable$|EOF$|pure_elem$|MOD_SEP$|component$|ANY_SUFFIX$|no_range$)')
    static_analysis_pending__ = []  # type: List[bool]
    parser_initialization__ = ["upon instantiation"]
    error_messages__ = {'definition': [(re.compile(r','),
                                        'Delimiter "," not expected in definition!\\nEither this was meant to be a directive and the directive symbol @ is missing\\nor the error is due to inconsistent use of the comma as a delimiter\\nfor the elements of a sequence.')]}
    COMMENT__ = r'(?!#x[A-Fa-f0-9])#.*(?:\n|$)|\/\*(?:.|\n)*?\*\/|\(\*(?:.|\n)*?\*\)'
    comment_rx__ = re.compile(COMMENT__)
    WHITESPACE__ = r'\s*'
    WSP_RE__ = mixin_comment(whitespace=WHITESPACE__, comment=COMMENT__)
    wsp__ = Whitespace(WSP_RE__)
    dwsp__ = Drop(Whitespace(WSP_RE__))
    RAISE_EXPR_WO_BRACKETS = Text("")
    HEXCODE = RegExp('(?:[A-Fa-f1-9]|0(?!x)){1,8}')
    SYM_REGEX = RegExp('(?!\\d)\\w(?:-?\\w)*')
    RE_CORE = RegExp('(?:(?<!\\\\)\\\\(?:/)|[^/])*')
    regex_heuristics = Series(NegativeLookahead(
        Alternative(RegExp(' +`[^`]*` +/'), RegExp(' +´[^´]*´ +/'), RegExp(" +'[^']*' +/"),
                    RegExp(' +"[^"]*" +/'), RegExp(' +\\w+ +/'))),
                              Alternative(RegExp('[^/\\n*?+\\\\]*[*?+\\\\][^/\\n]*/'),
                                          RegExp('[^\\w]+/'), RegExp('[^ ]')))
    literal_heuristics = Alternative(RegExp('~?\\s*"(?:[\\\\]\\]|[^\\]]|[^\\\\]\\[[^"]*)*"'),
                                     RegExp("~?\\s*'(?:[\\\\]\\]|[^\\]]|[^\\\\]\\[[^']*)*'"),
                                     RegExp('~?\\s*`(?:[\\\\]\\]|[^\\]]|[^\\\\]\\[[^`]*)*`'),
                                     RegExp('~?\\s*´(?:[\\\\]\\]|[^\\]]|[^\\\\]\\[[^´]*)*´'),
                                     RegExp('~?\\s*/(?:[\\\\]\\]|[^\\]]|[^\\\\]\\[[^/]*)*/'))
    more_than_one_blank = RegExp('[^ \\]]*[ ][^ \\]]*[ ]')
    STRICT_SYM_REGEX = RegExp('(?!\\d)\\w+')
    CH_LEADIN = Capture(Alternative(Text("0x"), Text("#x"), Text("\\x"), Text("\\u"), Text("\\U")),
                        zero_length_warning=False)
    MOD_SYM = Drop(Text("->"))
    character = Series(Retrieve(CH_LEADIN), HEXCODE)
    RE_LEADOUT = Capture(Text("/"), zero_length_warning=True)
    RE_LEADIN = Capture(Alternative(Series(Text("/"), Lookahead(regex_heuristics)), Text("^/")),
                        zero_length_warning=True)
    TIMES = Capture(Text("*"), zero_length_warning=False)
    RNG_DELIM = Capture(Text(","), zero_length_warning=False)
    BRACE_SIGN = Capture(Alternative(Text("{"), Text("(")), zero_length_warning=False)
    RNG_BRACE = Capture(Retrieve(BRACE_SIGN), zero_length_warning=True)
    ENDL = Capture(Alternative(Text(";"), Text("")), zero_length_warning=False)
    AND = Capture(Alternative(Text(","), Text("")), zero_length_warning=False)
    OR = Capture(Alternative(Text("|"), Series(Text("/"), NegativeLookahead(regex_heuristics))),
                 zero_length_warning=True)
    _DEF = Alternative(Text("="), Text(":="), Text("::="), Text("<-"), RegExp(':\\n'), Text(": "))
    DEF = Capture(Synonym(_DEF), zero_length_warning=False)
    EOF = Drop(
        Series(NegativeLookahead(RegExp('.')), Option(Pop(ENDL, match_func=optional_last_value)),
               Option(Pop(DEF, match_func=optional_last_value)),
               Option(Pop(OR, match_func=optional_last_value)),
               Option(Pop(AND, match_func=optional_last_value)),
               Option(Pop(RNG_DELIM, match_func=optional_last_value)),
               Option(Pop(BRACE_SIGN, match_func=optional_last_value)),
               Option(Pop(CH_LEADIN, match_func=optional_last_value)),
               Option(Pop(TIMES, match_func=optional_last_value)),
               Option(Pop(RE_LEADIN, match_func=optional_last_value)),
               Option(Pop(RE_LEADOUT, match_func=optional_last_value))))
    name = Synonym(SYM_REGEX)
    placeholder = Series(Series(Text("$"), dwsp__), name, NegativeLookahead(Text("(")), dwsp__)
    multiplier = Series(RegExp('[1-9]\\d*'), dwsp__)
    whitespace = Series(RegExp('~'), dwsp__)
    any_char = Series(Text("."), dwsp__)
    free_char = Alternative(RegExp('[^\\n\\[\\]\\\\]'), RegExp('\\\\[nrtfv`´\'"(){}\\[\\]/\\\\]'))
    range_desc = Series(Alternative(character, free_char),
                        Option(Series(Option(Text("-")), Alternative(character, free_char))))
    char_range_heuristics = Series(NegativeLookahead(
        Alternative(RegExp('[\\n]'), more_than_one_blank, Series(dwsp__, literal_heuristics),
                    Series(dwsp__, Option(Alternative(Text("::"), Text(":?"), Text(":"))),
                           STRICT_SYM_REGEX, RegExp('\\s*\\]')))),
                                   Lookahead(Series(OneOrMore(range_desc), Text("]"))))
    range_chain = Series(Text("["), Option(Text("^")), OneOrMore(range_desc), Text("]"))
    char_ranges = Series(RE_LEADIN, range_chain, ZeroOrMore(Series(Text("|"), range_chain)),
                         RE_LEADOUT, dwsp__)
    char_range = Series(Text("["), Lookahead(char_range_heuristics), Option(Text("^")),
                        OneOrMore(range_desc), Series(Text("]"), dwsp__))
    regexp = Series(Retrieve(RE_LEADIN), RE_CORE, Retrieve(RE_LEADOUT), dwsp__)
    plaintext = Alternative(Series(RegExp('`(?:(?<!\\\\)\\\\`|[^`])*?`'), dwsp__),
                            Series(RegExp('´(?:(?<!\\\\)\\\\´|[^´])*?´'), dwsp__))
    literal = Alternative(Series(RegExp('"(?:(?<!\\\\)\\\\"|[^"])*?"'), dwsp__),
                          Series(RegExp("'(?:(?<!\\\\)\\\\'|[^'])*?'"), dwsp__),
                          Series(RegExp('’(?:(?<!\\\\)\\\\’|[^’])*?’'), dwsp__))
    symbol = Series(SYM_REGEX, dwsp__)
    argument = Alternative(literal, Series(name, dwsp__))
    parser = Series(Series(Text("@"), dwsp__), name, Series(Text("("), dwsp__), Option(argument),
                    Series(Text(")"), dwsp__))
    no_range = Drop(
        Alternative(NegativeLookahead(multiplier), Series(Lookahead(multiplier), Retrieve(TIMES))))
    macro = Series(Series(Text("$"), dwsp__), name, Series(Text("("), dwsp__), no_range, expression,
                   ZeroOrMore(Series(Series(Text(","), dwsp__), no_range, expression)),
                   Series(Text(")"), dwsp__))
    range = Series(RNG_BRACE, dwsp__, multiplier,
                   Option(Series(Retrieve(RNG_DELIM), dwsp__, multiplier)),
                   Pop(RNG_BRACE, match_func=matching_bracket), dwsp__)
    counted = Alternative(Series(countable, range),
                          Series(countable, Retrieve(TIMES), dwsp__, multiplier),
                          Series(multiplier, Retrieve(TIMES), dwsp__, countable, mandatory=3))
    option = Alternative(
        Series(NegativeLookahead(char_range), Series(Text("["), dwsp__), expression,
               Series(Text("]"), dwsp__), mandatory=2), Series(element, Series(Text("?"), dwsp__)))
    repetition = Alternative(
        Series(Series(Text("{"), dwsp__), no_range, expression, Series(Text("}"), dwsp__),
               mandatory=2), Series(element, Series(Text("*"), dwsp__), no_range))
    oneormore = Alternative(
        Series(Series(Text("{"), dwsp__), no_range, expression, Series(Text("}+"), dwsp__)),
        Series(element, Series(Text("+"), dwsp__)))
    group = Series(Series(Text("("), dwsp__), no_range, expression, Series(Text(")"), dwsp__),
                   mandatory=2)
    retrieveop = Alternative(Series(Text("::"), dwsp__), Series(Text(":?"), dwsp__),
                             Series(Text(":"), dwsp__))
    flowmarker = Alternative(Series(Text("!"), dwsp__), Series(Text("&"), dwsp__),
                             Series(Text("<-!"), dwsp__), Series(Text("<-&"), dwsp__))
    ANY_SUFFIX = RegExp('[?*+]')
    is_mdef = Series(Series(Text("$"), dwsp__), name, Option(
        Series(Series(Text("("), dwsp__), placeholder,
               ZeroOrMore(Series(Series(Text(","), dwsp__), placeholder)),
               Series(Text(")"), dwsp__))), dwsp__, Retrieve(DEF))
    pure_elem = Series(element, NegativeLookahead(ANY_SUFFIX), mandatory=1)
    MOD_SEP = RegExp(' *: *')
    hide = Alternative(Series(Text("HIDE"), dwsp__), Series(Text("Hide"), dwsp__),
                       Series(Text("hide"), dwsp__), Series(Text("DISPOSE"), dwsp__),
                       Series(Text("Dispose"), dwsp__), Series(Text("dispose"), dwsp__))
    drop = Alternative(Series(Text("DROP"), dwsp__), Series(Text("Drop"), dwsp__),
                       Series(Text("drop"), dwsp__), Series(Text("SKIP"), dwsp__),
                       Series(Text("Skip"), dwsp__), Series(Text("skip"), dwsp__))
    part = Series(Alternative(oneormore, pure_elem), Option(Series(MOD_SYM, dwsp__, drop)))
    term = Series(Alternative(oneormore, counted, repetition, option, pure_elem),
                  Option(Series(MOD_SYM, dwsp__, drop)))
    difference = Series(term, Option(
        Series(NegativeLookahead(Text("->")), Series(Text("-"), dwsp__), part, mandatory=2)))
    lookaround = Series(flowmarker, part, mandatory=1)
    interleave = Series(difference, ZeroOrMore(
        Series(Series(Text("°"), dwsp__), Option(Series(Text("§"), dwsp__)), difference)))
    sequence = Series(Option(Series(Text("§"), dwsp__)), Alternative(interleave, lookaround),
                      ZeroOrMore(Series(NegativeLookahead(Text("@")),
                                        NegativeLookahead(Series(symbol, Retrieve(DEF))),
                                        Retrieve(AND), dwsp__, Option(Series(Text("§"), dwsp__)),
                                        Alternative(interleave, lookaround))))
    modifier = Series(Alternative(drop, Option(hide)), MOD_SEP)
    FOLLOW_UP = Alternative(Text("@"), Text("$"), modifier, symbol, EOF)
    is_def = Alternative(Series(Option(Series(MOD_SEP, symbol)), Retrieve(DEF)),
                         Series(MOD_SEP, is_mdef))
    macrobody = Synonym(expression)
    definition = Series(Option(modifier), symbol, Retrieve(DEF), dwsp__,
                        Option(Series(Retrieve(OR), dwsp__)), expression,
                        Option(Series(MOD_SYM, dwsp__, hide)), Retrieve(ENDL), dwsp__,
                        Lookahead(FOLLOW_UP), mandatory=2)
    procedure = Series(SYM_REGEX, Series(Text("()"), dwsp__))
    literals = OneOrMore(literal)
    macrodef = Series(Option(modifier), Series(Text("$"), dwsp__), name, dwsp__, Option(
        Series(Series(Text("("), dwsp__), placeholder,
               ZeroOrMore(Series(Series(Text(","), dwsp__), placeholder)),
               Series(Text(")"), dwsp__), mandatory=1)), Retrieve(DEF), dwsp__,
                      Option(Series(OR, dwsp__)), macrobody, Option(Series(MOD_SYM, dwsp__, hide)),
                      Retrieve(ENDL), dwsp__, Lookahead(FOLLOW_UP))
    _is_def = Alternative(Series(Option(Series(MOD_SEP, symbol)), _DEF), Series(MOD_SEP, is_mdef))
    component = Alternative(regexp, literals, procedure,
                            Series(symbol, NegativeLookahead(_DEF), NegativeLookahead(_is_def)),
                            Series(Lookahead(Text("$")), NegativeLookahead(is_mdef), placeholder,
                                   NegativeLookahead(is_def), mandatory=2),
                            Series(Series(Text("("), dwsp__), expression,
                                   Series(Text(")"), dwsp__)),
                            Series(RAISE_EXPR_WO_BRACKETS, expression))
    directive = Series(Series(Text("@"), dwsp__), symbol, Series(Text("="), dwsp__), component,
                       ZeroOrMore(Series(Series(Text(","), dwsp__), component)),
                       Lookahead(FOLLOW_UP), mandatory=1)
    element.set(Alternative(Series(Option(retrieveop), symbol, NegativeLookahead(is_def)), literal,
                            plaintext, char_ranges, regexp, char_range, Series(character, dwsp__),
                            any_char, whitespace, group, Series(macro, NegativeLookahead(is_def)),
                            Series(placeholder, NegativeLookahead(is_def)), parser))
    countable.set(Alternative(option, oneormore, element))
    expression.set(Series(sequence, ZeroOrMore(Series(Retrieve(OR), dwsp__, sequence))))
    syntax = Series(dwsp__, ZeroOrMore(Alternative(definition, directive, macrodef)), EOF)
    resume_rules__ = {'definition': [re.compile(r'\n\s*(?=@|\w+\w*\s*=)')],
                      'directive': [re.compile(r'\n\s*(?=@|\w+\w*\s*=)')]}
    root__ = syntax

    def __init__(self, root: Optional[Parser] = None, static_analysis: Optional[bool] = None) -> None:
        Grammar.__init__(self, root, static_analysis)
        self.free_char_parsefunc__ = self.free_char._parse
        self.char_range_heuristics_parsefunc__ = self.char_range_heuristics._parse
        self.regex_heuristics_parserfunc__ = self.regex_heuristics._parse
        self.mode__ = 'fixed'

    @property
    def mode(self) -> str:
        def which(p: Parser) -> str:
            if p._parse_proxy.__qualname__ == 'Never._parse':
                return 'never'
            elif p._parse_proxy.__qualname__ == 'Always._parse':
                return 'always'
            else:
                return 'custom'
        signature = (
            which(self.free_char),
            which(self.regex_heuristics),
            which(self.char_range_heuristics)
        )
        if signature == ('custom', 'custom', 'custom'):
            return 'heuristic'
        elif signature == ('never', 'always', 'always'):
            return 'strict'  # or 'classic'
        elif signature == ('custom', 'never', 'always'):
            return 'peg-like'
        elif signature == ('custom', 'always', 'always'):
            return 'regex-like'
        else:
            return "undefined"

    @mode.setter
    def mode(self, mode: str):
        if mode == self.mode:
            return

        def set_parsefunc(p: Parser, f: ParseFunc):
            method = f.__get__(p, type(p))  # bind function f to parser p
            p._parse_proxy = method

        always = Always._parse
        never = Never._parse

        if mode == 'heuristic':
            set_parsefunc(self.free_char, self.free_char_parsefunc__)
            set_parsefunc(self.regex_heuristics, self.regex_heuristics_parserfunc__)
            set_parsefunc(self.char_range_heuristics, self.char_range_heuristics_parsefunc__)
        elif mode in ('strict', 'classic'):
            set_parsefunc(self.free_char, never)
            set_parsefunc(self.regex_heuristics, always)
            set_parsefunc(self.char_range_heuristics, always)
        elif mode == 'peg-like':
            set_parsefunc(self.free_char, self.free_char_parsefunc__)
            set_parsefunc(self.regex_heuristics, never)
            set_parsefunc(self.char_range_heuristics, always)
        elif mode == 'regex-like':
            set_parsefunc(self.free_char, self.free_char_parsefunc__)
            set_parsefunc(self.regex_heuristics, always)
            set_parsefunc(self.char_range_heuristics, always)
        else:
            raise ValueError('Mode must be one of: ' + ', '.join(
                ALLOWED_PRESET_VALUES['syntax_variant']))




[docs]
class ConfigurableEBNFGrammar(Grammar):
    r"""A parser for an EBNF grammar that can be configured to parse
    different syntactical variants of EBNF. Other than HeuristicEBNF
    this parser does not detect the used variant while parsing.

    Different syntactical variants can be configured either by adjusting
    the definitions of DEF, OR, AND, ENDL, RNG_OPEN, RNG_CLOSE, RNG_DELIM,
    CH_LEADIN, TIMES, RE_LEADIN, RE_LEADOUT either within this grammar
    definition or in the Grammar-object changing the ``text``-field of the
    respective parser objects.

    EBNF-Definition of the Grammar::

        @ comment    = /(?!#x[A-Fa-f0-9])#.*(?:\n|$)|\/\*(?:.|\n)*?\*\/|\(\*(?:.|\n)*?\*\)/
            # comments can be either C-Style: /* ... */
            # or pascal/modula/oberon-style: (* ... *)
            # or python-style: # ... \n, excluding, however, character markers: #x20
        @ whitespace = /\s*/                            # whitespace includes linefeed
        @ literalws  = right                            # trailing whitespace of literals will be ignored tacitly
        @ hide       = is_mdef, component, pure_elem, countable, no_range, FOLLOW_UP,
                       ANY_SUFFIX, MOD_SYM, MOD_SEP, EOF
        @ drop       = whitespace, MOD_SYM, EOF, no_range   # do not include these even in the concrete syntax tree


        # re-entry-rules to resume parsing after a syntax-error

        @ definition_resume = /\n\s*(?=@|\w+\w*\s*=)/
        @ directive_resume  = /\n\s*(?=@|\w+\w*\s*=)/

        # specialized error messages for certain cases

        @ definition_error  = /,/, 'Delimiter "," not expected in definition!\nEither this was meant to '
                                   'be a directive and the directive symbol @ is missing\nor the error is '
                                   'due to inconsistent use of the comma as a delimiter\nfor the elements '
                                   'of a sequence.'

        #: top-level

        syntax     = ~ { definition | directive | macrodef } EOF
        definition = [modifier] symbol §DEF~ [ OR~ ] expression [MOD_SYM~ hide]
                     ENDL~ &FOLLOW_UP  # [OR~] to support v. Rossum's syntax
          modifier = (drop | [hide]) MOD_SEP   # node LF after modifier allowed!
          is_def   = [MOD_SEP symbol] DEF | MOD_SEP is_mdef
          MOD_SEP  = / *: */

        directive  = "@" §symbol "=" component { "," component } &FOLLOW_UP
          component  = regexp | literals | procedure | symbol !is_def
                     | &`$` !is_mdef § placeholder !is_def
                     | "(" expression ")" | RAISE_EXPR_WO_BRACKETS expression
          literals   = { literal }+                       # string chaining, only allowed in directives!
          procedure  = SYM_REGEX "()"                     # procedure name, only allowed in directives!

        macrodef   = [modifier] "$" name~ ["(" §placeholder { "," placeholder }  ")"]
                     DEF~ [ OR~ ] macrobody [MOD_SYM~ hide] ENDL~ & FOLLOW_UP
          macrobody  = expression
          is_mdef    = "$" name ["(" placeholder { "," placeholder }  ")"] ~DEF

        FOLLOW_UP  = `@` | `$` | modifier | symbol | EOF


        #: components

        expression = sequence { OR~ sequence }
        sequence   = ["§"] ( interleave | lookaround )  # "§" means all following terms mandatory
                     { AND~ ["§"] ( interleave | lookaround ) }
        interleave = difference { "°" ["§"] difference }
        lookaround = flowmarker § part
        difference = term [!`->` "-" § part]
        term       = (oneormore | counted | repetition | option | pure_elem) [MOD_SYM~ drop]
        part       = (oneormore | pure_elem) [MOD_SYM~ drop]


        #: tree-reduction-markers aka "AST-hints"

        drop       = "DROP" | "Drop" | "drop" | "SKIP" | "Skip" | "skip"
        hide       = "HIDE" | "Hide" | "hide" | "DISPOSE" | "Dispose" | "dispose"


        #: elements

        countable  = option | oneormore | element
        pure_elem  = element § !ANY_SUFFIX              # element strictly without a suffix
        element    = [retrieveop] symbol !is_def
                   | literal
                   | plaintext
                   | char_ranges
                   | character ~
                   | regexp
                   | char_range
                   | any_char
                   | whitespace
                   | group
                   | macro !is_def
                   | placeholder !is_def
                   | parser                            # a user defined parser
        ANY_SUFFIX = /[?*+]/


        #: flow-operators

        flowmarker = "!"  | "&"                         # '!' negative lookahead, '&' positive lookahead
                   | "<-!" | "<-&"                      # '<-!' negative lookbehind, '<-&' positive lookbehind
        retrieveop = "::" | ":?" | ":"                  # '::' pop, ':?' optional pop, ':' retrieve


        #: groups

        group      = "(" no_range §expression ")"
        oneormore  = "{" no_range expression "}+" | element "+"
        repetition = "{" no_range §expression "}" | element "*" no_range
        option     = !char_range "[" §expression "]" | element "?"
        counted    = countable range | countable TIMES~ multiplier | multiplier TIMES~ §countable

        range      = RNG_OPEN~ multiplier [ RNG_DELIM~ multiplier ] RNG_CLOSE~
        no_range   = !multiplier | &multiplier TIMES   # should that be &(multiplier TIMES)??
        multiplier = /[1-9]\d*/~


        #: leaf-elements

        parser     = "@" name "(" §[argument] ")"        # a user defined parser
          argument = literal | name~

        symbol     = SYM_REGEX ~                        # e.g. expression, term, parameter_list
        literal    = /"(?:(?<!\\)\\"|[^"])*?"/~         # e.g. "(", '+', 'while'
                   | /'(?:(?<!\\)\\'|[^'])*?'/~         # whitespace following literals will be ignored tacitly.
                   | /’(?:(?<!\\)\\’|[^’])*?’/~
        plaintext  = /`(?:(?<!\\)\\`|[^`])*?`/~         # like literal but does not eat whitespace
                   | /´(?:(?<!\\)\\´|[^´])*?´/~
        regexp     = RE_LEADIN RE_CORE RE_LEADOUT ~   # e.g. /\w+/, ~/#.*(?:\n|$)/~

        char_range = `[` [`^`] { restricted_range_desc }+ "]"
          restricted_range_desc = character [ `-` character ]
        char_ranges = RE_LEADIN range_chain { `|` range_chain } RE_LEADOUT ~
          range_chain = `[` [`^`] { range_desc }+ `]`
          range_desc = (character | free_char) [ `-` (character | free_char) ]

        character  = (CH_LEADIN | `\x` | `\u` | `\U`) HEXCODE
        free_char  = /[^\n\[\]\\]/ | /\\[nrtfv`´'"(){}\[\]\/\\]/
        any_char   = "."
        whitespace = /~/~                               # insignificant whitespace


        #: macros

        macro       = "$" name "(" no_range expression { "," no_range expression } ")"
        placeholder = "$" name !`(` ~

        name        = SYM_REGEX


        #: delimiters

        EOF = !/./

        DEF        = `=`
        OR         = `|`
        AND        = ``
        ENDL       = ``

        RNG_OPEN   = `{`
        RNG_CLOSE  = `}`
        RNG_DELIM  = `,`
        TIMES      = `*`

        RE_LEADIN  = `/`
        RE_LEADOUT = `/`

        CH_LEADIN  = `0x`

        MOD_SYM    = `->`  # symbol for postfix modifier

        #: basic-regexes

        RE_CORE    = /(?:(?<!\\)\\(?:\/)|[^\/])*/       # core of a regular expression, i.e. the dots in /.../
        SYM_REGEX  = /(?!\d)\w+/                        # regular expression for symbols
        HEXCODE    = /(?:[A-Fa-f1-9]|0(?!x)){1,8}/


        #: error-markers

        RAISE_EXPR_WO_BRACKETS = ``
    """
    countable = Forward()
    element = Forward()
    expression = Forward()
    source_hash__ = "dcc1a4c37c097b00a142af4be0b9a49f"
    disposable__ = re.compile(
        '(?:$.)|(?:MOD_SEP$|ANY_SUFFIX$|EOF$|countable$|component$|FOLLOW_UP$|pure_elem$|is_mdef$|MOD_SYM$|no_range$)')
    static_analysis_pending__ = []  # type: List[bool]
    parser_initialization__ = ["upon instantiation"]
    error_messages__ = {'definition': [(re.compile(r','),
                                        'Delimiter "," not expected in definition!\\nEither this was meant to be a directive and the directive symbol @ is missing\\nor the error is due to inconsistent use of the comma as a delimiter\\nfor the elements of a sequence.')]}
    COMMENT__ = r'(?!#x[A-Fa-f0-9])#.*(?:\n|$)|\/\*(?:.|\n)*?\*\/|\(\*(?:.|\n)*?\*\)'
    comment_rx__ = re.compile(COMMENT__)
    WHITESPACE__ = r'\s*'
    WSP_RE__ = mixin_comment(whitespace=WHITESPACE__, comment=COMMENT__)
    wsp__ = Whitespace(WSP_RE__)
    dwsp__ = Drop(Whitespace(WSP_RE__))
    RAISE_EXPR_WO_BRACKETS = Text("")
    HEXCODE = RegExp('(?:[A-Fa-f1-9]|0(?!x)){1,8}')
    SYM_REGEX = RegExp('(?!\\d)\\w+')
    RE_CORE = RegExp('(?:(?<!\\\\)\\\\(?:/)|[^/])*')
    MOD_SYM = Drop(Text("->"))
    CH_LEADIN = Text("0x")
    RE_LEADOUT = Text("/")
    RE_LEADIN = Text("/")
    TIMES = Text("*")
    RNG_DELIM = Text(",")
    RNG_CLOSE = Text("}")
    RNG_OPEN = Text("{")
    ENDL = Text("")
    AND = Text("")
    OR = Text("|")
    DEF = Text("=")
    EOF = Drop(NegativeLookahead(RegExp('.')))
    name = Synonym(SYM_REGEX)
    placeholder = Series(Series(Text("$"), dwsp__), name, NegativeLookahead(Text("(")), dwsp__)
    multiplier = Series(RegExp('[1-9]\\d*'), dwsp__)
    whitespace = Series(RegExp('~'), dwsp__)
    any_char = Series(Text("."), dwsp__)
    free_char = Alternative(RegExp('[^\\n\\[\\]\\\\]'), RegExp('\\\\[nrtfv`´\'"(){}\\[\\]/\\\\]'))
    character = Series(Alternative(CH_LEADIN, Text("\\x"), Text("\\u"), Text("\\U")), HEXCODE)
    range_desc = Series(Alternative(character, free_char),
                        Option(Series(Text("-"), Alternative(character, free_char))))
    range_chain = Series(Text("["), Option(Text("^")), OneOrMore(range_desc), Text("]"))
    char_ranges = Series(RE_LEADIN, range_chain, ZeroOrMore(Series(Text("|"), range_chain)),
                         RE_LEADOUT, dwsp__)
    restricted_range_desc = Series(character, Option(Series(Text("-"), character)))
    char_range = Series(Text("["), Option(Text("^")), OneOrMore(restricted_range_desc),
                        Series(Text("]"), dwsp__))
    regexp = Series(RE_LEADIN, RE_CORE, RE_LEADOUT, dwsp__)
    plaintext = Alternative(Series(RegExp('`(?:(?<!\\\\)\\\\`|[^`])*?`'), dwsp__),
                            Series(RegExp('´(?:(?<!\\\\)\\\\´|[^´])*?´'), dwsp__))
    literal = Alternative(Series(RegExp('"(?:(?<!\\\\)\\\\"|[^"])*?"'), dwsp__),
                          Series(RegExp("'(?:(?<!\\\\)\\\\'|[^'])*?'"), dwsp__),
                          Series(RegExp('’(?:(?<!\\\\)\\\\’|[^’])*?’'), dwsp__))
    symbol = Series(SYM_REGEX, dwsp__)
    argument = Alternative(literal, Series(name, dwsp__))
    parser = Series(Series(Text("@"), dwsp__), name, Series(Text("("), dwsp__), Option(argument),
                    Series(Text(")"), dwsp__), mandatory=3)
    no_range = Drop(
        Alternative(NegativeLookahead(multiplier), Series(Lookahead(multiplier), TIMES)))
    macro = Series(Series(Text("$"), dwsp__), name, Series(Text("("), dwsp__), no_range, expression,
                   ZeroOrMore(Series(Series(Text(","), dwsp__), no_range, expression)),
                   Series(Text(")"), dwsp__))
    range = Series(RNG_OPEN, dwsp__, multiplier, Option(Series(RNG_DELIM, dwsp__, multiplier)),
                   RNG_CLOSE, dwsp__)
    counted = Alternative(Series(countable, range), Series(countable, TIMES, dwsp__, multiplier),
                          Series(multiplier, TIMES, dwsp__, countable, mandatory=3))
    option = Alternative(
        Series(NegativeLookahead(char_range), Series(Text("["), dwsp__), expression,
               Series(Text("]"), dwsp__), mandatory=2), Series(element, Series(Text("?"), dwsp__)))
    repetition = Alternative(
        Series(Series(Text("{"), dwsp__), no_range, expression, Series(Text("}"), dwsp__),
               mandatory=2), Series(element, Series(Text("*"), dwsp__), no_range))
    oneormore = Alternative(
        Series(Series(Text("{"), dwsp__), no_range, expression, Series(Text("}+"), dwsp__)),
        Series(element, Series(Text("+"), dwsp__)))
    group = Series(Series(Text("("), dwsp__), no_range, expression, Series(Text(")"), dwsp__),
                   mandatory=2)
    retrieveop = Alternative(Series(Text("::"), dwsp__), Series(Text(":?"), dwsp__),
                             Series(Text(":"), dwsp__))
    flowmarker = Alternative(Series(Text("!"), dwsp__), Series(Text("&"), dwsp__),
                             Series(Text("<-!"), dwsp__), Series(Text("<-&"), dwsp__))
    ANY_SUFFIX = RegExp('[?*+]')
    is_mdef = Series(Series(Text("$"), dwsp__), name, Option(
        Series(Series(Text("("), dwsp__), placeholder,
               ZeroOrMore(Series(Series(Text(","), dwsp__), placeholder)),
               Series(Text(")"), dwsp__))), dwsp__, DEF)
    pure_elem = Series(element, NegativeLookahead(ANY_SUFFIX), mandatory=1)
    MOD_SEP = RegExp(' *: *')
    hide = Alternative(Series(Text("HIDE"), dwsp__), Series(Text("Hide"), dwsp__),
                       Series(Text("hide"), dwsp__), Series(Text("DISPOSE"), dwsp__),
                       Series(Text("Dispose"), dwsp__), Series(Text("dispose"), dwsp__))
    drop = Alternative(Series(Text("DROP"), dwsp__), Series(Text("Drop"), dwsp__),
                       Series(Text("drop"), dwsp__), Series(Text("SKIP"), dwsp__),
                       Series(Text("Skip"), dwsp__), Series(Text("skip"), dwsp__))
    part = Series(Alternative(oneormore, pure_elem), Option(Series(MOD_SYM, dwsp__, drop)))
    term = Series(Alternative(oneormore, counted, repetition, option, pure_elem),
                  Option(Series(MOD_SYM, dwsp__, drop)))
    difference = Series(term, Option(
        Series(NegativeLookahead(Text("->")), Series(Text("-"), dwsp__), part, mandatory=2)))
    lookaround = Series(flowmarker, part, mandatory=1)
    interleave = Series(difference, ZeroOrMore(
        Series(Series(Text("°"), dwsp__), Option(Series(Text("§"), dwsp__)), difference)))
    sequence = Series(Option(Series(Text("§"), dwsp__)), Alternative(interleave, lookaround),
                      ZeroOrMore(Series(AND, dwsp__, Option(Series(Text("§"), dwsp__)),
                                        Alternative(interleave, lookaround))))
    modifier = Series(Alternative(drop, Option(hide)), MOD_SEP)
    FOLLOW_UP = Alternative(Text("@"), Text("$"), modifier, symbol, EOF)
    is_def = Alternative(Series(Option(Series(MOD_SEP, symbol)), DEF), Series(MOD_SEP, is_mdef))
    macrobody = Synonym(expression)
    definition = Series(Option(modifier), symbol, DEF, dwsp__, Option(Series(OR, dwsp__)),
                        expression, Option(Series(MOD_SYM, dwsp__, hide)), ENDL, dwsp__,
                        Lookahead(FOLLOW_UP), mandatory=2)
    procedure = Series(SYM_REGEX, Series(Text("()"), dwsp__))
    literals = OneOrMore(literal)
    component = Alternative(regexp, literals, procedure, Series(symbol, NegativeLookahead(is_def)),
                            Series(Lookahead(Text("$")), NegativeLookahead(is_mdef), placeholder,
                                   NegativeLookahead(is_def), mandatory=2),
                            Series(Series(Text("("), dwsp__), expression,
                                   Series(Text(")"), dwsp__)),
                            Series(RAISE_EXPR_WO_BRACKETS, expression))
    directive = Series(Series(Text("@"), dwsp__), symbol, Series(Text("="), dwsp__), component,
                       ZeroOrMore(Series(Series(Text(","), dwsp__), component)),
                       Lookahead(FOLLOW_UP), mandatory=1)
    macrodef = Series(Option(modifier), Series(Text("$"), dwsp__), name, dwsp__, Option(
        Series(Series(Text("("), dwsp__), placeholder,
               ZeroOrMore(Series(Series(Text(","), dwsp__), placeholder)),
               Series(Text(")"), dwsp__), mandatory=1)), DEF, dwsp__, Option(Series(OR, dwsp__)),
                      macrobody, Option(Series(MOD_SYM, dwsp__, hide)), ENDL, dwsp__,
                      Lookahead(FOLLOW_UP))
    element.set(Alternative(Series(Option(retrieveop), symbol, NegativeLookahead(is_def)), literal,
                            plaintext, char_ranges, Series(character, dwsp__), regexp, char_range,
                            any_char, whitespace, group, Series(macro, NegativeLookahead(is_def)),
                            Series(placeholder, NegativeLookahead(is_def)), parser))
    countable.set(Alternative(option, oneormore, element))
    expression.set(Series(sequence, ZeroOrMore(Series(OR, dwsp__, sequence))))
    syntax = Series(dwsp__, ZeroOrMore(Alternative(definition, directive, macrodef)), EOF)
    resume_rules__ = {'definition': [re.compile(r'\n\s*(?=@|\w+\w*\s*=)')],
                      'directive': [re.compile(r'\n\s*(?=@|\w+\w*\s*=)')]}
    root__ = syntax

    def __init__(self, root: Optional[Parser] = None, static_analysis: Optional[bool] = None) -> None:
        Grammar.__init__(self, root, static_analysis)
        self.mode__ = 'fixed'




[docs]
def grammar_changed(grammar_class, grammar_source: str) -> bool:
    """
    Returns ``True`` if ``grammar_class`` does not reflect the latest
    changes of ``grammar_source``

    Parameters:
        grammar_class:  the parser class representing the grammar
            or the file name of a compiler suite containing the grammar
        grammar_source:  File name or string representation of the
            EBNF code of the grammar

    Returns (bool):
        True, if the source text of the grammar is different from the
        source from which the grammar class was generated
    """
    grammar = load_if_file(grammar_source)
    chksum = md5(grammar, __version__)
    if isinstance(grammar_class, str):
        # grammar_class = load_compiler_suite(grammar_class)[1]
        with open(grammar_class, 'r', encoding='utf8') as f:
            pycode = f.read()
        m = re.search(r'class \w*\(Grammar\)', pycode)
        if m:
            m = re.search(' {4}source_hash__ *= *"([a-z0-9]*)"',
                          pycode[m.span()[1]:])
            if m is None:
                return False
            return not (m.groups() and m.groups()[-1] == chksum)
        else:
            return True
    else:
        return chksum != grammar_class.source_hash__




[docs]
def get_ebnf_grammar() -> Union[HeuristicEBNFGrammar, ConfigurableEBNFGrammar]:
    """Returns a thread-local EBNF-Grammar-object for parsing EBNF sources."""
    THREAD_LOCALS = access_thread_locals()
    mode = get_config_value('syntax_variant')
    try:
        grammar = THREAD_LOCALS.ebnf_grammar_singleton
        if mode in ('fixed', 'configurable'):
            if not isinstance(grammar, ConfigurableEBNFGrammar):
                raise AttributeError
        else:
            if not isinstance(grammar, HeuristicEBNFGrammar):
                raise AttributeError
    except AttributeError:
        if mode in ('fixed', 'configurable'):
            grammar = ConfigurableEBNFGrammar(static_analysis=False)
            if mode == "fixed":
                # configure grammar once
                update_scanner(grammar, get_config_value('delimiter_set'))
            grammar.mode__ = mode
            THREAD_LOCALS.ebnf_grammar_singleton = grammar
        else:
            grammar = HeuristicEBNFGrammar(static_analysis=False)

            grammar.mode__ = mode
            THREAD_LOCALS.ebnf_grammar_singleton = grammar
    if mode == 'configurable' or (mode == 'fixed' and grammar.mode__ != 'fixed'):
        # configure grammar on each request of the grammar object
        update_scanner(grammar, get_config_value('delimiter_set'))
    grammar.mode__ = mode
    return grammar




[docs]
def parse_ebnf(ebnf: str) -> Node:
    """Parses and EBNF-source-text and returns the concrete syntax tree
    of the EBNF-code."""
    return get_ebnf_grammar()(ebnf)



########################################################################
#
# EBNF concrete to abstract syntax tree transformation and validation
#
########################################################################


def pythonize_identifier(path: Path):
    r"""A mekshift solution to allow identifiers with dashes:
    replace minus with underscore"""
    assert not path[-1].children
    path[-1]._result = re.sub(r'(?<=\w)-(?=\w)', '_', path[-1]._result)


EBNF_AST_transformation_table = {
    # AST Transformations for EBNF-grammar
    "<":
        [remove_children_if(all_of(not_one_of('regexp', "RAISE_EXPR_WO_BRACKETS"), is_empty))],
    "syntax":
        [],
    "directive":
        [flatten, remove_tokens('@', '=', ',', '(', ')'),
         remove_children("RAISE_EXPR_WO_BRACKETS")],
    "macrodef": [remove_anonymous_tokens],
    "macrobody": [],
    "macro": [remove_anonymous_tokens],
    "placeholder": [remove_anonymous_tokens, reduce_single_child],
    "procedure":
        [remove_tokens('()', '(', ')'), reduce_single_child],
    "literals":
        [replace_by_single_child],
    "definition":
        [flatten, remove_children('DEF', 'ENDL'),
         remove_tokens('=')],  # remove_tokens('=') is only for backwards-compatibility
    "modifier": [],
    "expression":
        [apply_if(replace_by_single_child, neg(has_parent("directive"))),
         # the restriction above is required to distinguish regex-find-semantics
         # from regex-parser semantics in skip and resume-directives
         flatten, remove_children('OR'),
         remove_tokens('|')],  # remove_tokens('|') is only for backwards-compatibility
    "sequence":
        [replace_by_single_child, flatten, remove_children('AND')],
    "interleave":
        [replace_by_single_child, flatten, remove_tokens('°')],
    "lookaround":
        [],
    "difference":
        [remove_tokens('-'), replace_by_single_child],
    "term, part, pure_elem, element":
        [replace_by_single_child],
    "flowmarker, retrieveop":
        [reduce_single_child],
    "group":
        [remove_brackets],
    "oneormore, repetition, option":
        [reduce_single_child, remove_brackets,  # remove_tokens('?', '*', '+'),
         forbid('repetition', 'option', 'oneormore'), assert_content(r'(?!§)(?:.|\n)*')],
    "counted":
        [remove_children('TIMES')],
    "range":
        [remove_children('BRACE_SIGN', 'RNG_BRACE', 'RNG_DELIM', 'RNG_OPEN', 'RNG_CLOSE')],
    "symbol, literal, any_char":
        [reduce_single_child],
    "plaintext":
        [],
    "regexp":
        [remove_children('RE_LEADIN', 'RE_LEADOUT'), reduce_single_child],
    "char_ranges, range_desc":
        [],
    "restricted_range_desc":
        [change_name("range_desc")],
    "char_range, range_chain":
        [flatten, remove_tokens('[', ']')],
    "character":
        [remove_children('CH_LEADIN'), reduce_single_child],
    "free_char":
        [],
    (TOKEN_PTYPE, WHITESPACE_PTYPE, "whitespace"):
        [reduce_single_child],
    "parser":
        [remove_tokens('@', '(', ')')],
    "name, argument":
        [reduce_single_child],
    "SYM_REGEX": [pythonize_identifier],
    "RAISE_EXPR_WO_BRACKETS":
        [add_error("PEG Expressions in directives must be enclosed in barckets (...)",
                   PEG_EXPRESSION_IN_DIRECTIVE_WO_BRACKETS)],
    "EOF, DEF, OR, AND, ENDL, BRACE_SIGN, RNG_BRACE, RNG_DELIM, RNG_OPEN, "
    "RNG_CLOSE, TIMES, RE_LEADIN, RE_CORE, RE_LEADOUT, CH_LEADIN":
        [],
    "*":
        [BLOCK_LEAVES, replace_by_single_child]
}


def EBNFTransform() -> TransformerFunc:
    return partial(transformer,
                   transformation_table=EBNF_AST_transformation_table.copy(),
                   src_stage='CST', dst_stage='AST')


def get_ebnf_transformer() -> TransformerFunc:
    THREAD_LOCALS = access_thread_locals()
    try:
        ebnf_transformer = THREAD_LOCALS.EBNF_transformer_singleton
    except AttributeError:
        THREAD_LOCALS.EBNF_transformer_singleton = EBNFTransform()
        ebnf_transformer = THREAD_LOCALS.EBNF_transformer_singleton
    return ebnf_transformer



[docs]
def transform_ebnf(cst: RootNode) -> RootNode:
    """Transforms the concrete-syntax-tree of an EBNF-source-code
    into the abstract-syntax-tree. The transformation changes the
    syntax tree in place. No value is returned."""
    root = get_ebnf_transformer()(cst)
    return root



########################################################################
#
# EBNF abstract syntax tree to Python-parser compilation
#
########################################################################


# source-code-snippets and -templates

DHPARSER_IMPORTS = '''
import collections
from functools import partial
import os
import sys
from typing import Tuple, List, Union, Any, Optional, Callable, cast

try:
    import regex as re
except ImportError:
    import re

try:
    scriptdir = os.path.dirname(os.path.realpath(__file__))
except NameError:
    scriptdir = ''
if scriptdir and scriptdir not in sys.path: sys.path.append(scriptdir)

try:
    from DHParser import versionnumber
except (ImportError, ModuleNotFoundError):
    i = scriptdir.rfind("/DHParser/")
    if i >= 0:
        dhparserdir = scriptdir[:i + 10]  # 10 = len("/DHParser/")
        if dhparserdir not in sys.path:  sys.path.insert(0, dhparserdir)

from DHParser.compile import Compiler, compile_source, Junction, full_compile
from DHParser.configuration import set_config_value, add_config_values, get_config_value, \\
    access_thread_locals, access_presets, finalize_presets, set_preset_value, \\
    get_preset_value, NEVER_MATCH_PATTERN
from DHParser import dsl
from DHParser.dsl import recompile_grammar, never_cancel
from DHParser.ebnf import grammar_changed
from DHParser.error import ErrorCode, Error, canonical_error_strings, has_errors, NOTICE, \\
    WARNING, ERROR, FATAL
from DHParser.log import start_logging, suspend_logging, resume_logging
from DHParser.nodetree import Node, WHITESPACE_PTYPE, TOKEN_PTYPE, RootNode, Path, ZOMBIE_TAG
from DHParser.parse import Grammar, PreprocessorToken, Whitespace, Drop, DropFrom, AnyChar, Parser, \\
    Lookbehind, Lookahead, Alternative, Pop, Text, Synonym, Counted, Interleave, INFINITE, ERR, \\
    Option, NegativeLookbehind, OneOrMore, RegExp, SmartRE, Retrieve, Series, Capture, TreeReduction, \\
    ZeroOrMore, Forward, NegativeLookahead, Required, CombinedParser, Custom, IgnoreCase, \\
    LateBindingUnary, mixin_comment, last_value, matching_bracket, optional_last_value, \\
    PARSER_PLACEHOLDER, UninitializedError
from DHParser.pipeline import end_points, full_pipeline, create_parser_junction, \\
    create_preprocess_junction, create_junction, PseudoJunction 
from DHParser.preprocess import nil_preprocessor, PreprocessorFunc, PreprocessorResult, \\
    gen_find_include_func, preprocess_includes, make_preprocessor, chain_preprocessors
from DHParser.stringview import StringView
from DHParser.toolkit import is_filename, load_if_file, cpu_count, RX_NEVER_MATCH, \\
    ThreadLocalSingletonFactory, expand_table
from DHParser.trace import set_tracer, resume_notices_on, trace_history
from DHParser.transform import is_empty, remove_if, TransformationDict, TransformerFunc, \\
    transformation_factory, remove_children_if, move_fringes, normalize_whitespace, \\
    is_anonymous, name_matches, reduce_single_child, replace_by_single_child, replace_or_reduce, \\
    remove_whitespace, replace_by_children, remove_empty, remove_tokens, flatten, all_of, \\
    any_of, transformer, merge_adjacent, collapse, collapse_children_if, transform_result, \\
    remove_children, remove_content, remove_brackets, change_name, remove_anonymous_tokens, \\
    keep_children, is_one_of, not_one_of, content_matches, apply_if, peek, \\
    remove_anonymous_empty, keep_nodes, traverse_locally, strip, lstrip, rstrip, \\
    replace_content_with, forbid, assert_content, remove_infix_operator, add_error, error_on, \\
    left_associative, lean_left, node_maker, has_descendant, neg, has_ancestor, insert, \\
    positions_of, replace_child_names, add_attributes, delimit_children, merge_connected, \\
    has_attr, has_parent, has_children, has_child, apply_unless, apply_ifelse, traverse
from DHParser import parse as parse_namespace__

'''


PREPROCESSOR_FACTORY = '''

# To capture includes, replace the NEVER_MATCH_PATTERN 
# by a pattern with group "name" here, e.g. r'\\input{{(?P<name>.*)}}'
RE_INCLUDE = NEVER_MATCH_PATTERN
RE_COMMENT = {COMMENT__}  # THIS MUST ALWAYS BE THE SAME AS {NAME}Grammar.COMMENT__ !!!


def {NAME}Tokenizer(original_text) -> Tuple[str, List[Error]]:
    # Here, a function body can be filled in that adds preprocessor tokens
    # to the source code and returns the modified source.
    return original_text, []

preprocessing: PseudoJunction = create_preprocess_junction(
    {NAME}Tokenizer, RE_INCLUDE, RE_COMMENT)
'''


CUSTOM_PARSER_EXAMPLE = '''
# Examples for custom parsers. Use these as role models for your own 
# Python-implemented parsing-functions

# 1. Plain parsing function (denoted as "@pCustom(parse_word)" is the grammar)

def parse_word(s: StringView) -> Optional[Node]:
    m = s.match(r'\\w+')
    if m: 
        return Node('word', s[:m.end()])
    return None

# 2. Simple parser generator: like plain parser function, but looks
#    in the grammar, i.e. "@parse_word()"



'''


GRAMMAR_FACTORY = '''    
parsing: PseudoJunction = create_parser_junction({NAME}Grammar)
get_grammar = parsing.factory # for backwards compatibility, only
'''


TRANSFORMER_FACTORY = '''
# DEPRECATED, because it requires pickling the transformation-table, which rules out lambdas!
# ASTTransformation: Junction = create_junction(
#     {NAME}_AST_transformation_table, "CST", "AST", "transtable")

def {NAME}Transformer() -> TransformerFunc:
    return partial(transformer, transformation_table={NAME}_AST_transformation_table.copy(),
                   src_stage='CST', dst_stage='AST')

ASTTransformation: Junction = Junction(
    'CST', ThreadLocalSingletonFactory({NAME}Transformer), 'AST')
'''


COMPILER_FACTORY = '''

compiling: Junction = create_junction(
    {NAME}Compiler, "AST", "{NAME}")
'''


WHITESPACE_TYPES = {'linefeed': r'[ \t]*(?:\n[ \t]*(?![ \t]*\n))?',  # default
                    'linestart': r'[ \t]*(?:\n(?![ \t]*\n))?',
                    'horizontal': r'[\t ]*',
                    'vertical': r'\s*'}

DROP_STRINGS     = 'strings'
DROP_BACKTICKED  = 'backticked'
DROP_WSPC        = 'whitespace'
DROP_NO_COMMENTS = 'no_comments'
DROP_REGEXP      = 'regexps'
DROP_VALUES      = {DROP_STRINGS, DROP_BACKTICKED, DROP_WSPC, DROP_NO_COMMENTS, DROP_REGEXP}

# Representation of Python code or, rather, something that will be output as Python code
ReprType: TypeAlias = Union[str, unrepr]


VALID_DIRECTIVES = {
    'comment': r'Regular expression for comments, e.g. /#.*(?:\n|$)/',
    'whitespace': r'Regular expression for whitespace or one of: horizontal, linefeed, linestart, vertical',
    'literalws': 'Controls implicit whitespace adjacent to literals: left, right, both, none',
    'ignorecase': 'Controls case-sensitivity: True, False',
    '[preprocessor_]tokens': 'List of the names of all preprocessor tokens',
    'disposable|hide': 'List of symbols that shall not be turned into tag-names',
    'drop': 'List of tags to be dropped with all their content from the tree, '
            'special values: strings, backticked, whitespace, regexps',
    '[tree_]reduction': 'Reduction level for simplifying the tree while parsing.'
                        'Possible levels: none, flatten, merge_treetops, merge',
    '$SYMBOL_filer': 'Function that transforms captured values of the given symbol on retrieval',
    '$SYMBOL_error': 'Pair of regular expression an custom error message if regex matches',
    '$SYMBOL_skip': 'List of regexes or functions to find reentry point after an error',
    '$SYMBOL_resume': 'List or regexes or functions to find reentry point for parent parser'
}



[docs]
class EBNFDirectives:
    r"""
    A Record that keeps information about compiler directives
    during the compilation process.

    :ivar whitespace:  the regular expression string for (insignificant)
            whitespace
    :ivar comment:  the regular expression string for comments
    :ivar literalws:  automatic "whitespace eating" next to literals. Can
            be either 'left', 'right', 'none', 'both'
    :ivar tokens:  set of the names of preprocessor tokens
    :ivar filter:  mapping of symbols to python match functions that
            will be called on any retrieve / pop - operations on
            these symbols
    :ivar error:  mapping of symbols to tuples of match conditions and
            customized error messages. A match condition can be
            either a string or a regular expression. The first
            error message where the search condition matches will
            be displayed. An empty string '' as search condition
            always matches, so in case of multiple error messages,
            this condition should be placed at the end.
    :ivar skip:  mapping of symbols to a list of search expressions. A
            search expressions can be either a string ot a regular
            expression. The closest match is the point of reentry
            for the series- or interleave-parser when a mandatory item
            failed to match the following text.
    :ivar resume:  mapping of symbols to a list of search expressions. A
            search expressions can be either a string ot a regular
            expression. The closest match is the point of reentry
            for after a parsing error has error occurred. Other
            than the skip field, this configures resuming after
            the failing parser (:py:class:`parser.Series`
            or :py:class:`parser.Interleave`) has returned.
    :ivar disposable: A regular expression to identify "disposable" symbols,
            i.e. symbols that will not appear as tag-names. Instead,
            the nodes produced by the parsers associated with these
            symbols will yield anonymous nodes just like "inner"
            parsers that are not directly assigned to a symbol.
    :ivar drop:   A set that may contain the elements ``DROP_STRINGS`` and
            ``DROP_WSP``, ``DROP_REGEXP`` or any name of a symbol of a
            disposable parser (e.g. '_linefeed') the results of which will
            be dropped during the parsing process, already.
    :ivar reduction: The reduction level (0-3) for early tree-reduction
            during the parsing stage.
    :ivar \_super\_ws: Cache for the "super whitespace" which
            is a regular expression that merges whitespace and
            comments. This property should only be accessed after
            the ``whitespace``- and ``comment``-field have been filled
            with the values parsed from the EBNF source.
    """
    __slots__ = ['whitespace', 'comment', 'literalws', 'tokens', 'filter', 'error', 'skip',
                 'resume', 'disposable', 'drop', 'reduction', '_super_ws']

    REPEATABLE_DIRECTIVES = {'tokens', 'preprocessor_tokens', 'disposable', 'drop'}

    def __init__(self):
        self.whitespace = WHITESPACE_TYPES['linefeed']  # type: str
        self.comment = ''         # type: str
        self.literalws = {get_config_value('default_literalws')}  # type: Set[str]
        self.tokens = set()       # type: Set[str]
        self.filter = dict()      # type: Dict[str, str]
        self.error = dict()       # type: Dict[str, List[Tuple[ReprType, ReprType]]]
        self.skip = dict()        # type: Dict[str, List[Union[unrepr, str]]]
        self.resume = dict()      # type: Dict[str, List[Union[unrepr, str]]]
        self.disposable = get_config_value('default_disposable_regexp')  # type: str
        self.drop = set()         # type: Set[str]
        self.reduction = 1        # type: int
        self._super_ws = None     # type: Optional[str]

    def __getitem__(self, key):
        return getattr(self, key)

    def __setitem__(self, key, value):
        assert hasattr(self, key)
        setattr(self, key, value)

    @property
    def super_ws(self):
        if self._super_ws is None:
            self._super_ws = mixin_comment(self.whitespace, self.comment)
        return self._super_ws

    def keys(self):
        return self.__slots__

    def add_to_disposable_regexp(self, pattern: str):
        if self.disposable == NEVER_MATCH_PATTERN:
            self.disposable = pattern
        else:
            self.disposable = '(?:%s)|(?:%s)' % (self.disposable, pattern)

    def add_to_disposable_symbols(self, symbols: Union[str, Container[str]]):
        symbols = {symbols} if isinstance(symbols, str) else set(symbols)
        for s in frozenset(symbols):
            if re.match(self.disposable, s):  symbols.remove(s)
        if symbols:
            pattern = '$|'.join(symbols) + '$'
            self.disposable = '(?:%s)|(?:%s)' % (self.disposable, pattern)




[docs]
class EBNFCompilerError(CompilerError):
    r"""Error raised by :py:class:`EBNFCompiler` class. (Not compilation errors
    in the strict sense, see :py:class:`~dsl.CompilationError` in module :py:mod:`~dsl`)"""
    pass




[docs]
class EBNFCompiler(Compiler):
    r"""
    Generates a Parser from an abstract syntax tree of a grammar specified
    in EBNF-Notation.

    Usually, this class will not be instantiated or instances of this
    class be called directly. Rather high-level functions like
    :py:func:`~dsl.create_parser` or :py:func:`~dsl.compileEBNF`
    will be used to generate callable :py:class:`~parse.Grammar`-objects
    or Python-source-code from an EBNF-grammar.

    Instances of this class must be called with the root-node of the
    abstract syntax tree from an EBNF-specification of a formal language.
    The returned value is the Python-source-code of a Grammar class for
    this language that can be used to parse texts in this language.
    See classes :py:class:`compile.Compiler` and :py:class:`parser.Grammar`
    for more information.

    Additionally, class EBNFCompiler provides helper methods to generate
    code-skeletons for a preprocessor, AST-transformation and full
    compilation of the formal language. These method's names start with
    the prefix ``gen_``.

    :ivar current_symbols:  During compilation, a list containing the root
            node of the currently compiled definition as first element
            and then the nodes of the symbols that are referred to in
            the currently compiled definition.

    :ivar cache_literal_symbols: A cache for all symbols that are defined
            by literals, e.g. ``head = "<head>"``. This is used by the
            on_expression()-method.

    :ivar rules:  Dictionary that maps rule names to a list of Nodes that
            contain symbol-references in the definition of the rule.
            The first item in the list is the node of the rule
            definition itself. Example::

                alternative = a | b

            Now ``[node.content for node in self.rules['alternative']]``
            yields ``['alternative = a | b', 'a', 'b']``

    :ivar referred_symbols_cache: A dictionary that caches the results of
            method ``referred_symbols()``. ``referred_symbols()`` maps a
            to the set of symbols that are directly or indirectly
            referred to in the definition of the symbol.

    :ivar directly_referred_cache: A dictionary that caches the results
            of method :py:meth:`directly_referred_symbols`, which yields
            the set of symbols that are referred to in the definition
            of a particular symbol.

    :ivar referred_otherwise: A set of symbols which are directly
           referred to in a directive, macro or macro-symbol. It does
           not matter whether these symbals are reachable (i.e. directly
           oder indirectly referred to) from the root-symbol.

    :ivar symbols:  A mapping of symbol names to their usages (not
            their definition!) in the EBNF source.

    :ivar py_symbols:  A set of symbols that are referred to in the
            grammar, but are (or must be) defined in Python-code
            outside the Grammar-class resulting from the compilation
            of the EBNF-source, as, for example, references to
            user-defined custom-parsers. (See :py:class:`~parse.Custom`)

    :ivar variables:  A set of symbols names that are used with the
            Pop or Retrieve operator. Because the values of these
            symbols need to be captured they are called variables.
            See ``test_parser.TestPopRetrieve`` for an example.

    :ivar forward:  A set of symbols that require a forward operator.

    :ivar definitions:  A dictionary of definitions. Other than ``rules``
            this maps the symbols to their compiled definienda.

    :ivar macros:  A dictionary that maps macro names to the
            macro-definition, or, more precisely to a tuple of the
            node of the macro-symbol, the string-list or macro arguments
            and the node of the AST that is substituted for the
            macro-symbol.

    :ivar macro_stack:  A stack (i.e. list) of macro names needed to
            ensure that macro calls are not recursively nested.

    :ivar required_keywords: A list of keywords (like ``comment__`` or
            ``whitespace__``) that need to be defined at the beginning
            of the grammar class because they are referred to later.

    :ivar deferred_tasks:  A list of callables that is filled during
            compilation, but that will be executed only after
            compilation has finished. Typically, it contains
            semantic checks that require information that
            is only available upon completion of compilation.

    :ivar root_symbol: The name of the root symbol.

    :ivar directives:  A record of all directives and their default values.

    :ivar defined_directives:  A dictionary of all directives that have already
            been defined, mapped onto the list of nodes where they have
            been (re-)defined. Except for those directives
            contained in EBNFDirectives.REPEATABLE_DIRECTIVES, directives
            must only be defined once.

    :ivar consumed_custom_errors:  A set of symbols for which a custom error
            has been defined and(!) consumed during compilation. This
            allows to add a compiler error in those cases where (i) an
            error message has been defined but will never be used or (ii)
            an error message is accidentally used twice. For examples, see
            ``test_ebnf.TestErrorCustomization``.

    :ivar consumed_skip_rules: The same as ``consumed_custom_errors`` only for
            in-series-resume-rules (aka 'skip-rules') for Series-parsers.

    :ivar P: a dictionary that maps parser class names to qualified names in
            cases a parser class name has also been used as a symbol name
            in the grammar. (e.g. Text -> parser_namespace\_\_.Text)

    :ivar re_flags:  A set of regular expression flags to be added to all
            regular expressions found in the current parsing process

    :ivar python_src:  A string that contains the python source code that was
            the outcome of the last EBNF-compilation.

    :ivar grammar_name:  The name of the grammar to be compiled

    :ivar grammar_source:  The source code of the grammar to be compiled.

    :ivar optimization_level:  Turns on optimizng parser by substituting
            SmartRE-parsers for compound parsers when possible.
            Any number higher than zero turns optimization on (the higher,
            the more; see "optimization_level" in  DHParser.config.py).
            A value of zero turns optimization off.
    """
    COMMENT_KEYWORD = "COMMENT__"
    COMMENT_PARSER_KEYWORD = "comment__"
    DROP_COMMENT_PARSER_KEYWORD = "dcomment__"
    COMMENT_RX_KEYWORD = "comment_rx__"
    WHITESPACE_KEYWORD = "WSP_RE__"
    RAW_WS_KEYWORD = "WHITESPACE__"
    RAW_WS_PARSER_KEYWORD = "whitespace__"
    DROP_RAW_WS_PARSER_KEYWORD = "dwhitespace__"
    WHITESPACE_PARSER_KEYWORD = "wsp__"
    DROP_WHITESPACE_PARSER_KEYWORD = "dwsp__"
    RESUME_RULES_KEYWORD = "resume_rules__"
    SKIP_RULES_KEYWORD = 'skip_rules__'
    ERR_MSGS_KEYWORD = 'error_messages__'
    COMMENT_OR_WHITESPACE = {COMMENT_PARSER_KEYWORD, DROP_COMMENT_PARSER_KEYWORD,
                             RAW_WS_PARSER_KEYWORD, DROP_RAW_WS_PARSER_KEYWORD,
                             WHITESPACE_PARSER_KEYWORD, DROP_WHITESPACE_PARSER_KEYWORD}
    RESERVED_SYMBOLS = {COMMENT_KEYWORD, COMMENT_RX_KEYWORD, COMMENT_PARSER_KEYWORD,
                        WHITESPACE_KEYWORD, RAW_WS_KEYWORD, RAW_WS_PARSER_KEYWORD,
                        WHITESPACE_PARSER_KEYWORD, DROP_WHITESPACE_PARSER_KEYWORD,
                        RESUME_RULES_KEYWORD}
    KEYWORD_SUBSTITUTION = {COMMENT_KEYWORD: COMMENT_PARSER_KEYWORD,
                            COMMENT_PARSER_KEYWORD: COMMENT_PARSER_KEYWORD,
                            RAW_WS_KEYWORD: RAW_WS_PARSER_KEYWORD,
                            RAW_WS_PARSER_KEYWORD: RAW_WS_PARSER_KEYWORD,
                            WHITESPACE_KEYWORD: WHITESPACE_PARSER_KEYWORD,
                            WHITESPACE_PARSER_KEYWORD: WHITESPACE_PARSER_KEYWORD,
                            DROP_WHITESPACE_PARSER_KEYWORD: DROP_WHITESPACE_PARSER_KEYWORD}
    AST_ERROR = "Badly structured syntax tree. " \
                "Potentially due to erroneous AST transformation."
    PREFIX_TABLE = {'§': 'Required',
                    '&': 'Lookahead', '!': 'NegativeLookahead',
                    '<-&': 'Lookbehind', '<-!': 'NegativeLookbehind',
                    '::': 'Pop', ':?': 'Pop', ':': 'Retrieve'}

    def __init__(self, grammar_name="DSL", grammar_source=""):
        super(EBNFCompiler, self).__init__()  # calls the reset()-method
        self.set_grammar_name(grammar_name, grammar_source)



[docs]
    def reset(self):
        super().reset()
        self.python_src = ''                   # type: str
        self.re_flags = set()                  # type: Set[str]
        self.rules = OrderedDict()             # type: OrderedDict[str, List[Node]]
        self.referred_symbols_cache = dict()   # type: Dict[str, FrozenSet[str]]
        self.directly_referred_cache = dict()  # type: Dict[str, FrozenSet[str]]
        self.referred_otherwise = set()        # type: Set[str]
        self.current_symbols = []              # type: List[Node]
        self.cache_literal_symbols = None      # type: Optional[Dict[str, str]]
        self.symbols = {}                      # type: Dict[str, List[Node]]
        self.py_symbols = set()                # type: Set[str]
        self.variables = {}                    # type: Dict[str, List[Node]]
        self.forward = set()                   # type: Set[str]
        self.definitions = {}                  # type: Dict[str, str]
        self.macros = {}                       # type: Dict[str, Tuple[Node, List[str], Node]]
        self.macro_stack = []                  # type: List[str]
        self.required_keywords = set()         # type: Set[str]
        self.deferred_tasks = []               # type: List[Callable]
        self.root_symbol = ""                  # type: str
        self.directives = EBNFDirectives()     # type: EBNFDirectives
        self.defined_directives = dict()       # type: Dict[str, List[Node]]
        self.consumed_custom_errors = set()    # type: Set[str]
        self.consumed_skip_rules = set()       # type: Set[str]
        self.P = {p: p for p in parser_names}  # type: Dict[str, str]
        self.optimization_level = get_config_value('optimization_level')  # type: int



    @property
    def result(self) -> str:
        return self.python_src



[docs]
    def set_grammar_name(self, grammar_name: str = "", grammar_source: str = ""):
        """
        Changes the grammar name and source.

        The grammar name and the source text are metadata that do not affect the
        compilation process. It is used to name and annotate the output.
        Returns `self`.
        """
        assert grammar_name == "" or re.match(r'\w+\Z', grammar_name), grammar_name
        if not grammar_name and re.fullmatch(r'[\w/:\\]+', grammar_source):
            grammar_name = os.path.splitext(os.path.basename(grammar_source))[0]
        self.grammar_name = grammar_name or "NameUnknown"
        self.grammar_source = load_if_file(grammar_source)


    # methods for generating skeleton code for preprocessor, transformer, and compiler


[docs]
    def gen_preprocessor_skeleton(self) -> str:
        """
        Returns Python-skeleton-code for a preprocessor-function for
        the previously compiled formal language.
        """
        comment_re = self.directives.comment
        rs = repr(comment_re) if comment_re else 'NEVER_MATCH_PATTERN'
        return PREPROCESSOR_FACTORY.format(NAME=self.grammar_name, COMMENT__=rs)



[docs]
    def gen_custom_parser_example(self) -> str:
        """
        Returns Python-example-code for a custom parser.
        """
        return CUSTOM_PARSER_EXAMPLE



[docs]
    def gen_transformer_skeleton(self) -> str:
        """
        Returns Python-skeleton-code for the AST-transformation for the
        previously compiled formal language.
        """
        if not self.rules and not self._dirty_flag:
            raise EBNFCompilerError('Compiler must be run before calling '
                                    '"gen_transformer_Skeleton()"!')
        tt_name = self.grammar_name + '_AST_transformation_table'
        transtable = [tt_name + ' = {',
                      '    # AST Transformations for the ' + self.grammar_name + '-grammar',
                      '    # "<": [],  # called for each node before calling its specific rules',
                      '    # "*": [],  # fallback for nodes that do not appear in this table',
                      '    # ">": [],   # called for each node after calling its specific rules']
        for name in self.rules:
            transformations = '[]'
            transtable.append('    "' + name + '": %s,' % transformations)
        transtable += ['}', '']
        transtable += [TRANSFORMER_FACTORY.format(NAME=self.grammar_name)]
        return '\n'.join(transtable)




[docs]
    def gen_compiler_skeleton(self) -> str:
        """
        Returns Python-skeleton-code for a Compiler-class for the
        previously compiled formal language.
        """
        if not self.rules and not self._dirty_flag:
            raise EBNFCompilerError('Compiler has not been run before calling '
                                    '"gen_Compiler_Skeleton()"!')
        compiler = ['class ' + self.grammar_name + 'Compiler(Compiler):',
                    '    """Compiler for the abstract-syntax-tree of a ',
                    f'        {self.grammar_name} source file.',
                    '    """', '',
                    '    def __init__(self):',
                    f'        super({self.grammar_name}Compiler, self).__init__()',
                    '        self.forbid_returning_None = True  '
                    '# set to False if any compilation-method is allowed to return None',
                    '',
                    '    def reset(self):',
                    '        super().reset()',
                    '        # initialize your variables here, not in the constructor!',
                    '',
                    '    def prepare(self, root: RootNode) -> None:',
                    '        assert root.stage == "AST", f"Source stage `AST` expected, '
                                                         '`but `{root.stage}` found."',
                    f'        root.stage = "{self.grammar_name}"'
                    '',
                    '    def finalize(self, result: Any) -> Any:',
                    '        return result',
                    '']
        c = Compiler()
        for name in self.rules:
            method_name = c.visitor_name(name)
            if name == self.root_symbol:
                compiler += ['    def ' + method_name + '(self, node):',
                             '        return self.fallback_compiler(node)', '']
            else:
                compiler += ['    # def ' + method_name + '(self, node):',
                             '    #     return node', '']
        compiler += [COMPILER_FACTORY.format(NAME=self.grammar_name)]
        return '\n'.join(compiler)



[docs]
    def verify_transformation_table(self, transtable):
        """
        Checks for symbols that occur in the transformation-table but have
        never been defined in the grammar. Usually, this kind of
        inconsistency results from an error like a typo in the transformation
        table.
        """
        assert self._dirty_flag
        table_entries = set(expand_table(transtable).keys()) - {'*', '<', '>', '~'}
        symbols = set(self.rules.keys()) | set(self.macros.keys())
        symbols.add('ZOMBIE__')
        messages = []
        for entry in table_entries:
            if entry not in symbols and not entry[:1] == ":":
                messages.append(Error(('Symbol "%s" is not defined in grammar %s but appears in '
                                       'the transformation table!') % (entry, self.grammar_name),
                                      0, UNDEFINED_SYMBOL_IN_TRANSTABLE_WARNING))
        return messages


    # def verify_compiler(self, compiler):
    #     """
    #     Checks for on_XXXX()-methods that occur in the compiler, although XXXX
    #     has never been defined in the grammar. Usually, this kind of
    #     inconsistency results from an error like a typo in the compiler-code.
    #     """
    #     pass  # TODO: add verification code here


[docs]
    def check_rx(self, node: Node, rx: str) -> str:
        """
        Checks whether the string `rx` represents a valid regular
        expression. Makes sure that multi-line regular expressions are
        prepended by the multi-line-flag. Returns the regular expression string.
        """
        # TODO: Support atomic grouping: https://stackoverflow.com/questions/13577372/
        #       do-python-regular-expressions-have-an-equivalent-to-rubys-atomic-grouping
        rx = rx.replace(r'\/', '/')
        flags = self.re_flags | {'x'} if rx.find('\n') >= 0 else self.re_flags
        if flags:
            rx = "(?%s)%s" % ("".join(flags), rx)
        try:
            re.compile(rx)
        except Exception as re_error:
            self.tree.new_error(node, "malformed regular expression %s: %s" %
                                (repr(rx), str(re_error)), MALFORMED_REGULAR_EXPRESSION)
        return rx




[docs]
    def extract_regex(self, node: Node) -> str:
        """Extracts regular expression string from regexp-Node."""
        value = node.content
        if node.name in ('literal', 'plaintext'):
            assert value
            assert value[0] + value[-1] in ('""', "''", "``")
            value = re.escape(value[1:-1])
        else:
            assert node.name == "regexp"""
        value = self.check_rx(node, value)
        return value




[docs]
    def make_search_rule(self, node: Node, nd: Node, kind: str) -> ReprType:
        """Generates a search rule, which can be either a string for simple
        string search or a regular expression from the node's content. Returns
        an empty string in case the node is neither regexp nor literal.

        :param node: The node of the directive
        :param nd: The node containing the AST of the search rule
        :param kind:  The kind of the search rule, which must be one of
            "resume", "skip", "error"
        """
        assert kind in ('resume', 'skip', 'error')
        if nd.name == 'regexp':
            super_ws = self.directives.super_ws
            nonempty_ws = mixin_nonempty(super_ws)
            search_regex = self.extract_regex(nd)\
                .replace(r'\~!', nonempty_ws).replace(r'\~', super_ws)
            return unrepr("re.compile(r'%s')" % search_regex)
        elif nd.name == 'literal':
            s = nd.content[1:-1]  # remove quotation marks
            return unrepr(f"re.compile(r'(?={re.escape(s)})')")
        elif nd.name == 'procedure':
            proc_name = nd.content
            self.py_symbols.add(proc_name)
            return unrepr(proc_name)
        elif nd.name == 'symbol':
            referred_symbol = nd.content.strip()
            self.referred_otherwise.add(referred_symbol)
            return unrepr(referred_symbol)
        else:
            # create a parser-based search rule
            symbol = node.children[0].content.split('_')[0]
            stub = f"{symbol}_{kind}_"
            L = len(stub)
            nr = 1
            for rule in self.rules.keys():
                if rule[:L] == stub:
                    i = int(rule[L:].strip('_'))
                    if i > nr:  nr = i + 1
            rule = stub + str(nr) + '__'
            self.current_symbols = [node]
            self.rules[rule] = self.current_symbols
            defn = self.compile(nd)
            assert defn.find("(") >= 0  # synonyms are impossible here
            self.definitions[rule] = defn
            self.referred_otherwise.add(rule)
            return unrepr(rule)




[docs]
    def directly_referred(self, symbol: str) -> FrozenSet[str]:
        """Returns the set of symbols that are referred to in the definition
        of `symbol`."""
        try:
            return self.directly_referred_cache[symbol]
        except KeyError:
            pass
        try:
            referred_nodes = self.rules[symbol]
        except KeyError:
            referred_nodes = []  # Missing Symbol definition error will be caught later
        result = frozenset({nd.content for nd in referred_nodes[1:]})
        self.directly_referred_cache[symbol] = result
        return result




[docs]
    def referred_symbols(self, symbol: str) -> FrozenSet[str]:
        """Returns the set of all symbols that are directly or indirectly
        referred to in the definition of `symbol`. The symbol itself can
        be contained in this set, if and only if its rule is recursive.

        `referred_symbols()` only yields reliable results if the collection
        of definitions has been completed.
        """
        try:
            return self.referred_symbols_cache[symbol]
        except KeyError:
            pass
        collected = set()  # type: Set[str]

        def gather(sym: str):
            for s in self.directly_referred(sym):
                if s not in collected and s not in EBNFCompiler.RESERVED_SYMBOLS:
                    collected.add(s)
                    if s != symbol:
                        gather(s)
        gather(symbol)
        result = frozenset(collected)
        self.referred_symbols_cache[symbol] = result
        return result




[docs]
    def recursive_paths(self, symbol: str) -> FrozenSet[Tuple[str, ...]]:
        """Returns the recursive paths from symbol to itself. If
        sym is not recursive, the returned tuple (of paths) will be empty.
        This method exists only for debugging (so far...)."""
        path = []  # type: List[str]
        recursive_paths = set()  # type: Set[Tuple[str, ...]]

        def gather(sym: str):
            nonlocal path, recursive_paths
            path.append(sym)
            for s in self.directly_referred(sym):
                if s not in EBNFCompiler.RESERVED_SYMBOLS:
                    if s == symbol:
                        recursive_paths.add(tuple(path))
                    elif s not in path:
                        gather(s)
            path.pop()
        gather(symbol)
        return frozenset(recursive_paths)




[docs]
    @cython.locals(N=cython.int, top=cython.int, pointer=cython.int,
                   i=cython.int, k=cython.int, j=cython.int)
    def optimize_definitions_order(self, definitions: List[Tuple[str, str]]):
        """Reorders the definitions so as to minimize the number of Forward
        declarations. Forward declarations remain inevitable only where
        recursion is involved.
        """
        if not get_config_value('reorder_definitions'):
            return
        N = len(definitions)
        root = definitions[0][0] if N > 0 else ''
        recursive = {sym for sym in self.forward
                     if sym in self.referred_symbols(sym) or sym == root}

        # move recursive symbols to the top of the list

        top = 1  # however, leave the root symbol at the top
        while top < N and definitions[top][0] in recursive:
            top += 1
        for i in range(top, N):
            if definitions[i][0] in recursive:
                definitions[top], definitions[i] = definitions[i], definitions[top]
                top += 1

        # order the other definitions

        pointer = top
        while pointer < N:
            topsym = definitions[pointer][0]
            for i in range(pointer + 1, N):
                if topsym in self.directly_referred(definitions[i][0]):
                    definitions[pointer], definitions[i] = definitions[i], definitions[pointer]
                    break
            else:
                pointer += 1

        # try to place as many recursive symbols as possible among the
        # non-forwardly-defined-symbols

        i = top - len(recursive)
        while i < top:
            sym = definitions[i][0]
            referred = self.directly_referred(sym)
            if root not in referred:
                k = i
                while k < N and definitions[k][0] not in referred:
                    k += 1
                j = k
                while j < N and sym not in self.directly_referred(definitions[j][0]):
                    j += 1
                if j == N:
                    definitions.insert(k, definitions[i])
                    del definitions[i]
                    top -= 1
                    try:
                        recursive.remove(sym)
                    except KeyError:
                        # due to reiterated reordering sym may happen to be
                        # removed more than once
                        pass
            i += 1
        self.forward = recursive




[docs]
    def assemble_parser(self, definitions: List[Tuple[str, str]], root_symbol: str) -> str:
        """
        Creates the Python code for the parser after compilation of
        the EBNF-Grammar
        """
        def pp_rules(rule_name: str, ruleset: Dict[str, List]) -> Tuple[str, str]:
            """Pretty-print skip- and resume-rule and error-messages dictionaries
            to avoid excessively long lines in the generated python source."""
            assert ruleset
            indent = ",\n" + " " * (len(rule_name) + 8)
            rule_repr = []
            for k, v in ruleset.items():
                delimiter = indent + ' ' * (len(k) + 5)
                val = '[' + delimiter.join(str(it) for it in v) + ']'
                rule_repr.append("'{key}': {value}".format(key=k, value=val))
            rule_repr[0] = '{' + rule_repr[0]
            rule_repr[-1] = rule_repr[-1] + '}'
            return rule_name, indent.join(rule_repr)

        def verify_directive_against_symbol(directive: str, related_symbol: str):
            """Adds an error if the symbol that the directive relates to
            does not exist."""
            if related_symbol not in self.rules:
                nd = self.defined_directives[directive][0]
                self.tree.new_error(nd, 'Directive "%s" relates to a symbol that is '
                                    'nowhere defined!' % directive,
                                    DIRECTIVE_FOR_NONEXISTANT_SYMBOL)

        # execute deferred tasks, for example semantic checks that cannot
        # be done before the symbol table is complete

        for task in self.deferred_tasks:
            task()

        # minimize the necessary number of forward declarations
        self.optimize_definitions_order(definitions)
        self.root_symbol = root_symbol

        # provide for capturing of symbols that are variables, i.e. the
        # value of which will be retrieved at some point during the parsing process

        # The following is needed for cases, where the first retrival-use of a symbol
        # follows its definition

        if self.variables:
            for i, (symbol, code) in enumerate(definitions):
                if symbol in self.variables:
                    if set(self.symbols[symbol]) - set(self.variables[symbol]):
                        code = f'Capture({code}, zero_length_warning=True)'
                    else:
                        code = f'Capture({code}, zero_length_warning=False)'
                    definitions[i] = (symbol, code)
                    self.definitions[symbol] = code

        # add special fields for Grammar class

        if (DROP_WSPC in self.directives.drop or DROP_NO_COMMENTS in self.directives.drop
                or DROP_STRINGS in self.directives.drop):
            if DROP_NO_COMMENTS in self.directives.drop:
                drop_wspc_tmpl = 'Drop(Whitespace(%s, keep_comments=True))'  # TODO: turn keep-comments into a global option
            else:
                drop_wspc_tmpl = 'Drop(Whitespace(%s))'
            definitions.append((EBNFCompiler.DROP_WHITESPACE_PARSER_KEYWORD,
                                drop_wspc_tmpl % EBNFCompiler.WHITESPACE_KEYWORD))
        definitions.append((EBNFCompiler.WHITESPACE_PARSER_KEYWORD,
                            'Whitespace(%s)' % EBNFCompiler.WHITESPACE_KEYWORD))
        definitions.append((EBNFCompiler.WHITESPACE_KEYWORD,
                            ("mixin_comment(whitespace=" + EBNFCompiler.RAW_WS_KEYWORD
                             + ", comment=" + EBNFCompiler.COMMENT_KEYWORD + ")")))
        if EBNFCompiler.RAW_WS_PARSER_KEYWORD in self.required_keywords:
            definitions.append((EBNFCompiler.RAW_WS_PARSER_KEYWORD,
                                "Whitespace(%s)" % EBNFCompiler.RAW_WS_KEYWORD))
        definitions.append((EBNFCompiler.RAW_WS_KEYWORD,
                            "r'{}'".format(self.directives.whitespace)))
        comment_rx = ("re.compile(%s)" % EBNFCompiler.COMMENT_KEYWORD) \
            if self.directives.comment else "RX_NEVER_MATCH"
        if EBNFCompiler.COMMENT_PARSER_KEYWORD in self.required_keywords:
            definitions.append((EBNFCompiler.COMMENT_PARSER_KEYWORD,
                                "RegExp(%s)" % EBNFCompiler.COMMENT_RX_KEYWORD))
        definitions.append((EBNFCompiler.COMMENT_RX_KEYWORD, comment_rx))
        definitions.append((EBNFCompiler.COMMENT_KEYWORD,
                            "r'{}'".format(self.directives.comment)))

        # prepare and add resume-rules

        if self.directives.resume:
            definitions.insert(0, pp_rules(self.RESUME_RULES_KEYWORD, self.directives.resume))
        for symbol in self.directives.resume.keys():
            verify_directive_against_symbol(symbol + '_resume', symbol)

        # prepare and add skip-rules

        if self.directives.skip:
            definitions.insert(0, pp_rules(self.SKIP_RULES_KEYWORD, self.directives.skip))
        for symbol in self.directives.skip.keys():
            verify_directive_against_symbol(symbol + '_skip', symbol)
            if symbol not in self.consumed_skip_rules:
                # check for indirectly reachable unconsumed mandatory markers
                for s in self.referred_symbols(symbol):
                    if s not in self.directives.skip and s not in self.directives.resume:
                        try:
                            if self.definitions[s].find('mandatory=') >= 0:
                                self.consumed_skip_rules.add(symbol)
                                break
                        except KeyError as ke:
                            if len(ke.args) >= 1 and s == ke.args[0] and \
                                    (s in self.rules or ('$' + s) in self.macros):
                                raise ke
                            # otherwise s is an undefined symbol and will be reported later!
                else:
                    try:
                        def_node = self.rules[symbol][0]
                        self.tree.new_error(
                            def_node, '"Skip-rules" for symbol "{}" will never be used, because '
                            'the mandatory marker "§" does not appear in its definiendum or has '
                            'already been consumed earlier.'
                            .format(symbol), UNUSED_ERROR_HANDLING_WARNING)
                    except KeyError:
                        pass  # error has already been notified earlier!

        # prepare and add customized error-messages

        if self.directives.error:
            definitions.append(pp_rules(self.ERR_MSGS_KEYWORD, self.directives.error))

        for symbol in self.directives.error.keys():
            verify_directive_against_symbol(symbol + '_error', symbol)
            if symbol in self.rules and symbol not in self.consumed_custom_errors:
                def_node = self.rules[symbol][0]
                self.tree.new_error(
                    def_node, 'Customized error message for symbol "{}" will never be used, '
                    'because the mandatory marker "§" appears nowhere in its definiendum!'
                    .format(symbol), UNUSED_ERROR_HANDLING_WARNING)

        # prepare parser class header and docstring and
        # add EBNF grammar to the doc string of the parser class

        article = 'an ' if self.grammar_name[0:1] in "AaEeIiOoUu" else 'a '
        # what about 'hour', 'universe' etc.?
        show_source = get_config_value('add_grammar_source_to_parser_docstring')
        declarations = ['class ' + self.grammar_name
                        + 'Grammar(Grammar):',
                        'r"""Parser for ' + article + self.grammar_name
                        + ' source file'
                        + ('. Grammar:' if self.grammar_source and show_source else '.')]
        definitions.append(('parser_initialization__', '["upon instantiation"]'))
        definitions.append(('static_analysis_pending__', '[True]'))
        definitions.append(('disposable__',
                            're.compile(' + repr(self.directives.disposable) + ')'))
        if self.directives.reduction != CombinedParser.DEFAULT_OPTIMIZATION:
            opt = 'CombinedParser.' + ('NO_TREE_REDUCTION', 'FLATTEN', 'MERGE_TREETOPS',
                                       'MERGE_LEAVES')[self.directives.reduction]
            definitions.append(('early_tree_reduction__', opt))
        if self.grammar_source:
            definitions.append(('source_hash__',
                                '"%s"' % md5(self.grammar_source, __version__)))
            declarations.append('')
            if show_source:
                declarations += [line for line in self.grammar_source.split('\n')]
            while declarations[-1].strip() == '':
                declarations = declarations[:-1]
        declarations.append('"""')

        # turn definitions into declarations in reverse order

        definitions.reverse()
        declarations += [symbol + ' = Forward()'
                         for symbol in sorted(list(self.forward))]
        for symbol, statement in definitions:
            if symbol in self.forward:
                declarations += [symbol + '.set(' + statement + ')']
            else:
                declarations += [symbol + ' = ' + statement]

        # check for symbols used but never defined

        defined_symbols = set(self.rules.keys())
        symbols_and_macros = defined_symbols | self.RESERVED_SYMBOLS \
                             | set(['$' + k for k in self.macros])
        for symbol, usages in self.symbols.items():
            if symbol not in symbols_and_macros:
                for usage in usages:
                    passage = self.tree.source[usage.pos:usage.pos + len(symbol) + 1]
                    if passage == symbol + '(':
                        i = self.tree.source.find(')', usage.pos)
                        if i > usage.pos + len(passage):
                            passage = self.tree.source[usage.pos:i + 1]
                        else:
                            passage += "...)"
                        hint = f' If  {passage}  was meant as a custom parser call, ' \
                               f'then "@" should be added in front of it.'
                    else:
                        hint = ''
                    self.tree.new_error(
                        usage, f'Missing definition for symbol "{symbol}"!' + hint,
                        UNDEFINED_SYMBOL)

        # check for unconnected rules

        def remove_connections(symbol):
            """Recursively removes all symbols which appear in the
            definiens of a particular symbol."""
            if symbol in defined_symbols:
                defined_symbols.remove(symbol)
                for related in self.rules[symbol][1:]:
                    remove_connections(str(related))

        remove_connections(self.root_symbol)
        for symbol in self.referred_otherwise:
            if symbol in defined_symbols:
                remove_connections(symbol)
        for leftover in defined_symbols - self.py_symbols:
            self.tree.new_error(self.rules[leftover][0],
                                'Rule "%s" is not connected to parser root "%s" !' %
                                (leftover, self.root_symbol), UNCONNECTED_SYMBOL_WARNING)

        # check for filters assigned to non-existing or uncaptured symbols

        def directive_node(tree, directive) -> Node:
            """Returns the node, where the given directive was stated in the
            EBNF-source."""
            for dr in tree.select_if(lambda nd: nd.name == 'directive'):
                if dr.pick_if(lambda nd: nd.name == 'symbol').content == directive:
                    return dr
            return tree

        for symbol in self.directives.filter:
            if symbol not in self.symbols:
                self.tree.new_error(directive_node(self.tree, symbol + '_filter'),
                                    'Filter declared for non-existent symbol "%s"' % symbol,
                                    WARNING)
            else:
                if not self.definitions[symbol][:8] == 'Capture(':
                    self.tree.new_error(directive_node(self.tree, symbol + '_filter'),
                                        'Filter declared for uncaptured symbol "%s"' % symbol,
                                        WARNING)

        # assemble python grammar definition

        if self.root_symbol:
            declarations.append('root__ = ' + self.root_symbol)
        else:
            declarations.append(f'root__ = RegExp(r"{NEVER_MATCH_PATTERN}")')
        declarations.append('')
        self.python_src = '\n    '.join(declarations) \
                          + GRAMMAR_FACTORY.format(NAME=self.grammar_name)
        return self.python_src



    # compilation methods ###


    def on_ZOMBIE__(self, node: Node) -> str:
        result = ['Illegal node in AST generated from EBNF-Source!']
        if node.children:
            result.append(' Fragments found: ')
            result.extend([str(self.compile(child)) for child in node.children])
        src_dump = node.content.replace('\n', '\\n')
        self.tree.new_error(node, f'Structural Error in AST, skipped source "{src_dump}"',
                            STRUCTURAL_ERROR_IN_AST)
        return '\n'.join(result)


    def on_syntax(self, node: Node) -> str:
        self.optimization_level = get_config_value('optimization_level')

        # drop the wrapping sequence node
        if len(node.children) == 1 and node.children[0].anonymous:
            node = node.children[0]

        # check for possible naming conflicts with Python parser names
        for nd in node.children:
            if nd.name == "definition":
                assert nd[0].name == "symbol" \
                       or (nd[0].name == "modifier" and nd[1].name == "symbol")
                sym = nd[0].content if nd[0].name == "symbol" else nd[1].content
                if sym in self.P:
                    self.P[sym] = 'parse_namespace__.' + sym

        # reorder children: directives first, then macro syms and definitions,
        #                   finally symbol definitions
        directivedefs, macrosyms, macrodefs, symdefs = [], [], [], []
        for nd in node.children:
            if nd.name == 'directive':
                if nd[0].content == "disposable":
                    directivedefs.insert(0, nd)  # make sure that @disposable is always in front of @drop
                else:
                    directivedefs.append(nd)
            elif nd.name == 'macrodef':
                if nd.get('placeholder', None):
                    macrodefs.append(nd)
                else:  # macros without arguments must be read first, later, because of early substitution
                    macrosyms.append(nd)
            else:
                assert nd.name == 'definition', nd.as_sxpr()
                symdefs.append(nd)
        node.result = tuple(directivedefs + macrosyms + macrodefs + symdefs)

        # compile definitions and directives and collect definitions
        root_symbol = ''
        for nd in node.children:
            if nd.name == "definition":
                rule, defn = self.compile(nd)
                self.definitions[rule] = defn
                if not root_symbol:  root_symbol = rule
            else:
                assert nd.name in ("macrodef", "directive", ZOMBIE_TAG), nd.as_sxpr()
                if nd.name != ZOMBIE_TAG:   self.compile(nd)
                # If nd.name == ZOMBIE_TAG, some error has already been reported, earlier.
                # So it can be ignored here.

        if not self.definitions:
            self.tree.new_error(node, "Grammar does not contain any rules!", EMPTY_GRAMMAR_ERROR)

        # derive Python-parser and run static analysis of the Python-parser
        python_src = self.assemble_parser(list(self.definitions.items()), root_symbol)
        if get_config_value('static_analysis') == 'early' and not \
                any((e.code >= FATAL or e.code in
                     (MALFORMED_REGULAR_EXPRESSION, SYMBOL_NAME_IS_PYTHON_KEYWORD))
                    for e in self.tree.errors):
            errors = []
            # circumvent name-errors for external symbols
            tmpl = "def {name}(*args, **kwargs):\n    return ERR('')\n"
            probe_src = '\n'.join([DHPARSER_IMPORTS, '\n'] +
                                  [tmpl.format(name=name) for name in self.py_symbols
                                   if name not in self.rules] +
                                  [python_src])
            try:
                grammar_class = compile_python_object(
                    probe_src, (self.grammar_name or "DSL") + "Grammar")
                errors = grammar_class().static_analysis_errors__
                python_src = python_src.replace(
                    'static_analysis_pending__ = [True]',
                    'static_analysis_pending__ = []  # type: List[bool]', 1)
            except NameError as ne:
                try:
                    if ne.name not in self.symbols:
                        # undefined symbols in the grammar have already been caught and reported
                        raise(ne)
                except AttributeError:
                    pass  # In Python3.7/3.8/3.9 NameError does not have a name attribute
            except GrammarError as ge:
                errors = ge.errors
                if not has_errors([e.error for e in errors], ERROR):
                    python_src = python_src.replace(
                        'static_analysis_pending__ = [True]',
                        'static_analysis_pending__ = []  # type: List[bool]', 1)
            except (TypeError, AttributeError) as te_ae:
                if not (errors or self.tree.errors):
                    raise te_ae  # not merely a consequntial error of another reported error
            except SyntaxError as se:
                src_lines = probe_src.split('\n')
                se.msg += f' "{src_lines[se.lineno - 1]}" '
                raise se
            for sym, parser, err in errors:
                psym = parser.symbol.pname
                for dic, key in [(self.rules, sym), (self.rules, psym), (self.macros, psym)]:
                    symdef_node = dic.get(key, [None])[0]
                    if symdef_node:  break
                else:  symdef_node = node
                err.pos = symdef_node.pos
                self.tree.add_error(symdef_node, err)
        self.python_src = python_src
        return python_src


    def on_definition(self, node: Node) -> Tuple[str, str]:
        # process modifiers
        drop_flag = False
        if node[0].name == "modifier":
            assert node[1].name == "symbol"
            rule = node[1].content
            body = node[2]
            if node[0].pick('drop'):
                drop_flag = True
                self.directives.drop.add(rule)
            self.directives.add_to_disposable_symbols(rule)
        else:
            rule = node[0].content
            body = node[1]
            if node.children[-1].name == 'hide':
                self.directives.add_to_disposable_symbols(rule)
        if not drop_flag:
            drop_flag = rule in self.directives['drop'] \
                             and rule not in DROP_VALUES

        # check for various possible errors
        if rule.endswith('_error') or rule.endswith('_skip') \
                or rule.endswith('_resume') or rule.endswith('_filter'):
            self.tree.new_error(node, 'Symbol name "%s" suggests directive, ' % rule +
                                'but directive marker @ is missing...', WARNING)
        if rule in self.rules:
            first = self.rules[rule][0]
            if not id(first) in self.tree.error_nodes:
                self.tree.new_error(first, 'First definition of rule "%s" '
                                    'followed by illegal redefinitions.' % rule)
            self.tree.new_error(node, 'A rule "%s" has already been defined earlier.' % rule)
        elif rule in EBNFCompiler.RESERVED_SYMBOLS:
            self.tree.new_error(node, 'Symbol "%s" is a reserved symbol.' % rule)
        elif not sane_parser_name(rule):
            self.tree.new_error(node, 'Illegal symbol "%s". Symbols must not start or '
                                ' end with a double underscore "__".' % rule)
        elif rule in self.directives.tokens:
            self.tree.new_error(node, 'Symbol "%s" has already been defined as '
                                'a preprocessor token.' % rule)
        elif keyword.iskeyword(rule):
            self.tree.new_error(node, f'Python keyword "{rule}" may not be used as a symbol. '
                '(This may change in the future. In the meantime it might suffice to use '
                'capital letters as a remedy, e.g. "And" instead of "and".)',
                SYMBOL_NAME_IS_PYTHON_KEYWORD)

        # process definition
        try:
            self.current_symbols = [node]
            self.rules[rule] = self.current_symbols
            defn = self.compile(body)
            if isinstance(defn, str):
                if defn.find("(") < 0:
                    # assume it's a synonym, like 'page = REGEX_PAGE_NR'
                    if not drop_flag and defn in self.directives['drop'] \
                            and re.match(self.directives['disposable'], rule):
                        self.tree.new_error(node, f'Content of "{rule}" will be dropped, because '
                            f'it is a Synonym for the dropped symbol "{defn}". If this behaviour '
                            f'is undesired, swap the definitions of both symbols. Otherwise, add '
                            f'"{rule}" to the @drop-directive to avoid this warning.', ERROR)
                    defn = f'{self.P["Synonym"]}({defn})'
                if drop_flag and not defn.startswith(self.P["Drop"] + "("):
                    defn = f'{self.P["Drop"]}({defn})'
            else:
                assert isinstance(defn, Node)
                self.tree.new_error(node, 'Structural Error in AST, unexpected node-type: '
                                    + flatten_sxpr(defn.as_sxpr()), STRUCTURAL_ERROR_IN_AST)
                defn = "Malformed AST"
        except TypeError as error:
            from traceback import extract_tb, format_list
            trace = ''.join(format_list((extract_tb(error.__traceback__))))
            errmsg = "%s (TypeError: %s;\n%s)\n%s" \
                     % (EBNFCompiler.AST_ERROR, str(error), trace, node.as_sxpr())
            self.tree.new_error(node, errmsg)
            rule, defn = rule + ':error', '"' + errmsg + '"'
        return rule, defn


    @staticmethod
    def join_literals(nd):
        assert nd.name == "literals"
        parts = [nd.children[0].content[:-1]]
        for child in nd.children[1:-1]:
            parts.append(child.content[1:-1])
        parts.append(nd.children[-1].content[1:])
        nd.result = "".join(parts)
        nd.name = "literal"

    # def add_to_disposable_regexp(self, pattern):
    #     if self.directives.disposable == NEVER_MATCH_PATTERN:
    #         self.directives.disposable = pattern
    #     else:
    #         old_pattern = self.directives.disposable
    #         self.directives.disposable = '(?:%s)|(?:%s)' % (old_pattern, pattern)

    def on_directive(self, node: Node) -> str:
        for child in node.children:
            if child.name == "literal":
                child.result = escape_ctrl_chars(child.content)
            elif child.name == "literals":
                self.join_literals(child)
                child.result = escape_ctrl_chars(child.content)

        key = node.children[0].content
        assert key not in self.directives.tokens

        if key not in EBNFDirectives.REPEATABLE_DIRECTIVES and not key.endswith('_error') \
                and key in self.defined_directives:
            self.tree.new_error(node, 'Directive "%s" has already been defined earlier. '
                                % key + 'Later definition will be ignored!',
                                code=REDEFINED_DIRECTIVE)
            return ""
        self.defined_directives.setdefault(key, []).append(node)

        def check_argnum(n: int = 1):
            if len(node.children) > n + 1:
                self.tree.new_error(node, 'Directive "@%s" can have at most %i values.' % (key, n))

        if key in {'comment', 'whitespace'}:
            check_argnum()
            if node.children[1].name == "symbol":
                value = node.children[1].content
                if key == 'whitespace' and value in WHITESPACE_TYPES:
                    value = WHITESPACE_TYPES[value]  # replace whitespace-name by regex
                else:
                    self.tree.new_error(
                        node, f'Value "{value}" not allowed for directive "{key}". '
                        f'Expcted: {VALID_DIRECTIVES[key]}')
            else:
                value = self.extract_regex(node.children[1])
                if key == 'whitespace':
                    if not re.match(value, ''):
                        self.tree.new_error(node, f"Implicit whitespace should "
                            f"always match the empty string, /{value}/ does not.")
                    elif not re.match(value, '\n\n'):
                        self.tree.new_error(node, f"Implicit whitespace should always "
                            f"match and at least return the empty string, but /{value}/ "
                            f'does not match the empty beginning of "\\n\\n"!')
            self.directives[key] = value

        elif key in ('hide', 'disposable'):
            if node.children[1].name == "regexp":
                if len(node.children) > 2:
                    self.tree.new_error(node, 'Directive "@disposable" can only have one argument'
                                        ' if specified as regexp and not as a list of symbols.')
                re_pattern = node.children[1].content
                if re.match(re_pattern, ''):
                    self.tree.new_error(
                        node, "The regular expression r'%s' matches any symbol, "
                        "which is not allowed!" % re_pattern)
                else:
                    self.directives.add_to_disposable_regexp(re_pattern)
            else:
                args = node.children[1:]
                symlist = []
                phldr_list = []
                for child in args:
                    if child.name == "symbol":
                        symlist.append(child.content)
                    elif child.name == "placeholder":
                        phldr_list.append(child.content)
                    else:
                        self.tree.new_error(
                            child, f'Non-symbol argument: {flatten_sxpr(child.as_sxpr())}'
                            ' cannot be mixed with symbols in @disposable-directive')
                for sym in symlist:
                    self.symbols.setdefault(sym, []).append(node)
                for phldr in phldr_list:
                    self.symbols.setdefault('$' + phldr, []).append(node)
                self.directives.add_to_disposable_symbols(symlist + phldr_list)
                # self.add_to_disposable_regexp('$|'.join(symlist + phldr_list) + '$')

        elif key == 'drop':
            if len(node.children) <= 1:
                self.tree.new_error(node, 'Directive "@ drop" requires as least one argument!')
            unmatched = []  # type: List[str]  # dropped syms that are not already anonymous syms
            for child in node.children[1:]:
                content = child.content
                if re.match(self.directives.disposable, content):
                    self.directives[key].add(content)
                elif content.lower() in DROP_VALUES:
                    self.directives[key].add(content.lower())
                else:
                    unmatched.append(content)
                    if child.name == "placeholder":  content = '$' + content
                    if self.directives.disposable == NEVER_MATCH_PATTERN or \
                            re.fullmatch(r'(?:\w+\$\|)*\w+\$', self.directives.disposable):
                        self.tree.new_error(node, 'Illegal value "%s" for Directive "@ drop"! '
                                            'Should be one of %s or a disposable parser, where '
                                            'the "@disposable"-directive must precede the '
                                            '@drop-directive.' % (content, str(DROP_VALUES)))
                    else:
                        self.tree.new_error(
                            node, 'Illegal value "%s" for Directive "@ drop"! Should be one of '
                            '%s or a string matching r"%s", preceeded by "$" in case of a macro '
                            'name' % (content, str(DROP_VALUES), self.directives.disposable))
            if unmatched:
                self.directives[key].add(content)
                self.directives.add_to_disposable_symbols(unmatched)
                # self.add_to_disposable_regexp('$|'.join(unmatched) + '$')

        elif key in ('reduction', 'tree_reduction'):
            check_argnum(1)
            arg = node.children[1].content
            if arg not in ('none', 'flatten', 'merge_treetops', 'merge', 'merge_leaves',
                           '0', '1', '2', '3'):
                self.tree.new_error(node, 'Wrong value "%s" for Directive "%s". ' % (arg, key) +
                                    'Choose from: 0-4 or none, flatten, merge_treetop, merge.')
            elif arg.isdigit():
                self.directives.reduction = int(arg)
            else:
                level = {'none': 0, 'flatten': 1, 'merge_treetops': 2,
                         'merge': 3, 'merge_leaves': 3}[arg]
                self.directives.reduction = level

        elif key == 'ignorecase':
            check_argnum(1)
            if node.children[1].content.lower() not in {"off", "false", "no"}:
                self.re_flags.add('i')
                self.P['Text'] = 'IgnoreCase'

        elif key == 'literalws':
            values = {child.content.strip().lower() for child in node.children[1:]}
            if ((values - {'left', 'right', 'both', 'none'})
                    or ('none' in values and len(values) > 1)):
                self.tree.new_error(node, 'Directive "literalws" allows only `left`, `right`, '
                                    '`both` or `none`, not `%s`' % ", ".join(values))
            wsp = {'left', 'right'} if 'both' in values \
                else set() if 'none' in values else values
            self.directives.literalws = wsp

        elif key in ('tokens', 'preprocessor_tokens'):
            tokens = {child.content.strip() for child in node.children[1:]}
            redeclared = self.directives.tokens & tokens
            if redeclared:
                self.tree.new_error(node, 'Tokens %s have already been declared earlier. '
                                    % str(redeclared) + 'Later declaration will be ignored',
                                    code=REDECLARED_TOKEN_WARNING)
            self.directives.tokens |= tokens - redeclared

        elif key.endswith('_filter'):
            check_argnum()
            symbol = key[:-7]
            if node.children[1].name != "procedure":
                self.tree.new_error(
                    node, 'Filter must be a procedure, denoted as "foo()", not not a %s: %s'
                    % (node.children[1].name, node.children[1].content))
            self.directives.filter[symbol] = node.children[1].content.strip()

        elif key.endswith('_error'):
            check_argnum(2)
            symbol = key[:-6]
            error_msgs = self.directives.error.get(symbol, [])
            if node.children[1 if len(node.children) == 2 else 2].name != 'literal':
                self.tree.new_error(
                    node, 'Directive "%s" requires message string or a pair ' % key
                    + '(regular expression or search string, message string) as argument!')
            if len(node.children) == 2:
                error_msgs.append(('', unrepr(node.children[1].content)))
            elif len(node.children) == 3:
                rule = self.make_search_rule(node, node.children[1], 'error')
                error_msgs.append((rule if rule else unrepr(node.children[1].content),
                                   unrepr(node.children[2].content)))
            else:
                self.tree.new_error(node, 'Directive "%s" allows at most two parameters' % key)
            self.directives.error[symbol] = error_msgs

        elif key.endswith('_skip'):
            symbol = key[:-5]
            # if symbol in self.rules:
            #     self.tree.new_error(node, 'Skip list for resuming in series for symbol "{}"'
            #                         ' must be defined before the symbol!'.format(symbol))
            self.directives.skip[symbol] = [self.make_search_rule(node, nd, 'skip')
                                            for nd in node.children[1:]]

        elif key.endswith('_resume'):
            symbol = key[:-7]
            self.directives.resume[symbol] = [self.make_search_rule(node, nd, 'resume')
                                              for nd in node.children[1:]]

        else:
            if any(key.startswith(directive) for directive in ('skip', 'error', 'resume')):
                kl = key.split('_')
                proper_usage = '_'.join(kl[1:] + kl[0:1])
                self.tree.new_error(node, 'Directive "%s" must be used as postfix not prefix to '
                                    'the symbolname. Please, write: "%s"' % (kl[0], proper_usage))
            else:
                if key == 'include':
                    self.tree.new_error(node, 'Badly formatted @include-directive. Should be: '
                                        '@include = "FILEPATH"')
                else:
                    self.tree.new_error(node, 'Unknown directive %s ! (Known directives: %s.)' %
                                        (key, ', '.join(k for k in VALID_DIRECTIVES.keys())))

        return ""


    def on_macrodef(self, node) -> str:
        macro_name = node['name'].content
        placeholders = [pl.content for pl in node.children if pl.name == 'placeholder']
        body = node.pick_if(lambda nd: nd.name == 'macrobody')
        assert body and body.children and len(body.children) == 1
        template = body.children[0]

        # process modifiers
        if node[0].name == "modifier":
            if node[0].pick('drop'):
                self.directives.drop.add(macro_name)
            self.directives.add_to_disposable_symbols(macro_name)
        elif node[-1].name == 'hide':
            self.directives.add_to_disposable_symbols(macro_name)
        elif template.name == "term" and template[-1].name == 'drop':
            self.directives.drop.add(macro_name)
            self.directives.add_to_disposable_symbols(macro_name)

        for sym in template.select_if(lambda nd: nd.name=='symbol', include_root=True):
            self.referred_otherwise.add(sym.content)
        used_placholders = set()
        replacement = 1
        countdown = 20
        while replacement and countdown > 0:
            replacement = None
            countdown -= 1
            for pl in template.select(lambda nd: nd.name == 'placeholder', include_root=True):
                pl_name = pl.content
                if pl_name in placeholders:
                    used_placholders.add(pl_name)
                elif pl_name == macro_name:  # this covers but one of several cases
                    self.tree.new_error(pl, f'Recursive nesting of component "{macro_name}"!',
                                        RECURSIVE_MACRO_CALL)
                    return ""
                else:
                    try:
                        _, args, expansion = self.macros[pl_name]
                        if args:
                            self.tree.new_error(node, f'Macro "${pl_name}" requires arguments',
                                                WRONG_NUMBER_OF_ARGUMENTS)
                            return ""
                        replacement = Node('macroname', copy.deepcopy(expansion))\
                            .with_attr({'name': pl_name}).with_pos(pl.pos)
                        pl.replace_by(replacement, merge_attr=True)
                    except KeyError:
                        pass
        if countdown <= 0:
            self.tree.new_error(pl, f'Component "{macro_name}" too deeply structured '
                                f'if not recursively nested!', RECURSIVE_MACRO_CALL)
            return ""
        leftover = set(placeholders) - used_placholders
        if leftover:
            self.tree.new_error(node, f"Unused macro arguments: {str(leftover)[1:-1]}",
                                UNUSED_MACRO_ARGUMENTS_WARNING)
        self.macros[macro_name] = (node, placeholders, template)
        return ""



[docs]
    def non_terminal(self, node: Node, parser_class: str, custom_args: List[str] = []) -> str:
        """
        Compiles any non-terminal, where `parser_class` indicates the Parser class
        name for the particular non-terminal.
        """
        arguments = [self.compile(r) for r in node.children] + custom_args
        assert all(isinstance(arg, str) for arg in arguments), str(arguments)
        # remove drop clause for non-dropping definitions of forms like "/\w+/~"
        parser_class = self.P.get(parser_class, parser_class)
        drop_clause = f'{self.P["Drop"]}('
        drop_regexp = drop_clause + f'{self.P["RegExp"]}('
        drop_text = drop_clause + f'{self.P["Text"]}('
        if (parser_class == self.P["Series"] and node.name not in self.directives.drop
            and len(arguments) <= 2
            and DROP_REGEXP in self.directives.drop and self.path[-2].name == "definition"
            and all((arg.startswith(drop_regexp) or arg.startswith(drop_text)
                     or arg in EBNFCompiler.COMMENT_OR_WHITESPACE) for arg in arguments)
                and any(arg in EBNFCompiler.COMMENT_OR_WHITESPACE for arg in arguments)):
            arguments = [arg.replace(drop_clause, '').replace('))', ')') for arg in arguments]
        return parser_class + '(' + ', '.join(arguments) + ')'



    def on_macroname(self, node) -> str:
        macro_name = node.attr['name']
        code = self.compile(node[0])
        if macro_name in self.directives.drop and not code.startswith(self.P["Drop"]):
            return f'{self.P["Drop"]}({code})'
        elif not re.match(self.directives.disposable, macro_name):
            if len(self.path) >= 3 and self.path[-3].name != 'definition':
                if code.find('(') >= 0:
                    return f'({code}).name("{macro_name}")'
                else:  # code is symbol which already has a name which should not be overwritten.
                    return f'{self.P["Synonym"]}({code}).name("{macro_name}")'
        return code

    def on_macro(self, node) -> str:
        macro_name = node['name'].content if node.children else node.content
        try:
            _, margs, expansion = self.macros[macro_name]
        except KeyError:
            if len(node.children) <= 1:
                self.tree.new_error(node, f'Unknown argument or macro-component "${macro_name}"!',
                                    UNKNOWN_MACRO_ARGUMENT)
            else:
                self.tree.new_error(node, f'Undefined macro "${macro_name}"!', UNDEFINED_MACRO)
            return "wsp__"
        self.symbols.setdefault('$' + macro_name, []).append(node)
        if macro_name in self.macro_stack:
            self.tree.new_error(node, f'Recursive nesting of macro "{macro_name}"!',
                                RECURSIVE_MACRO_CALL)
            return "wsp__"
        values = node.children[1:]  #  [expr for expr in node.children[1:]]
        if len(values) != len(margs):
            self.tree.new_error(node, f'Wrong number of macro arguments: {len(margs)} expected, '
                                f'{len(values)} given!', WRONG_NUMBER_OF_ARGUMENTS)
            return "wsp__"  # use wsp__ as a placeholder
        args = dict(zip(margs, values))
        tmpl = copy.deepcopy(expansion)
        for arg in tmpl.select_if(lambda nd: nd.name == 'placeholder', include_root=True):
            arg_name = arg.content
            if arg_name in args:
                value = args[arg_name]
                # replace placeholder with argument expression
                arg.replace_by(value, merge_attr=True)
        node.replace_by(tmpl, merge_attr=True)
        self.macro_stack.append(macro_name)
        code = self.compile(Node('macroname', node).with_attr({'name': macro_name}))
        _ = self.macro_stack.pop()
        assert macro_name == _
        return code


    def on_placeholder(self, node) -> str:
        return self.on_macro(node)


    def on_expression(self, node) -> str:
        # The following algorithm reorders literal alternatives, so that earlier alternatives
        # do not pre-empt later alternatives, e.g. 'ID' | 'IDREF' will be reordered as
        # 'IDREF' | 'ID'

        def move_items(l: List, a: int, b: int):
            """Moves all items in the interval [a:b[ one position forward and moves the
            item at position b to position a."""
            if a > b:
                a, b = b, a
            last = l[b]
            for i in range(b, a, -1):
                l[i] = l[i-1]
            l[a] = last

        def literal_content(nd: Node) -> Optional[str]:
            """Returns the literal content of either a literal-Node, a plaintext-Node
            or a symbol-Node, where the symbol is defined as a literal or plaintext."""
            if nd.name in ('literal', 'plaintext'):
                return nd.content[1:-1]
            elif nd.name == 'symbol':
                if self.cache_literal_symbols is None:
                    self.cache_literal_symbols = dict()
                    for df in self.tree.select_children('definition'):
                        if df.children[1].name in ('literal', 'plaintext'):
                            self.cache_literal_symbols[df.children[0].content] = \
                                df.children[1].content[1:-1]
                return self.cache_literal_symbols.get(nd.content, None)
            return None

        literals = []  # type: List[List]
        for i, child in enumerate(node.children):
            content = literal_content(child)
            if content is not None:
                literals.append([i, content])

        move = []          # type: List[Tuple[int, int]]
        snapshots = set()  # type: Set[Tuple[int, ...]]
        start_over = True  # type: bool
        while start_over:
            start_over = False
            for i in range(1, len(literals)):
                later = literals[i][1]
                for k in range(i):
                    earlier = literals[k][1]
                    if later.startswith(earlier):
                        a, b = literals[k][0], literals[i][0]
                        move.append((a, b))
                        literals[i][0] = a
                        for n in range(k, i):
                            literals[n][0] += 1
                        self.tree.new_error(
                            node, 'Alternative "%s" should not precede "%s" where it occurs '
                            'at the beginning! Reorder to avoid warning.' % (earlier, later),
                            REORDERING_OF_ALTERNATIVES_REQUIRED)
                        start_over = True
                        break
                if start_over:
                    break
            if start_over:
                move_items(literals, k, i)
                snapshot: Tuple[int, ...] = tuple(item[1] for item in literals)
                if snapshot in snapshots:
                    self.tree.new_error(
                        node, 'Reordering of alternatives "%s" and "%s" required but not possible!'
                              % (earlier, later), BAD_ORDER_OF_ALTERNATIVES)
                    move = []
                    start_over = False
                else:
                    snapshots.add(snapshot)

        if move:
            children = list(node.children)
            for mv in move:
                move_items(children, mv[0], mv[1])
            node.result = tuple(children)

        if len(node.children) == 1:
            # can only occur inside directives, because here expressions are not
            # necessarily replaced by their single child. (See AST-Transformation)
            return self.compile(node.children[0])
        return self.non_terminal(node, 'Alternative')


    def on_term(self, node):
        assert len(node.children) == 2, node.as_sxpr()
        assert node[1].name in ('drop', 'skip')  # 'skip' is left for backwards compatibility
        term_code = self.compile(node[0])
        if node[0].name == "symbol":
            return f'{self.P["DropFrom"]}({term_code})'
        elif self.current_symbols and len(self.path) >=2 \
                and self.path[-2].name in ("definition", "macrodef"):
            curr_sym = self.current_symbols[0][0].content
            self.directives.add_to_disposable_symbols(self.current_symbols[0][0].content)
            # return f'{self.P["Drop"]}({term_code})' if term_code[:4] != "Drop" else term_code
        elif all(self.path[i].name == "group" for i in range(2, len(self.path) - 1)):
            return f'Synonym({self.P["Drop"]}({term_code}))' \
                   if term_code[:4] != "Drop" else  f'Synonym({term_code})'
        return f'{self.P["Drop"]}({term_code})' if term_code[:4] != "Drop" else term_code

    def on_part(self, node):
        return self.on_term(node)

    def _error_customization(self, node) -> Tuple[Tuple[Node, ...], List[str]]:
        """Generates the customization arguments (mandatory, error_msgs, skip) for
        `MandatoryNary`-parsers (Series, Interleave, ...)."""
        mandatory_marker = []
        filtered_children = []  # type: List[Node]
        for nd in node.children:
            if nd.name == TOKEN_PTYPE and nd.content == "§":
                mandatory_marker.append(len(filtered_children))
                if len(mandatory_marker) > 1:
                    self.tree.new_error(nd, 'One mandatory marker (§) is sufficient to declare '
                                        'the rest of the elements as mandatory.', WARNING)
            else:
                filtered_children.append(nd)
        custom_args = ['mandatory=%i' % mandatory_marker[0]] if mandatory_marker else []
        # add custom error message if it has been declared for the current definition
        if custom_args:
            try:
                current_symbol = next(reversed(list(self.rules.keys())))   # list() needed for Python < 3.8
            except StopIteration:
                current_symbol = ''
            # add customized error messages, if defined
            if current_symbol in self.directives.error:
                if current_symbol in self.consumed_custom_errors:
                    self.tree.new_error(
                        node, "Cannot apply customized error messages unambiguously, because "
                        "symbol {} contains more than one parser with a mandatory marker '§' "
                        "in its definiens.".format(current_symbol),
                        AMBIGUOUS_ERROR_HANDLING)
                else:
                    self.consumed_custom_errors.add(current_symbol)
            # add skip-rules to resume parsing of a series, if rules have been declared
            if current_symbol in self.directives.skip:
                if current_symbol in self.consumed_skip_rules:
                    self.tree.new_error(
                        node, "Cannot apply 'skip-rules' unambiguously, because symbol "
                        "{} contains more than one parser with a mandatory marker '§' "
                        "in its definiens.".format(current_symbol),
                        AMBIGUOUS_ERROR_HANDLING)
                else:
                    self.consumed_skip_rules.add(current_symbol)
        return tuple(filtered_children), custom_args


    def on_sequence(self, node) -> str:
        filtered_result, custom_args = self._error_customization(node)
        mock_node = Node(node.name, filtered_result)
        return self.non_terminal(mock_node, 'Series', custom_args)


    def on_interleave(self, node) -> str:
        children = []
        repetitions = []
        filtered_result, custom_args = self._error_customization(node)
        for child in filtered_result:
            if child.name == "group":
                assert len(child.children) == 1
                if child.children[0].name == 'repetition':
                    child.name = "option"
                    child.children[0].name = "oneormore"
                elif child.children[0].name == 'option':
                    child = child.children[0]
                else:
                    repetitions.append((1, 1))
                    children.append(child.children[0])
                    continue
            if child.name == "oneormore":
                repetitions.append((1, INFINITE))
                assert len(child.children) == 1
                children.append(child.children[0])
            elif child.name == "repetition":
                repetitions.append((0, INFINITE))
                assert len(child.children) == 1
                children.append(child.children[0])
            elif child.name == "option":
                repetitions.append((0, 1))
                assert len(child.children) == 1
                children.append(child.children[0])
            elif child.name == "counted":
                what, r = self.extract_counted(child)
                repetitions.append(r)
                children.append(what)
            else:
                repetitions.append((1, 1))
                children.append(child)
        custom_args.append('repetitions=' + str(repetitions).replace(str(INFINITE), 'INFINITE'))
        mock_node = Node(node.name, tuple(children))
        return self.non_terminal(mock_node, 'Interleave', custom_args)


    def on_lookaround(self, node: Node) -> str:
        # assert node.children
        # assert len(node.children) == 2, self.node.as_sxpr()
        # assert node.children[0].name == 'flowmarker'
        if not node.children or len(node.children) != 2 \
                or node.children[0].name != 'flowmarker':
            tree_dump = flatten_sxpr(node.as_sxpr())
            self.tree.new_error(
                node, 'Structural error in AST when compiling lookaround operator: '
                + tree_dump, STRUCTURAL_ERROR_IN_AST)
            return 'Structural Error in AST: ' + tree_dump
        prefix = node.children[0].content
        # arg_node = node.children[1]
        node.result = node.children[1:]
        assert prefix in {'&', '!', '<-&', '<-!'}

        parser_class = self.PREFIX_TABLE[prefix]
        result = self.non_terminal(node, parser_class)
        if prefix[:2] == '<-':
            def verify(node: Node):
                nd = node
                if len(nd.children) >= 1:
                    nd = nd.children[0]
                while nd.name == "symbol":
                    symlist = self.rules.get(nd.content, [])
                    if len(symlist) == 2:
                        nd = symlist[1]
                    else:
                        if len(symlist) == 1:
                            nd = symlist[0].children[1]
                        break
                # content = nd.content
                if nd.name not in ("regexp", "char_ranges"):
                    self.tree.new_error(node, "Lookbehind-parser can only be used with RegExp"
                                              " or PlainText-parsers, not: " + nd.name)

            if not result.startswith(self.P['RegExp'] + '('):
                self.deferred_tasks.append(lambda: verify(node))
        return result


    def on_difference(self, node: Node) -> str:
        assert len(node.children) == 2, str(node.as_sxpr())
        left = self.compile(node.children[0])
        right = self.compile(node.children[1])
        return f"{self.P['Series']}({self.P['NegativeLookahead']}({right}), {left})"


    def on_element(self, node: Node) -> str:
        assert node.children
        assert len(node.children) == 2
        assert node.children[0].name == "retrieveop"
        assert node.children[1].name == "symbol"
        prefix = node.children[0].content  # type: str
        usage = node.children[1]
        arg = usage.content     # type: str
        node.result = node.children[1:]
        assert prefix in {'::', ':?', ':'}

        if re.match(self.directives.disposable, arg):
            self.tree.new_error(
                node, 'Retrieve operator "%s" does not work with disposable parsers like %s'
                % (prefix, arg))
            return arg

        custom_args = []           # type: List[str]
        match_func = 'last_value'  # type: str
        if arg in self.directives.filter:
            match_func = self.directives.filter[arg]
        if prefix.endswith('?'):
            match_func = 'optional_' + match_func
        if match_func != 'last_value':
            custom_args = ['match_func=%s' % match_func]

        self.variables.setdefault(arg, []).append(usage)
        parser_class = self.PREFIX_TABLE[prefix]
        return self.non_terminal(node, parser_class, custom_args)


    def on_option(self, node) -> str:
        return self.non_terminal(node, 'Option')


    def on_repetition(self, node) -> str:
        return self.non_terminal(node, 'ZeroOrMore')


    def on_oneormore(self, node) -> str:
        return self.non_terminal(node, 'OneOrMore')


    def on_group(self, node) -> str:
        assert len(node.children) == 1, node.as_sxpr()
        return self.compile(node.children[0])



[docs]
    def extract_range(self, node) -> Tuple[int, int]:
        """Returns the range-value of a range-node as a tuple of two integers.
        """
        assert node.name == "range"
        assert all(child.name == 'multiplier' for child in node.children)
        if len(node.children) == 2:
            r = (int(node.children[0].content), int(node.children[1].content))
            if r[0] > r[1]:
                self.tree.new_error(
                    node, "Upper bound %i of range is greater than lower bound %i!" % r)
                return r[1], r[0]
            return r
        else:
            assert len(node.children) == 1
            return int(node.children[0].content), int(node.children[0].content)




[docs]
    def extract_counted(self, node) -> Tuple[Node, Tuple[int, int]]:
        """Returns the content of a counted-node in a normalized form:
        (node, (n, m)) where node is root of the sub-parser that is counted,
        i.e. repeated n or n up to m times.
        """
        assert node.name == 'counted'
        if len(node.children) != 2:
            self.tree.new_error(node, f'Wrong number of arguments for repetition: '
                                      f'{len(node.children)} (two expected)!')
            return Node(ZOMBIE_TAG, '').with_pos(node.pos), (0, 0)
        rng = node.get('range', None)
        if rng:
            r = self.extract_range(rng)
            what = node.children[0]
            assert what.name != 'range'
        else:  # multiplier specified instead of range
            if node.children[0].name == 'multiplier':
                m = int(node.children[0].content)
                what = node.children[1]
            else:
                assert node.children[1].name == 'multiplier'
                m = int(node.children[1].content)
                what = node.children[0]
            if what.name == 'option':
                what = what.children[0]
                r = (0, m)
            elif what.name == 'oneormore':
                what = what.children[0]
                r = (m, INFINITE)
            elif what.name == 'repetition':
                self.tree.new_error(
                    node, 'Counting zero or more repetitions of something does not make sense! '
                    'Its still zero or more repetitions all the same.')
                what = what.children[0]
                r = (0, INFINITE)
            else:
                r = (m, m)
        return what, r



    def on_counted(self, node) -> str:
        what, r = self.extract_counted(node)
        # whatstr = self.compile(what)
        rstr = str(r).replace(str(INFINITE), 'INFINITE')
        mock_node = Node(node.name, (what,))
        return self.non_terminal(mock_node, 'Counted', ['repetitions=' + rstr])


    def on_symbol(self, node: Node) -> str:     # called only for symbols on the right hand side!
        symbol = node.content  # ; assert result == cast(str, node.result)
        if symbol in self.directives.tokens:
            return f'{self.P["PreprocessorToken"]}("{symbol}")'
            # return 'PreprocessorToken("' + symbol + '")'
        else:
            self.current_symbols.append(node)
            self.symbols.setdefault(symbol, []).append(node)
            # remember use of symbol, so that dangling references or
            # redundant definitions or usages of symbols can be detected later
            if symbol in self.rules:
                self.forward.add(symbol)
            if symbol in EBNFCompiler.KEYWORD_SUBSTITUTION:
                keyword = EBNFCompiler.KEYWORD_SUBSTITUTION[symbol]
                self.required_keywords.add(keyword)
                return keyword
            elif symbol.endswith('__'):
                self.tree.new_error(node, 'Illegal use of reserved symbol name "%s"!' % symbol)
            return symbol


    def drop_on(self, category):
        return category in self.directives.drop and self.path[-2].name != "definition"


    def TEXT_PARSER(self, text, drop):
        return f'{self.P["Drop"]}({self.P["Text"]}({text}))' if drop \
            else f'{self.P["Text"]}({text})'
        # return 'Drop(Text(' + text + '))' if drop else 'Text(' + text + ')'


    def REGEXP_PARSER(self, pattern, drop):
        REClass = "SmartRE" if pattern.find('(?P<') >= 0 else "RegExp"
        if pattern.find('(?x)') >= 0:
            m = next(re.finditer(r'\\n[ \t]*', pattern))
            indent = len(m.group(0)) - 2
            pattern = re.sub(r'\\n[ \t]*', '\\\\n\n', pattern)
            parts = pattern.split('\n')
            pattern = wrap_str_literal(parts, offset = len(REClass) + 1 + indent)
        else:
            pattern = wrap_str_literal(pattern, 80, len(REClass) + 1)
        return f'{self.P["Drop"]}({self.P[REClass]}({pattern}))' if drop \
            else f'{self.P[REClass]}({pattern})'


    def WSPC_PARSER(self, force_drop=False):
        if ((force_drop or DROP_WSPC in self.directives.drop
             or DROP_NO_COMMENTS in self.directives.drop)
                and (self.path[-2].name != "definition"
                     or self.path[-1].name == 'literal')):
            return 'dwsp__'
        return 'wsp__'


    def on_literals(self, node: Node) -> str:
        self.join_literals(node)
        return self.on_literal(node)



[docs]
    def prepare_literal(self, node: Node) -> Tuple[str, str, str]:
        """Returns content, left_Whitespace, right_whitspace for a literal-node."""
        assert node.name == "literal"
        force = DROP_STRINGS in self.directives.drop
        left = self.WSPC_PARSER(force) if 'left' in self.directives.literalws else ''
        right = self.WSPC_PARSER(force) if 'right' in self.directives.literalws else ''
        content = node.content  # escape_ctrl_chars(node.content)
        if (content[:1] + content[-1:]) == '’’':
            content = "'" + re.sub(r"(?<!\\)'", r"\'", content[1:-1]) + "'"
        return content, left, right




[docs]
    def whitespace_rx(self, ws: str) -> str:
        # TODO: doctests required
        """Returns a ragular expression for whitespace, possibly including comments"""
        if ws:
            # ws_rx = self.directives.super_ws
            ws_rx = '{WSP_RE__}'
            if ws in (self.DROP_WHITESPACE_PARSER_KEYWORD, ""):
                if DROP_NO_COMMENTS in self.directives.drop:
                    return f'(?P<{KEEP_COMMENTS_PTYPE}>{ws_rx})'
                else:
                    return f'(?:{ws_rx})'
            else:
                assert ws == self.WHITESPACE_PARSER_KEYWORD, str(ws)
                return f'(?P<{WHITESPACE_PTYPE}>{ws_rx})'
        return ""




[docs]
    def literal_rx(self, content: str, left: str, right: str) -> str:
        """Returns a regular expression string to parse a literal. This
        can be used to optimize the parsing of literals with adjacent
        whitespace by defining a :py:class:`~parse:SmartRe`-parser with
        with regular expression."""
        # TODO: doctests required

        left_rx = self.whitespace_rx(left)
        right_rx = self.whitespace_rx(right)
        if self.drop_on(DROP_STRINGS):
            center_rx = f'(?:{re.escape(content)})'
        else:
            center_rx = f'(?P<{TOKEN_PTYPE}>{re.escape(content)})'
        return ''.join([left_rx, center_rx.replace('{', '{{').replace('}', '}}'), right_rx])



    def on_literal(self, node: Node) -> str:
        # TODO: Test with self.optimization_level > 1
        content, left, right = self.prepare_literal(node)
        if self.optimization_level >= 1 and (left or right):
            q = content[0]
            # rxp = repr(self.literal_rx(content[1:-1], left, right))
            rxp = ''.join(['fr', q, self.literal_rx(content[1:-1], left, right), q])
            rxp = wrap_str_literal(rxp, 80, 12 if self.drop_on(DROP_STRINGS) else 7)
            content = escape_ctrl_chars(content)
            content = ''.join([q, '\\', content[:-1], '\\', content[-1], q])
            return f"{self.P['SmartRE']}({rxp}, {content})"
        center = self.TEXT_PARSER(escape_ctrl_chars(content), self.drop_on(DROP_STRINGS))
        if left or right:
            args = ", ".join(item for item in (left, center, right) if item)
            return f'{self.P["Series"]}({args})'
            # return 'Series(' + ", ".join(item for item in (left, center, right) if item) + ')'
        return center


    def on_plaintext(self, node: Node) -> str:
        tk = escape_ctrl_chars(node.content)
        rpl = '"' if tk.find('"') < 0 else "'" if tk.find("'") < 0 else ''
        if rpl:
            tk = rpl + tk[1:-1] + rpl
        else:
            tk = rpl + tk.replace('"', '\\"')[1:-1] + rpl
        return self.TEXT_PARSER(tk, self.drop_on(DROP_BACKTICKED))


    def on_regexp(self, node: Node) -> str:
        rx = node.content
        try:
            arg = repr(self.check_rx(node, rx))
        except AttributeError as error:
            from traceback import extract_tb, format_list
            trace = ''.join(format_list(extract_tb(error.__traceback__)))
            errmsg = "%s (AttributeError: %s;\n%s)\n%s" \
                     % (EBNFCompiler.AST_ERROR, str(error), trace, node.as_sxpr())
            self.tree.new_error(node, errmsg)
            return '"' + errmsg + '"'
        return self.REGEXP_PARSER(arg, self.drop_on(DROP_REGEXP))


    def on_char_ranges(self, node) -> str:
        # TODO: Add alternative char_ranges handling here
        node.result = "|".join('[' + child.content + ']'
                for child in node.children if child.name == "range_chain")
        return self.on_regexp(node)


    def on_char_range(self, node) -> str:
        # TODO: Add alternative char_ranges handling here
        if 'range_desc' in node:
            node = self.fallback_compiler(node)
        else:
            # old grammar, can be removed soon
            node = self.on_range_desc(node)
        re_str = re.sub(r"(?<!\\)'", r'\'', node.content)
        re_str = re.sub(r"(?<!\\)]", r'\]', re_str)
        return f"{self.P['RegExp']}('[{re_str}]')"


    def on_range_chain(self, node) -> Node:
        raise NotImplementedError


    def on_range_desc(self, node) -> Node:
        for child in node.children:
            if child.name == 'character':
                child.result = self.extract_character(child)
            elif child.name == 'free_char':
                child.result = self.extract_free_char(child)
        return node


    def extract_character(self, node: Node) -> str:
        hexcode = node.content
        if len(hexcode) > 4:
            assert len(hexcode) <= 8
            code = '\\U' + (8 - len(hexcode)) * '0' + hexcode
        elif len(hexcode) > 2:
            code = '\\u' + (4 - len(hexcode)) * '0' + hexcode
        else:
            code = '\\x' + (2 - len(hexcode)) * '0' + hexcode
        return code


    def on_character(self, node: Node) -> str:
        return f"{self.P['RegExp']}('[{self.extract_character(node)}]')"
        # return "RegExp('%s')" % self.extract_character(node)


    def extract_free_char(self, node: Node) -> str:
        assert len(node.content) == 1 or (node.content[0:1] == '\\' and len(node.content) == 2), node.content
        bytestr = node.content.encode('unicode-escape')
        return bytestr.decode()


    def on_any_char(self, node: Node) -> str:
        return f'{self.P["AnyChar"]}()'


    def on_whitespace(self, node: Node) -> str:
        return self.WSPC_PARSER()

    def on_parser(self, node: Node) -> str:
        assert 1 <= len(node.children) <= 2, node.as_sxpr()
        name = self.compile(node['name'])
        if name not in ('Custom', 'CustomParser', 'Error', 'ERR'):
            self.py_symbols.add(name)
        if name == 'Error':
            name = 'ERR'
            argument = self.compile(node['argument'])
            return f'{self.P["Custom"]}({name}({argument}))'
        elif 'argument' in node:
            argument = self.compile(node['argument'])
            if argument[0:1] in ('"', "'"):
                arglist = re.sub(r',\s*', ',', argument.strip('"').strip("'"))
                for arg in arglist.split(','):  # identify potential identifiers in string args
                    if re.fullmatch(r'(?!\d)\w+', arg):  self.py_symbols.add(arg)
            else:
                self.py_symbols.add(argument)
            return f'{self.P["Custom"]}({name}({argument}))'
        else:
            return f'{self.P["Custom"]}({name}())'

    def on_name(self, node: Node) -> str:
        assert not node.children
        return node.content

    def on_argument(self, node: Node) -> str:
        assert not node.children
        return node.content


def get_ebnf_compiler(grammar_name="", grammar_source="") -> EBNFCompiler:
    THREAD_LOCALS = access_thread_locals()
    try:
        compiler = THREAD_LOCALS.ebnf_compiler_singleton
        compiler.set_grammar_name(grammar_name, grammar_source)
        return compiler
    except AttributeError:
        compiler = EBNFCompiler(grammar_name, grammar_source)
        THREAD_LOCALS.ebnf_compiler_singleton = compiler
        return compiler



[docs]
def compile_ebnf_ast(ast: RootNode) -> str:
    """Compiles the abstract-syntax-tree of an EBNF-source-text into
    python code of a class derived from `parse.Grammar` that can
    parse text following the grammar described with the EBNF-code."""
    return get_ebnf_compiler()(ast)



########################################################################
#
# EBNF compiler
#
########################################################################


[docs]
def compile_ebnf(ebnf_source: str, branding: str = 'DSL', *, preserve_AST: bool = False) \
        -> CompilationResult:   # -> (result, messages, AST)
    """
    Compiles an `ebnf_source` (file_name or EBNF-string) and returns
    a tuple of the python code of the compiler, a list of warnings or errors
    and the abstract syntax tree (if called with the keyword argument
    ``preserve_AST=True``) of the EBNF-source. This function is merely
    syntactic sugar.
    """
    result = compile_source(ebnf_source,
                            get_ebnf_preprocessor(),
                            get_ebnf_grammar(),
                            get_ebnf_transformer(),
                            get_ebnf_compiler(branding, ebnf_source),
                            preserve_AST=preserve_AST)
    return result