Reference

At the core of DHParser lies a parser generator for parsing expression grammars. As a parser generator it offers similar functionality as pyparsing or lark. But it goes far beyond a mere parser generator by offering rich support of the testing an debugging of grammars, tree-processing (always needed in the XML-prone Digital Humanities ;-) ), fail-tolerant grammars and some (as of now, experimental) support for editing via the language server protocol.

ebnf

Although DHParser also offers a Python-interface for specifying grammars (similar to pyparsing), the recommended way of using DHParser is by specifying the grammar in EBNF. Here it is described how grammars are specified in EBNF and how parsers can be auto-generated from these grammars and how they are used to parse text.

nodetree

Syntax-trees are the central data-structure of any parsing system. The description to this modules explains how syntax-trees are represented within DHParser, how they can be manipulated, queried and serialized or deserialized as XML, S-expressions or json.

transform

It is not untypical for digital humanities applications that document tress are transformed again and again to produce different representations of research data or various output forms. DHParser supplies the scaffolding for two different types of tree transformations, both of which a variations of the visitor pattern. The scaffolding supplied by the transform-module allows to specify tree-transformations in a declarative style by filling in a dictionary of tag-names with lists of transformation functions that are called in sequence on a node. A number of transformations are pre-defined that cover the most needed cases that occur in particular when transforming concrete syntax trees to more abstract syntax trees. (An example for this kind of declaratively specified transformation is the EBNF_AST_transformation_table within DHParser’s ebnf-module.)

compile

offers an object-oriented scaffolding for the visitor pattern that is more suitable for complex transformations that make heavy use of algorithms as well as transformations from trees to non-tree objects like program code. (An example for the latter kind of transformation is the :py:class`~ebnf.EBNFCompiler`-class of DHParser’s ebnf-module.)

pipeline

offers support for “processing-pipelines” composed out of “junctions” A processing pipe-line consists of a series of tree-transformations that are applied in sequence. “Junctions” declare which source-tree-stage is transformed by which transformation-routine into which destination tree-stage. Processing-pipelines can contain bifurcations, which are needed if from one source-document different kinds of output-data shall be derived.

testing

provides a rich framework for unit-testing of grammars, parsers and any kind of tree-transformation. Usually, developers will not need to interact with this module directly, but rely on the unit-testing script generated by the “dhparser.py” command-line tool. The tests themselves a specified declaratively in test-input-files (in the very simple “.ini”-format) that reside by default in the “test_grammar”-directory of a DHParser-project.

preprocess

provides support for DSL-pre-processors as well as source mapping of (error-)locations from the preprocessed document to the original document(s). Pre-processors are a practical means for adding features to a DSL which are difficult or impossible to define with context-free-grammars in EBNF-notation, like for example scoping based on indentation (as used by Python) or chaining of source-texts via an “include”-directive.

parse

contains the parsing algorithms and the Python-Interface for defining parsers. DHParser features a packrat-parser for parsing-expression-grammars with full left-recursion support as well configurable error catching an continuation after error. The Python-Interface allows to define grammars directly as Python-code without the need to compile an EBNF-grammar first. This is an alternative approach to defining grammars similar to that of pyparsing.

dsl

contains high-level functions for compiling ebnf-grammars and domain specific languages “on the fly”.

error

defines the Error-class, the objects of which describe errors in the source document. Errors are defined by - at least - an error code (indicating at the same time the level of severity), a human readable error message and a position in the source text.

trace

Apart from unit-testing DHParser offers “post-mortem” debugging of the parsing process itself - as described in the A Step by Step Guide. This is helpful to figure out why a parser went wrong. Again, there is little need to interact with this module directly, as it functionality is turned on by setting the configuration variables history_tracking and, for tracing continuation after errors, resume_notices, which in turn can be triggered by calling the auto-generated -Parser.py-scripts with the parameter --debug.

log

logging facilities for DHParser as well as tracking of the parsing-history in connection with module trace.

configuration

the central place for all configuration settings of DHParser. Be sure to use the access, set and get functions to change presets and configuration values in order to make sure that changes to the configuration work when used in combination with multithreading or multiprocessing.

server

In order to avoid startup times or to provide a language sever for a domain-specific-language (DSL), DSL-parsers generated by DHParser can be run as a server. Module server provides the scaffolding for an asynchronous language server. The -Server.py”-script generated by DHParser provides a minimal language server (sufficient) for compiling a DSL. Especially if used with the just-in-time compiler pypy using the -Server.py script allows for a significant speed-up.

lsp

(as of now, this is just a stub!) provides data classes that resemble the typescript-interfaces of the language server protocol specification.

stringview

defines a low level class that provides views on slices of strings. It is used by the parse-module to avoid excessive copying of data when slicing strings. (Python always creates a copy of the data when slicing strings as a design decision.) If any, this module can significantly be sped up by compiling it with cython. (Use the cythonize_stringview-script in DHParser’s main directory or, even better, compile (almost) all modules with the build_cython-modules-script. This yields a 2-3x speed increase.)

toolkit

various little helper functions for DHParser. Usually, there is no need to call any of these directly.

Module ebnf

Module ebnf provides an EBNF-parser-generator. The parser-generator compiles an EBNF-Grammar into an executable Python-class. An instance of this class can be called to parse text-documents conforming to the grammar. It will the concrete-syntax-tree of the document. Various flavors of EBNF- of PEG- (Parsing-Expression-Grammar) notations are supported.

Usually, the classes and functions provided by the DHParser.ebnf will not be called directly, because it is much simpler to use the high-level API in module DHParser.dsl, in particular DHParser.dsl.create_parser().

The ebnf-module is structured just like the parser-modules that are generated by running the “dhparser”-script on an EBNF-grammar or by executing the test-runner script in the project-directory of a DHParser-project. It consists of the following sections, divided by a comment-block with a sections name:

  1. The EBNF-preprocessor takes care of chaining the main file and the included files (if there are any) into one single document that is passed on to the parser. (=> preprocessor_factory())

  2. The EBNF-parser translates the EBNF-grammar into a concrete-syntax-tree. (=> ConfigurableEBNFGrammar)

  3. The AST-transformation molds the concrete-syntax-tree into abstract-syntax-tree (AST) that represents the syntactical structure of the grammar-source-code. (=> get_ebnf_transformer())

  4. Finally, the EBNF-compiler compiles the AST into an executable Python-class that is a descendant of Grammar and is composed mostly of instances of the parser-classes from module DHParser.parse. (=> EBNFCompiler)

    (Each symbol of the grammar is represented by a class variable to which a nested call of parser-class-instantiations is assigned. These parser-instances serve as a prototype from which grammar objects are derived via deep-copy upon the instantiation of the grammar-class.)

    This section also contains source-code-snippets and -templates for the Python-parser code the compiler produces.

The most notable difference to ordinary DHParser-projects is that the DHParser.ebnf-module contains two Grammar-classes, one for parsing code that strictly follows DHParser’s EBNF-syntax (ConfigurableEBNFGrammar) and another one that is able to parse many different brands of EBNF-syntax (HeuristicEBNFGrammar) at the cost of parsing speed. When parsing or compiling an EBNF-grammar with any of the high-level functions into Python-code, the faster “configurable” EBNF-grammar is tried first, and, if that fails with particular errors which suggest that the failure might be merely due to the use of a different brand of EBNF, a second attempt is made with the slower heuristic EBNF-parser.

The EBNF-compiler is actually split into two classes, EBNFCompiler, which contains the EBNF-AST -> Python compiler proper, and EBNFDirectives which is a helper class to keep track of the directives used to avoid overburdening the compiler-class with instance variables.

Just as DHParser’s (auto-)generated parser-scripts, the classes contained in DHParser.ebnf should not be instantiated directly. Other than the parser scripts, though, the ebnf-module does not provide Junctions with factory-functions for each stage from preprocessing to compiling. Instead it provides factory-functions that return one singleton instance per thread of each class, namely:

  • get_ebnf_parser()

  • get_ebnf_transformer()

  • get_ebnf_compiler()

These are supplemented by the quick-use-functions:

The following example shows how the classes and functions of the ebnf-module can be connected to produce runnable Python-code from an EBNF-grammar. It is meant as a help to understand the role of these classes better as well as - in a simplified manner - the basic working mechanisms of higher level functions like DHParser.dsl.create_parser(). In any practical application, the use of the high-level functions from DHParser.dsl is to be preferred. One can think of the DHParser.dsl-module as the public API of the ebnf-module.

This said, here is how a Python-parser can be generated from a grammar, step by step:

>>> arithmetic_ebnf = r"""
... @ whitespace  = vertical
... @ literalws   = right
... @ drop        = whitespace, strings
... expression = term  { (add | sub) term}
... term       = factor { (div | mul) factor}
... factor     = [minus] (NUMBER | VARIABLE | group)
... group      = "(" expression ")"
... add        = "+"
... sub        = "-"
... mul        = "*"
... div        = "/"
... minus      = `-`
... NUMBER     = /(?:0|(?:[1-9]\d*))(?:\.\d+)?/~
... VARIABLE   = /[A-Za-z]/~"""

>>> # 1. Compilation of an EBNF-grammar into Python-source-code
>>> ebnf_parser = ConfigurableEBNFGrammar()
>>> ebnf_transformer = EBNFTransform()
>>> ebnf_compiler = EBNFCompiler()

>>> CST = ebnf_parser(arithmetic_ebnf)
>>> AST = ebnf_transformer(CST)   # CST should be considered invalid after that
>>> ebnf_compiler.set_grammar_name("Arithmetic")
>>> python_src = ebnf_compiler(AST)
>>> assert not AST.errors
>>> print(python_src)
class ArithmeticGrammar(Grammar):
    r"""Parser for an Arithmetic source file.
    """
    expression = Forward()
    disposable__ = re.compile('$.')
    static_analysis_pending__ = []  # type: List[bool]
    parser_initialization__ = ["upon instantiation"]
    COMMENT__ = r''
    comment_rx__ = RX_NEVER_MATCH
    WHITESPACE__ = r'\s*'
    WSP_RE__ = mixin_comment(whitespace=WHITESPACE__, comment=COMMENT__)
    wsp__ = Whitespace(WSP_RE__)
    dwsp__ = Drop(Whitespace(WSP_RE__))
    VARIABLE = Series(RegExp('[A-Za-z]'), dwsp__)
    NUMBER = Series(RegExp('(?:0|(?:[1-9]\\d*))(?:\\.\\d+)?'), dwsp__)
    minus = Text("-")
    div = Series(Text("/"), dwsp__)
    mul = Series(Text("*"), dwsp__)
    sub = Series(Text("-"), dwsp__)
    add = Series(Text("+"), dwsp__)
    group = Series(Series(Drop(Text("(")), dwsp__), expression, Series(Drop(Text(")")), dwsp__))
    factor = Series(Option(minus), Alternative(NUMBER, VARIABLE, group))
    term = Series(factor, ZeroOrMore(Series(Alternative(div, mul), factor)))
    expression.set(Series(term, ZeroOrMore(Series(Alternative(add, sub), term))))
    root__ = expression

parsing: PseudoJunction = create_parser_junction(ArithmeticGrammar)
get_grammar = parsing.factory # for backwards compatibility, only


>>> # 2. Execution of the Python-source and extraction of the Grammar-class
>>> code = compile(DHPARSER_IMPORTS + python_src, '<string>', 'exec')
>>> namespace = {}
>>> exec(code, namespace)
>>> ArithmeticGrammar = namespace['ArithmeticGrammar']

>>> # 3. Instantiation of the Grammar class and parsing of an expression
>>> arithmetic_parser = ArithmeticGrammar()
>>> syntax_tree = arithmetic_parser("2 + 3 * 4")
>>> print(syntax_tree.as_sxpr())
(expression
  (term
    (factor
      (NUMBER "2")))
  (add "+")
  (term
    (factor
      (NUMBER "3"))
    (mul "*")
    (factor
      (NUMBER "4"))))

Of course, the first part, compiling of the grammar to Python-code, could also have been achieved with:

>>> python_src = compile_ebnf(arithmetic_ebnf, "Arithmetic").result

And for the execution of the Python-source and extraction of the Grammar-class, one can use DHParser.toolkit.compile_python_object():

>>> from DHParser import toolkit
>>> ArithmeticGrammar = toolkit.compile_python_object(
...     DHPARSER_IMPORTS + python_src, "ArithmeticGrammar")
>>> arithmetic_parser = ArithmeticGrammar()
>>> syntax_tree_2 = arithmetic_parser("2 + 3 * 4")
>>> assert syntax_tree_2.equals(syntax_tree)

The recommended canonical way for the last step, however, would be:

>>> parsing = toolkit.compile_python_object(
...     DHPARSER_IMPORTS + python_src, "parsing")
>>> arithmetic_parser = parsing.factory()
>>> syntax_tree_3 = arithmetic_parser("2 + 3 * 4")
>>> assert syntax_tree_3.equals(syntax_tree)

By using the factory function of the parsing-junction to get a grammar-object instead of instantiating it directly, it is avoided to instantiate the grammar-object more than once per thread. Re-using the same grammar-object is more efficient than re-instantiating it for every new document to be parsed. At the same-time grammar-objects must not be shared between threads or processes. (See also ThreadLocalSingletonFactory.)

class ConfigurableEBNFGrammar(root: Parser | None = None, static_analysis: bool | None = None)[source]

A parser for an EBNF grammar that can be configured to parse different syntactical variants of EBNF. Other than HeuristicEBNF this parser does not detect the used variant while parsing.

Different syntactical variants can be configured either by adjusting the definitions of DEF, OR, AND, ENDL, RNG_OPEN, RNG_CLOSE, RNG_DELIM, CH_LEADIN, TIMES, RE_LEADIN, RE_LEADOUT either within this grammar definition or in the Grammar-object changing the text-field of the respective parser objects.

EBNF-Definition of the Grammar:

@ comment    = /(?!#x[A-Fa-f0-9])#.*(?:\n|$)|\/\*(?:.|\n)*?\*\/|\(\*(?:.|\n)*?\*\)/
    # comments can be either C-Style: /* ... */
    # or pascal/modula/oberon-style: (* ... *)
    # or python-style: # ... \n, excluding, however, character markers: #x20
@ whitespace = /\s*/                            # whitespace includes linefeed
@ literalws  = right                            # trailing whitespace of literals will be ignored tacitly
@ hide       = is_mdef, component, pure_elem, countable, no_range, FOLLOW_UP,
               ANY_SUFFIX, MOD_SYM, MOD_SEP, EOF
@ drop       = whitespace, MOD_SYM, EOF, no_range   # do not include these even in the concrete syntax tree


# re-entry-rules to resume parsing after a syntax-error

@ definition_resume = /\n\s*(?=@|\w+\w*\s*=)/
@ directive_resume  = /\n\s*(?=@|\w+\w*\s*=)/

# specialized error messages for certain cases

@ definition_error  = /,/, 'Delimiter "," not expected in definition!\nEither this was meant to '
                           'be a directive and the directive symbol @ is missing\nor the error is '
                           'due to inconsistent use of the comma as a delimiter\nfor the elements '
                           'of a sequence.'

#: top-level

syntax     = ~ { definition | directive | macrodef } EOF
definition = [modifier] symbol §DEF~ [ OR~ ] expression [MOD_SYM~ hide]
             ENDL~ &FOLLOW_UP  # [OR~] to support v. Rossum's syntax
  modifier = (drop | [hide]) MOD_SEP   # node LF after modifier allowed!
  is_def   = [MOD_SEP symbol] DEF | MOD_SEP is_mdef
  MOD_SEP  = / *: */

directive  = "@" §symbol "=" component { "," component } &FOLLOW_UP
  component  = regexp | literals | procedure | symbol !is_def
             | &`$` !is_mdef § placeholder !is_def
             | "(" expression ")" | RAISE_EXPR_WO_BRACKETS expression
  literals   = { literal }+                       # string chaining, only allowed in directives!
  procedure  = SYM_REGEX "()"                     # procedure name, only allowed in directives!

macrodef   = [modifier] "$" name~ ["(" §placeholder { "," placeholder }  ")"]
             DEF~ [ OR~ ] macrobody [MOD_SYM~ hide] ENDL~ & FOLLOW_UP
  macrobody  = expression
  is_mdef    = "$" name ["(" placeholder { "," placeholder }  ")"] ~DEF

FOLLOW_UP  = `@` | `$` | modifier | symbol | EOF


#: components

expression = sequence { OR~ sequence }
sequence   = ["§"] ( interleave | lookaround )  # "§" means all following terms mandatory
             { AND~ ["§"] ( interleave | lookaround ) }
interleave = difference { "°" ["§"] difference }
lookaround = flowmarker § part
difference = term [!`->` "-" § part]
term       = (oneormore | counted | repetition | option | pure_elem) [MOD_SYM~ drop]
part       = (oneormore | pure_elem) [MOD_SYM~ drop]


#: tree-reduction-markers aka "AST-hints"

drop       = "DROP" | "Drop" | "drop" | "SKIP" | "Skip" | "skip"
hide       = "HIDE" | "Hide" | "hide" | "DISPOSE" | "Dispose" | "dispose"


#: elements

countable  = option | oneormore | element
pure_elem  = element § !ANY_SUFFIX              # element strictly without a suffix
element    = [retrieveop] symbol !is_def
           | literal
           | plaintext
           | char_ranges
           | character ~
           | regexp
           | char_range
           | any_char
           | whitespace
           | group
           | macro !is_def
           | placeholder !is_def
           | parser                            # a user defined parser
ANY_SUFFIX = /[?*+]/


#: flow-operators

flowmarker = "!"  | "&"                         # '!' negative lookahead, '&' positive lookahead
           | "<-!" | "<-&"                      # '<-!' negative lookbehind, '<-&' positive lookbehind
retrieveop = "::" | ":?" | ":"                  # '::' pop, ':?' optional pop, ':' retrieve


#: groups

group      = "(" no_range §expression ")"
oneormore  = "{" no_range expression "}+" | element "+"
repetition = "{" no_range §expression "}" | element "*" no_range
option     = !char_range "[" §expression "]" | element "?"
counted    = countable range | countable TIMES~ multiplier | multiplier TIMES~ §countable

range      = RNG_OPEN~ multiplier [ RNG_DELIM~ multiplier ] RNG_CLOSE~
no_range   = !multiplier | &multiplier TIMES   # should that be &(multiplier TIMES)??
multiplier = /[1-9]\d*/~


#: leaf-elements

parser     = "@" name "(" §[argument] ")"        # a user defined parser
  argument = literal | name~

symbol     = SYM_REGEX ~                        # e.g. expression, term, parameter_list
literal    = /"(?:(?<!\\)\\"|[^"])*?"/~         # e.g. "(", '+', 'while'
           | /'(?:(?<!\\)\\'|[^'])*?'/~         # whitespace following literals will be ignored tacitly.
           | /’(?:(?<!\\)\\’|[^’])*?’/~
plaintext  = /`(?:(?<!\\)\\`|[^`])*?`/~         # like literal but does not eat whitespace
           | /´(?:(?<!\\)\\´|[^´])*?´/~
regexp     = RE_LEADIN RE_CORE RE_LEADOUT ~   # e.g. /\w+/, ~/#.*(?:\n|$)/~

char_range = `[` [`^`] { restricted_range_desc }+ "]"
  restricted_range_desc = character [ `-` character ]
char_ranges = RE_LEADIN range_chain { `|` range_chain } RE_LEADOUT ~
  range_chain = `[` [`^`] { range_desc }+ `]`
  range_desc = (character | free_char) [ `-` (character | free_char) ]

character  = (CH_LEADIN | `\x` | `\u` | `\U`) HEXCODE
free_char  = /[^\n\[\]\\]/ | /\\[nrtfv`´'"(){}\[\]\/\\]/
any_char   = "."
whitespace = /~/~                               # insignificant whitespace


#: macros

macro       = "$" name "(" no_range expression { "," no_range expression } ")"
placeholder = "$" name !`(` ~

name        = SYM_REGEX


#: delimiters

EOF = !/./

DEF        = `=`
OR         = `|`
AND        = ``
ENDL       = ``

RNG_OPEN   = `{`
RNG_CLOSE  = `}`
RNG_DELIM  = `,`
TIMES      = `*`

RE_LEADIN  = `/`
RE_LEADOUT = `/`

CH_LEADIN  = `0x`

MOD_SYM    = `->`  # symbol for postfix modifier

#: basic-regexes

RE_CORE    = /(?:(?<!\\)\\(?:\/)|[^\/])*/       # core of a regular expression, i.e. the dots in /.../
SYM_REGEX  = /(?!\d)\w+/                        # regular expression for symbols
HEXCODE    = /(?:[A-Fa-f1-9]|0(?!x)){1,8}/


#: error-markers

RAISE_EXPR_WO_BRACKETS = ``
class EBNFCompiler(grammar_name='DSL', grammar_source='')[source]

Generates a Parser from an abstract syntax tree of a grammar specified in EBNF-Notation.

Usually, this class will not be instantiated or instances of this class be called directly. Rather high-level functions like create_parser() or compileEBNF() will be used to generate callable Grammar-objects or Python-source-code from an EBNF-grammar.

Instances of this class must be called with the root-node of the abstract syntax tree from an EBNF-specification of a formal language. The returned value is the Python-source-code of a Grammar class for this language that can be used to parse texts in this language. See classes compile.Compiler and parser.Grammar for more information.

Additionally, class EBNFCompiler provides helper methods to generate code-skeletons for a preprocessor, AST-transformation and full compilation of the formal language. These method’s names start with the prefix gen_.

Variables:
  • current_symbols – During compilation, a list containing the root node of the currently compiled definition as first element and then the nodes of the symbols that are referred to in the currently compiled definition.

  • cache_literal_symbols – A cache for all symbols that are defined by literals, e.g. head = "<head>". This is used by the on_expression()-method.

  • rules

    Dictionary that maps rule names to a list of Nodes that contain symbol-references in the definition of the rule. The first item in the list is the node of the rule definition itself. Example:

    alternative = a | b
    

    Now [node.content for node in self.rules['alternative']] yields ['alternative = a | b', 'a', 'b']

  • referred_symbols_cache – A dictionary that caches the results of method referred_symbols(). referred_symbols() maps a to the set of symbols that are directly or indirectly referred to in the definition of the symbol.

  • directly_referred_cache – A dictionary that caches the results of method directly_referred_symbols(), which yields the set of symbols that are referred to in the definition of a particular symbol.

  • referred_otherwise – A set of symbols which are directly referred to in a directive, macro or macro-symbol. It does not matter whether these symbals are reachable (i.e. directly oder indirectly referred to) from the root-symbol.

  • symbols – A mapping of symbol names to their usages (not their definition!) in the EBNF source.

  • py_symbols – A set of symbols that are referred to in the grammar, but are (or must be) defined in Python-code outside the Grammar-class resulting from the compilation of the EBNF-source, as, for example, references to user-defined custom-parsers. (See Custom)

  • variables – A set of symbols names that are used with the Pop or Retrieve operator. Because the values of these symbols need to be captured they are called variables. See test_parser.TestPopRetrieve for an example.

  • forward – A set of symbols that require a forward operator.

  • definitions – A dictionary of definitions. Other than rules this maps the symbols to their compiled definienda.

  • macros – A dictionary that maps macro names to the macro-definition, or, more precisely to a tuple of the node of the macro-symbol, the string-list or macro arguments and the node of the AST that is substituted for the macro-symbol.

  • macro_stack – A stack (i.e. list) of macro names needed to ensure that macro calls are not recursively nested.

  • required_keywords – A list of keywords (like comment__ or whitespace__) that need to be defined at the beginning of the grammar class because they are referred to later.

  • deferred_tasks – A list of callables that is filled during compilation, but that will be executed only after compilation has finished. Typically, it contains semantic checks that require information that is only available upon completion of compilation.

  • root_symbol – The name of the root symbol.

  • directives – A record of all directives and their default values.

  • defined_directives – A dictionary of all directives that have already been defined, mapped onto the list of nodes where they have been (re-)defined. Except for those directives contained in EBNFDirectives.REPEATABLE_DIRECTIVES, directives must only be defined once.

  • consumed_custom_errors – A set of symbols for which a custom error has been defined and(!) consumed during compilation. This allows to add a compiler error in those cases where (i) an error message has been defined but will never be used or (ii) an error message is accidentally used twice. For examples, see test_ebnf.TestErrorCustomization.

  • consumed_skip_rules – The same as consumed_custom_errors only for in-series-resume-rules (aka ‘skip-rules’) for Series-parsers.

  • P – a dictionary that maps parser class names to qualified names in cases a parser class name has also been used as a symbol name in the grammar. (e.g. Text -> parser_namespace__.Text)

  • re_flags – A set of regular expression flags to be added to all regular expressions found in the current parsing process

  • python_src – A string that contains the python source code that was the outcome of the last EBNF-compilation.

  • grammar_name – The name of the grammar to be compiled

  • grammar_source – The source code of the grammar to be compiled.

assemble_parser(definitions: List[Tuple[str, str]], root_symbol: str) str[source]

Creates the Python code for the parser after compilation of the EBNF-Grammar

check_rx(node: Node, rx: str) str[source]

Checks whether the string rx represents a valid regular expression. Makes sure that multi-line regular expressions are prepended by the multi-line-flag. Returns the regular expression string.

directly_referred(symbol: str) FrozenSet[str][source]

Returns the set of symbols that are referred to in the definition of symbol.

extract_counted(node) Tuple[Node, Tuple[int, int]][source]

Returns the content of a counted-node in a normalized form: (node, (n, m)) where node is root of the sub-parser that is counted, i.e. repeated n or n up to m times.

extract_range(node) Tuple[int, int][source]

Returns the range-value of a range-node as a tuple of two integers.

extract_regex(node: Node) str[source]

Extracts regular expression string from regexp-Node.

gen_compiler_skeleton() str[source]

Returns Python-skeleton-code for a Compiler-class for the previously compiled formal language.

gen_custom_parser_example() str[source]

Returns Python-example-code for a custom parser.

gen_preprocessor_skeleton() str[source]

Returns Python-skeleton-code for a preprocessor-function for the previously compiled formal language.

gen_transformer_skeleton() str[source]

Returns Python-skeleton-code for the AST-transformation for the previously compiled formal language.

make_search_rule(node: Node, nd: Node, kind: str) ReprType[source]

Generates a search rule, which can be either a string for simple string search or a regular expression from the node’s content. Returns an empty string in case the node is neither regexp nor literal.

Parameters:
  • node – The node of the directive

  • nd – The node containing the AST of the search rule

  • kind – The kind of the search rule, which must be one of “resume”, “skip”, “error”

non_terminal(node: Node, parser_class: str, custom_args: List[str] = []) str[source]

Compiles any non-terminal, where parser_class indicates the Parser class name for the particular non-terminal.

optimize_definitions_order(definitions: List[Tuple[str, str]])[source]

Reorders the definitions so as to minimize the number of Forward declarations. Forward declarations remain inevitable only where recursion is involved.

recursive_paths(symbol: str) FrozenSet[Tuple[str, ...]][source]

Returns the recursive paths from symbol to itself. If sym is not recursive, the returned tuple (of paths) will be empty. This method exists only for debugging (so far…).

referred_symbols(symbol: str) FrozenSet[str][source]

Returns the set of all symbols that are directly or indirectly referred to in the definition of symbol. The symbol itself can be contained in this set, if and only if its rule is recursive.

referred_symbols() only yields reliable results if the collection of definitions has been completed.

reset()[source]

Resets alls variables to their default values before the next call of the object.

set_grammar_name(grammar_name: str = '', grammar_source: str = '')[source]

Changes the grammar name and source.

The grammar name and the source text are metadata that do not affect the compilation process. It is used to name and annotate the output. Returns self.

verify_transformation_table(transtable)[source]

Checks for symbols that occur in the transformation-table but have never been defined in the grammar. Usually, this kind of inconsistency results from an error like a typo in the transformation table.

exception EBNFCompilerError[source]

Error raised by EBNFCompiler class. (Not compilation errors in the strict sense, see CompilationError in module dsl)

class EBNFDirectives[source]

A Record that keeps information about compiler directives during the compilation process.

Variables:
  • whitespace – the regular expression string for (insignificant) whitespace

  • comment – the regular expression string for comments

  • literalws – automatic “whitespace eating” next to literals. Can be either ‘left’, ‘right’, ‘none’, ‘both’

  • tokens – set of the names of preprocessor tokens

  • filter – mapping of symbols to python match functions that will be called on any retrieve / pop - operations on these symbols

  • error – mapping of symbols to tuples of match conditions and customized error messages. A match condition can be either a string or a regular expression. The first error message where the search condition matches will be displayed. An empty string ‘’ as search condition always matches, so in case of multiple error messages, this condition should be placed at the end.

  • skip – mapping of symbols to a list of search expressions. A search expressions can be either a string ot a regular expression. The closest match is the point of reentry for the series- or interleave-parser when a mandatory item failed to match the following text.

  • resume – mapping of symbols to a list of search expressions. A search expressions can be either a string ot a regular expression. The closest match is the point of reentry for after a parsing error has error occurred. Other than the skip field, this configures resuming after the failing parser (parser.Series or parser.Interleave) has returned.

  • disposable – A regular expression to identify “disposable” symbols, i.e. symbols that will not appear as tag-names. Instead, the nodes produced by the parsers associated with these symbols will yield anonymous nodes just like “inner” parsers that are not directly assigned to a symbol.

  • drop – A set that may contain the elements DROP_STRINGS and DROP_WSP, DROP_REGEXP or any name of a symbol of a disposable parser (e.g. ‘_linefeed’) the results of which will be dropped during the parsing process, already.

  • reduction – The reduction level (0-3) for early tree-reduction during the parsing stage.

  • _super_ws – Cache for the “super whitespace” which is a regular expression that merges whitespace and comments. This property should only be accessed after the whitespace- and comment-field have been filled with the values parsed from the EBNF source.

class HeuristicEBNFGrammar(root: Parser | None = None, static_analysis: bool | None = None)[source]

Parser for an EBNF source file that heuristically detects the used syntactical variant of EBNF on the fly.

This grammar is tuned for flexibility, that is, it supports as many different flavors of EBNF as possible. However, this flexibility comes at the cost of some ambiguities. In particular:

  1. the alternative OR-operator / could be mistaken for the start of a regular expression and vice versa, and

  2. character ranges [a-z] can be mistaken for optional blocks and vice versa

A strategy to avoid these ambiguities is to do all of the following:

  • replace the free_char-parser by a never matching parser

  • if this is done, it is safe to replace the char_range_heuristics- parser by an always matching parser

  • replace the regex_heuristics by an always matching parser

Ambiguities can also be avoided by NOT using all the syntactic variants made possible by this EBNF-grammar within one and the same EBNF-document.

EBNF-definition of the Grammar:

@ comment    = /(?!#x[A-Fa-f0-9])#.*(?:\n|$)|\/\*(?:.|\n)*?\*\/|\(\*(?:.|\n)*?\*\)/
    # comments can be either C-Style: /* ... */
    # or pascal/modula/oberon-style: (* ... *)
    # or python-style: # ... \n, excluding, however, character markers: #x20
@ whitespace = /\s*/                            # whitespace includes linefeed
@ literalws  = right                            # trailing whitespace of literals will be ignored tacitly
@ hide       = is_mdef, component, pure_elem, countable, no_range, FOLLOW_UP,
               MOD_SYM, MOD_SEP, ANY_SUFFIX, EOF
@ drop       = whitespace, MOD_SYM, EOF, no_range        # do not include these even in the concrete syntax tree
@ RNG_BRACE_filter = matching_bracket()         # filter or transform content of RNG_BRACE on retrieve

# re-entry-rules for resuming after parsing-error

@ definition_resume = /\n\s*(?=@|\w+\w*\s*=)/
@ directive_resume  = /\n\s*(?=@|\w+\w*\s*=)/

# specialized error messages for certain cases

@ definition_error  = /,/, 'Delimiter "," not expected in definition!\nEither this was meant to '
                           'be a directive and the directive symbol @ is missing\nor the error is '
                           'due to inconsistent use of the comma as a delimiter\nfor the elements '
                           'of a sequence.'


#: top-level

syntax     = ~ { definition | directive | macrodef } EOF
definition = [modifier] symbol §:DEF~ [ :OR~ ] expression [ MOD_SYM~ hide ]
             :ENDL~ & FOLLOW_UP  # [:OR~] to support v. Rossum's syntax
  modifier = (drop | [hide]) MOD_SEP   # node LF after modifier allowed!
  is_def   = [ MOD_SEP symbol ] :DEF | MOD_SEP is_mdef
  _is_def  = [ MOD_SEP symbol ] _DEF | MOD_SEP is_mdef
  MOD_SEP  = / *: */

directive  = "@" §symbol "=" component { "," component } & FOLLOW_UP
  # component  = (regexp | literals | procedure | symbol !DEF)
  component  = regexp | literals | procedure | symbol !_DEF !_is_def
             | &`$` !is_mdef § placeholder !is_def
             | "(" expression ")"  | RAISE_EXPR_WO_BRACKETS expression
  literals   = { literal }+                       # string chaining, only allowed in directives!
  procedure  = SYM_REGEX "()"                     # procedure name, only allowed in directives!

macrodef   =  [modifier] "$" name~ ["(" §placeholder { "," placeholder }  ")"]
             :DEF~ [ OR~ ] macrobody [ MOD_SYM~ hide ] :ENDL~ & FOLLOW_UP
  macrobody  = expression
  is_mdef    = "$" name ["(" placeholder { "," placeholder }  ")"] ~:DEF

FOLLOW_UP  = `@` | `$` | modifier | symbol | EOF


#: components

expression = sequence { :OR~ sequence }
sequence   = ["§"] ( interleave | lookaround )  # "§" means all following terms mandatory
             { !`@` !(symbol :DEF) :AND~ ["§"] ( interleave | lookaround ) }
interleave = difference { "°" ["§"] difference }
lookaround = flowmarker § part
difference = term [!`->` "-" § part]
term       = (oneormore | counted | repetition | option | pure_elem) [ MOD_SYM~ drop ]
part       = (oneormore | pure_elem) [ MOD_SYM~ drop ]


#: tree-reduction-markers aka "AST-hints"

drop       = "DROP" | "Drop" | "drop" | "SKIP" | "Skip" | "skip"
hide       = "HIDE" | "Hide" | "hide" | "DISPOSE" | "Dispose" | "dispose"


#: elements

countable  = option | oneormore | element
pure_elem  = element § !ANY_SUFFIX              # element strictly without a suffix
element    = [retrieveop] symbol !is_def        # negative lookahead to be sure it's not a definition
           | literal
           | plaintext
           | char_ranges
           | regexp
           | char_range
           | character ~
           | any_char
           | whitespace
           | group
           | macro !is_def
           | placeholder !is_def
           | parser                             # a user-defined parser


ANY_SUFFIX = /[?*+]/


#: flow-operators

flowmarker = "!"  | "&"                         # '!' negative lookahead, '&' positive lookahead
           | "<-!" | "<-&"                      # '<-!' negative lookbehind, '<-&' positive lookbehind
retrieveop = "::" | ":?" | ":"                  # '::' pop, ':?' optional pop, ':' retrieve


#: groups

group      = "(" no_range §expression ")"
oneormore  = "{" no_range expression "}+" | element "+"
repetition = "{" no_range §expression "}" | element "*" no_range
option     = !char_range "[" §expression "]" | element "?"
counted    = countable range | countable :TIMES~ multiplier | multiplier :TIMES~ §countable

range      = RNG_BRACE~ multiplier [ :RNG_DELIM~ multiplier ] ::RNG_BRACE~
no_range   = !multiplier | &multiplier :TIMES
multiplier = /[1-9]\d*/~


#: leaf-elements

parser     = "@" name "(" [argument] ")"        # a user defined parser
  argument = literal | name~

symbol     = SYM_REGEX ~                        # e.g. expression, term, parameter_list
literal    = /"(?:(?<!\\)\\"|[^"])*?"/~         # e.g. "(", '+', 'while'
           | /'(?:(?<!\\)\\'|[^'])*?'/~         # whitespace following literals will be ignored tacitly.
           | /’(?:(?<!\\)\\’|[^’])*?’/~
plaintext  = /`(?:(?<!\\)\\`|[^`])*?`/~         # like literal but does not eat whitespace
           | /´(?:(?<!\\)\\´|[^´])*?´/~
regexp     = :RE_LEADIN RE_CORE :RE_LEADOUT ~   # e.g. /\w+/, ~/#.*(?:\n|$)/~
# regexp     = /\/(?:(?<!\\)\\(?:\/)|[^\/])*?\//~     # e.g. /\w+/, ~/#.*(?:\n|$)/~

char_range = `[` &char_range_heuristics [`^`] { range_desc }+ "]"
char_ranges = RE_LEADIN range_chain { `|` range_chain } RE_LEADOUT ~
  range_chain = `[` [`^`] { range_desc }+ `]`
  range_desc = (character | free_char) [ [`-`] (character | free_char) ]

character  = :CH_LEADIN HEXCODE
free_char  = /[^\n\[\]\\]/ | /\\[nrtfv`´'"(){}\[\]\/\\]/
any_char   = "."
whitespace = /~/~                               # insignificant whitespace


#: macros

macro       = "$" name "(" no_range expression { "," no_range expression } ")"
placeholder = "$" name !`(` ~

name        = SYM_REGEX


#: delimiters

EOF = !/./ [:?ENDL] [:?DEF] [:?OR] [:?AND]      # [:?DEF], [:?OR], ... clear stack by eating stored value
           [:?RNG_DELIM] [:?BRACE_SIGN] [:?CH_LEADIN] [:?TIMES] [:?RE_LEADIN] [:?RE_LEADOUT]

DEF        = _DEF
_DEF       = `=` | `:=` | `::=` | `<-` | /:\n/ | `: `  # with `: `, retrieve markers mustn't be followed by a blank!
OR         = `|` | `/` !regex_heuristics
AND        =  `,` | ``
ENDL       = `;` | ``

RNG_BRACE  = :BRACE_SIGN
BRACE_SIGN = `{` | `(`
RNG_DELIM  = `,`
TIMES      = `*`

RE_LEADIN  = `/` &regex_heuristics | `^/`
RE_LEADOUT = `/`

CH_LEADIN  = `0x` | `#x` | `\x` | `\u` | `\U`

MOD_SYM   = `->`

#: heuristics

char_range_heuristics  = ! ( /[\n]/ | more_than_one_blank
                           | ~ literal_heuristics
                           | ~ [`::`|`:?`|`:`] STRICT_SYM_REGEX /\s*\]/ )
                         & ({ range_desc }+ `]`)
  STRICT_SYM_REGEX     = /(?!\d)\w+/
more_than_one_blank    = /[^ \]]*[ ][^ \]]*[ ]/
literal_heuristics     = /~?\s*"(?:[\\]\]|[^\]]|[^\\]\[[^"]*)*"/
                       | /~?\s*'(?:[\\]\]|[^\]]|[^\\]\[[^']*)*'/
                       | /~?\s*`(?:[\\]\]|[^\]]|[^\\]\[[^`]*)*`/
                       | /~?\s*´(?:[\\]\]|[^\]]|[^\\]\[[^´]*)*´/
                       | /~?\s*\/(?:[\\]\]|[^\]]|[^\\]\[[^\/]*)*\//
regex_heuristics       = ! ( / +`[^`]*` +\//
                           | / +´[^´]*´ +\//
                           | / +'[^']*' +\//
                           | / +"[^"]*" +\//
                           | / +\w+ +\// )
                         ( /[^\/\n*?+\\]*[*?+\\][^\/\n]*\//
                         | /[^\w]+\//
                         | /[^ ]/ )

#: basic-regexes

RE_CORE    = /(?:(?<!\\)\\(?:\/)|[^\/])*/       # core of a regular expression, i.e. the dots in /.../
SYM_REGEX  = /(?!\d)\w(?:-?\w)*/                # regular expression for symbols
HEXCODE    = /(?:[A-Fa-f1-9]|0(?!x)){1,8}/


#: error-markers

RAISE_EXPR_WO_BRACKETS = ``
compile_ebnf(ebnf_source: str, branding: str = 'DSL', *, preserve_AST: bool = False) CompilationResult[source]

Compiles an ebnf_source (file_name or EBNF-string) and returns a tuple of the python code of the compiler, a list of warnings or errors and the abstract syntax tree (if called with the keyword argument preserve_AST=True) of the EBNF-source. This function is merely syntactic sugar.

compile_ebnf_ast(ast: RootNode) str[source]

Compiles the abstract-syntax-tree of an EBNF-source-text into python code of a class derived from parse.Grammar that can parse text following the grammar described with the EBNF-code.

get_ebnf_grammar() HeuristicEBNFGrammar | ConfigurableEBNFGrammar[source]

Returns a thread-local EBNF-Grammar-object for parsing EBNF sources.

grammar_changed(grammar_class, grammar_source: str) bool[source]

Returns True if grammar_class does not reflect the latest changes of grammar_source

Parameters:
  • grammar_class – the parser class representing the grammar or the file name of a compiler suite containing the grammar

  • grammar_source – File name or string representation of the EBNF code of the grammar

Returns (bool):

True, if the source text of the grammar is different from the source from which the grammar class was generated

parse_ebnf(ebnf: str) Node[source]

Parses and EBNF-source-text and returns the concrete syntax tree of the EBNF-code.

preprocess_ebnf(ebnf: str, source_name='source') PreprocessorResult[source]

Preprocesses the @include-directives of an EBNF-source.

transform_ebnf(cst: RootNode) RootNode[source]

Transforms the concrete-syntax-tree of an EBNF-source-code into the abstract-syntax-tree. The transformation changes the syntax tree in place. No value is returned.

Module nodetree

Module nodetree encapsulates the functionality for creating and handling trees of nodes, in particular, syntax-trees. This includes serialization and deserialization of node-trees, navigating and searching node-trees as well as annotating node-trees with attributes and error messages.

nodetree can also be seen as a document-tree-library for handling any kind of XML or S-expression-data. In contrast to Elementtree and lxml, nodetree maps mixed content to dedicated nodes, which simplifies the programming of algorithms that run on the data stored in the (XML-)tree.

The source code of module nodetree consists of four main sections:

  1. Node-classes, i.e. Node, FrozenNode and :py:class`RootNode` as well as a number of top-level functions closely related to these. The Node-classes in turn provide several groups of functionality:

    1. Capturing segments of documents and organizing it in trees. (A node is either a “leaf”-node with string-content or a “branch”-node with children.)

    2. Retaining its source-position in the document (important for error reporting, in particular when errors occur later in the processing-pipeline.)

    3. Storing and retrieving of (XML-)attributes. Like XML-attributes, attribute-names are strings, but other than XML-attributes, attributes attached to Node-objects can take any Python-type as value that is serializable with “str()”.

    4. Tree-traversal, in particular node- and path-selection based on arbitrary criteria (passed as match-node or match-path-function)

    5. Experimental (XML-)milestone-support: Node.milestone_segment(). See also: ContextMapping

    6. A very simple method for tree-“evaluation”: (More elaborate scaffolding for evaluation tree are found in traverse and compile.)

    7. Functions for serialization and deserialization as XML, S-Expression, JSON as well as conversion to and from ElemenTree/LXML-representations.

    8. Class RootNode serving as both root-node of the tree and a hub for storing data for the tree as a whole (as, for example, the list of errors that have occurred during parsing or further processing) as well as information on the current processing-stage.

  2. Attribute-handling: Functions to handle attributes-values that are organized as blank separated sets of strings, like for example the class-attribute in HTML.

  3. Path-Navigation: Functions that help navigating with paths through the tree. A path is the list of nodes that connects the root-node of the tree with one particular node inside or at the leaf of the tree.

  4. Context-mappings: A Class (ContextMapping) for relating the flat string-content of a document-tree to its structure. This allows using the string-content for searching in the document and then switching to the tree-structure to manipulate it.

class ContentMapping(origin: ~nodetree.Node, select: PathSelector = <function LEAF_PATH>, ignore: PathSelector = <function deny>, greedy: bool = True, divisibility: ~typing.Dict[str, ~typing.Container] | ~typing.Container | str = frozenset({':EMPTY', ':RegExp', ':Text', ':Whitespace'}), chain_attr_name: str = '', auto_cleanup: bool = True)[source]

ContentMapping represents a path-mapping of the string-content of all or a specific selection of the leave-nodes of a tree. A content-mapping is an ordered mapping of the first text position of every (selected) leaf-node to the path of this node.

Path-mappings allow to search the flat document with regular expressions or simple text search and then changing the tree at the appropriate places, for example by adding markup (i.e. nodes) in these places.

The ContentMapping class provides methods for adding markup-nodes. In cases where the new markup-nodes cut across the existing tree-hierarchy, the markup-method takes care of splitting up either the newly created or some of the existing nodes to fit in the markup.

Public properties:

Variables:
  • path_list – A list of paths covering the selected leaves of the tree from left to right.

  • pos_list – The list of positions of the paths in path_list

Location-related instance variables:

Variables:
  • origin – The orogin of the tree for which a path mapping shall be generated. This can be a branch of another tree and therefore does not need to be a RootNode-object.

  • select_func – Only leaf-paths for which this is true will be considered when generating the content-mapping. This function integrates both the select- and ignore-criteria passed to the constructor of the class. Note that the select-criterion must only accept leaf-paths. Otherwise a ValueError will be raised.

  • ignore_func – The ignore function derives from the ignore-parameter of the __init__()-construcotr of class ContentMapping.

  • content – The string content of the selected parts of the tree.

Markup-related instance variables:

Variables:
  • greedy – If True, the algorithm for adding markup minimizes the number of required cuts by switching child and parent nodes if the markup fills up a node completely as well as including empty nodes in the markup. In any the case string content of the added markup remains the same, but it might cover more tags than strictly necessary.

  • chain_attr – An attribute that will receive one and the same identifier as value for all nodes belonging to the chain of on split-up node.

  • auto_cleanup – Update the content mapping after the markup has been finished. Should always be true, if it is intended to reuse the same content mapping for further markups in the same range or other purposes.

Parameters:

divisibility

A dictionary that contains the information which tags (or nodes as identified by their name) are “harder” than other tags. Each key-tag in the dictionary is harder than (i.e. is allowed to split up) up all tags in the associated value (which is a set of node, or for that matter, tag-names). Tag or node-names associated to the wildcard key * can be split by any tag.

If the markup-method reaches nodes that cannot be split, it will split the markup-node instead to cover the string to be marked up, completely.

get_path_and_offset(pos: int, left_biased: bool = False) Tuple[Path, int][source]

Yields the path and relative position for the absolute position pos. See ContentMappin.get_path_index() for the description of the parameters.

Returns:

tuple (path, offset) where the offset is the position of pos relative to the actual position of the last node in the path.

Raises:

IndexError if not 0 <= position < length of document

get_path_index(pos: int, left_biased: bool = False) int[source]

Yields the index for the path in given context-mapping that contains the position pos.

Parameters:
  • pos – a position in the content of the tree for which the path mapping cm was generated

  • left_biased – yields the location after the end of the previous path rather than the location at the very beginning of the next path. Default value is “False”.

Returns:

the integer index of the path in self.path_list that covers the given position pos

Raises:

IndexError if not 0 <= position < length of document

Example:

>>> tree = parse_sxpr('(a (b "123") (c (d "45") (e "67")))')
>>> cm = ContentMapping(tree)
>>> i = cm.get_path_index(4)
>>> path = cm.path_list[i]
>>> print(pp_path(path, 1, ', '))
a, c, d "45"
insert_node(pos: int, node: Node) Node[source]

Inserts a node at a specific position into the last or eventually second but last node in the path from the context mapping that covers this position. Returns the parent of the newly inserted node.

iterate_paths(start_pos: int, end_pos: int, left_biased: bool = False) Iterator[Path][source]

Yields all paths from position start_pos up to and including position end_pos. Example:

>>> tree = parse_sxpr('(a (b "123") (c (d "456") (e "789")) (f "ABC"))')
>>> cm = ContentMapping(tree)
>>> [[nd.name for nd in p] for p in cm.iterate_paths(1, 12)]
[['a', 'b'], ['a', 'c', 'd'], ['a', 'c', 'e'], ['a', 'f']]
markup(start_pos: int, end_pos: int, name: str, *attr_dict, **attributes) Node[source]

Marks the span [start_pos, end_pos[ up by adding one or more Node’s with name, eventually cutting through divisible nodes. Returns the nearest common ancestor of start_pos and end_pos.

Parameters:
  • cm – A context mapping of the document (or a part therof) where the markup shall be inserted. See generate_content_mapping()

  • start_pos – The string-position of the first character to be marked up. Note that this is the position in the string-content of the tree over which the content mapping has been generated and not the position in the XML or any other serialization of the tree!

  • end_pos – The string-position of the last character to be included in the markup. Be aware that other than in slicing of Python lists or strings where the beginning and ending define an half-open intervall, the character indexed by end_pos is included in the markup, i.e. [start_pos, end_pos] define a closed intervall for markup. Also note that end_pos is the position in the string-content of the tree over which the content mapping has been generated and not the position in the XML or any other serialization of the tree!

  • name – The name, or “tag-name” in XML-terminology, of the element (or tag) to be added.

  • attr_dict – A dictionary of attributes that will be added to the newly created tag.

  • attributes – Alternatively, the attributes can also be passed as a list of named parameters.

Returns:

The nearest (from the top of the tree) node within which the entire markup lies.

Examples:

>>> from DHParser.toolkit import printw
>>> tree = parse_sxpr('(X (l ",.") (A (O "123") (P "456")) (m "!?") '
...                   ' (B (Q "789") (R "abc")) (n "+-"))')
>>> X = copy.deepcopy(tree)
>>> t = ContentMapping(X)
>>> _ = t.markup(2, 8, 'em')
>>> printw(X.as_sxpr(flatten_threshold=-1))
(X (l ",.") (A (em (O "123") (P "456"))) (m "!?") (B (Q "789") (R "abc"))
 (n "+-"))
>>> X = copy.deepcopy(tree)
>>> t = ContentMapping(X)
>>> _ = t.markup(2, 10, 'em')
>>> printw(X.as_sxpr(flatten_threshold=-1))
(X (l ",.") (em (A (O "123") (P "456")) (m "!?")) (B (Q "789") (R "abc"))
 (n "+-"))
>>> X = copy.deepcopy(tree)
>>> t = ContentMapping(X, divisibility={'A'})
>>> _ = t.markup(5, 10, 'em')
>>> printw(X.as_sxpr(flatten_threshold=-1))
(X (l ",.") (A (O "123")) (em (A (P "456")) (m "!?")) (B (Q "789") (R "abc"))
 (n "+-"))
>>> X = copy.deepcopy(tree)
>>> t = ContentMapping(X)
>>> _ = t.markup(2, 13, 'em')
>>> printw(X.as_sxpr(flatten_threshold=-1))
(X (l ",.") (em (A (O "123") (P "456")) (m "!?")) (B (em (Q "789")) (R "abc"))
 (n "+-"))
>>> X = copy.deepcopy(tree)
>>> t = ContentMapping(X)
>>> _ = t.markup(5, 16, 'em')
>>> printw(X.as_sxpr(flatten_threshold=-1))
(X (l ",.") (A (O "123") (em (P "456"))) (em (m "!?") (B (Q "789") (R "abc")))
 (n "+-"))
>>> X = copy.deepcopy(tree)
>>> t = ContentMapping(X)
>>> _ = t.markup(5, 13, 'em')
>>> printw(X.as_sxpr(flatten_threshold=-1))
(X (l ",.") (A (O "123") (em (P "456"))) (em (m "!?")) (B (em (Q "789"))
 (R "abc")) (n "+-"))
>>> X = copy.deepcopy(tree)
>>> t = ContentMapping(X)
>>> _ = t.markup(6, 12, 'em')
>>> printw(X.as_sxpr(flatten_threshold=-1))
(X (l ",.") (A (O "123") (P (:Text "4") (em "56"))) (em (m "!?"))
 (B (Q (em "78") (:Text "9")) (R "abc")) (n "+-"))
>>> X = copy.deepcopy(tree)
>>> t = ContentMapping(X)
>>> _ = t.markup(1, 17, 'em')
>>> printw(X.as_sxpr(flatten_threshold=-1))
(X (l (:Text ",") (em ".")) (em (A (O "123") (P "456")) (m "!?") (B (Q "789")
 (R "abc"))) (n (em "+") (:Text "-")))
>>> X = copy.deepcopy(tree)
>>> t = ContentMapping(X, divisibility={'em': {'l', 'n'}})
>>> _ = t.markup(1, 17, 'em')
>>> printw(X.as_sxpr(flatten_threshold=-1))
(X (l ",") (em (l ".") (A (O "123") (P "456")) (m "!?") (B (Q "789") (R "abc"))
 (n "+")) (n "-"))
rebuild_mapping(start_pos: int, end_pos: int)[source]

Reconstructs a particular section of the context mapping after the underlying tree has been restructured.

Parameters:
  • start_pos – The string position of the beginning of the text-area that has been affected by earlier changes.

  • end_pos – The string position of the ending of the text-area that has been affected by earlier changes.

rebuild_mapping_slice(first_index: int, last_index: int)[source]

Reconstructs a particular section of the context mapping after the underlying tree has been restructured. Ohter than ContentMappin.rebuild_mapping(), the section that needs repairing if here defined by the path indices and not the string positions.

Parameters:
  • first_index – The index (not the position within the string-content!) of the first path that has been affected by restruturing of the tree. Use ContentMapping.get_path_index() to determine the path-index if only the position is known.

  • last_index – The index (not the position within the string-content!) of the last path that has been affected by restruturing of the tree. Use ContentMapping.get_path_index() to determine the path-index if only the position is known.

Examples:

>>> tree = parse_sxpr('(a (b (c "123") (d "456")) (e (f (g "789") (h "ABC")) (i "DEF")))')
>>> cm = ContentMapping(tree)
>>> print(cm)
0 -> a, b, c "123"
3 -> a, b, d "456"
6 -> a, e, f, g "789"
9 -> a, e, f, h "ABC"
12 -> a, e, i "DEF"
>>> b = tree.pick('b')
>>> b.result = (b[0], Node('x', 'xyz'), b[1])
>>> cm.rebuild_mapping_slice(0, 1)
>>> print(cm)
0 -> a, b, c "123"
3 -> a, b, x "xyz"
6 -> a, b, d "456"
9 -> a, e, f, g "789"
12 -> a, e, f, h "ABC"
15 -> a, e, i "DEF"
>>> cm.auto_cleanup = False
>>> common_ancestor = cm.markup(10, 16, 'Y')
>>> print(common_ancestor.as_sxpr())
(e (f (g (:Text "7") (Y "89")) (Y (h "ABC"))) (i (Y "D") (:Text "EF")))
>>> print(cm)
0 -> a, b, c "123"
3 -> a, b, x "xyz"
6 -> a, b, d "456"
9 -> a, e, f, g (:Text "7") (Y "89")
12 -> a, e, f, h "ABC"
15 -> a, e, i (Y "D") (:Text "EF")
>>> a = cm.get_path_index(10)
>>> b = cm.get_path_index(16, left_biased=True)
>>> a, b
(3, 5)
>>> cm.rebuild_mapping_slice(3, 5)
>>> print(cm)
0 -> a, b, c "123"
3 -> a, b, x "xyz"
6 -> a, b, d "456"
9 -> a, e, f, g, :Text "7"
10 -> a, e, f, g, Y "89"
12 -> a, e, f, Y, h "ABC"
15 -> a, e, i, Y "D"
16 -> a, e, i, :Text "EF"

>>> tree = parse_sxpr('(a (b (c "123") (d "456")) (e (f (g "789") (h "ABC")) (i "DEF")))')
>>> cm = ContentMapping(tree, auto_cleanup=False)
>>> common_ancestor = cm.markup(0, 6, 'Y')
>>> print(common_ancestor.as_sxpr())
(b (Y (c "123") (d "456")))
>>> a = cm.get_path_index(0)
>>> b = cm.get_path_index(6, left_biased=True)
>>> a, b
(0, 1)
>>> cm.rebuild_mapping_slice(a, b)
>>> print(cm)
0 -> a, b, Y, c "123"
3 -> a, b, Y, d "456"
6 -> a, e, f, g "789"
9 -> a, e, f, h "ABC"
12 -> a, e, i "DEF"
class DHParser_JSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

A JSON-encoder that also encodes nodetree.Node as valid json objects. Node-objects are encoded using Node.as_json.

default(obj)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
class FrozenNode(name: str, result: ResultType, leafhint: bool = True)[source]

FrozenNode is an immutable kind of Node, i.e. it must not be changed after initialization. The purpose is mainly to allow certain kinds of optimizations, like not having to instantiate empty nodes (because they are always the same and will be dropped while parsing, anyway) and to be able to trigger errors if the program tries to treat such temporary nodes as a regular ones. (See DHParser.parse)

Frozen nodes must only be used temporarily during parsing or tree-transformation and should not occur in the product of the transformation anymore. This can be verified with tree_sanity_check(). Or, as comparison criterion for content equality when picking or selecting nodes or paths from a tree (see create_match_function()).

property attr

Returns a dictionary of XML-attributes attached to the node.

Examples:

>>> node = Node('', '')
>>> print('Any attributes present?', node.has_attr())
Any attributes present? False
>>> node.attr['id'] = 'identificator'
>>> node.attr
{'id': 'identificator'}
>>> node.attr['id']
'identificator'
>>> del node.attr['id']
>>> node.attr
{}

NOTE: Use Node.has_attr() rather than bool(node.attr) to probe the presence of attributes. Attribute dictionaries are created lazily and node.attr would create a dictionary, even though it may never be needed, anymore.

static from_json_obj(json_obj: Dict | Sequence) Node[source]

Converts a JSON-object representing a node (or tree) back into a Node object. Raises a ValueError, if json_obj does not represent a node.

property pos

Returns the position of the Node’s content in the source text.

property result: Tuple[Node, ...] | StringView | str

Returns the result from the parser that created the node.

to_json_obj(as_dict: bool = False, include_pos: bool = True) List[source]

Converts the tree into a JSON-serializable nested list. Nodes can be serialized in list-flavor (faster) or dictionary-flavor (asdict=True, slower).

In list-flavor, Nodes are serialized as JSON-lists with either two or three elements:

  1. name (always a string),

  2. content (either a string or a list of JSON-serialized Nodes)

  3. optional: a dictionary that maps attribute names to attribute values, both of which are strings.

In dictionary flavor, Nodes are serialized as dictionaries that map the node’s name to a string (in case of a leaf node) or to a dictionary of its children (in case all the children’s names are unique) or to a list of pairs (child name, child’s result).

Examples (list flavor):

>>> Node('root', 'content').with_attr(importance="high").to_json_obj()
['root', 'content', {'importance': 'high'}]
>>> node = parse_sxpr('(letters (a "A") (b "B") (c "C"))')
>>> node.to_json_obj()
['letters', [['a', 'A'], ['b', 'B'], ['c', 'C']]]
Examples (dictionary flavor)
>>> node.to_json_obj(as_dict=True)
{'letters': {'a': 'A', 'b': 'B', 'c': 'C'}}
>>> node.result = node.children + (Node('a', 'doublette'),)
>>> node.to_json_obj(as_dict=True)
{'letters': [['a', 'A'], ['b', 'B'], ['c', 'C'], ['a', 'doublette']]}
with_pos(pos: int) Node[source]

Initializes the node’s position value. Usually, the parser takes care of assigning the positions in the document to the nodes of the parse-tree. However, when Nodes are created outside the reach of the parser guard, their document-position must be assigned manually. Position values of the child nodes are assigned recursively, too. Example:

>>> node = Node('test', 'position').with_pos(10)
>>> node.pos
10

>>> tree = parse_sxpr('(a (b (c "0") (d (e "1")(f "2"))) (g "3"))')
>>> _ = tree.with_pos(0)
>>> [(nd.name, nd.pos) for nd in tree.select(ANY_NODE, include_root=True)]
[('a', 0), ('b', 0), ('c', 0), ('d', 1), ('e', 1), ('f', 2), ('g', 3)]
Parameters:

pos – The position assigned to be assigned to the node. Value must be >= 0.

Returns:

the node itself (for convenience).

Raises:

AssertionError if position has already been assigned or if parameter pos has a value < 0.

class Node(name: str, result: Tuple[Node, ...] | Node | StringView | str, leafhint: bool = False)[source]

Represents a node in a tree data structure. This can, for example, be the concrete or abstract syntax tree that is produced by a recursive descent parser.

There are three different kinds of nodes:

  1. Branch nodes that have children, but no string content. Other than in XML there are no mixed-content nodes that contain strings as well other tags. This constraint simplifies tree-processing considerably.

    The conversion to and from XML works by enclosing strings in a mixed-content tag with some, freely chosen tag name, and dropping the tag name again when serializing to XML. Since this is easily done, there is no serious restriction involved when not allowing mixed-content nodes. See Node.as_xml() (parameter string_tags) as parse_xml.

  2. Leaf nodes that do not have children but only string content.

  3. The root node which contains further properties that are global properties of the parsing tree, such as the error list (which cannot be stored locally in the nodes, because nodes might be dropped during tree-processing, but error messages should not be forgotten!). Because of that, the root node requires a different class (RootNode) while leaf-nodes as well as branch nodes are both instances of class Node.

A node always has a tag name (which can be empty, though) and a result field, which stores the results of the parsing process and contains either a string or a tuple of child nodes.

All other properties are either optional or represent different views on these two properties. Among these are the ‘attr`-field that contains a dictionary of xml-attributes, the children-filed that contains a tuple of child-nodes or an empty tuple if the node does not have child nodes, the content-field which contains the string content of the node and the pos-field which contains the position of the node’s content in the source code, but may also be left uninitialized.

Variables:
  • name

    The name of the node, which is either its parser’s name or, if that is empty, the parser’s class name.

    By convention the parser’s class name when used as tag name is prefixed with a colon “:”. A node, the tag name of which starts with a colon “:” or the tag name of which is the empty string is considered as “anonymous”. See Node.anonymous()-property

  • result – The result of the parser which generated this node, which can be either a string or a tuple of child nodes.

  • children – The tuple of child nodes or an empty tuple if there are no child nodes. READ ONLY!

  • content – Yields the contents of the tree as string. The difference to str(node) is that node.content does not add the error messages to the returned string. READ ONLY!

  • pos

    the position of the node within the parsed text.

    The default value of pos is -1 meaning invalid by default. Setting pos to a value >= 0 will trigger the assignment of position values of all child nodes relative to this value.

    The pos field is WRITE ONCE, i.e. once assigned it cannot be reassigned. The assignment of the pos values happens either during the parsing process or, when later added to a tree, the pos-values of which have already been initialized.

    Thus, pos-values always retain their position in the source text. If in any tree-processing stage after parsing, nodes are added or deleted, the pos values will not represent the position within in the string value of the tree.

    Retaining the original source positions is crucial for correctly locating errors which might only be detected at later stages of the tree-transformation within the source text.

  • attr – An optional dictionary of attributes attached to the node. This dictionary is created lazily upon first usage.

property anonymous: bool

Returns True, if the Node is an “anonymous” Node, i.e. a node that has not been created by a named parser.

The tag name of anonymous node contains a colon followed by the class name of the parser that created the node, i.e. “:Series”. It is the recommended practice to remove (or name) all anonymous nodes during the AST-transformation.

as_etree(ET=None, string_tags: AbstractSet[str] = frozenset({':Text'}), empty_tags: AbstractSet[str] = frozenset({}))[source]

Returns the tree as standard-library- or lxml-ElementTree.

Parameters:
  • ET – The ElementTree-library to be used. If None, the STL ElementTree will be used.

  • string_tags – A set of tags the content of which will be written without tag-name into the mixed content of the parent.

  • empty_tags – A set of tags that will be considered empty tags like “<br/>”. No Node with any of these tags must contain any content.

Returns:

The tree of Nodes as an ElementTree

as_html(css: str = '', head: str = '', lang: str = 'en', **kwargs) str[source]

Serialize as HTML-page. See Node.as_xml() for the further keyword-arguments.

as_json(indent: int | None = 2, ensure_ascii=False, as_dict: bool = False, include_pos: bool = True) str[source]

Serializes the tree originating in self as JSON-string. Nodes are serialized as JSON-lists with either two or three elements:

  1. name (always a string),

  2. content (either a string or a list of JSON-serialized Nodes)

  3. optional: a dictionary that maps attribute names to attribute values, both of which are strings.

If as_dict is True, nodes are serialized as JSON dictionaries, which can be better human-readable when serialized. Keep in mind, though, that while this renders the json files more readable, not all json parsers honor the order of the entries of dictionaries. Thus, serializing node trees as ordered JSON-dictionaries is not strictly in accordance with the JSON-specification! Also serializing and de-serializing the dictionary-flavored JSON is slower.

Example:

>>> node = Node('root', 'content').with_attr(importance="high")
>>> node.as_json(indent=0)
'["root","content",{"importance":"high"}]'
>>> node.as_json(indent=0, as_dict=True)
'{"root":{"content__":"content","attributes__":{"importance":"high"}}}'
as_sxml(src: str | None = None, indentation: int = 2, compact: bool = True, flatten_threshold: int = 92, normal_form: int = 1) str[source]

Serializes the tree as SXML

as_sxpr(src: str | None = None, indentation: int = 2, compact: bool = True, flatten_threshold: int = 92, sxml: int = 0) str[source]

Serializes the tree as S-expression, i.e. in lisp-like form. If this method is called on a RootNode-object, error strings will be displayed as pseudo-attributes of the nodes where the error is located.

Parameters:
  • src – The source text or None. In case the source text is given the position of the element in the text will be reported as position, line, column. In case the empty string is given rather than None, only the position value will be reported in case it has been initialized, i.e. pos >= 0.

  • indentation – The number of whitespaces for indentation

  • compact – If True, a compact representation is returned where closing brackets remain on the same line as the last element.

  • flatten_threshold – Return the S-expression in flattened form if the flattened expression does not exceed the threshold length. A negative number means that it will always be flattened.

  • sxml

    If >= 1, attributes are rendered according to the SXML -conventions, e.g. `` (@ (attr “value”)`` instead of `` (attr “value”) ` if 2, the attribute node (@) will always be present, even is empty.

Returns:

A string containing the S-expression serialization of the tree.

as_tree() str[source]

Serialize as a simple indented text-tree.

as_xml(src: str | None = None, indentation: int = 2, inline_tags: AbstractSet[str] = frozenset({}), string_tags: AbstractSet[str] = frozenset({':Text'}), empty_tags: AbstractSet[str] = frozenset({}), strict_mode: bool = True) str[source]

Serializes the tree of nodes as XML.

Parameters:
  • src – The source text or None. In case the source text is given, the position will also be reported as line and column.

  • indentation – The number of whitespaces for indentation

  • inline_tags – A set of tag names, the content of which will always be written on a single line, unless it contains explicit line feeds (n). In addition, all nodes that have the attribute xml:space="preserve" will be inlined.

  • string_tags – A set of tags from which only the content will be printed, but neither the opening tag nor its attr nor the closing tag. This allows producing a mix of plain text and child tags in the output, which otherwise is not supported by the Node object, because it requires its content to be either a tuple of children or string content.

  • empty_tags – A set of tags which shall be rendered as empty elements, e.g. “<empty/>” instead of “<empty><empty>”.

  • strict_mode – If True, violation of stylistic or interoperability rules raises a ValueError.

Returns:

The XML-string representing the tree originating in self

property attr

Returns a dictionary of XML-attributes attached to the node.

Examples:

>>> node = Node('', '')
>>> print('Any attributes present?', node.has_attr())
Any attributes present? False
>>> node.attr['id'] = 'identificator'
>>> node.attr
{'id': 'identificator'}
>>> node.attr['id']
'identificator'
>>> del node.attr['id']
>>> node.attr
{}

NOTE: Use Node.has_attr() rather than bool(node.attr) to probe the presence of attributes. Attribute dictionaries are created lazily and node.attr would create a dictionary, even though it may never be needed, anymore.

property children: ChildrenType

Returns the tuple of child-nodes or an empty tuple if the node does node have any child-nodes but only string content.

property content: str

Returns content as string. If the node has child-nodes, the string content of the child-nodes is recursively read and then concatenated.

equals(other: Node, ignore_attr_order: bool = True) bool[source]

Equality of value: Two nodes are considered as having the same value, if their tag name is the same, if their results are equal and if their attributes and attribute values are the same and if either ignore_attr_order is True or the attributes also appear in the same order.

Parameters:
  • other – The node to which self shall be compared.

  • ignore_attr_order – If True (default), two sets of attributes are considered as equal if their attribute-names and attribute-values are the same, no matter in which order the attributes have been added.

Returns:

True, if the tree originating in node self is equal by value to the tree originating in node other.

evaluate(actions: Dict[str, Callable], path: Path = []) Any[source]

Simple tree evaluation: For each node the action associated with the node’s tag-name is called with either the tuple of the evaluated children or, in case of a leaf-node, the result-string as parameter(s):

>>> tree = parse_sxpr('(plus (number 3) (mul (number 5) (number 4)))')
>>> from operator import add, mul
>>> actions = {'plus': add, 'mul': mul, 'number': int}
>>> tree.evaluate(actions)
23

evaluate() can operate in two modes. In the basic mode, shown, in the example, only the evaluated values of the children are passed to each function in the action dictionary. However, if evaluate is called with passing the beginning of the path to its path-argument, each function will be called with the current path as its first argument and the evaluated values of its children as the following arguments, e.g. result = node.evaluate(actions, path=[node]) This more sophisticated mode gives the action function access to the nodes of the tree as well.

Parameters:
  • actions – A dictionary that maps node-names to action functions.

  • path – If not empty, the current tree-path will be passed as first argument (before the evaluation results of the children) to each action. Start with a list of the node itself to trigger passing the path.

Raises:
  • KeyError – if an action is missing in the table, use the joker ‘*’ to void this error, e.g. { ..., '*': lambda node: node.content, ...}.

  • ValueError – in case any of the action functions cannot handle the passed arguments.

Returns:

the result of the evaluation

find_parent(node) Node | None[source]

Finds and returns the parent of node within the tree represented by self. If the tree does not contain node, the value None is returned.

static from_etree(et, string_tag: str = ':Text') Node[source]

Converts a standard-library- or lxml-ElementTree to a tree of nodes.

Parameters:
  • et – the root element-object of the ElementTree

  • string_tag – A tag-name that will be used for the strings occurring in mixed content.

Returns:

a tree of nodes.

static from_json_obj(json_obj: Dict | Sequence) Node[source]

Converts a JSON-object representing a node (or tree) back into a Node object. Raises a ValueError, if json_obj does not represent a node.

get(key: int | slice | NodeSelector, surrogate: Node | Sequence[Node]) Node | Sequence[Node][source]

Returns the child node with the given index if key is an integer or the first child node with the given tag name. If no child with the given index or name exists, the surrogate is returned instead. This mimics the behavior of Python’s dictionary’s get()-method.

The type of the return value is always the same type as that of the surrogate. If the surrogate is a Node, but there are several items matching key, then only the first of these will be returned.

get_attr(attribute: str, default: Any) Any[source]

Returns the value of ‘attribute’ if attribute exists. If not, the default value is returned. This function has the same semantics as node.attr.get(attribute, default), but with the advantage then other than node.attr.get it does not automatically create an attribute dictionary on (first) access.

Parameters:
  • attribute – The attribute, the value of which shall be looked up

  • default – A default value that is returned, in case attribute does not exist.

Returns:

the attribute’s value or, if unassigned, the default value.

has_attr(attr: str = '') bool[source]

Returns True, if the node has the attribute attr or, in case attr is the empty string, any attributes at all; False otherwise.

This function does not create an attribute dictionary, therefore it should be preferred to querying node.attr when testing for the existence of any attributes.

has_equal_attr(other: Node, ignore_order: bool = True) bool[source]

Returns True, if self and other have the same attributes with the same attribute values. If ignore_order is False, the attributes must also appear in the same order.

index(selector: NodeSelector, start: int = 0, stop: int = 1073741824) int[source]

Returns the index of the first child that fulfills the criterion what. If the parameters start and stop are given, the search is restricted to the children with indices from the half-open interval [start:end[. If no such child exists a ValueError is raised.

Parameters:
  • selector – the criterion by which the child is identified, the index of which shall be returned.

  • start – the first index to start searching.

  • stop – the last index that shall be searched

Returns:

the index of the first child that matches what.

Raises:

ValueError, if no child matching the criterion what was found.

indices(selector: NodeSelector) Tuple[int, ...][source]

Returns the indices of all children that fulfil the criterion what.

insert(index: int, node: Node)[source]

Inserts a node at position index

locate(location: int) Node | None[source]

Returns the leaf-Node that covers the given location, where location is the actual position within self.content (not the source code position that the pos-attribute represents). If the location lies outside the node’s string content, None is returned.

See also ContentMapping for a more general approach to locating string positions within the tree.

locate_path(location: int) Path[source]

Like Node.locate(), only that the entire path (i.e. chain of descendants) relative to self is returned.

milestone_segment(begin: Path | Node, end: Path | Node) Node[source]

EXPERIMENTAL!!! Picks a segment from a tree beginning with start and ending with end.

Parameters:
  • begin – the opening milestone (will be included in the result)

  • end – the closing milestone (will be included in the result)

Returns:

a tree(-segment) encompassing all nodes from the opening milestone up to and including the closing milestone.

pick(criteria: NodeSelector, include_root: bool = False, reverse: bool = False, skip_subtree: NodeSelector = <function deny>) Node | None[source]

Picks the first (or last if run in reverse mode) descendant that fulfils the given criterion. See create_match_function() for a catalogue of possible criteria.

This function is syntactic sugar for next(node.select(criterion, …)). However, rather than raising a StopIterationError if no descendant with the given tag-name exists, it returns None.

pick_child(criteria: NodeSelector, reverse: bool = False) Node | None[source]

Picks the first child (or last if run in reverse mode) descendant that fulfils the given criterion. See create_match_function() for a catalogue of possible criteria.

This function is syntactic sugar for next(node.select_children(criterion, False)). However, rather than raising a StopIterationError if no descendant with the given tag-name exists, it returns None.

pick_if(match_func: NodeMatchFunction, include_root: bool = False, reverse: bool = False, skip_subtree: NodeSelector = <function deny>) Node | None[source]

Picks the first (or last if run in reverse mode) descendant for which the match-functions returns True. Or, returns None if no matching node exists.

pick_path(criteria: PathSelector, include_root: bool = False, reverse: bool = False, skip_subtree: PathSelector = <function deny>) Path[source]

Like Node.pick(), only that the entire path (i.e. chain of descendants) relative to self is returned.

pick_path_if(match_func: PathMatchFunction, include_root: bool = False, reverse: bool = False, skip_subtree: NodeSelector = <function deny>) Path[source]

Picks the first (or last if run in reverse mode) descendant-path for which the match-functions returns True. Or, returns None if no matching node exists.

property pos: int

Returns the position of the Node’s content in the source text.

reconstruct_path(node: Node) Path[source]

Determines the chain of ancestors of a node that leads up to self. Note: The use of this quite inefficient method can most of the time be avoided by traversing the tree with the path-selector-methods (e.g. Node.select_path()) right from the start.

Parameters:

node – the descendant node, the ancestry of which shall be determined.

Returns:

the list of nodes starting with self and leading to node

Raises:

ValueError, in case node does not occur in the tree rooted in self

remove(node: Node)[source]

Removes node from the children of the node.

replace_by(replacement: Node, merge_attr: bool = False)[source]

Replaces the node’s name, result and attributes by that of another node. This allows to effectually replace the node without needing to change the parent node’s children’s tuple.

Parameters:
  • replacement – the node by which self shall be “replaced”.

  • merge_attr – if True, attributes are merged (by updating the attr dictionary with that of the replacement node) rather than simply be replaced.

property repr: str

Return a full (re-)executable representation of self including attributes and position value.

property result: StrictResultType

Returns the result from the parser that created the node.

select(criteria: NodeSelector, include_root: bool = False, reverse: bool = False, skip_subtree: NodeSelector = <function deny>) Iterator[Node][source]

Generates an iterator over all nodes in the tree that fulfill the given criterion. See create_match_function() for a catalogue of possible criteria.

Parameters:
  • criteria – The criteria for selecting nodes.

  • include_root – If False, only descendant nodes will be checked for a match.

  • reverse – If True, the tree will be walked in reverse order, i.e. last children first.

  • skip_subtree – A criterion to identify subtrees that the returned iterator shall not dive into. Note that the root-node of the subtree will still be yielded by the iterator.

Returns:

An iterator over all descendant nodes which fulfill the given criterion. Traversal is pre-order.

Examples:

>>> tree = parse_sxpr('(a (b "X") (X (c "d")) (e (X "F")))')
>>> list(flatten_sxpr(item.as_sxpr()) for item in tree.select("X", False))
['(X (c "d"))', '(X "F")']
>>> list(flatten_sxpr(item.as_sxpr()) for item in tree.select({"X", "b"}, False))
['(b "X")', '(X (c "d"))', '(X "F")']
>>> any(tree.select('a', False))
False
>>> list(flatten_sxpr(item.as_sxpr()) for item in tree.select('a', True))
['(a (b "X") (X (c "d")) (e (X "F")))']
>>> flatten_sxpr(next(tree.select("X", False)).as_sxpr())
'(X (c "d"))'
>>> tree = parse_sxpr('(a (b (c "") (d (e "")(f ""))) (g ""))')
>>> [nd.name for nd in tree.select(ANY_NODE)]
['b', 'c', 'd', 'e', 'f', 'g']
select_children(criteria: NodeSelector, reverse: bool = False) Iterator[Node][source]

Returns an iterator over all direct children of a node that fulfil the given criterion. See Node.select() for a description of the parameters.

select_if(match_func: NodeMatchFunction, include_root: bool = False, reverse: bool = False, skip_func: NodeMatchFunction = <function deny>) Iterator[Node][source]

Generates an iterator over all nodes in the tree for which match_function() returns True. See the more general function Node.select() for a detailed description and examples. The tree is traversed pre-order by the iterator.

select_path(criteria: PathSelector, include_root: bool = False, reverse: bool = False, skip_subtree: PathSelector = <function deny>) Iterator[Path][source]

Like Node.select() but yields the entire path (i.e. list of descendants, the last one being the matching node) instead of just the matching nodes.

select_path_if(match_func: PathMatchFunction, include_root: bool = False, reverse: bool = False, skip_func: PathMatchFunction = <function deny>) Iterator[Path][source]

Like Node.select_if() but yields the entire path (i.e. list of descendants, the last one being the matching node) instead of just the matching nodes. NOTE: In contrast to select_if(), match_function receives the complete path as argument, rather than just the last node!

serialize(how: str = 'default') str[source]

Serializes the tree originating in the node self either as S-expression, XML, JSON, or in compact form. Possible values for how are ‘S-expression’, ‘XML’, ‘JSON’, ‘indented’ accordingly, or ‘AST’, ‘CST’, ‘default’, in which case the value of respective configuration variable determines the serialization format. (See module DHParser.configuration.)

strlen() int[source]

Returns the length of the string-content of this node. Mind that len(node) returns the number of children of this node!

to_json_obj(as_dict: bool = False, include_pos: bool = True) List | Dict[source]

Converts the tree into a JSON-serializable nested list. Nodes can be serialized in list-flavor (faster) or dictionary-flavor (asdict=True, slower).

In list-flavor, Nodes are serialized as JSON-lists with either two or three elements:

  1. name (always a string),

  2. content (either a string or a list of JSON-serialized Nodes)

  3. optional: a dictionary that maps attribute names to attribute values, both of which are strings.

In dictionary flavor, Nodes are serialized as dictionaries that map the node’s name to a string (in case of a leaf node) or to a dictionary of its children (in case all the children’s names are unique) or to a list of pairs (child name, child’s result).

Examples (list flavor):

>>> Node('root', 'content').with_attr(importance="high").to_json_obj()
['root', 'content', {'importance': 'high'}]
>>> node = parse_sxpr('(letters (a "A") (b "B") (c "C"))')
>>> node.to_json_obj()
['letters', [['a', 'A'], ['b', 'B'], ['c', 'C']]]
Examples (dictionary flavor)
>>> node.to_json_obj(as_dict=True)
{'letters': {'a': 'A', 'b': 'B', 'c': 'C'}}
>>> node.result = node.children + (Node('a', 'doublette'),)
>>> node.to_json_obj(as_dict=True)
{'letters': [['a', 'A'], ['b', 'B'], ['c', 'C'], ['a', 'doublette']]}
walk_tree(include_root: bool = False, reverse: bool = False) Iterator[Node][source]

Yields all nodes of the tree. (Faster than select())

walk_tree_paths(include_root: bool = False, reverse: bool = False)[source]

Yields all paths of the tree. (Faster than select_paths_if())

with_attr(*attr_dict, **attributes) Node[source]

Adds the attributes which are passed to with_attr() either as an attribute dictionary or as keyword parameters to the node’s attributes and returns self. Example:

>>> node = Node('test', '').with_attr(animal = "frog", plant= "tree")
>>> dict(node.attr)
{'animal': 'frog', 'plant': 'tree'}
>>> node.with_attr({'building': 'skyscraper'}).repr
"Node('test', '').with_attr({'animal': 'frog', 'plant': 'tree', 'building': 'skyscraper'})"
Parameters:
  • attr_dict – a dictionary of attribute keys and values

  • attributes – alternatively, a sequences of keyword parameters

Returns:

self

with_pos(pos: int) Node[source]

Initializes the node’s position value. Usually, the parser takes care of assigning the positions in the document to the nodes of the parse-tree. However, when Nodes are created outside the reach of the parser guard, their document-position must be assigned manually. Position values of the child nodes are assigned recursively, too. Example:

>>> node = Node('test', 'position').with_pos(10)
>>> node.pos
10

>>> tree = parse_sxpr('(a (b (c "0") (d (e "1")(f "2"))) (g "3"))')
>>> _ = tree.with_pos(0)
>>> [(nd.name, nd.pos) for nd in tree.select(ANY_NODE, include_root=True)]
[('a', 0), ('b', 0), ('c', 0), ('d', 1), ('e', 1), ('f', 2), ('g', 3)]
Parameters:

pos – The position assigned to be assigned to the node. Value must be >= 0.

Returns:

the node itself (for convenience).

Raises:

AssertionError if position has already been assigned or if parameter pos has a value < 0.

class RootNode(*args, **kwargs)[source]

The root node for the node-tree is a special kind of node that keeps and manages global properties of the tree as a whole. These are first and foremost the list off errors that occurred during tree generation (i.e. parsing) or any transformation of the tree.

Other properties concern the customization of the XML-serialization and meta-data about the processed document and processing stage.

Although errors are local properties that occur on a specific point or chunk of source code, instead of attaching the errors to the nodes on which they have occurred, the list of errors in managed globally by the root-node object. Otherwise, it would be hard to keep track of the errors when during the transformation of trees node are replaced or dropped that might also contain error messages.

The root node can be instantiated before the tree is fully parsed. This is necessary, because the root node is needed for managing error messages during the parsing process, already. In order to connect the root node to the tree, when parsing is finished, the swallow()-method must be called.

Variables:
  • errors – A list of all errors that have occurred so far during processing (i.e. parsing, AST-transformation, compiling) of this tree. The errors are ordered by the time of their being added to the list.

  • errors_sorted – (read-only property) The list of errors ordered by their position.

  • error_nodes – A mapping of node-ids to a list of errors that occurred on the node with the respective id.

  • error_positions – A mapping of locations to a set of ids of nodes that contain an error at that particular location.

  • error_flag – the highest warning or error level of all errors that occurred.

  • source – The source code (after preprocessing)

  • source_mapping – A source mapping function to map source code positions to the positions of the non-preprocessed source. See module preprocess

  • lbreaks – A list of indices of all linebreaks in the source.

  • inline_tags – see Node.as_xml() for an explanation.

  • string_tags – see Node.as_xml() for an explanation.

  • empty_tags – see Node.as_xml() for an explanation.

  • docname – a name for the document

  • stage – a name for the current processing stage or the empty string (default). This name if present is used for verifying the stage in DHParser.compile.run_pipeline(). If stage contains the empty string, stage-verification is turned off (which may result in obscure error messages in case a tree-transformation is run on the wrong stage.) Stage-names should be considered as case-insensitive, i.e. “AST” is treated as the same stage as “ast”.

  • serialization_type – The kind of serialization for the current processing stage. Can be one of ‘XML’, ‘json’, ‘indented’, ‘S-expression’ or ‘default’. (The latter picks the default serialization from the configuration.)

  • data – Compiled data. If the data still is a tree this simply contains a reference to self.

add_error(node: Node | None, error: Error) RootNode[source]

Adds an Error object to the tree, locating it at a specific node.

as_xml(src: str | None = None, indentation: int = 2, inline_tags: AbstractSet[str] = frozenset({}), string_tags: AbstractSet[str] = frozenset({}), empty_tags: AbstractSet[str] = frozenset({}), strict_mode: bool = True) str[source]

Serializes the tree of nodes as XML.

Parameters:
  • src – The source text or None. In case the source text is given, the position will also be reported as line and column.

  • indentation – The number of whitespaces for indentation

  • inline_tags – A set of tag names, the content of which will always be written on a single line, unless it contains explicit line feeds (n). In addition, all nodes that have the attribute xml:space="preserve" will be inlined.

  • string_tags – A set of tags from which only the content will be printed, but neither the opening tag nor its attr nor the closing tag. This allows producing a mix of plain text and child tags in the output, which otherwise is not supported by the Node object, because it requires its content to be either a tuple of children or string content.

  • empty_tags – A set of tags which shall be rendered as empty elements, e.g. “<empty/>” instead of “<empty><empty>”.

  • strict_mode – If True, violation of stylistic or interoperability rules raises a ValueError.

Returns:

The XML-string representing the tree originating in self

continue_with_data(data: Any)[source]

Drops the swallowed tree in favor of the (non-tree) data resulting from the compilation of the tree. The data can then be retrieved from the field self.data, which before the tree has been dropped contains a reference to the tree itself.

did_match() bool[source]

Returns True, if the parser that has generated this tree did match, False otherwise. Depending on wether the Grammar-object that that generated the node-tree was called with complete_match=True or not this requires either the complete document to have been matched or only the beginning.

Note: If the parser did match, this does not mean that it must have matched without errors. It simply means the no PARSER_STOPPED_BEFORE_END-error has occurred.

error_safe(level: ErrorCode = 1000) RootNode[source]

Asserts that the given tree does not contain any errors with a code equal or higher than the given level. Returns the tree if this is the case, raises an AssertionError otherwise.

property errors_sorted: List[Error]

Returns the list of errors, ordered bv their position.

new_error(node: Node, message: str, code: ErrorCode = 1000) RootNode[source]

Adds an error to this tree, locating it at a specific node.

Parameters:
  • node – the node where the error occurred

  • message – a string with the error message

  • code – an error code to identify the type of the error

node_errors(node: Node) List[Error][source]

Returns the List of errors that occurred on the node or any child node at the position of the node that has already been removed from the tree, for example, because it was an anonymous empty child node. The position of the node is here understood to cover the range: [node.pos, node.pos + node.strlen()[

serialize(how: str = '') str[source]

Serializes the tree originating in the node self either as S-expression, XML, JSON, or in compact form. Possible values for how are ‘S-expression’, ‘XML’, ‘JSON’, ‘indented’ accordingly, or ‘AST’, ‘CST’, ‘default’, in which case the value of respective configuration variable determines the serialization format. (See module DHParser.configuration.)

swallow(node: Node | None, source: str | StringView = '', source_mapping: SourceMapFunc | None = None) RootNode[source]

Put self in the place of node by copying all its data. Returns self.

This is done by the parse.Grammar object after parsing has finished, so that the Grammar object always returns a node-tree rooted in a RootNode object.

It is possible to add errors to a RootNode object, before it has actually swallowed the root of the node-tree.

transfer_errors(src: Node, dest: Node)[source]

Transfers errors to a different node. While errors never get lost during AST-transformation, because they are kept by the RootNode, the nodes they are connected to may be dropped in the course of the transformation. This function allows attaching errors from a node that will be dropped to a different node.

add_class(node: Node, tokens: str, *, attribute: str = 'class')

Adds all tokens to attribute of node.

add_token(token_sequence: str, tokens: str) str[source]

Adds the tokens from ‘tokens’ that are not already contained in token_sequence to the end of token_sequence:

>>> add_token('', 'italic')
'italic'
>>> add_token('bold italic', 'large')
'bold italic large'
>>> add_token('bold italic', 'bold')
'bold italic'
>>> add_token('red thin', 'stroked red')
'red thin stroked'
add_token_to_attr(node: Node, tokens: str, attribute: str)[source]

Adds all tokens to attribute of node.

can_split(t: Path, i: int, left_biased: bool = True, greedy: bool = True, match_func: PathMatchFunction = <function affirm>, skip_func: PathMatchFunction = <function deny>, divisible: ~typing.AbstractSet[str] = frozenset({':EMPTY', ':RegExp', ':Text', ':Whitespace'})) int[source]

Returns the negative index of the first node in the path, from which on all nodes can be split or do not need to be split, because the split-index lies to the left or right of the node.

Examples:

>>> tree = parse_sxpr('(doc (p (:Text "ABC")))')
>>> can_split([tree, tree[0], tree[0][0]], 1)
-1
>>> can_split([tree, tree[0], tree[0][0]], 0)
-2
>>> can_split([tree, tree[0], tree[0][0]], 3)
-2
>>> # anonymous nodes, like ":Text" are always divisible
>>> can_split([tree, tree[0], tree[0][0]], 1, divisible=set())
-1
>>> # However, non anonymous nodes aren't ...
>>> tree = parse_sxpr('(doc (p (Text "ABC")))')
>>> can_split([tree, tree[0], tree[0][0]], 1, divisible=set())
0
>>> # ... unless explicitly mentioned
>>> tree = parse_sxpr('(doc (p (Text "ABC")))')
>>> can_split([tree, tree[0], tree[0][0]], 1, divisible={'Text'})
-1
>>> tree = parse_sxpr('(X (Z "!?") (A (B "123") (C "456")))')
>>> can_split(tree.pick_path('B'), 0)
-2

# edge cases
>>> can_split([parse_sxpr('(p "123")')], 1)
0
>>> can_split([parse_sxpr('(:Text "123")')], 1)
0
content_of(segment: ~nodetree.Node | ~typing.Tuple[~nodetree.Node, ...] | ~DHParser.stringview.StringView | str, select: PathSelector = <function LEAF_PATH>, ignore: PathSelector = <function deny>) str[source]

Returns the string content from a single node or a tuple of Nodes.

create_match_function(criterion: NodeSelector) NodeMatchFunction[source]

Creates a node-match-function (Node -> bool) for the given criterion that returns True, if the node passed to the function matches the criterion.

criterion

type of match

id (int)

object identity

Node

object identity

FrozenNode

equality of tag name, string content and attributes

tag name (str)

equality of tag name only

multiple tag names

equality of tag name with one of the given names

pattern (re.Pattern)

full match of content with pattern

match function

function returns True

Parameters:

criterion – Either a node, the id of a node, a frozen node, a name or a container (usually a set) of multiple tag names, a regular expression pattern or another match function.

Returns:

a match-function (Node -> bool) for the given criterion.

create_path_match_function(criterion: PathSelector) PathMatchFunction[source]

Creates a path-match-function (Path -> bool) for the given criterion that returns True, if the last node in the path passed to the function matches the criterion.

See create_match_function() for a description of the possible criteria and their meaning.

Parameters:

criterion – Either a node, the id of a node, a frozen node, a name or a container (usually a set) of multiple tag names, a regular expression pattern or another match function.

Returns:

a match-function (Path -> bool) for the given criterion.

deep_split(path: Path, i: int, left_biased: bool = True, greedy: bool = True, match_func: PathMatchFunction = <function affirm>, skip_func: PathMatchFunction = <function deny>, chain_attr_name: str = '') int[source]

Split all nodes from the end of the path up to the i-th element, but excluding the first node in the path. Returns the index of the split-location in the first node of the path.

Exapmles:

>>> from DHParser.toolkit import printw
>>> tree = parse_sxpr('(X (s "") (A (u "") (C "One, ") (D "two, ")) '
...                   '(B (E "three, ") (F "four!") (t "")))')
>>> X = copy.deepcopy(tree)
>>> C = X.pick_path('C')
>>> a = deep_split(C, 0)
>>> a
1
>>> F = X.pick_path('F', reverse=True)
>>> b = deep_split(F, F[-1].strlen(), left_biased=False)
>>> b
3
>>> printw(X.as_sxpr())
(X (s) (A (u) (C "One, ") (D "two, ")) (B (E "three, ") (F "four!") (t)))
>>> a = deep_split(C, 0, greedy=False)
>>> a
2
>>> b = deep_split(F, F[-1].strlen(), left_biased=False, greedy=False)
>>> b
4
>>> printw(X.as_sxpr(flatten_threshold=-1))
(X (s) (A (u)) (A (C "One, ") (D "two, ")) (B (E "three, ") (F "four!"))
 (B (t)))

>>> X = copy.deepcopy(tree).with_pos(0)
>>> C = X.pick_path('C')
>>> a = deep_split(C, 4)
>>> E = X.pick_path('E')
>>> b = deep_split(E, 0, left_biased=False)
>>> a, b
(2, 3)
>>> printw(X.as_sxpr(flatten_threshold=-1))
(X (s) (A (u) (C "One,")) (A (C " ") (D "two, ")) (B (E "three, ") (F "four!")
 (t)))
>>> X.result = X[:a] + (Node('em', X[a:b]).with_pos(X[a].pos),) + X[b:]
>>> printw(X.as_sxpr(flatten_threshold=-1))
(X (s) (A (u) (C "One,")) (em (A (C " ") (D "two, "))) (B (E "three, ")
 (F "four!") (t)))

# edge cases
>>> Y = parse_sxpr('(Y "123")')
>>> deep_split([Y], 1)
1
>>> print(Y.as_sxpr())
(Y "123")
deserialize(xml_sxpr_or_json: str) Node | None[source]

Parses either XML or S-expressions or a JSON representation of a syntax-tree. Which of these is detected automatically.

ensuing_str(path: Path, length: int = -1) str[source]

Returns length characters from the string content succeeding the path.

eq_tokens(token_sequence1: str, token_sequence2: str) bool[source]

Returns True if bothe token sequences contain the same tokens, no matter in what order:

>>> eq_tokens('red thin stroked', 'stroked red thin')
True
>>> eq_tokens('red thin', 'thin blue')
False
find_common_ancestor(path_A: Path, path_B: Path) Tuple[Node | None, int][source]

Returns the last common ancestor of path_A, path_B and its index in the path. If there is no common ancestor (None, undefined integer) is returned.

flatten_sxpr(sxpr: str, threshold: int = -1) str[source]

Returns S-expression sxpr as a one-liner without unnecessary whitespace.

The threshold value is a maximum number of characters allowed in the flattened expression. If this number is exceeded the un-flattened S-expression is returned. A negative number means that the S-expression will always be flattened. Zero or (any positive integer <= 3) essentially means that the expression will not be flattened. Example:

>>> flatten_sxpr('(a\n    (b\n        c\n    )\n)\n')
'(a (b c))'
Parameters:
  • sxpr – and S-expression in string form

  • threshold – maximum allowed string-length of the flattened S-exrpession. A value < 0 means that it may be arbitrarily long.

Returns:

Either flattened S-expression or, if the threshold has been overstepped, the original S-expression without leading or trailing whitespace.

flatten_xml(xml: str) str[source]

Returns an XML-tree as a one liner without unnecessary whitespace, i.e. only whitespace within leaf-nodes is preserved.

A more precise alternative to flatten_xml is to use Node.as_xml() and passing a set containing the top level tag to parameter inline_tags.

Parameters:

xml – the XML-Text to be “flattened”

Returns:

the flattened XML-Text

foregoing_str(path: Path, length: int = -1) str[source]

Returns length characters from the string content preceding the path.

has_class(node: Node, tokens: str, *, attribute: str = 'class')

Returns True, if ‘attribute’ of ‘node’ contains all ‘tokens’.

has_token(token_sequence: str, tokens: str) bool[source]

Returns true, if token is contained in the blank-spearated token sequence. If token itself is a blank-separated sequence of tokens, True is returned if all tokens are contained in token_sequence:

>>> has_token('bold italic', 'italic')
True
>>> has_token('bold italic', 'normal')
False
>>> has_token('bold italic', 'italic bold')
True
>>> has_token('bold italic', 'bold normal')
False
has_token_on_attr(node: Node, tokens: str, attribute: str)[source]

Returns True, if ‘attribute’ of ‘node’ contains all ‘tokens’.

insert_node(leaf_path: Path, rel_pos: int, node: Node, divisible_leaves: Container = frozenset({':EMPTY', ':RegExp', ':Text', ':Whitespace'})) Node[source]

Inserts a node at a specific position into the last or eventually second but last node in the path. The path must be a “leaf”-path, i.e. a path that ends in a leaf. Returns the parent of the newly inserted node.

leaf_path(path: Path | None, pick_child: PickChildFunction, *, steps: int = -1) Path | None

Returns the path of a descendant that follows steps generations up the tree originating in ‘path[-1]`. If steps < 0 this will be as many generations as are needed to reach a leaf-node. The function pick_child determines which branch to follow during each iteration, as long as the top of the path is not yet a leaf node. A path-parameter value of None will simply be passed through.

leaf_paths(criterion: PathSelector) PathMatchFunction[source]

Creates a path-match function that matches only and all leaf paths for those paths that the criterion matches. Warning: This may be slower than a custom algorithm that matches only leaf-paths right from the start. Example:

>>> xml = '''<doc><p>In München<footnote><em>München</em> is the German
... name of the city of Munich</footnote> is a Hofbräuhaus</p></doc>'''
>>> tree = parse_xml(xml)
>>> for path in tree.select_path(leaf_paths('footnote')):
...    pp_path(path, 1)
'doc <- p <- footnote <- em "München"'
'doc <- p <- footnote <- :Text " is the German\nname of the city of Munich"'

Compare this with the result without the leaf_paths-filter:

>>> for path in tree.select_path('footnote'):
...    pp_path(path, 1)
'doc <- p <- footnote "München is the German\nname of the city of Munich"'
match_path_str(path_str: str, glob_pattern: str) bool[source]

Matches a path_str against a glob-pattern.

next_path(path: Path) Path | None[source]

Returns the path of the successor of the last node in the path. The successor is the sibling of the same parent Node succeeding the node, or if it already is the last sibling, the parent’s sibling succeeding the parent, or grand parent’s sibling and so on. In case no successor is found when the first ancestor has been reached, None is returned.

parse_json(json_str: str) RootNode[source]

Parses a JSON-representation of a node-tree. Other than and parse_xml, this function does not convert any json-document into a node-tree, but only json-documents that represents a node-tree, e.g. a json-document that has been produced by Node.as_json()!

parse_sxml(sxml: str | StringView) RootNode[source]

Generates a tree of nodes from SXML. Example:

>>> sxml = '(employee(@ (branch "Secret Service") (id "007")) "James Bond")'
>>> tree = parse_sxml(sxml)
>>> print(tree.as_xml())
<employee branch="Secret Service" id="007">James Bond</employee>
parse_sxpr(sxpr: str | StringView) RootNode[source]

Generates a tree of nodes from an S-expression.

This can - among other things - be used for deserialization of trees that have been serialized with Node.as_sxpr() or as a convenient way to generate test data.

Example: >>> parse_sxpr(“(a (b c))”).as_sxpr(flatten_threshold=0) ‘(an (b “c”))’

parse_sxpr() does not initialize the node’s pos-values. This can be done with Node.with_pos():

>>> tree = parse_sxpr('(A (B "x") (C "y"))').with_pos(0)
>>> tree['C'].pos
1
parse_xml(xml: str | StringView, string_tag: str = ':Text', ignore_pos: bool = False, out_empty_tags: Set[str] = {}, strict_mode: bool = True) RootNode[source]

Generates a tree of nodes from a (Pseudo-)XML-source.

Parameters:
  • xml – The XML-string to be parsed into a tree of Nodes

  • string_tag – A tag-name that will be used for strings inside mixed-content-tags.

  • ignore_pos – if True, ‘_pos’-attributes will be understood as normal XML-attributes. Otherwise, ‘_pos’ will be understood as a special attribute, the value of which will be written to node._pos and not transferred to the node.attr-dictionary.

  • out_empty_tags – A set that is filled with the names of those tags that are empty tags, e.g. “<br/>”

  • strict_mode – If True, errors are raised if XML contains stylistic or interoperability errors, like using one and the same tag-name for empty and non-empty tags, for example.

path_sanity_check(path: Path) bool[source]

Checks whether the nodes following in the path-list are really immediate descendants of each other.

path_str(path: Path) str[source]

Returns the path as pseudo filepath of tag-names.

pick_from_path(path: Path, criterion: NodeSelector, reverse: bool = False) Node | None[source]

Picks the first node from the path that fulfils the criterion. Returns None if the path does not contain any node fulfilling the criterion.

pick_from_path_if(path: Path, match_func: NodeMatchFunction, reverse: bool = False) Node | None[source]

Picks the first node from the path for which match_func is True. Returns None if the path does not contain any node for which this is the case.

pick_path(start_path: Path, criteria: PathSelector, include_root: bool = False, reverse: bool = False, skip_subtree: PathSelector = <function deny>) Path | None[source]

Returns the first path for which the criterion matches. If no path in the given direction (forward by default or reverse, if paramter reverse is True), None is returned.

pick_path_if(start_path: Path, match_func: PathMatchFunction, include_root: bool = False, reverse: bool = False, skip_subtree: PathSelector = <function deny>) Path | None[source]

Returns the first path for which match_func becomes True. If no path in the given direction (forward by default or reverse, if paramter reverse is True), None is returned.

pp_path(path: Path, with_content: int = 0, delimiter: str = ' <- ') str[source]

Serializes a path as string.

Parameters:
  • path – the path to be serialized.

  • with_content – the number of nodes from the end of the path for which the content will be displayed next to the name.

  • delimiter – The delimiter separating the nodes in the returned string.

Returns:

the string-serialization of the given path.

pred_siblings(path: Path, criteria: NodeSelector = <function affirm>, reverse: bool = False) Iterator[Node][source]

Returns an iterator over the siblings preceeding the end-node in the path. Siblings are iterated left to right, so if the end-node of path is the 5th child of its parent (path[-2]) the siblings will be iterated starting with the 1st child (not with the 4th!). This can be reversed with reverse=True.

prev_path(path: Path) Path | None[source]

Returns the path of the predecessor of the last node in the path. The predecessor is the sibling of the same parent-node preceding the node, or, if it already is the first sibling, the parent’s sibling preceding the parent, or grandparent’s sibling and so on. In case no predecessor is found, when the first ancestor has been reached, None is returned.

remove_class(node: Node, tokens: str, *, attribute: str = 'class')

Removes all tokens from attribute of node.

remove_token(token_sequence, tokens: str) str[source]

Removes all tokens from token_sequence:

>>> remove_token('red thin stroked', 'thin')
'red stroked'
>>> remove_token('red thin stroked', 'blue')
'red thin stroked'
>>> remove_token('red thin stroked', 'blue stroked')
'red thin'
remove_token_from_attr(node: Node, tokens: str, attribute: str)[source]

Removes all tokens from attribute of node.

reset_chain_ID(chain_length: int = 3)[source]

For testing and debugging, reset the chain_id counter to ensure deterministic results.

Parameters:

chain_length – The staring length of the letter-chain used as ID value

select_from_path(path: Path, criteria: NodeSelector, reverse: bool = False) Iterator[Node][source]

Yields all nodes from path which fulfill the criterion.

select_from_path_if(path: Path, match_func: NodeMatchFunction, reverse: bool = False) Iterator[Node][source]

Yields all nodes from path for which the match_function is true.

select_path(start_path: Path, criteria: PathSelector, include_root: bool = False, reverse: bool = False, skip_subtree: PathSelector = <function deny>) Iterator[Path][source]

Like select_path_if() but yields the entire path (i.e. list of descendants, the last one being the matching node) instead of just the matching nodes.

select_path_if(start_path: Path, match_func: PathMatchFunction, include_root: bool = False, reverse: bool = False, skip_func: PathMatchFunction = <function deny>) Iterator[Path][source]

Creates an Iterator yielding all paths for which the match_function is true, starting from path.

split_node(node: Node, parent: Node, i: int, left_biased: bool = True, chain_attr: dict | None = None) int[source]

Splits a node at the given index (in case of a branch-node) or string-position (in case of a leaf-node). Returns the index of the right part within the parent node after the split. (This means that with node.insert(index, nd) nd will be inserted (exactly at the split location.)

Non-anonymous node that have been split will be marked by updateing their attribute-dictionary with the chain_attr-dictionary if given.

Parameters:
  • node – the node to be split

  • parent – the node’s parent

  • i – the index either of the child or of the character before which the node will be split.

  • left_biased – if True, yields the location after the end of the previous path rather than the location at the very beginning of the next path. Default value is “False”.

  • chain_attr – a dictionary with a single key and value resembling an attribute and value that will be added to the attributes-dicitonary of both nodes after the split, if the node is named node.

Returns:

the index of the split within the children’s tuple of the parent node.

Examples:

>>> test_tree = parse_sxpr('(X (A "Hello, ") (B "Peter") (C " Smith"))').with_pos(0)
>>> X = copy.deepcopy(test_tree)

# test edge cases first
>>> split_node(X['B'], X, 0)
1
>>> print(X.as_sxpr())
(X (A "Hello, ") (B "Peter") (C " Smith"))
>>> split_node(X['B'], X, X['B'].strlen())
2
>>> print(X.as_sxpr())
(X (A "Hello, ") (B "Peter") (C " Smith"))

# standard case
>>> split_node(X['B'], X, 2)
2
>>> print(X.as_sxpr())
(X (A "Hello, ") (B "Pe") (B "ter") (C " Smith"))
>>> print(X.pick('B', reverse=True).pos)
9

# use split() as preparation for adding markup
>>> X = copy.deepcopy(test_tree)
>>> a = split_node(X['A'], X, 6)
>>> a
1
>>> b = split_node(X['C'], X, 1)
>>> b
4
>>> print(X.as_sxpr())
(X (A "Hello,") (A " ") (B "Peter") (C " ") (C "Smith"))
>>> markup = Node('em', X[a:b]).with_pos(X[a].pos)
>>> X.result = X[:a] + (markup,) + X[b:]
>>> print(X.as_sxpr())
(X (A "Hello,") (em (A " ") (B "Peter") (C " ")) (C "Smith"))

# a more complex case: add markup to a nested tree
>>> X = parse_sxpr('(X (A "Hello, ") (B "Peter") (bold (C " Smith")))').with_pos(0)
>>> a = split_node(X['A'], X, 6)
>>> b0 = split_node(X['bold']['C'], X['bold'], 1)
>>> b0
1
>>> print(X.as_sxpr())
(X (A "Hello,") (A " ") (B "Peter") (bold (C " ") (C "Smith")))
>>> b = split_node(X['bold'], X, b0)
>>> b
4
>>> print(X.as_sxpr())
(X (A "Hello,") (A " ") (B "Peter") (bold (C " ")) (bold (C "Smith")))
>>> markup = Node('em', X[a:b]).with_pos(X[a].pos)
>>> X.result = X[:a] + (markup,) + X[b:]
>>> print(X.as_sxpr())
(X (A "Hello,") (em (A " ") (B "Peter") (bold (C " "))) (bold (C "Smith")))

# use left_bias hint for potentially ambiguous cases:
>>> X = parse_sxpr('(X (A ""))')
>>> split_node(X['A'], X, X['A'].strlen())
0
>>> split_node(X['A'], X, X['A'].strlen(), left_biased=False)
1
strlen_of(segment: Union[Node, Sequence[Node, ...], StringView, str], select: PathSelector = <function LEAF_PATH>, ignore: PathSelector = <function deny>) int[source]

Returns the string size from a single node or a tuple of Nodes.

succ_siblings(path: Path, criteria: NodeSelector = <function affirm>, reverse: bool = False) Iterator[Node][source]

Returns an iterator over the siblings succeeding the end-node in the path. Siblings are iterated left to right. This can be reversed with reverse=True.

tree_sanity_check(tree: Node) bool[source]

Sanity check for node-trees: One and the same node must never appear twice in the node-tree. Frozen Nodes (EMTPY_NODE, PLACEHOLDER) should only exist temporarily and must have been dropped or eliminated before any kind of tree generation (i.e. parsing) or transformation is finished. :param tree: the root of the tree to be checked :returns: True, if the tree is “sane”, False otherwise.

validate_token_sequence(token_sequence: str) bool[source]

Returns True, if token_sequence is properly formed.

Token sequences are strings or words which are separated by single blanks with no leading or trailing blank.

zoom_into_path(path: Path | None, pick_child: PickChildFunction, steps: int) Path | None[source]

Returns the path of a descendant that follows steps generations up the tree originating in ‘path[-1]`. If steps < 0 this will be as many generations as are needed to reach a leaf-node. The function pick_child determines which branch to follow during each iteration, as long as the top of the path is not yet a leaf node. A path-parameter value of None will simply be passed through.

Module transform

Module transform contains the functions for transforming the concrete syntax tree (CST) into an abstract syntax tree (AST).

As these functions are very generic, they can in principle be used for any kind of tree transformations, not necessarily only for CST -> AST transformations.

add_attributes(path: Path, attributes: dict)[source]
add_attributes(*args: dict, **kwargs)

Adds the attributes in the given dictionary to the XML-attributes of the last node in the given path.

add_error(path: Path, error_msg: str, error_code: ErrorCode = 1000)[source]
add_error(*args: str, **kwargs)

Raises an error unconditionally. This makes sense in case illegal paths are encoded in the syntax to provide more accurate error messages.

all_of(path: Path, bool_func_set: AbstractSet[Callable]) bool[source]
all_of(*args: Callable)
all_of(*args: Set, **kwargs)

Returns True, if all of the bool functions in bool_func_set evaluate to True for the given path.

always(path: Path) bool[source]

Always returns True, no matter what the state of the path is.

any_of(path: Path, bool_func_set: AbstractSet[Callable]) bool[source]
any_of(*args: Callable)
any_of(*args: Set, **kwargs)

Returns True, if any of the bool functions in bool_func_set evaluate to True for the given path.

apply_if(path: Path, transformation: Callable | Tuple[Callable, ...], condition: Callable)[source]
apply_if(*args: Callable, **kwargs)
apply_if(*args: tuple, **kwargs)

Applies a transformation only if a certain condition is met.

apply_ifelse(path: Path, if_transformation: Callable | Tuple[Callable, ...], else_transformation: Callable | Tuple[Callable, ...], condition: Callable)[source]
apply_ifelse(*args: Callable, **kwargs)
apply_ifelse(*args: tuple, **kwargs)

Applies a one particular transformation if a certain condition is met and another transformation otherwise.

apply_unless(path: Path, transformation: Callable | Tuple[Callable, ...], condition: Callable)[source]
apply_unless(*args: Callable, **kwargs)
apply_unless(*args: tuple, **kwargs)

Applies a transformation if a certain condition is not met.

assert_has_children(path: Path, *, condition: Callable = <function <lambda>>, error_msg: str = 'Element "%s" has no children', error_code: ErrorCode = 1000)

Checks for condition; adds an error or warning message if condition is not met.

change_name(path: Path, name: str)[source]
change_name(*args: str, **kwargs)

Changes the tag name of the last node in the path.

Parameters:
  • path – the path where the parser shall be replaced

  • name – The new tag name.

collapse(path: Path)[source]

Collapses all sub-nodes of a node by replacing them with the string representation of the node. USE WITH CARE!

>>> sxpr = '(place (abbreviation "p.") (page "26") (superscript "b") (mark ",") (page "18"))'
>>> tree = parse_sxpr(sxpr)
>>> collapse([tree])
>>> print(flatten_sxpr(tree.as_sxpr()))
(place "p.26b,18")
collapse_children_if(path: Path, condition: ~typing.Callable, target_name: str, merge_rule: ~typing.Callable[[~typing.Sequence[~DHParser.nodetree.Node]], ~DHParser.nodetree.Node] = <function join_content>)[source]
collapse_children_if(*args: Callable, **kwargs)

(Recursively) merges the content of all adjacent child nodes that fulfill the given condition into a single leaf node with the tag-name target_tag. Nodes that do not fulfil the condition will be preserved.

>>> sxpr = '(place (abbreviation "p.") (page "26") (superscript "b") (mark ",") (page "18"))'
>>> tree = parse_sxpr(sxpr)
>>> collapse_children_if([tree], not_one_of({'superscript', 'subscript'}), 'text')
>>> print(flatten_sxpr(tree.as_sxpr()))
(place (text "p.26") (superscript "b") (text ",18"))

See test_transform.TestComplexTransformations for more examples.

contains_only_whitespace(path: Path) bool[source]

Returns True for nodes that contain only whitespace regardless of the name, i.e. nodes the content of which matches the regular expression /s*/, including empty nodes. Note, that this is not true for anonymous whitespace nodes that contain comments.

content_matches(path: Path, regexp: str) bool[source]
content_matches(*args: str, **kwargs)

Checks a node’s content against a regular expression.

In contrast to re.match the regular expression must match the complete string and not just the beginning of the string to succeed!

del_attributes(path: Path, attributes: Set = frozenset({}))[source]
del_attributes(*args: Set, **kwargs)

Removes XML-attributes from the last node in the given path. If the given set is empty, all attributes will be deleted.

delimit_children(path: Path, node_factory: Callable)[source]
delimit_children(*args: Callable, **kwargs)

Add a delimiter drawn from the node_factory between all children.

error_on(path: Path, condition: Callable, error_msg: str = '', error_code: ErrorCode = 1000)[source]
error_on(*args: Callable, **kwargs)

Checks for condition; adds an error or warning message if condition is not met.

flatten(path: Path, condition: ~typing.Callable = <function is_anonymous>, recursive: bool = True)[source]
flatten(*args: Callable, **kwargs)

Flattens all children, that fulfill the given condition (default: all unnamed children). Flattening means that wherever a node has child nodes, the child nodes are inserted in place of the node.

If the parameter recursive is True the same will recursively be done with the child-nodes, first. In other words, all leaves of this node and its child nodes are collected in-order as direct children of this node.

Applying flatten recursively will result in these kinds of structural transformation:

(1 (+ 2) (+ 3))    ->   (1 + 2 + 3)
(1 (+ (2 + (3))))  ->   (1 + 2 + 3)
has_ancestor(path: Path, name_set: AbstractSet[str], generations: int = -1, until: AbstractSet[str] | str = frozenset({})) bool[source]
has_ancestor(*args: Set, **kwargs)

Checks whether a node with one of the given tag names appears somewhere in the path before the last node in the path.

Parameters:
  • generations – determines how deep has_ancestor should dive into the ancestry. “1” means only the immediate parents wil be considered, “2” means also the grandparents, ans so on. A value smaller or equal zero means all ancestors will be considered.

  • until – node-names which, when reached, will stop has_ancestor from searching further, even if the generations-parameter would allow a deeper search.

has_attr(path: Path, attr: str, value: str | None = None) bool[source]
has_attr(*args: str, **kwargs)

Returns true, if the node has the attribute attr and its value equals value. If value is None, True is returned if the attribute exists, no matter what it value is.

has_child(path: Path, name_set: AbstractSet[str]) bool[source]
has_child(*args: str)
has_child(*args: Set, **kwargs)

Checks whether at least one child (i.e. immediate descendant) has one of the given tags.

has_children(path: Path) bool[source]

Checks whether last node in path has children.

has_content(path: Path, content: str) bool[source]
has_content(*args: str, **kwargs)

Checks a node’s content against a given string. This is faster than content_matches for mere string comparisons.

has_parent(path: Path, name_set: AbstractSet[str]) bool[source]
has_parent(*args: str)
has_parent(*args: Set, **kwargs)

Checks whether the immediate predecessor in the path has one of the given tags.

insert(path: Path, position: int | tuple | Callable, node_factory: Callable)[source]
insert(*args: int, **kwargs)
insert(*args: tuple, **kwargs)
insert(*args: Callable, **kwargs)

Inserts a delimiter at a specific position within the children. If position is None nothing will be inserted. Position values greater or equal the number of children mean that the delimiter will be appended to the tuple of children.

Example:

insert(pos_of('paragraph'), node_maker('LF', '\n'))
is_anonymous(path: Path) bool[source]

Returns True if the current node is anonymous.

is_anonymous_leaf(path: Path) bool[source]

Returns True if path ends in an anonymous leaf-node

is_empty(path: Path) bool[source]

Returns True if the current node’s content is empty.

is_named(path: Path) bool[source]

Returns True if the current node’s parser is a named parser.

is_one_of(path: Path, name_set: AbstractSet[str]) bool[source]
is_one_of(*args: str)
is_one_of(*args: Set, **kwargs)

Returns true, if the node’s name is one of the given tag names.

is_token(path: Path, tokens: AbstractSet[str] = frozenset({})) bool[source]
is_token(*args: str)
is_token(*args: Set, **kwargs)

Checks whether the last node in the path has the name “:Text” and it’s content matches one of the given tokens. Leading and trailing whitespace-tokens will be ignored. In case an empty set of tokens is passed, any token is a match.

keep_children(path: Path, section: slice = slice(None, None, None))[source]
keep_children(*args: slice, **kwargs)

Keeps only child-nodes which fall into a slice of the result field.

keep_children_if(path: Path, condition: Callable)[source]
keep_children_if(*args: Callable, **kwargs)

Removes all children for which condition() returns True.

keep_content(path: Path, regexp: str)[source]
keep_content(*args: str, **kwargs)

Removes children depending on their string value.

keep_nodes(path: Path, names: AbstractSet[str])[source]
keep_nodes(*args: str)
keep_nodes(*args: Set, **kwargs)

Removes children by tag name.

keep_tokens(path: Path, tokens: AbstractSet[str] = frozenset({}))[source]
keep_tokens(*args: str)
keep_tokens(*args: Set, **kwargs)

Removes any among a particular set of tokens from the immediate descendants of a node. If tokens is the empty set, all tokens are removed.

key_node_name(node: Node) str[source]

Returns the tag name of the node as key for selecting transformations from the transformation table in function traverse.

lean_left(path: Path, operators: AbstractSet[str])[source]
lean_left(*args: str)
lean_left(*args: Set, **kwargs)

Turns a right leaning tree into a left leaning tree:

(op1 a (op2 b c)) -> (op2 (op1 a b) c)

If a left-associative operator is parsed with a right-recursive parser, lean_left can be used to rearrange the tree structure so that it properly reflects the order of association.

This transformation is needed, if you want to get the order of precedence right, when writing a grammar, say, for arithmetic that avoids left-recursion. (DHParser does support left-recursion but left-recursive grammars might not be compatible with other PEG-frameworks any more.)

ATTENTION: This transformation function moves forward recursively, so grouping nodes must not be eliminated during traversal! This must be done in a second pass.

left_associative(path: Path)[source]

Rearranges a flat node with infix operators into a left associative tree.

lstrip(path: Path, condition: ~typing.Callable = <function contains_only_whitespace>)[source]
lstrip(*args: Callable, **kwargs)

Recursively removes all leading child-nodes that fulfill a given condition.

merge_adjacent(path: Path, condition: Callable, preferred_name: str = '')[source]
merge_adjacent(*args: Callable, **kwargs)

Merges adjacent nodes that fulfill the given condition. It is assumed that condition is never true for leaf-nodes and non-leaf-nodes alike. Otherwise, a type-error might ensue!

The merged node’s name will be set to the value preferred_name unless that value is the empty string. In this case the name of the first node of the merge will be chosen. (Note that the assignment of the preferred name only happens if a merge actually took place, i.e. if there are at least two nodes that have been merged. ´merge_adjacent()` will not rename single nodes.)

‘merge_adjacent’ differs from collapse_children_if() in two respects:

  1. The merged nodes are not “collapsed” to their string content.

  2. The naming rule for merged nodes is different, in so far as the ‘preferred_name’ passed to merge_adjacent is only used if it actually occurs among the nodes to be merged.

This, if ‘merge_adjacent’ is substituted for ‘collapse_children_if’ in doc-string example of the latter function, the example yields:

>>> sxpr = '(place (abbreviation "p.") (page "26") (superscript "b") (mark ",") (page "18"))'
>>> tree = parse_sxpr(sxpr)
>>> merge_adjacent([tree], not_one_of({'superscript', 'subscript'}), '')
>>> print(flatten_sxpr(tree.as_sxpr()))
(place (abbreviation "p.26") (superscript "b") (mark ",18"))
merge_connected(path: Path, content: Callable, delimiter: Callable, content_name: str = '', delimiter_name: str = '')[source]
merge_connected(*args: Callable, **kwargs)

Merges sequences of content and delimiters. Other than merge_adjacent(), which does not make this distinction, delimiters at the fringe of content blocks are not included in the merge.

Parameters:
  • path – The path, i.e. list of “ancestor” nodes, ranging from the root node (path[0]) to the current node (path[-1])

  • content – Condition to identify content nodes. (Path -> bool)

  • delimiter – Condition to identify delimiter nodes. (Path -> bool)

  • content_name – tag name for the merged content blocks

  • delimiter_name – tag name for the merged delimiters at the fringe

ATTENTION: The condition to identify content nodes and the condition to identify delimiter nodes must never come true for one and the same node!!!

merge_leaves(path: Path, *, condition: Callable = <function is_anonymous_leaf>, preferred_name: str = '')

Merges adjacent nodes that fulfill the given condition. It is assumed that condition is never true for leaf-nodes and non-leaf-nodes alike. Otherwise, a type-error might ensue!

The merged node’s name will be set to the value preferred_name unless that value is the empty string. In this case the name of the first node of the merge will be chosen. (Note that the assignment of the preferred name only happens if a merge actually took place, i.e. if there are at least two nodes that have been merged. ´merge_adjacent()` will not rename single nodes.)

‘merge_adjacent’ differs from collapse_children_if() in two respects:

  1. The merged nodes are not “collapsed” to their string content.

  2. The naming rule for merged nodes is different, in so far as the ‘preferred_name’ passed to merge_adjacent is only used if it actually occurs among the nodes to be merged.

This, if ‘merge_adjacent’ is substituted for ‘collapse_children_if’ in doc-string example of the latter function, the example yields:

>>> sxpr = '(place (abbreviation "p.") (page "26") (superscript "b") (mark ",") (page "18"))'
>>> tree = parse_sxpr(sxpr)
>>> merge_adjacent([tree], not_one_of({'superscript', 'subscript'}), '')
>>> print(flatten_sxpr(tree.as_sxpr()))
(place (abbreviation "p.26") (superscript "b") (mark ",18"))
merge_results(dest: Node, src: Tuple[Node, ...], root: RootNode) bool[source]

Merges the results of nodes src and writes them to the result of dest type-safely, if all src nodes are leaf-nodes (in which case their result-strings are concatenated) or none are leaf-nodes (in which case the tuples of children are concatenated). Returns True in case of a successful merge, False if some source nodes were leaf-nodes and some weren’t and the merge could thus not be done.

Example

>>> head, tail = Node('head', '123'), Node('tail', '456')
>>> merge_results(head, (head, tail), head)  # merge head and tail (in that order) into head
True
>>> str(head)
'123456'
merge_treetops(node: Node)[source]

Recursively traverses the tree and “merges” nodes that contain only anonymous child nodes that are leaves. “mergeing” here means the node’s result will be replaced by the merged content of the children.

move_fringes(path: Path, condition: Callable, merge: bool = True)[source]
move_fringes(*args: Callable, **kwargs)

Moves adjacent nodes on the left and right fringe that fulfill the given condition to the parent node. If the merge-flag is set, a moved node will be merged with its predecessor (or successor, respectively) in the parent node in case it also fulfills the given condition. Example:

>>> tree = parse_sxpr('''(paragraph
...  (sentence
...    (word "Hello ")
...    (S " ")
...    (word "world,")
...    (S " "))
...  (sentence
...    (word "said")
...    (S " ")
...    (word "Hal.")))''')
>>> tree = traverse(tree, {'sentence': move_fringes(is_one_of({'S'}))})
>>> print(tree.as_sxpr())
(paragraph
  (sentence
    (word "Hello ")
    (S " ")
    (word "world,"))
  (S " ")
  (sentence
    (word "said")
    (S " ")
    (word "Hal.")))

In this example the blank at the end of the first sentence has been moved BETWEEN the two sentences. This is desirable, because if you extract a sentence from the data, most likely you are not interested in the trailing blank. Of course, this situation can best be avoided by a careful formulation of the grammar in the first place.

WARNING: This function should never follow replace_by_children() in the transformation list!!!

name_matches(path: Path, regexp: str) bool[source]
name_matches(*args: str, **kwargs)

Returns true, if the node’s name matches the regular expression regexp completely. For example, ‘:.*’ matches all anonymous nodes.

neg(path: Path, bool_func: Callable) bool | None[source]
neg(*args: Callable, **kwargs)

Returns the inverted boolean result of bool_func(path), unless the result is None. In that case None is passed through.

never(path: Path) bool[source]

Always returns True, no matter what the state of the path is.

node_maker(name: str, result: Tuple[Callable[[], Node], ...] | Callable[[], Node] | str, attributes: dict = {}) Callable[source]

Returns a parameter-free function that upon calling returns a freshly instantiated node with the given result, where result can again contain recursively nested node-factory functions which will be evaluated before instantiating the node.

Example

>>> factory = node_maker('d', (node_maker('c', ','), node_maker('l', ' ')))
>>> node = factory()
>>> node.serialize()
'(d (c ",") (l " "))'
normalize_position_representation(path: Path, position: int | tuple | Callable) Tuple[int, ...][source]

Converts a position-representation in any of the forms that PositionType allows into a (possibly empty) tuple of integers.

normalize_whitespace(path)[source]

Normalizes Whitespace inside a leaf node, i.e. any sequence of whitespaces, tabs and line feeds will be replaced by a single whitespace. Empty (i.e. zero-length) Whitespace remains empty, however.

not_one_of(path: Path, name_set: AbstractSet[str]) bool[source]
not_one_of(*args: str)
not_one_of(*args: Set, **kwargs)

Returns true, if the node’s name is not one of the given tag names.

peek(path: Path)[source]

For debugging: Prints the last node in the path as S-expression.

positions_of(path: Path, names: AbstractSet[str] = frozenset({})) Tuple[int, ...][source]
positions_of(*args: str)
positions_of(*args: Set, **kwargs)

Returns a (potentially empty) tuple of the positions of the children that have one of the given names.

reduce_single_child(path: Path)[source]

Reduces a single branch node by transferring the result of its immediate descendant to this node, but keeping this node’s parser entry. Reduction only takes place if the last node in the path has exactly one child. Attributes will be merged. In case one and the same attribute is defined for the child as well as the parent, the parent’s attribute value take precedence.

remove(path: Path)[source]

Removes node unconditionally.

remove_anonymous_empty(path: Path, *, condition: Callable = <function <lambda>>)

Removes all children for which condition() returns True.

remove_anonymous_tokens(path: Path, *, condition: Callable = <function <lambda>>)

Removes all children for which condition() returns True.

remove_brackets(path: Path)[source]

Removes any leading or trailing sequence of whitespaces, tokens or regexps.

remove_children(path: Path, names: AbstractSet[str])[source]
remove_children(*args: str)
remove_children(*args: Set, **kwargs)

Removes children by tag name.

remove_children_if(path: Path, condition: Callable)[source]
remove_children_if(*args: Callable, **kwargs)

Removes all children for which condition() returns True.

remove_content(path: Path, regexp: str)[source]
remove_content(*args: str, **kwargs)

Removes children depending on their string value.

remove_empty(path: Path, *, condition: Callable = <function is_empty>)

Removes all children for which condition() returns True.

remove_if(path: Path, condition: Callable)[source]
remove_if(*args: Callable, **kwargs)

Removes node if condition is True

remove_infix_operator(path: Path, *, section: slice = slice(0, None, 2))

Keeps only child-nodes which fall into a slice of the result field.

remove_tokens(path: Path, tokens: AbstractSet[str] = frozenset({}))[source]
remove_tokens(*args: str)
remove_tokens(*args: Set, **kwargs)

Removes any among a particular set of tokens from the immediate descendants of a node. If tokens is the empty set, all tokens are removed.

remove_whitespace(path: Path, *, condition: Callable = functools.partial(<function is_one_of>, name_set={':Whitespace'}))

Removes all children for which condition() returns True.

replace_by_children(path: Path)[source]

Eliminates the last node in the path by replacing it with its children. The attributes of this node will be dropped. In case the last node is the root-note (i.e. len(path) == 1), it will only be eliminated, if there is but one child.

WARNING: This should never be followed by move_fringes() in the transformation list!!!

replace_by_single_child(path: Path)[source]

Removes single branch node, replacing it by its immediate descendant. Replacement only takes place, if the last node in the path has exactly one child. Attributes will be merged. In case one and the same attribute is defined for the child as well as the parent, the child’s attribute value take precedence.

replace_child_names(path: Path, replacements: Dict[str, str])[source]
replace_child_names(*args: str)
replace_child_names(*args: dict, **kwargs)

Replaces the tag names of the children of the last node in the path according to the replacement dictionary.

Parameters:
  • path – The current path (i.e. list of ancestors and current node)

  • replacements – A dictionary of names. Each tag name of a child node that exists as a key in the dictionary will be replaced by the value for that key.

replace_content_with(path: Path, content: str)[source]
replace_content_with(*args: str, **kwargs)

Replaces the content of the node with the given text content.

replace_or_reduce(path: Path, condition: ~typing.Callable = <function is_named>)[source]
replace_or_reduce(*args: Callable, **kwargs)

Replaces node by a single child, if condition is True on child. Reduces the child, if condition is not True and not None. If the condition is None nothing is changed.

rstrip(path: Path, condition: ~typing.Callable = <function contains_only_whitespace>)[source]
rstrip(*args: Callable, **kwargs)

Recursively removes all trailing nodes that fulfill a given condition.

strip(path: Path, condition: ~typing.Callable = <function contains_only_whitespace>)[source]
strip(*args: Callable, **kwargs)

Removes leading and trailing child-nodes that fulfill a given condition.

swap_attributes(node: Node, other: Node)[source]

Exchanges the attributes between node and other. This might be needed when re-arranging trees.

transform_result(path: Path, func: Callable)[source]
transform_result(*args: Callable, **kwargs)

Replaces the result of the node. func takes the node’s result as an argument an returns the mapped result.

transformation_factory(t1=None, t2=None, t3=None, t4=None, t5=None)[source]

Creates factory functions from transformation-functions that dispatch on the first parameter after the path parameter.

Decorating a transformation-function that has more than merely the path-parameter with transformation_factory creates a function with the same name, which returns a partial-function that takes just the path-parameter.

Additionally, there is some syntactic sugar for transformation-functions that receive a collection as their second parameter and do not have any further parameters. In this case a list of parameters passed to the factory function will be converted into a single collection-parameter.

The primary benefit is the readability of the transformation-tables.

Usage:

@transformation_factory(AbstractSet[str])
def remove_tokens(path, tokens):
    ...

or, alternatively:

@transformation_factory
def remove_tokens(path, tokens: AbstractSet[str]):
    ...

Example:

trans_table = { 'expression': remove_tokens('+', '-') }

instead of:

trans_table = { 'expression': partial(remove_tokens, tokens={'+', '-'}) }
Parameters:

t1 – type of the second argument of the transformation function, only necessary if the transformation functions’ parameter list does not have type annotations.

transformer(tree: ~DHParser.nodetree.RootNode, transformation_table: TransformationTableType, key_func: KeyFunc = <function key_node_name>, src_stage: str = '', dst_stage: str = '') RootNode[source]

Same as traverse(), but expects a node of type RootNode to be passed in parameter tree and retruns this RootNode. Furthermore, the names of the source and destination stages can be passed optionally in the parameters src_stage and dst_stage. If these parameters are not empty strings, the tree.stage will be checked against src_stage before transforming the tree and set to dst_stage after the transformation.

See traverse() for the first three parameters and the general explanation of what transform does.

Parameters:
  • src_stage – The name of the source stage or the empty string (default) if the source stage shall not be checked.

  • dst_stage – The name of the destination stage or the empty string (default)

Raises:

ValueError, if tree.stage != src_stage

traverse(tree: ~DHParser.nodetree.Node, transformation_table: TransformationTableType, key_func: KeyFunc = <function key_node_name>) Node[source]

Traverses the syntax tree starting with the given node depth first and applies the sequences of callback-functions registered in the transformation_table-dictionary.

The most important use case is the transformation of a concrete syntax tree into an abstract tree (AST). But it is also imaginable to employ tree-traversal for the semantic analysis of the AST.

In order to assign sequences of callback-functions to nodes, a dictionary (“processing table”) is used. The keys usually represent tag names, but any other key function is possible. There exist three special keys:

  • ‘<’: always called (before any other processing function)

  • ‘*’: called for those nodes for which no (other) processing function appears in the table

  • ‘>’: always called (after any other processing function)

Parameters:
  • tree – The root-node of the syntax tree to be traversed

  • transformation_table – A mapping node key -> sequence of functions that will be applied to matching nodes in order. This dictionary is interpreted as a compact_table. See expand_table() or EBNFCompiler.EBNFTransTable()

  • key_func – A mapping key_func(node) -> keystr. The default key_func yields node.name.

Returns:

The tree that has been transformed in-place. The returned object is the same that has been passed in parameter tree, but be aware that this tree has been changed in-place!

Example:

table = { "term": [replace_by_single_child, flatten],
          "factor, flowmarker, retrieveop": replace_by_single_child }
traverse(node, table)
traverse_locally(path: Path, transformation_table: dict, key_func: ~typing.Callable = <function key_node_name>)[source]
traverse_locally(*args: dict, **kwargs)

Transforms the syntax tree starting from the last node in the path according to the given transformation table. The purpose of this function is to apply certain transformations locally, i.e. only for those nodes that have the last node in the path as their parent node.

update_attr(dest: Node, src: Node | Tuple[Node, ...], root: RootNode)[source]

Adds all attributes from src to dest and transfers all errors from src to dest. This is needed, in order to keep the attributes if the child node is going to be eliminated.

Module compile

Module compile contains a skeleton class for syntax driven compilation support. Class Compiler can serve as base class for a compiler. Compiler objects are callable and receive the Abstract syntax tree (AST) as argument and yield whatever output the compiler produces. In most Digital Humanities applications this will be XML-code. However, it can also be anything else, like binary code or, as in the case of DHParser’s EBNF-compiler, Python source code.

Function compile_source invokes all stages of the compilation process, i.e. pre-processing, parsing, CST to AST-transformation and compilation.

See module ebnf for a sample of the implementation of a compiler object.

class CompilationResult(result, messages, AST)
AST

Alias for field number 2

messages

Alias for field number 1

result

Alias for field number 0

class Compiler[source]

Class Compiler is the abstract base class for compilers. Compiler objects are callable and take the root node of the abstract syntax tree (AST) as argument and return the compiled code in a format chosen by the compiler itself.

Subclasses implementing a compiler must define on_XXX()-methods for each node name that can occur in the AST where ‘XXX’ is the node’s name(for unnamed nodes it is the node’s ptype without the leading colon ‘:’).

These compiler methods take the node on which they are run as argument. Other than in the AST transformation, which runs depth-first, compiler methods are called forward moving starting with the root node, and they are responsible for compiling the child nodes themselves. This should be done by invoking the compile(node)- method which will pick the right on_XXX()-method or, more commonly, by calling fallback_compiler()-methods which compiles of child-nodes and updates the tuple of children according to the results. It is not recommended to call the on_XXX()-methods directly!

Variables that are (re-)set only in the constructor and retain their value if changed during subesquent calls:

Variables:

forbid_returning_None

Default value: True. Most of the time, if a compiler-method (i.e. on_XXX()) returns None, this is a mistake due to a forgotten return statement. The method compile() checks for this mistake and raises an error if a compiler-method returns None. However, some compilers require the possibility to return None-values. In this case forbis_returing_None should be set to False in the constructor of the derived class.

(Another Alternativ would be to return a sentinel object instead of None.)

Object-Variables that are reset after each call of the Compiler-object:

Variables:
  • path – A list of parent nodes that ends with the currently compiled node.

  • tree – The root of the abstract syntax tree.

  • finalizers – A stack of tuples (function, parameters) that will be called in reverse order after compilation.

  • has_attribute_visitors – A flag indicating that the class has attribute-visitor-methods which are named ‘attr_ATTRIBUTENAME’ and will be called if the currently processed node has one or more attributes for which such visitors exist.

  • _dirty_flag – A flag indicating that the compiler has already been called at least once and that therefore all compilation variables must be reset when it is called again.

  • _debug – A flag indicating that debugging is turned on. The value for this flag is read before each call of the configuration (see debugging section in DHParser.configuration). If debugging is turned on the compiler class raises en error if there is an attempt to be compiled one and the same node a second time.

  • _debug_already_compiled – A set of nodes that have already been compiled.

attr_visitor_name(attr_name: str) str[source]

Returns the visitor_method name for attr_name, e.g.:

>>> c = Compiler()
>>> c.attr_visitor_name('class')
'attr_class'
compile(node: Node) Any[source]

Calls the compilation method for the given node and returns the result of the compilation.

The method’s name is derived from either the node’s parser name or, if the parser is disposable, the node’s parser’s class name by adding the prefix on_.

Note that compile does not call any compilation functions for the parsers of the sub nodes by itself. Rather, this should be done within the compilation methods.

fallback_compiler(node: Node) Any[source]

This is a generic compiler function which will be called on all those node types for which no compiler method on_XXX has been defined.

finalize(result: Any) Any[source]

A finalization method that is called after compilation has finished and after all tasks from the finalizers-stack have been executed

prepare(root: RootNode) None[source]

A preparation method that will be called after everything else has been initialized and immediately before compilation starts. This method can be overwritten in order to implement preparation tasks.

reset()[source]

Resets alls variables to their default values before the next call of the object.

visitor_name(node_name: str) str[source]

Returns the visitor_method name for name, e.g.:

>>> c = Compiler()
>>> c.visitor_name('expression')
'on_expression'
>>> c.visitor_name('!--')
'on_212d2d'
wildcard(node: Node) Any[source]

The wildcard method is called on nodes for which no other compilation- method has been specified. This allows to check, whether illegal nodes occur in the tree (although, a static structural validation is to be preferred.) or whether a compilation node has been forgotten.

Per default, wildcard() just redirects to self.fallback_compiler()

exception CompilerError[source]

Exception raised when an error of the compiler itself is detected. Compiler errors are not to be confused with errors in the source code to be compiled, which do not raise Exceptions but are merely reported as an error.

NoTransformation(root: RootNode) RootNode[source]

Simply passes through the unaltered node-tree.

compile_source(source: str, preprocessor: PreprocessorFunc | None, parser: ~DHParser.parse.Grammar | ~typing.Callable[[str], ~DHParser.nodetree.RootNode] | ~functools.partial, transformer: ~typing.Callable[[~DHParser.nodetree.RootNode], ~DHParser.nodetree.RootNode] | ~functools.partial = <function NoTransformation>, compiler: CompilerFunc = <function NoTransformation>, *, preserve_AST: bool = False) CompilationResult[source]

Compiles a source in four stages:

  1. Pre-Processing (if needed)

  2. Parsing

  3. AST-transformation

  4. Compiling.

The later stages AST-transformation, compilation will only be invoked if no fatal errors occurred in any of the earlier stages of the processing pipeline. Function “compile_source” does not invoke any postprocessing after compiling. See functions: run_pipeline() and full_compile() for postprocessing and compiling plus postprocessing.

Parameters:
  • source – The input text for compilation or a the name of a file containing the input text.

  • preprocessor – text -> text. A preprocessor function or None, if no preprocessor is needed.

  • parser – A parsing function or grammar class

  • transformer – A transformation function that takes the root-node of the concrete syntax tree as an argument and transforms it (in place) into an abstract syntax tree.

  • compiler – A compiler function or compiler class instance

  • preserve_AST – Preserves the AST-tree.

Returns:

The result of the compilation as a 3-tuple (result, errors, abstract syntax tree). In detail:

  1. The result as returned by the compiler or None in case of failure

  2. A list of error or warning messages

  3. The root-node of the abstract syntax tree if preserve_ast is True or None otherwise.

process_tree(tp: CompilerFunc, tree: RootNode) Any[source]

Process a tree with the tree-processor tp only if no fatal error has occurred so far. Catch any Python-exceptions in case any normal errors have occurred earlier in the processing pipeline. Don’t catch Python-exceptions if no errors have occurred earlier.

This behavior is based on the assumption that given any non-fatal errors have occurred earlier, the tree passed through the pipeline might not be in a state that is expected by the later stages, thus if an exception occurs it is not really to be considered a programming error. Processing stages should be written with possible errors occurring in earlier stages in mind, though. However, because it could be difficult to provide for all possible kinds of badly structured trees resulting from errors, exceptions occurring when processing potentially faulty trees will be dealt with gracefully.

Tree processing should generally be assumed to change the tree in place. If the input tree shall be preserved, it is necessary to make a deep copy of the input tree, before calling process_tree.

Module parse

Module parse contains the python classes and functions for DHParser’s packrat-parser. It’s central class is the Grammar-class, which is the base class for any concrete Grammar. Grammar-objects are callable and parsing is done by calling a Grammar-object with a source text as argument.

The different parsing functions are callable descendants of class Parser. Usually, they are organized in a tree and defined within the namespace of a grammar-class. See ebnf.EBNFGrammar for an example.

class Alternative(*parsers: Parser)[source]

Matches if one of several alternatives matches. Returns the first match.

This parser represents the EBNF-operator “|” with the qualification that both the symmetry and the ambiguity of the EBNF-or-operator are broken by selecting the first match.:

# the order of the sub-expression matters!
>>> number = RE(r'\d+') | RE(r'\d+') + RE(r'\.') + RE(r'\d+')
>>> str(Grammar(number)("3.1416"))
'3 <<< Error on ".1416" | Parser "root" stopped before end, at: ».1416« Terminating parser. >>> '

# the most selective expression should be put first:
>>> number = RE(r'\d+') + RE(r'\.') + RE(r'\d+') | RE(r'\d+')
>>> Grammar(number)("3.1416").content
'3.1416'

EBNF-Notation: ... | ...

EBNF-Example: number = /\d+\.\d+/ | /\d+/

is_optional() bool | None[source]

Returns True, if the parser can never fail, i.e. never yields None instead of a node. Returns False, if the parser can fail. Returns None if it is not known whether the parser can fail.

static_analysis() List[AnalysisError][source]

Analyses the parser for logical errors after the grammar has been instantiated.

class Always[source]

A parser that always matches, but does not capture anything.

is_optional() bool | None[source]

Returns True, if the parser can never fail, i.e. never yields None instead of a node. Returns False, if the parser can fail. Returns None if it is not known whether the parser can fail.

class AnalysisError(symbol, parser, error)
error

Alias for field number 2

parser

Alias for field number 1

symbol

Alias for field number 0

class AnyChar[source]

A parser that returns the next unicode character of the document whatever that is. The parser fails only at the very end of the text.

class Capture(parser: Parser, zero_length_warning: bool = True)[source]

Applies the contained parser and, in case of a match, saves the result in a variable. A variable is a stack of values associated with the contained parser’s name. This requires the contained parser to be named.

static_analysis() List[AnalysisError][source]

Analyses the parser for logical errors after the grammar has been instantiated.

class CombinedParser[source]

Class CombinedParser is the base class for all parsers that call (“combine”) other parsers. It contains functions for the optimization of return values of such parsers (i.e. descendants of classes UnaryParser and NaryParser).

One optimization consists in flattening the tree by eliminating anonymous nodes. This is the same as what the function DHParser.transform.flatten() does, only at an earlier stage. The reasoning is that the earlier the tree is reduced, the less work remains to do at all later processing stages. As these typically run through all nodes of the syntax tree, this saves memory and presumably also time.

Regarding the latter, however, performing flattening or merging during parsing stage alse means that it will be perfomred on all those tree-structures that are discarded later in the parsing process, as well.

Doing flatteining or merging during AST-transformation will ensure that it is only performed only on those nodes that made it into the concrete-syntax-tree. Mergeing, in particular, might become costly because of potentially many string-concatenations. But then again, the usual depth-first-traversal during AST-transformation will take longer, because of the much more verbose tree. (Experiments suggest that not much ist to be gained by post-poning flattening and merging to the AST-transformation stage.)

Another optimization consists in returning the singleton EMPTY_NODE for dropped contents, rather than creating an new empty node every time empty content is returned. This optimization should always work.

location_info() str[source]

Returns a description of the location of the parser within the grammar for the purpose of transparent error reporting.

class ContextSensitive(parser: Parser)[source]

Base class for context-sensitive parsers.

Context-Sensitive-Parsers are parsers that either manipulate (store or change) the values of variables or that read values of variables and use them to determine whether the parser matches or not.

While context-sensitive-parsers are quite useful, grammars that use them will not be context-free anymore. Plus they breach the technology of packrat-parsers. In particular, their results cannot simply be memoized by storing them in a dictionary of locations. (In other words, the memoization function is not a function of parser and location anymore, but would need to be a function parser, location and variable (stack-)state.) DHParser blocks memoization for context-sensitive-parsers (see Parser.__call__() and Forward.__call__()). As a consequence the parsing time cannot be assumed to be strictly proportional to the size of the document, anymore. Therefore, it is recommended to use context-sensitive-parsers sparingly.

class Counted(parser: Parser, repetitions: Tuple[int, int])[source]

Counted applies a parser for a number of repetitions within a given range, i.e. the parser must at least match the lower bound number of repetitions and it can at most match the upper bound number of repetitions.

Examples:

>>> A2_4 = Counted(Text('A'), (2, 4))
>>> A2_4
`A`{2,4}
>>> Grammar(A2_4)('AA').as_sxpr()
'(root (:Text "A") (:Text "A"))'
>>> Grammar(A2_4)('AAAAA', complete_match=False).as_sxpr()
'(root (:Text "A") (:Text "A") (:Text "A") (:Text "A"))'
>>> Grammar(A2_4)('A', complete_match=False).as_sxpr()
'(ZOMBIE__ `(err "1:1: Error (1040): Parser did not match!"))'
>>> moves = OneOrMore(Counted(Text('A'), (1, 3)) + Counted(Text('B'), (1, 3)))
>>> result = Grammar(moves)('AAABABB')
>>> result.name, result.content
('root', 'AAABABB')
>>> moves = Counted(Text('A'), (2, 3)) * Counted(Text('B'), (2, 3))
>>> moves
`A`{2,3} ° `B`{2,3}
>>> Grammar(moves)('AAABB').as_sxpr()
'(root (:Text "A") (:Text "A") (:Text "A") (:Text "B") (:Text "B"))'

While a Counted-parser could be treated as a special case of Interleave-parser, defining a dedicated class makes the purpose clearer and runs slightly faster.

is_optional() bool | None[source]

Returns True, if the parser can never fail, i.e. never yields None instead of a node. Returns False, if the parser can fail. Returns None if it is not known whether the parser can fail.

static_analysis() List[AnalysisError][source]

Analyses the parser for logical errors after the grammar has been instantiated.

class CustomParser(parse_func: CustomParseFunc)[source]

A wrapper for a simple custom parser function defined by the user:

>>> def parse_magic_number(rest: StringView) -> Node:
...     return Node('', rest[:4]) if rest.startswith('1234') else EMPTY_NODE
>>> parser = Grammar(CustomParser(parse_magic_number))
>>> result = parser('1234')
>>> print(result.as_sxpr())
(root "1234")
>>> result = parser('abcd')
>>> for e in result.errors:  print(e)
1:1: Error (1040): Parser "root" stopped before end, at: »abcd« Terminating parser.
DTKN(token, wsL='', wsR='\\s*')[source]

Syntactic Sugar for ‘Series(Whitespace(wsL), DropText(token), Whitespace(wsR))’

Drop(parser: Parser) Parser[source]

Returns the parser with the parser.drop_content-property set to True. Parser must be anonymous and disposable. Use `DropFrom instead when this requirement ist not met.

DropFrom(parser: Parser) Parser[source]

Encapsulates the parser in an anonymous Synonym-Parser and sets the drop_content-flag of the latter. This leaves the drop-flag of the parser itself untouched. This is needed, if you want to drop the result from a named-parser in one particular context where it is referred to, only.

class ERR(err_msg: str, err_code: ErrorCode = 1000)[source]

ERR is a pseudo-parser does not consume any text, but adds an error message at the current location.

class ErrorCatchingNary(*parsers: Parser, mandatory: int = 1073741824)[source]

ErrorCatchingNary is the parent class for N-ary parsers that can be configured to fail with a parsing error in case of a non-match, if all contained parsers from a specific subset of non-mandatory parsers have already matched successfully, so that only “mandatory” parsers are left for matching. The idea is that once all non-mandatory parsers have been consumed it is clear that this parser is a match so that the failure to match any of the following mandatory parsers indicates a syntax error in the processed document at the location were a mandatory parser fails to match.

For the sake of simplicity, the division between the set of non-mandatory parsers and mandatory parsers is realized by an index into the list of contained parsers. All parsers from the mandatory-index onward are considered mandatory once all parsers up to the index have been consumed.

In the following example, Series is a descendant of ErrorCatchingNary:

>>> fraction = Series(Text('.'), RegExp(r'[0-9]+'), mandatory=1).name('fraction')
>>> number = (RegExp(r'[0-9]+') + Option(fraction)).name('number')
>>> num_parser = Grammar(TreeReduction(number, CombinedParser.MERGE_TREETOPS))
>>> num_parser('25').as_sxpr()
'(number "25")'
>>> num_parser('3.1415').as_sxpr()
'(number (:RegExp "3") (fraction ".1415"))'
>>> str(num_parser('3.1415'))
'3.1415'
>>> str(num_parser('3.'))
'3. <<< Error on "" | »/[0-9]+/« expected by parser \'fraction\', but END OF FILE found instead! >>> '

In this example, the first item of the fraction, i.e. the decimal dot, is non-mandatory, because only the parser with an index of one or more are mandatory (mandator=1). In this case this is only the regular expression parser capturing the decimal digits after the dot. This means, if there is no dot, the fraction parser simply will not match. However, if there is a dot, it will fail with an error if the following mandatory item, i.e. the decimal digits, are missing.

Variables:

mandatory – Number of the element starting at which the element and all following elements are considered “mandatory”. This means that rather than returning a non-match an error message is issued. The default value is NO_MANDATORY, which means that no elements are mandatory. NOTE: The semantics of the mandatory-parameter might change depending on the subclass implementing it.

get_reentry_point(location: int) Tuple[int, Node][source]

Returns a tuple of integer index of the closest reentry point and a Node capturing all text from rest up to this point or (-1, None) if no reentry-point was found. If no reentry-point was found or the skip-list ist empty, -1 and a zombie-node are returned.

mandatory_violation(location: int, failed_on_lookahead: bool, expected: str, reloc: int, err_node: Node) Tuple[Error, int][source]

Chooses the right error message in case of a mandatory violation and returns an error with this message, an error node, to which the error is attached, and the text segment where parsing is to continue.

This is a helper function that abstracts functionality that is needed by the Interleave-parser as well as the Series-parser.

Parameters:
  • location – the point, where the mandatory violation happend. As usual the string view represents the remaining text from this point.

  • failed_on_lookahead – True if the violating parser was a Lookahead-Parser.

  • expected – the expected (but not found) text at this point.

  • err_node – A zombie-node that captures the text from the position where the error occurred to a suggested reentry-position.

  • reloc – A position offset that represents the reentry point for parsing after the error occurred.

Returns:

a tuple of an error object and a location for the continuation the parsing process

static_analysis() List[AnalysisError][source]

Analyses the parser for logical errors after the grammar has been instantiated.

class Forward[source]

Forward allows to declare a parser before it is actually defined. Forward declarations are needed for parsers that are recursively nested, e.g.:

>>> class Arithmetic(Grammar):
...     r'''
...     expression =  term  { ("+" | "-") term }
...     term       =  factor  { ("*" | "/") factor }
...     factor     =  INTEGER | "("  expression  ")"
...     INTEGER    =  /\d+/~
...     '''
...     expression = Forward()
...     INTEGER    = RE('\\d+')
...     factor     = INTEGER | TKN("(") + expression + TKN(")")
...     term       = factor + ZeroOrMore((TKN("*") | TKN("/")) + factor)
...     expression.set(term + ZeroOrMore((TKN("+") | TKN("-")) + term))
...     root__     = expression
Variables:

recursion_counter – Mapping of places to how often the parser has already been called recursively at this place. This is needed to implement left recursion. The number of calls becomes irrelevant once a result has been memoized.

property repr: str

Returns the parser’s name if it has a name or repr(self) if not.

reset()[source]

Initializes or resets any parser variables. If overwritten, the reset()-method of the parent class must be called from the reset()-method of the derived class.

set(parser: Parser)[source]

Sets the parser to which the calls to this Forward-object shall be delegated.

set_proxy(proxy: ParseFunc | None)[source]

set_proxy has no effects on Forward-objects!

class Grammar(root: Parser | None = None, static_analysis: bool | None = None)[source]

Class Grammar directs the parsing process and stores global state information of the parsers, i.e. state information that is shared accross parsers.

Grammars are basically collections of parser objects, which are connected to an instance object of class Grammar. There exist two ways of connecting parsers to grammar objects: Either by passing the root parser object to the constructor of a Grammar object (“direct instantiation”), or by assigning the root parser to the class variable root__ of a descendant class of class Grammar.

Example for direct instantiation of a grammar:

>>> number = RE(r'\d+') + RE(r'\.') + RE(r'\d+') | RE(r'\d+')
>>> number_parser = Grammar(number)
>>> number_parser("3.1416").content
'3.1416'

Collecting the parsers that define a grammar in a descendant class of class Grammar and assigning the named parsers to class variables rather than global variables has several advantages:

  1. It keeps the namespace clean.

  2. The parser names of named parsers do not need to be passed to the constructor of the Parser object explicitly, but it suffices to assign them to class variables, which results in better readability of the Python code. See classmethod Grammar._assign_parser_names__()

  3. The parsers in the class do not necessarily need to be connected to one single root parser, which is helpful for testing and when building up a parser gradually from several components.

As a consequence, though, it is highly recommended that a Grammar class should not define any other variables or methods with names that are legal parser names. A name ending with a double underscore __ is not a legal parser name and can safely be used.

Example:

class Arithmetic(Grammar):
    # special fields for implicit whitespace and comment configuration
    COMMENT__ = r'#.*(?:\n|$)'  # Python style comments
    wspR__ = mixin_comment(whitespace=r'[\t ]*', comment=COMMENT__)

    # parsers
    expression = Forward()
    INTEGER = RE('\\d+')
    factor = INTEGER | TKN("(") + expression + TKN(")")
    term = factor + ZeroOrMore((TKN("*") | TKN("/")) + factor)
    expression.set(term + ZeroOrMore((TKN("+") | TKN("-")) + term))
    root__ = expression

Upon instantiation the parser objects are deep-copied to the Grammar object and assigned to object variables of the same name. For any parser that is directly assigned to a class variable the field parser.pname contains the variable name after instantiation of the Grammar class. The parser will nevertheless remain anonymous with respect to the tag names of the nodes it generates, if its name is included in the disposable__-set or, if disposable__ has been defined by a regular expression, matched by that regular expression. If one and the same parser is assigned to several class variables such as, for example, the parser expression in the example above, which is also assigned to root__, the first name sticks.

Grammar objects are callable. Calling a grammar object with a UTF-8 encoded document, initiates the parsing of the document with the root parser. The return value is the concrete syntax tree. Grammar objects can be reused (i.e. called again) after parsing. Thus, it is not necessary to instantiate more than one Grammar object per thread.

Grammar classes contain a few special class fields for implicit whitespace and comments that should be overwritten, if the defaults (no comments, horizontal right aligned whitespace) don’t fit:

Class Attributes:

Variables:
  • root__ – The root parser of the grammar. Theoretically, all parsers of the grammar should be reachable by the root parser. However, for testing of yet incomplete grammars class Grammar does not assume that this is the case.

  • resume_rules__ – A mapping of parser names to a list of regular expressions that act as rules to find the reentry point if a ParserError was thrown during the execution of the parser with the respective name.

  • skip_rules__ – A mapping of parser names to a list of regular expressions that act as rules to find the reentry point if a ParserError was thrown during the execution of the parser with the respective name.

  • error_messages__ – A mapping of parser names to a Tuple of regalar expressions and error messages. If a mandatory violation error occurs on a specific symbol (i.e. parser name) and any of the regular expressions matches the error message of the first matching expression is used instead of the generic mandatory violation error messages. This allows to answer typical kinds of errors (say putting a colon “,” where a semi-colon “;” is expected) with more informative error messages.

  • disposable__ – A set of parser-names or a regular expression to identify names of parsers that are assigned to class fields but shall nevertheless yield anonymous nodes (i.e. nodes the tag name of which starts with a colon “:” followed by the parser’s class name).

  • parser_initialization__ – Before the grammar class (!) has been initialized, which happens upon the first time it is instantiated (see _assign_parser_names() for an explanation), this class field contains a value other than “done”. A value of “done” indicates that the class has already been initialized.

  • static_analysis_pending__ – True as long as no static analysis (see the method with the same name for more information) has been done to check parser tree for correctness. Static analysis is done at instantiation and the flag is then set to false, but it can also be carried out once the class has been generated (by DHParser.ebnf.EBNFCompiler) and then be set to false in the definition of the grammar class already.

  • static_analysis_errors__ – A list of errors and warnings that were found in the static analysis

  • parser_names__ – The list of the names of all named parsers defined in the grammar class

  • python_src__ – For the purpose of debugging and inspection, this field can take the python src of the concrete grammar class (see dsl.grammar_provider()).

Instance Attributes:

Variables:
  • all_parsers__ – A set of all parsers connected to this grammar object

  • comment_rx__ – The compiled regular expression for comments. If no comments have been defined, it defaults to RX_NEVER_MATCH This instance-attribute will only be defined if a class-attribute with the same name does not already exist!

  • start_parser__ – During parsing, the parser with which the parsing process was started (see method __call__) or None if no parsing process is running.

  • unconnected_parsers__ – A set of parsers that are not connected to the root parser. The set of parsers is collected during instantiation.

  • resume_parsers__ – A set of parsers that appear either in a resume-rule or a skip-rule. This set is a subset of unconnected_parsers__

  • _dirty_flag__ – A flag indicating that the Grammar has been called at least once so that the parsing-variables need to be reset when it is called again.

  • text__ – The text that is currently been parsed or that has mose recently been parsed.

  • document__ – A string view on the text that has most recently been parsed or that is currently being parsed.

  • document_length__ – the length of the document.

  • document_lbreaks__ – (property) list of linebreaks within the document, starting with -1 and ending with EOF. This helps to generate line and column number for history recording and will only be initialized if history_tracking__ is true.

  • tree__ – The root-node of the parsing tree. This variable is available for error-reporting already during parsing via self.grammar.tree__.add_error, but it references the full parsing tree only after parsing has been finished.

  • _reversed__ – the same text in reverse order - needed by the Lookbehind- parsers.

  • variables__ – A mapping for variable names to a stack of their respective string values - needed by the Capture-, Retrieve- and Pop-parsers.

  • rollback__ – A list of tuples (location, rollback-function) that are deposited by the Capture- and Pop-parsers. If the parsing process reaches a dead end then all rollback-functions up to the point to which it retreats will be called and the state of the variable stack restored accordingly.

  • last_rb__loc__ – The last, i.e. most advanced location in the text where a variable changing operation occurred. If the parser backtracks to a location at or before last_rb__loc__ (i.e. location < last_rb__loc__) then a rollback of all variable changing operations is necessary that occurred after the location to which the parser backtracks. This is done by calling method rollback_to__() (location).

  • ff_pos__ – The “farthest fail”, i.e. the highest location in the document where a parser failed. This gives a good indication where and why parsing failed, if the grammar did not match a text.

  • ff_parser__ – The parser that failed at the “farthest fail”-location ff_pos__

  • suspend_memoization__ – A flag that if set suspends memoization of results from returning parsers. This flag is needed by the left-recursion handling algorithm (see Parser.__call__() and Forward.__call__()) as well as the context-sensitive parsers (see function Grammar.push_rollback__()).

  • left_recursion__ – Turns on left-recursion handling. This prevents the recursive descent parser to get caught in an infinite loop (resulting in a maximum recursion depth reached error) when the grammar definition contains left recursions.

  • associated_symbol_cache__

    A cache for the associated_symbol__() -method.

    # mirrored class attributes:

  • static_analysis_pending__ – A pointer to the class attribute of the same name. (See the description above.) If the class is instantiated with a parser, this pointer will be overwritten with an instance variable that serves the same function.

  • static_analysis_errors__ – A pointer to the class attribute of the same name. (See the description above.) If the class is instantiated with a parser, this pointer will be overwritten with an instance variable that serves the same function.

Tacing and debugging support:

The following parameters are needed by the debugging functions in module trace.py. They should not be manipulated by the users of class Grammar directly.

Variables:
  • history_tracking__ – A flag indicating that the parsing history is being tracked. This flag should not be manipulated by the user. Use trace.set_tracer() (grammar, trace.trace_history) to turn (full) history tracking on and trace.set_tracer() (grammar, None) to turn it off. Default is off.

  • resume_notices__ – A flag indicating that resume messages are generated in addition to the error messages, in case the parser was able to resume after an error. Use trace.resume_notices() (grammar) to turn resume messages on and trace.set_tracer() (grammar, None) to turn resume messages (as well as history recording) off. Default is off.

  • call_stack__ – A stack of the tag names and locations of all parsers in the call chain to the currently processed parser during parsing. The call stack can be thought of as a breadcrumb path. This is required for recording the parser history (for debugging) and, eventually, i.e. one day in the future, for tracing through the parsing process.

  • history__ – A list of history records. A history record is appended to the list each time a parser either matches, fails or if a parser-error occurs. See class log.HistoryRecord. History records store copies of the current call stack.

  • moving_forward__ – This flag indicates that the parsing process is currently moving forward. It is needed to reduce noise in history recording and should not be considered as having a valid value if history recording is turned off! (See Parser.__call__())

  • most_recent_error__ – The most recent parser error that has occurred or None. This can be read by tracers. See module trace

Configuration parameters:

The values of these parameters are copied from the global configuration in the constructor of the Grammar object. (see mpodule configuration)

Variables:
  • max_parser_dropouts__ – Maximum allowed number of retries after errors where the parser would exit before the complete document has been parsed. Default is 1, as usually the retry-attemts lead to a proliferation of senseless error messages.

  • reentry_search_window__ – The number of following characters that the parser considers when searching a reentry point when a syntax error has been encountered. Default is 10.000 characters.

as_ebnf__() str[source]

Serializes the Grammar object as a grammar-description in the Extended Backus-Naur-Form. Does not serialize directives and may contain abbreviations with three dots “ … “ for very long expressions.

associated_symbol__(parser: Parser) Parser[source]

Returns the closest named parser that contains parser. If parser is a named parser itself, parser is returned. If parser is not connected to any symbol in the Grammar, an AttributeError is raised. Example:

>>> word = Series(RegExp(r'\w+'), Whitespace(r'\s*'))
>>> word.pname = 'word'
>>> gr = Grammar(word)
>>> anonymous_re = gr['word'].parsers[0]
>>> gr.associated_symbol__(anonymous_re).pname
'word'
fill_associated_symbol_cache__()[source]

Pre-fills the associated symbol cache with an algorithm that is more efficient than filling the cache by calling associated_symbol__() on each parser individually.

fullmatch(parser: str | Parser, string: str, source_mapping: SourceMapFunc | None = None)[source]

Returns the matched string, if the parser matches the complete string or None if the parser does not match.

get_memoization_dict__(parser: Parser) MemoizationDict[source]

Returns the memoization dictionary for the parser’s equivalence class.

match(parser: str | Parser, string: str, source_mapping: SourceMapFunc | None = None)[source]

Returns the matched string, if the parser matches the beginning of a string or None if the parser does not match.

push_rollback__(location, func)[source]

Adds a rollback function that either removes or re-adds values on the variable stack (self.variables) that have been added (or removed) by Capture or Pop Parsers, the results of which have been dismissed.

property reversed__: StringView

Returns a reversed version of the currently parsed document. As about the only case where this is needed is the Lookbehind-parser, this is done lazily.

rollback_to__(location)[source]

Rolls back the variable stacks (self.variables) to its state at an earlier location in the parsed document.

static_analysis__() List[AnalysisError][source]

Checks the parser tree statically for possible errors.

This function is called by the constructor of class Grammar and does not need to (and should not) be called externally.

Returns:

a list of error-tuples consisting of the narrowest containing named parser (i.e. the symbol on which the failure occurred), the actual parser that failed and an error object.

exception GrammarError(static_analysis_result: List[AnalysisError])[source]

GrammarError will be raised if static analysis reveals errors in the grammar.

class IgnoreCase(text: str)[source]

Parses plain text strings, ignoring the case, e.g. “head” == “HEAD” == “Head”. (Could be done by RegExp as well, but is faster.)

Example:

>>> tag = IgnoreCase("head")
>>> Grammar(tag)("HEAD").content
'HEAD'
>>> Grammar(tag)("Head").content
'Head'
class Interleave(*parsers: Parser, mandatory: int = 1073741824, repetitions: Sequence[Tuple[int, int]] = ())[source]

Parse elements in arbitrary order.

Examples:

>>> prefixes = TKN("A") * TKN("B")
>>> Grammar(prefixes)('A B').content
'A B'
>>> Grammar(prefixes)('B A').content
'B A'

>>> prefixes = Interleave(TKN("A"), TKN("B"), repetitions=((0, 1), (0, 1)))
>>> Grammar(prefixes)('A B').content
'A B'
>>> Grammar(prefixes)('B A').content
'B A'
>>> Grammar(prefixes)('B').content
'B'

EBNF-Notation: ... ° ...

EBNF-Example: float =  { /\d/ }+ ° /\./

is_optional() bool | None[source]

Returns True, if the parser can never fail, i.e. never yields None instead of a node. Returns False, if the parser can fail. Returns None if it is not known whether the parser can fail.

static_analysis() List[AnalysisError][source]

Analyses the parser for logical errors after the grammar has been instantiated.

class LateBindingUnary(parser_name: str)[source]

Superclass for late-binding unary parsers. LateBindingUnary only stores the name of a parser upon object creation. This name is resolved at the time when the late-binding-parser-object is connected to the grammar.

EXPERIMENTAL !!

A possible use case is a custom parser derived from LateBindingUnary that calls another parser without having to worry about whether the called parser has already been defined earlier in the Grammar-class.

LateBindingUnary is not to be confused with Forward and should not be abused for recursive parser calls either!

class Lookahead(parser: Parser)[source]

Matches, if the contained parser would match for the following text, but does not consume any text.

static_analysis() List[AnalysisError][source]

Analyses the parser for logical errors after the grammar has been instantiated.

class Lookbehind(parser: Parser)[source]

Matches, if the contained parser would match backwards. Requires the contained parser to be a RegExp, _RE, Text parser.

EXPERIMENTAL

class NaryParser(*parsers: Parser)[source]

Base class of all Nary parsers, i.e. parser that contains one or more other parsers, like the alternative parser for example.

The NaryOperator base class supplies __deepcopy__() and methods for n-ary parsers. The __deepcopy__()-method needs to be overwritten, however, if the constructor of a derived class takes additional parameters.

class NegativeLookahead(parser: Parser)[source]

Matches, if the contained parser would not match for the following text.

match(bool_value) bool[source]

Returns the value. Can be overridden to return the inverted bool.

class NegativeLookbehind(parser: Parser)[source]

Matches, if the contained parser would not match backwards. Requires the contained parser to be a RegExp-parser.

match(bool_value) bool[source]

Returns the value. Can be overridden to return the inverted bool.

class Never[source]

A parser that never matches.

class OneOrMore(parser: Parser)[source]

OneOrMore applies a parser repeatedly as long as this parser matches. Other than ZeroOrMore which always matches, at least one match is required by OneOrMore.

Examples:

>>> sentence = OneOrMore(RE(r'\w+,?')) + TKN('.')
>>> Grammar(sentence)('Wo viel der Weisheit, da auch viel des Grämens.').content
'Wo viel der Weisheit, da auch viel des Grämens.'
>>> str(Grammar(sentence)('.'))  # an empty sentence also matches
' <<< Error on "." | Parser "root->/\\\\w+,?/" did not match: ».« >>> '
>>> forever = OneOrMore(RegExp('(?=.)|$'))
>>> Grammar(forever)('')  # infinite loops will automatically be broken
Node('root', '')

Except for the end of file a warning will be emitted, if an infinite-loop is detected.

EBNF-Notation: { ... }+

EBNF-Example: sentence = { /\w+,?/ }+

static_analysis() List[AnalysisError][source]

Analyses the parser for logical errors after the grammar has been instantiated.

class Option(parser: Parser)[source]

Parser Option always matches, even if its child-parser did not match.

If the child-parser did not match Option returns a node with no content and does not move forward in the text.

If the child-parser did match, Option returns a node with the node returned by the child-parser as its single child and the text at the position where the child-parser left it.

Examples:

>>> number = Option(TKN('-')) + RegExp(r'\d+') + Option(RegExp(r'\.\d+'))
>>> Grammar(number)('3.14159').content
'3.14159'
>>> Grammar(number)('3.14159').as_sxpr()
'(root (:RegExp "3") (:RegExp ".14159"))'
>>> Grammar(number)('-1').content
'-1'

EBNF-Notation: [ ... ]

EBNF-Example: number = ["-"]  /\d+/  [ /\.\d+/ ]

is_optional() bool | None[source]

Returns True, if the parser can never fail, i.e. never yields None instead of a node. Returns False, if the parser can fail. Returns None if it is not known whether the parser can fail.

static_analysis() List[AnalysisError][source]

Analyses the parser for logical errors after the grammar has been instantiated.

class Parser[source]

(Abstract) Base class for Parser combinator parsers. Any parser object that is actually used for parsing (i.e. no mock parsers) should be derived from this class.

Since parsers can contain other parsers (see classes UnaryOperator and NaryOperator) they form a cyclical directed graph. A root parser is a parser from which all other parsers can be reached. Usually, there is one root parser which serves as the starting point of the parsing process. When speaking of “the root parser” it is this root parser object that is meant.

There are two different types of parsers:

  1. Named parsers for which a name is set in field parser.pname. The results produced by these parsers can later be retrieved in the AST by the parser name.

  2. Disposable parsers where the name-field just contains the empty string. AST-transformation of disposable parsers can be hooked only to their class name, and not to the individual parser.

Parser objects are callable and parsing is done by calling a parser object with the text to parse.

If the parser matches it returns a tuple consisting of a node representing the root of the concrete syntax tree resulting from the match as well as the substring text[i:] where i is the length of matched text (which can be zero in the case of parsers like ZeroOrMore or Option). If i > 0 then the parser has “moved forward”.

If the parser does not match it returns (None, text). Note that this is not the same as an empty match ("", text). Any empty match can for example be returned by the ZeroOrMore-parser in case the contained parser is repeated zero times.

Variables:
  • pname – The parser’s name.

  • disposable – A property indicating that the parser returns anonymous nodes. For performance reasons this is implemented as an object variable rather than a property. This property should always be equal to self.name[0] == ":".

  • drop_content – A property (for performance reasons implemented as simple field) that, if set, induces the parser not to return the parsed content or subtree if it has matched but the dummy EMPTY_NODE. In effect the parsed content will be dropped from the concrete syntax tree already. Only anonymous (or pseudo-anonymous) parsers are allowed to drop content.

  • node_name – The name for the nodes that are created by the parser. If the parser is named, this is the same as pname, otherwise it is the name of the parser’s type prefixed with a colon “:”.

  • eq_class – A unique number for the class of functionally equivalent parsers that this parser belongs to. (This serves the purpose of optimizing memoization, by tying memoization dictionaries to the classes of functionally equivalent parsers, rather than to the individual parsers themselves.)

  • visited – Mapping of places this parser has already been to during the current parsing process onto the results the parser returned at the respective place. This dictionary is used to implement memoizing.

  • parse_proxy – Usually, just a reference to self._parse, but can be overwritten to run th call to the _parse-method through a proxy like, for example, a tracing debugger. See trace

  • sub_parsers

    set of parsers that are directly referred to by this parser, e.g. parser “a” defined by the EBNF-expression “a = b (b | c)” has the sub-parser-set {b, c}.

    Notes: 1.the set is empty for parser that derive neither from UnaryParser nor from NaryParser 2. unary parser have exactly on sub-parser 3. n-ary parsers have one or more sub_parsers. For n-ary-parsers len(p.sub_parser) can be lower than len(p.parsers), in case one and the same parser is referred to more than once in the contained parser’s list.

  • _grammar – A reference to the Grammar object to which the parser is attached.

  • _symbol – The closest named parser to which this parser is connected in a grammar. If the parser itself is named, this is the same as self. _symbol is private and should be accessed only via the symbol-property which will initialize its value on first use.

  • _descendants_cache – A cache of all descendant parsers that can be reached from this parser.

  • _desc_trails_cache – A cache of the trails (i.e. list of parsers) from this parser to all other parsers that can be reached from this parser.

apply(func: ApplyFunc, grammar=None) bool | None[source]

Applies function func(parser) recursively to this parser and all descendant parsers as long as func() returns None or False. Traversal is pre-order. Stops the further application of func and returns True once func has returned True.

If func has been applied to all descendant parsers without issuing a stop signal by returning True, False is returned.

if apply is called for the first time on the parser, the parser will be conntected to grammar This use of the return value allows to use the apply-method both to issue tests on all descendant parsers (including self) which may be decided already after some parsers have been visited without any need to visit further parsers. At the same time apply can be used to simply apply a procedure to all descendant parsers (including self) without worrying about forgetting the return value of procedure, because a return value of None means “carry on”.

apply_to_trail(func: Callable[[Tuple[Parser]], bool | None]) bool | None[source]

Same as Parser.apply(), only that the applied function receives the complete “trail”, i.e. list of parsers that lead from self to the visited parser as argument.

descendant_trails() AbstractSet[ParserTrail][source]

Returns a set of the trails of self and all descendant parsers, avoiding circles. NOTE: The algorithm is rather sloppy and the returned set is not really comprehensive, but sufficient to trace anonymous parsers to their nearest named ancestor.

descendants(grammar=None) AbstractSet[Parser][source]

Returns a set of self and all descendant parsers, avoiding circles.

gen_memoization_dict() dict[source]

Create and return an empty memoization dictionary. This allows to customize memoization dictionaries. The default is to just return a new plain dictionary.

is_optional() bool | None[source]

Returns True, if the parser can never fail, i.e. never yields None instead of a node. Returns False, if the parser can fail. Returns None if it is not known whether the parser can fail.

name(pname: str = '', disposable: bool | None = None) Parser[source]

Sets the parser name to pname and returns self. If disposable is True, the nodes produced by the parser will also be marked as disposable, i.e. they can be eliminated bur their content will be retained. The same can be achived by prefixing the panme-string with a colon “:” or with “HIDE:”. Another possible prefix is “DROP:” in which case the nodes will be dropped entirely, including their content. (This is useful to keep delimiters out of the syntax-tree.)

property repr: str

Returns the parser’s name if it has a name and self.__repr__() otherwise.

reset()[source]

Initializes or resets any parser variables. If overwritten, the reset()-method of the parent class must be called from the reset()-method of the derived class.

set_proxy(proxy: ParseFunc | None)[source]

Sets a proxy that replaces the _parse()-method. Call set_proxy with None to remove a previously set proxy. Typical use case is the installation of a tracing debugger. See module trace.

signature() Hashable[source]

Returns a value that is identical for two different parser objects if they are functionally equivalent, i.e. yield the same return value for the same call parameters:

>>> a = Text('[')
>>> b = Text('[')
>>> c = Text(']')
>>> a is b
False
>>> a.signature() == b.signature()
True
>>> a.signature() == c.signature()
False

The purpose of parser-signatures is to optimize better memoization in cases of code repetition in the grammar.

DO NOT OVERRIDE THIS METHOD. In order to implement a signature function, the protected method _signature should be overridden instead.

static_analysis() List[AnalysisError][source]

Analyses the parser for logical errors after the grammar has been instantiated.

property symbol: str

Returns the symbol with which the parser is associated in a grammar. This is the closest parser with a pname that contains this parser.

exception ParserError(origin: Parser, node: Node, node_orig_len: int, location: int, error: Error, *, first_throw: bool)[source]

A ParserError is thrown for those parser errors that allow the controlled re-entrance of the parsing process after the error occurred. If a reentry-rule has been configured for the parser where the error occurred, the parser guard can resume the parsing process.

Currently, the only case when a ParserError is thrown (and not some different kind of error like UnknownParserError) is when a Series or Interleave-parser detects a missing mandatory element.

Variables:
  • origin – The parser within which the error has been raised

  • node – The node within which the error is locted

  • node_orig_len – The original size of that node. The actual size of that node may change due to later processing steps und thus not be reliable any more for the description of the error.

  • location – The location in the document where the parser that caused the error started. This is not to be confused with the location where the error occurred, because by the time the error occurs the parser may already have read some part of the document.

  • error – The Error object containing among other things the exact error location.

  • first_throw – A flag that indicates that the error has not yet been re-raised

  • attributes_locked – A frozenset of attributes that must not be overwritten once the ParrserError-object has been initialized by its constructor

  • callstack_snapshot – A snapshot of the callstack (if history-recording has been turned on) at the point where the error occurred.

new_PE(**kwargs)[source]

Returns a new ParserError object with the same attribute values as self, except those that are reassigned in kwargs.:

>>> pe = ParserError(Parser(), Node('test', ""), 0, 0, Error("", 0), first_throw=True)
>>> pe_derived = pe.new_PE(first_throw = False)
>>> pe.first_throw
True
>>> pe_derived.first_throw
False
class Pop(symbol: Parser, match_func: MatchVariableFunc | None = None)[source]

Matches if the following text starts with the value of a particular variable. As a variable in this context means a stack of values, the last value will be compared with the following text. Other than the Retrieve-parser, the Pop-parser removes the value from the stack in case of a match.

The constructor parameter symbol determines which variable is used.

reset()[source]

Initializes or resets any parser variables. If overwritten, the reset()-method of the parent class must be called from the reset()-method of the derived class.

class PreprocessorToken(token: str)[source]

Parses tokens that have been inserted by a preprocessor.

Preprocessors can generate Tokens with the make_token-function. These tokens start and end with magic characters that can only be matched by the PreprocessorToken Parser. Such tokens can be used to insert BEGIN - END delimiters at the beginning or ending of a quoted block, for example.

RE(regexp, wsL='', wsR='\\s*')[source]

Syntactic Sugar for ‘Series(Whitespace(wsL), RegExp(regexp), Whitespace(wsR))’

class RegExp(regexp)[source]

Regular expression parser.

The RegExp-parser parses text that matches a regular expression. RegExp can also be considered as the “atomic parser”, because all other parsers delegate part of the parsing job to other parsers, but do not match text directly.

Example:

>>> word = RegExp(r'\w+')
>>> Grammar(word)("Haus").content
'Haus'

EBNF-Notation: / ... /

EBNF-Example: word = /\w+/

is_optional() bool | None[source]

Returns True, if the parser can never fail, i.e. never yields None instead of a node. Returns False, if the parser can fail. Returns None if it is not known whether the parser can fail.

class Retrieve(symbol: Parser, match_func: MatchVariableFunc | None = None)[source]

Matches if the following text starts with the value of a particular variable. As a variable in this context means a stack of values, the last value will be compared with the following text. It will not be removed from the stack! (This is the difference between the Retrieve and the Pop parser.) The constructor parameter symbol determines which variable is used.

Variables:
  • symbol – The parser that has stored the value to be retrieved, in other words: “the observed parser”

  • match_func – a procedure that through which the processing to the retrieved symbols is channeled. In the simplest case it merely returns the last string stored by the observed parser. This can be (mis-)used to execute any kind of semantic action.

get_node_name() str[source]

Returns a name for the retrieved node. If the Retrieve-parser has a node-name, this overrides the node-name of the retrieved symbol’s parser.

retrieve_and_match(location: int) ParsingResult[source]

Retrieves variable from stack through the match function passed to the class’ constructor and tries to match the variable’s value with the following text. Returns a Node containing the value or None accordingly.

property symbol_pname: str

Returns the watched symbol’s pname, properly, i.e. even in cases where the symbol’s parser is shielded by a Forward-parser

class Series(*parsers: Parser, mandatory: int = 1073741824)[source]

Matches if each of a series of parsers matches exactly in the order of the series.

Example:

>>> variable_name = RegExp(r'(?!\d)\w') + RE(r'\w*')
>>> Grammar(variable_name)('variable_1').content
'variable_1'
>>> str(Grammar(variable_name)('1_variable'))
' <<< Error on "1_variable" | Parser "root->/(?!\\\\d)\\\\w/" did not match: »1_variable« >>> '

EBNF-Notation: ... ... (sequence of parsers separated by a blank or new line)

EBNF-Example: series = letter letter_or_digit

is_optional() bool | None[source]

Returns True, if the parser can never fail, i.e. never yields None instead of a node. Returns False, if the parser can fail. Returns None if it is not known whether the parser can fail.

class Synonym(parser: Parser)[source]

Simply calls another parser and encapsulates the result in another node if that parser matches.

This parser is needed to support synonyms in EBNF, e.g.:

jahr       = JAHRESZAHL
JAHRESZAHL = /\d\d\d\d/

Otherwise, the first line could not be represented by any parser class, in which case it would be unclear whether the parser RegExp(’dddd’) carries the name ‘JAHRESZAHL’ or ‘jahr’.

TKN(token, wsL='', wsR='\\s*')[source]

Syntactic Sugar for ‘Series(Whitespace(wsL), Text(token), Whitespace(wsR))’

class Text(text: str)[source]

Parses plain text strings. (Could be done by RegExp as well, but is faster.)

Example:

>>> while_token = Text("while")
>>> Grammar(while_token)("while").content
'while'
is_optional() bool | None[source]

Returns True, if the parser can never fail, i.e. never yields None instead of a node. Returns False, if the parser can fail. Returns None if it is not known whether the parser can fail.

TreeReduction(root_or_parserlist: Parser | Collection[Parser], level: int = 1) Parser[source]

Applies tree-reduction level to CombinedParsers either in the collection or parsers passed in the first arg or in the graph of interconnected parsers originating in the single “roo” parser passed as first argument. Returns the root-parser or, if a collection has been passed, the PARSER_PLACEHOLDER

Examples, how tree-reduction wors:

>>> root = Text('A') + Text('B') | Text('C') + Text('D')
>>> grammar = Grammar(TreeReduction(root, CombinedParser.NO_TREE_REDUCTION))
>>> tree = grammar('AB')
>>> print(tree.as_sxpr())
(root (:Series (:Text "A") (:Text "B")))
>>> grammar = Grammar(TreeReduction(root, CombinedParser.FLATTEN))  # default
>>> tree = grammar('AB')
>>> print(tree.as_sxpr())
(root (:Text "A") (:Text "B"))
>>> grammar = Grammar(TreeReduction(root, CombinedParser.MERGE_TREETOPS))
>>> tree = grammar('AB')
>>> print(tree.as_sxpr())
(root "AB")
>>> grammar = Grammar(TreeReduction(root, CombinedParser.MERGE_LEAVES))
>>> tree = grammar('AB')
>>> print(tree.as_sxpr())
(root "AB")

>>> root = Series(Text('A'), Text('B'), Text('C').name('important') | Text('D'))
>>> grammar = Grammar(TreeReduction(root, CombinedParser.NO_TREE_REDUCTION))
>>> tree = grammar('ABC')
>>> print(tree.as_sxpr())
(root (:Text "A") (:Text "B") (:Alternative (important "C")))
>>> grammar = Grammar(TreeReduction(root, CombinedParser.FLATTEN))  # default
>>> tree = grammar('ABC')
>>> print(tree.as_sxpr())
(root (:Text "A") (:Text "B") (important "C"))
>>> tree = grammar('ABD')
>>> print(tree.as_sxpr())
(root (:Text "A") (:Text "B") (:Text "D"))
>>> grammar = Grammar(TreeReduction(root, CombinedParser.MERGE_TREETOPS))
>>> tree = grammar('ABC')
>>> print(tree.as_sxpr())
(root (:Text "A") (:Text "B") (important "C"))
>>> tree = grammar('ABD')
>>> print(tree.as_sxpr())
(root "ABD")
>>> grammar = Grammar(TreeReduction(root, CombinedParser.MERGE_LEAVES))
>>> tree = grammar('ABC')
>>> print(tree.as_sxpr())
(root (:Text "AB") (important "C"))
>>> tree = grammar('ABD')
>>> print(tree.as_sxpr())
(root "ABD")
class UnaryParser(parser: Parser)[source]

Base class of all unary parsers, i.e. parser that contains one and only one other parser, like the optional parser for example.

The UnaryOperator base class supplies __deepcopy__() and methods for unary parsers. The __deepcopy__()-method needs to be overwritten, however, if the constructor of a derived class has additional parameters.

exception UninitializedError(msg: string)[source]

An error that results from unintialized objects. This can be a consequence of some broken boot-strapping-process.

class Whitespace(regexp, keep_comments: bool = False)[source]

A variant of RegExp that it meant to be used for insignificant whitespace. In contrast to RegExp, Whitespace always returns a match. If the defining regular expression did not match, an empty match is returned.

Variables:

keep_comments – A boolean indicating whether or not whitespace containing comments should be kept, even if the self.drop_content flag is True. If keep_comments and drop_flag are both True a stretch of whitespace containing a comment will be renamed to “comment__” and whitspace that does not contain any comments will be dropped.

Example:

>>> ws = Whitespace(mixin_comment(r'\s+', r'#.*'))
>>> Grammar(ws)("   # comment").as_sxpr()
'(root "   # comment")'
>>> dws = Drop(Whitespace(mixin_comment(r'\s+', r'#.*')))
>>> Grammar(dws)("   # comment").as_sxpr()
'(:EMPTY)'
>>> dws = Drop(Whitespace(mixin_comment(r'\s+', r'#.*'), keep_comments=True))
>>> Grammar(Synonym(dws))("   # comment").as_sxpr()
'(root (comment__ "   # comment"))'
>>> Grammar(Synonym(dws))("   ").as_sxpr()
'(root)'
>>> Grammar(dws)("   # comment").as_sxpr()
'(root "   # comment")'
>>> Grammar(dws)("   ").as_sxpr()
'(:EMPTY)'
is_optional() bool | None[source]

Returns True, if the parser can never fail, i.e. never yields None instead of a node. Returns False, if the parser can fail. Returns None if it is not known whether the parser can fail.

class ZeroOrMore(parser: Parser)[source]

ZeroOrMore applies a parser repeatedly as long as this parser matches. Like Option the ZeroOrMore parser always matches. In case of zero repetitions, the empty match ((), text) is returned.

Examples:

>>> sentence = ZeroOrMore(RE(r'\w+,?')) + TKN('.')
>>> Grammar(sentence)('Wo viel der Weisheit, da auch viel des Grämens.').content
'Wo viel der Weisheit, da auch viel des Grämens.'
>>> Grammar(sentence)('.').content  # an empty sentence also matches
'.'
>>> forever = ZeroOrMore(RegExp('(?=.)|$'))
>>> Grammar(forever)('')  # infinite loops will automatically be broken
Node('root', '')

Except for the end of file a warning will be emitted, if an infinite-loop is detected.

EBNF-Notation: { ... }

EBNF-Example: sentence = { /\w+,?/ } "."

copy_parser_base_attrs(src: Parser, duplicate: Parser)[source]

Duplicates all attributes of the Parser-class from src to duplicate.

extract_error_code(err_msg: str, err_code: ErrorCode = 1000) Tuple[str, ErrorCode][source]

Extracts the error-code-prefix from an error message.

Example:

>>> msg = '2010:Big mistake!'
>>> print(extract_error_code(msg))
('Big mistake!', 2010)
>>> msg = "Syntax error at: {1}"
>>> print(extract_error_code(msg))
('Syntax error at: {1}', 1000)
is_context_sensitive(parser: Parser) bool[source]

Returns True, is parser is a context-sensitive parser or calls a context-sensitive parser.

is_parser_placeholder(parser: Parser | None) bool[source]

Returns True, if parser is None or merely a placeholder for a parser.

last_value(text: StringView | str, stack: List[str]) str | None[source]

Matches text with the most recent value on the capture stack. This is the default case when retrieving captured substrings.

longest_match(strings: List[str], text: StringView | str, n: int = 1) str[source]

Returns the longest string from a given list of strings that matches the beginning of text. Examples:

>>> l = ['a', 'ab', 'ca', 'cd']
>>> longest_match(l, 'a')
'a'
>>> longest_match(l, 'abcdefg')
'ab'
>>> longest_match(l, 'ac')
'a'
>>> longest_match(l, 'cb')
''
>>> longest_match(l, 'cab')
'ca'
matching_bracket(text: StringView | str, stack: List[str]) str | None[source]

Returns a closing bracket for the opening bracket on the capture stack, i.e. if “[” was captured, “]” will be retrieved.

mixin_comment(whitespace: str, comment: str, always_match: bool = True) str[source]

Returns a regular expression pattern that merges comment and whitespace regexps. Thus comments can occur wherever whitespace is allowed and will be skipped just as implicit whitespace.

Note, that because this works on the level of regular expressions, nesting comments is not possible. It also makes it much harder to use directives inside comments (which isn’t recommended, anyway).

Examples

>>> import re
>>> combined = mixin_comment(r"\s+", r"#.*")
>>> print(combined)
(?:(?:\s+)?(?:(?:#.*)(?:\s+)?)*)
>>> rx = re.compile(combined)
>>> rx.match('   # comment').group(0)
'   # comment'
>>> combined = mixin_comment(r"\s+", r"#.*", always_match=False)
>>> print(combined)
(?:(?:\s+)(?:(?:#.*)(?:\s+))*)
>>> rx = re.compile(combined)
>>> rx.match('   # comment').group(0)
'   # '
mixin_nonempty(whitespace: str) str[source]

Returns a regular expression pattern that matches only if the regular expression pattern whitespace matches AND if the match is not empty.

If whitespace does not match the empty string ‘’, anyway, then it will be returned unaltered.

WARNING: mixin_nonempty() does not work for regular expressions the matched strings of which can be followed by a symbol that can also occur at the start of the regular expression.

In particular, it does not work for fixed size regular expressions, that is / / or / / or /t/ won’t work, but / / or /s/ or /s+/ do work. There is no test for this. Fixed-size regular expressions run through mixin_nonempty will not match at anymore if they are applied to the beginning or the middle of a sequence of whitespaces!

In order to be safe, your whitespace regular expressions should follow the rule: “Whitespace cannot be followed by whitespace” or “Either grab it all or leave it all”.

Parameters:

whitespace – a regular expression pattern

Returns:

new regular expression pattern that does not match the empty string ‘’, anymore.

optional_last_value(text: StringView | str, stack: List[str]) str | None[source]

Matches text with the most recent value on the capture stack or with the empty string, i.e. optional_match never returns None but either the value on the stack or the empty string.

Use case: Implement shorthand notation for matching tags, i.e.:

Good Morning, Mrs. <emph>Smith</>!

update_scanner(grammar: Grammar, leaf_parsers: Dict[str, str])[source]

Updates the “scanner” of a grammar by overwriting the text or regex-fields of some of or all of its leaf parsers with new values. This works only for those parsers that are assigned to a symbol in the Grammar class.

Parameters:
  • grammar – The grammar-object for which the leaf parsers shall be updated.

  • leaf_parsers – A mapping of parser names onto strings that are interpreted as plain text (if the parser name refers to a Text-parser or as regular expressions, if the parser name refers to a RegExp-parser)

Raises:

AttributeError – in case a leaf parser name in the dictionary does not exist or does not refer to a Text or RegExp-parser.

Module dsl

Module dsl contains high-level functions for the compilation of domain specific languages based on an EBNF-grammar.

exception CompilationError(errors, dsl_text, dsl_grammar, AST, result)[source]

Raised when a string or file in a domain specific language (DSL) contains errors. These can also contain definition errors that have been caught early.

exception DefinitionError(errors, grammar_src)[source]

Raised when (already) the grammar of a domain specific language (DSL) contains errors. Usually, these are repackaged parse.GrammarError(s).

batch_process(file_names: ~typing.List[str], out_dir: str, process_file: ~typing.Callable[[~typing.Tuple[str, str]], str], *, submit_func: ~typing.Callable | None = None, log_func: ~typing.Callable | None = None, cancel_func: ~typing.Callable = <function never_cancel>) List[str][source]

Compiles all files listed in file_names and writes the results and/or error messages to the directory our_dir. Returns a list of error messages files.

compileDSL(text_or_file: str, preprocessor: PreprocessorFunc | None, dsl_grammar: str | Grammar, ast_transformation: Callable[[RootNode], RootNode] | partial, compiler: Compiler, fail_when: ErrorCode = 1000) Any[source]

Compiles a text in a domain specific language (DSL) with an EBNF-specified grammar. Returns the compiled text or raises a compilation error.

Raises:

CompilationError if any errors occurred during compilation

compileEBNF(ebnf_src: str, branding='DSL') str[source]

Compiles an EBNF source file and returns the source code of a compiler suite with skeletons for preprocessor, transformer and compiler.

Parameters:
  • ebnf_src (str) – Either the file name of an EBNF-grammar or the EBNF-grammar itself as a string.

  • branding (str) – Branding name for the compiler suite source code.

Returns:

The complete compiler suite skeleton as Python source code.

Raises:

CompilationError if any errors occurred during compilation

compile_on_disk(source_file: str, parser_name: str = '', compiler_suite: str = '', extension: str = '.xml') Iterable[Error][source]

Compiles a source file with a given compiler and writes the result to a file.

If no compiler_suite is given it is assumed that the source file is an EBNF grammar. In this case the result will be a Python script containing a parser for that grammar as well as the skeletons for a preprocessor, AST transformation table, and compiler. If the Python script already exists only the parser name in the script will be updated. (For this to work, the different names need to be delimited section marker blocks.). compile_on_disk() returns a list of error messages or an empty list if no errors occurred.

Parameters:
  • source_file – The file name of the source text to be compiled.

  • parser_name – The name of the generated parser. If the empty string is passed, the default name “…Parser.py” will be used.

  • compiler_suite – The file name of the parser/compiler-suite (usually ending with ‘Parser.py’), with which the source file shall be compiled. If this is left empty, the source file is assumed to be an EBNF-Grammar that will be compiled with the internal EBNF-Compiler.

  • extension – The result of the compilation (if successful) is written to a file with the same name but a different extension than the source file. This parameter sets the extension.

Returns:

A (potentially empty) list of error or warning messages.

create_parser(ebnf_src: str, branding='DSL', additional_code: str = '') Grammar[source]

Compiles the ebnf source into a callable Grammar-object. This is essentially syntactic sugar for grammar_provider(ebnf)().

create_scripts(ebnf_filename: str, parser_name: str = '', server_name: str | None = '', app_name: str | None = '', overwrite: bool = False)[source]

Creates a parser script from the grammar with the filename ebnf_filename' or, if ebnf_filename referes to a directory from all grammars in files ending with “.ebnf” in that directory.

If server_name is not None a script for starting a parser-server will be created as well. Running the parser as a server has the advantage that the startup time for calling the parser is greatly reduced for subsequent parser calls. (While the same can be achieved with running the parser script in batch-processing-mode by passing a directory or several filenames on the command line to the parser script, batch processing is not suitable for all application cases. For example, it is not usable when implementing language servers to feed editors with data from the parseing process.)

if app_name is not None an application script with a tkinter-based graphical user interface will be created as well. (When distributing this script with pyinstaller, parallel processing should be turned off at least on MS Windows systems!)

Parameters:
  • ebnf_filename – The filename of the grammar, from which the servfer script’s filename is derived.

  • parser_name – The filename of the parser script or the empty string if the default filename shall be used.

  • server_name – The filename of the server script of the empty string if the default filename shall be used, or None if no server script shall be created.

  • app_name – The filename of the server script of the empty string if the default filename shall be used, or None if no app-script shall be created

  • overwrite – If True an existing server script will be overwritten.

grammar_provider(ebnf_src: str, branding='DSL', additional_code: str = '', fail_when: ErrorCode = 1000) Callable[[], Grammar | Callable[[str], RootNode] | partial] | partial[source]

Compiles an EBNF-grammar and returns a grammar-parser provider function for that grammar.

Parameters:
  • ebnf_src (str) – Either the file name of an EBNF grammar or the EBNF-grammar itself as a string.

  • branding (str or bool) – Branding name for the compiler suite source code.

  • additional_code – Python code added to the generated source. This typically contains the source code of semantic actions referred to in the generated source, e.g. filter-functions, resume-point-search-functions

Returns:

A provider function for a grammar object for texts in the language defined by ebnf_src.

load_compiler_suite(compiler_suite: str) Tuple[Callable[[], Callable[[str, str], PreprocessorResult] | partial], Callable[[], Grammar | Callable[[str], RootNode] | partial] | partial, Callable[[], Callable[[RootNode], RootNode] | partial], CompilerFactory][source]

Extracts a compiler suite from file or string compiler_suite and returns it as a tuple (preprocessor, parser, ast, compiler).

Returns:

4-tuple (preprocessor function, parser class,

ast transformer function, compiler class)

process_file(source: str, out_dir: str, preprocessor_factory: Callable[[], Callable[[str, str], PreprocessorResult] | partial], parser_factory: Callable[[], Grammar | Callable[[str], RootNode] | partial] | partial, junctions: Set[Junction], targets: Set[str], serializations: Dict[str, List[str]]) str[source]

Compiles the source and writes the serialized results back to disk, unless any fatal errors have occurred. Error and Warning messages are written to a file with the same name as result_filename with an appended “_ERRORS.txt” or “_WARNINGS.txt” in place of the name’s extension. Returns the name of the error-messages file or an empty string, if no errors or warnings occurred.

Parameters:
  • source – the source document or the filename of the source-document

  • out_dir – the path of the output-directory. If the output-directory does not exist, it will be created.

  • preprocessor_factory – A factory-function that returns a preprocessing function.

  • parser_factory – A factory-function that returns a parser function which, usually, is a parse.Grammar-object.

  • junctions – a set of junctions for all processing stages beyond parsing.

  • serializations – A dictionary of serialization names, e.g. “sxpr”, “xml”, “json” for those target stages that still are node-trees. These will be serialized and written to disk in all given serializations.

Returns:

either the empty string or the file name of a file that contains the errors or warnings that occurred while processing the source.

raw_compileEBNF(ebnf_src: str, branding='DSL', fail_when: ErrorCode = 1000) EBNFCompiler[source]

Compiles an EBNF grammar file and returns the compiler object that was used and which can now be queried for the result as well as skeleton code for preprocessor, transformer and compiler objects.

Parameters:
  • ebnf_src (str) – Either the file name of an EBNF grammar or the EBNF grammar itself as a string.

  • branding (str) – Branding name for the compiler suite source code.

Returns:

An instance of class ebnf.EBNFCompiler

Raises:

CompilationError if any errors occurred during compilation

read_template(template_name: str) str[source]

Reads a script-template from a template file named template_name in the template-directory and returns it as a string.

recompile_grammar(ebnf_filename: str, parser_name: str = '', force: bool = False, notify: ~typing.Callable = <function <lambda>>) bool[source]

Re-compiles an EBNF-grammar if necessary, that is, if either no corresponding ‘XXXXParser.py’-file exists or if that file is outdated.

Parameters:
  • ebnf_filename – The filename of the ebnf-source of the grammar. In case this is a directory and not a file, all files within this directory ending with .ebnf will be compiled.

  • parser_name – The name of the compiler script. If not given the ebnf-filename without extension and with the addition of “Parser.py” will be used.

  • force – If False (default), the grammar will only be recompiled if it has been changed.

  • notify – ‘notify’ is a function without parameters that is called when recompilation actually takes place. This can be used to inform the user.

Returns:

True, if recompilation of grammar has been successful or did not take place, because the Grammar hasn’t changed since the last compilation. False, if the recompilation of the grammar has been attempted but failed.

Module preprocess

Module preprocess contains functions for preprocessing source code before the parsing stage as well as source mapping facilities to map the locations of parser and compiler errors to the non-preprocessed source text.

Preprocessing (and source mapping of errors) will only be needed for some domain specific languages, most notably those that cannot completely be described entirely with context-free grammars.

class IncludeInfo(begin, length, file_name)
begin

Alias for field number 0

file_name

Alias for field number 2

length

Alias for field number 1

class PreprocessorResult(original_text, preprocessed_text, back_mapping, errors)
back_mapping

Alias for field number 2

errors

Alias for field number 3

original_text

Alias for field number 0

preprocessed_text

Alias for field number 1

chain_preprocessors(*preprocessors) PreprocessorFunc[source]

Merges a sequence of preprocessor functions in to a single function.

gen_find_include_func(rx: str | ~typing.Any, comment_rx: str | ~typing.Any | None = None, derive_file_name: DeriveFileNameFunc = <function <lambda>>) FindIncludeFunc[source]

Generates a function to find include-statements in a file.

Parameters:

rx – A regular expression (either as string or compiled regular expression) to catch the names of the includes in a document. The expression should catch

gen_neutral_srcmap_func(original_text: StringView | str, original_name: str = '') SourceMapFunc[source]

Generates a source map function that maps positions to itself.

make_preprocessor(tokenizer: Tokenizer) PreprocessorFunc[source]

Generates a preprocessor function from a “naive” tokenizer, i.e. a function that merely adds preprocessor tokens to a source text and returns the modified source.

make_token(token: str, argument: str = '') str[source]

Turns the token and argument into a special token that will be caught by the PreprocessorToken-parser.

This function is a support function that should be used by preprocessors to inject preprocessor tokens into the source text.

nil_preprocessor(original_text: str, original_name: str) PreprocessorResult[source]

A preprocessor that does nothing, i.e. just returns the input.

prettyprint_tokenized(tokenized: str) str[source]

Returns a pretty-printable version of a document that contains tokens.

strip_tokens(tokenized: str) str[source]

Replaces all tokens with the token’s arguments.

tokenized_to_original_mapping(tokenized_text: str, original_text: str, original_name: str = 'UNKNOWN_FILE') SourceMap[source]

Generates a source map for mapping positions in a text that has been enriched with token markers to their original positions.

Parameters:
  • tokenized_text – the source text enriched with token markers

  • original_text – the original source text

  • original_name – the name or path or uri of the original source file

Returns:

a source map, i.e. a list of positions and a list of corresponding offsets. The list of positions is ordered from smallest to highest. An offset is valid for its associated position and all following positions until (and excluding) the next position in the list of positions.

Module error

Module error defines class Error and a few helpful functions that are needed for error reporting of DHParser. Usually, what is of interest are the string representations of the error objects. For example:

from DHParser import compile_source, has_errors

result, errors, ast = compile_source(source, preprocessor, grammar,
                                     transformer, compiler)
if errors:
    for error in errors:
        print(error)

    if has_errors(errors):
        print("There have been fatal errors!")
        sys.exit(1)
    else:
        print("There have been warnings, but no errors.")

The central class of module DHParser’s error is the Error-class. The easiest way to create an error object is by instantiating the Error class with an error message and a source position:

>>> error = Error('Something went wrong', 123)
>>> print(error)
Error (1000): Something went wrong

However, in order to report errors, usually at least a line and column-number

class Error(message: str, pos: int, code: ErrorCode = 1000, line: int = -1, column: int = -1, length: int = 1, related: Sequence[Error] = [], orig_pos: int = -1, orig_doc: str = '')[source]

The Error class encapsulates the all information for a single error.

Variables:
  • message – the error message as text string

  • pos – the position where the error occurred in the preprocessed text

  • code

    the error-code, which also indicates the severity of the error:

    ========= ===========
    code      severity
    ========= ===========
    0         no error
    < 100     notice
    < 1000    warning
    < 10000   error
    >= 10000  fatal error
    ========= ===========
    

    In cas of a fatal error (error code >= 10000), no further compilation stages will be processed, because it is assumed that the syntax tree is too distorted for further processing.

  • orig_pos – the position of the error in the original source file, not in the preprocessed document. This is a write-once value!

  • orig_doc – the name or path or url of the original source file to which orig_pos is related. This is relevant, if the preprocessed document has been plugged together from several source files.

  • line – the line number where the error occurred in the original text. Lines are counted from 1 onward.

  • column – the column where the error occurred in the original text. Columns are counted from 1 onward.

  • length – the length in characters of the faulty passage (default is 1)

  • end_line – the line number of the position after the last character covered by the error in the original source.

  • end_column – the column number of the position after the last character covered by the error in the original source.

  • related – a sequence of related errors.

diagnostic_obj() dict[source]

Returns the Error as Language Server Protocol Diagnostic object. https://microsoft.github.io/language-server-protocol/specifications/specification-current/#diagnostic

range_obj() dict[source]

Returns the range (position plus length) of the error as an LSP-Range-Object. https://microsoft.github.io/language-server-protocol/specifications/specification-current/#range

property severity

Returns a string representation of the error level, e.g. “warning”.

signature() bytes[source]

Returns a signature to quickly check the equality of errors

visualize(document: str) str[source]

Shows the line of the document and the position where the error occurred.

class ErrorCode[source]
class SourceLocation(original_name, original_text, pos)
original_name

Alias for field number 0

original_text

Alias for field number 1

pos

Alias for field number 2

class SourceMap(original_name, positions, offsets, file_names, originals_dict)
file_names

Alias for field number 3

offsets

Alias for field number 2

original_name

Alias for field number 0

originals_dict

Alias for field number 4

positions

Alias for field number 1

add_source_locations(errors: List[Error], source_mapping: SourceMapFunc)[source]

Adds (or adjusts) line and column numbers of error messages inplace.

Parameters:
  • errors – The list of errors as returned by the method errors() of a Node object

  • source_mapping – A function that maps error positions to their positions in the original source file.

canonical_error_strings(errors: List[Error]) List[str][source]

Returns the list of error strings in canonical form that can be parsed by most editors, i.e. “relative filepath : line : column : severity (code) : error string”

has_errors(messages: Iterable[Error], level: ErrorCode = 1000) bool[source]

Returns True, if at least one entry in messages has at least the given error level.

is_error(code: Error | int) bool[source]

Returns True, if error is a (fatal) error, not just a warning.

is_fatal(code: Error | int) bool[source]

Returns True, ir error is fatal. Fatal errors are typically raised when a crash (i.e. Python exception) occurs at later stages of the processing pipeline (e.g. ast transformation, compiling).

is_warning(code: Error | int) bool[source]

Returns True, if error is merely a warning or a message.

only_errors(messages: Iterable[Error], level: ErrorCode = 1000) Iterator[Error][source]

Returns an Iterator that yields only those messages that have at least the given error level.

Module testing

Module testing contains support for unit-testing domain specific languages. Tests for arbitrarily small components of the Grammar can be written into test files with ini-file syntax in order to test whether the parser matches or fails as expected. It can also be tested whether it produces an expected concrete or abstract syntax tree. Usually, however, unexpected failure to match a certain string is the main cause of trouble when constructing a context free Grammar.

class MockStream(name='')[source]
Simulates a stream that can be written to from one side and read from

from the other side like a pipe. Usage pattern:

pipe = MockStream()
reader = StreamReaderProxy(pipe)
writer = StreamWriterProxy(pipe)

async def main(text):
    writer.write((text + '
‘).encode())

await writer.drain() data (await reader.read()).decode() writer.close() return data

asyncio.run(main(‘Hello World’))

data_available() int[source]

Returns the size of the available data.

clean_report(report_dir='REPORT')[source]

Deletes any test-report-files in the REPORT sub-directory and removes the REPORT sub-directory, if it is empty after deleting the files.

create_test_templates(symbols_or_ebnf: str | SymbolsDictType, path: str, fmt: str = '.ini') None[source]

Creates template files for grammar unit-tests for the given symbols .

Parameters:
  • symbols_or_ebnf – Either a dictionary that matches section names to the grammar’s symbols under that section or an EBNF-grammar or file name of an EBNF-grammar from which the symbols shall be extracted.

  • path – the path to the grammar-test directory (usually ‘tests_grammar’). If the last element of the path does not exist, the directory will be created.

  • fmt – the test-file-format. At the moment only ‘.ini’ is supported

extract_symbols(ebnf_text_or_file: str) SymbolsDictType[source]

Extracts all defined symbols from an EBNF-grammar. This can be used to prepare grammar-tests. The symbols will be returned as lists of strings which are grouped by the sections to which they belong and returned as an ordered dictionary, they keys of which are the section names. In order to define a section in the ebnf-source, add a comment-line starting with “#:”, followed by the section name. It is recommended to use valid file names as section names. Example:

#: components

expression = term { EXPR_OP~ term} term = factor { TERM_OP~ factor} factor = [SIGN] ( NUMBER | VARIABLE | group ) { VARIABLE | group } group = “(” expression “)”

#: leaf_expressions

EXPR_OP = /+/ | /-/ TERM_OP = /*/ | /// SIGN = /-/

NUMBER = /(?:0|(?:[1-9]d*))(?:.d+)?/~ VARIABLE = /[A-Za-z]/~

If no sections have been defined in the comments, there will be only one group with the empty string as a key.

Parameters:

ebnf_text_or_file – Either an ebnf-grammar or the file-name of an ebnf-grammar

Returns:

Ordered dictionary mapping the section names of the grammar to lists of symbols that appear under that section.

get_report(test_unit, serializations: Dict[str, List[str]] = {}) str[source]

Returns a text-report of the results of a grammar unit test. The report lists the source of all tests as well as the error messages, if a test failed or the abstract-syntax-tree (AST) in case of success.

If an asterix has been appended to the test name then the concrete syntax tree will also be added to the report in this particular case.

The purpose of the latter is to help constructing and debugging of AST-Transformations. It is better to switch the CST-output on and off with the asterix marker when needed than to output the CST for all tests which would unnecessarily bloat the test reports.

grammar_suite(directory, parser_factory, transformer_factory, fn_patterns=('*test*',), ignore_unknown_filetypes=False, report='REPORT', verbose=True, junctions={}, show={}, serializations: Dict[str, List[str]] = {})[source]

Runs all grammar unit tests in a directory. A file is considered a test unit, if it has the word “test” in its name.

grammar_unit(test_unit, parser_factory, transformer_factory, report='REPORT', verbose=False, junctions={}, show={}, serializations: Dict[str, List[str]] = {})[source]

Unit tests for a grammar-parser and ast-transformations.

Parameters:
  • test_unit – The test-unit in a json-like dictionary format as it is returned by unit_from_file().

  • parser_factory – the parser-factory-object, typically an instance of Grammar.

  • transformer_factory – A factory-function for the AST-transformation-function.

  • report – the name of the subdirectory where the test-reports will be saved. If the name is the empty string, no reports will be generated.

  • verbose – If True, more information will be printed to the console during testing.

  • junctions – A set of Junction-objects that define further processing stages after the AST-transformation.

  • show – A set of stage names that shall be shown in the report apart from the AST. (The abstract-syntax-tree will always be shown!)

reset_unit(test_unit)[source]

Resets the tests in test_unit by removing all results and error messages.

runner(tests, namespace, profile=False)[source]

Runs all or some selected Python unit tests found in the namespace. To run all tests in a module, call runner("", globals()) from within that module.

Unit-Tests are either classes, the name of which starts with “Test” and methods, the name of which starts with “test” contained in such classes or functions, the name of which starts with “test”.

if tests is either the empty string or an empty sequence, runner checks sys.argv for specified tests. In case sys.argv[0] (i.e. the script’s file name) starts with ‘test’ any argument in sys.argv[1:] (i.e. the rest of the command line) that starts with ‘test’ or ‘Test’ is considered the name of a test function or test method (of a test-class) that shall be run. Test-Methods are specified in the form: class_name.method.name e.g. “TestServer.test_connection”.

Parameters:
  • tests – String or list of strings with the names of tests to run. If empty, runner searches by itself all objects the of which starts with ‘test’ and runs it (if its a function) or all of its methods that start with “test” if its a class plus the “setup” and “teardown” methods if they exist.

  • namespace – The namespace for running the test, usually globals() should be used.

  • profile – If True, the tests will be run with the profiler on. results will be displayed after the test-results. Profiling will also be turned on, if the parameter –profile has been provided on the command line.

Example:

class TestSomething()
    def setup(self):
        pass
    def teardown(self):
        pass
    def test_something(self):
        pass

if __name__ == "__main__":
    from DHParser.testing import runner
    runner("", globals())
unique_name(file_name: str) str[source]

Turns the file or dirname into a unique name by adding a time stamp. This helps to avoid race conditions when running tests in parallel that create and delete files on the disk.

unit_from_config(config_str: str, filename: str, allowed_stages=frozenset({'AST', 'CST', 'fail', 'match', 'match*'}))[source]

Reads grammar unit tests contained in a file in config file (.ini) syntax.

Parameters:
  • config_str – A string containing a config-file with Grammar unit-tests

  • filename – The file-name of the config-file containing config_str.

  • allows_stages – A set of stage names of stages in the processing pipeline for which the test-file may contain tests.

Returns:

A JSON-like object(i.e. dictionary) representing the unit tests.

unit_from_file(filename, additional_stages=frozenset({'AST', 'CST', 'fail', 'match', 'match*'}))[source]

Reads a grammar unit test from a file. The format of the file is determined by the ending of its name.

unit_from_json(json_str, filename, allowed_stages=frozenset({'AST', 'CST', 'fail', 'match', 'match*'}))[source]

Reads grammar unit tests from a json string.

Module trace

Module trace provides trace-debugging functionality for the parser. The tracers are added or removed via monkey patching to all or some parsers of a grammar and trace the actions of these parsers, making use of the call_stack__, history__ and moving_forward__, most_recent_error__-hooks in the Grammar-object.

This functionality can be used for several purposes:

  1. “live” or “breakpoint”-debugging (not implemented)

  2. recording of parsing history and “post-mortem”-debugging, implemented here and in module log

  3. Interrupting long running parser processes by polling a threading.Event or multiprocessing.Event once in a while

resume_notices_off(grammar: Grammar)[source]

Turns resume-notices as well as history tracking off!

resume_notices_on(grammar: Grammar)[source]

Turns resume-notices as well as history tracking on!

set_tracer(parsers: Grammar | Parser | Iterable[Parser], tracer: ParseFunc | None)[source]

Adds or removes a tracing function to (or from) a single parser, a set of parsers or all parsers in a grammar.

Parameters:
  • parsers – the parsers or single parser or grammar-object containing parsers where the tracer shall be added or removed.

  • tracer – a tracer function or None. If None any existing tracer will be removed. If not None, tracer must be a parsing function. It is up to the tracer to call the original parsing function (self._parse()).

Module log

Module log contains logging and debugging support for the parsing process.

For logging functionality, the global variable LOGGING is defined which contains the name of a directory where log files shall be placed. By setting its value to the empty string “” logging can be turned off.

To read the directory name function LOGS_DIR() should be called rather than reading the variable LOGGING. LOGS_DIR() makes sure the directory exists and raises an error if a file with the same name already exists.

For debugging of the parsing process, the parsing history can be logged and written to an html-File.

For ease of use module log defines a context-manager logging to which either False (turn off logging), a log directory name or True for the default logging directory is passed as argument. The other components of DHParser check whether logging is on and write log files in the logging directory accordingly. Usually, this will be concrete and abstract syntax trees as well as the full and abbreviated parsing history.

Example:

from DHParser import compile_source, start_logging, set_config_value
start_logging("LOGS")
set_config_value('log_syntax_trees', {'CST', 'AST'})
set_config_value('history_tracking', True)
set_config_value('resume_notices', True)
result, errors, ast = compile_source(source, preprocessor, grammar,
                                     transformer, compiler)
class HistoryRecord(call_stack: List[CallItem] | Tuple[CallItem, ...], node: Node | None, text: StringView, line_col: Tuple[int, int], errors: List[Error] = [])[source]

Stores debugging information about one completed step in the parsing history.

A parsing step is “completed” when the last one of a nested sequence of parser-calls returns. The call stack including the last parser call will be frozen in the HistoryRecord- object. In addition, a reference to the generated leaf node (if any) will be stored and the result status of the last parser call, which ist either MATCH, FAIL (i.e. no match) or ERROR.

class Snapshot(line, column, stack, status, text)
column

Alias for field number 1

line

Alias for field number 0

stack

Alias for field number 2

status

Alias for field number 3

text

Alias for field number 4

as_csv_line() str[source]

Returns history record formatted as a csv table row.

as_html_tr() str[source]

Returns history record formatted as an html table row.

as_tuple() Snapshot[source]

Returns history record formatted as a snapshot tuple.

static last_match(history: List[HistoryRecord]) HistoryRecord | None[source]

Returns the last match from the parsing-history. :param history: the parsing-history as a list of HistoryRecord objects

Returns:

the history record of the last match or none if either history is empty or no parser could match

static most_advanced_fail(history: List[HistoryRecord]) HistoryRecord | None[source]

Returns the closest-to-the-end-fail from the parsing-history. :param history: the parsing-history as a list of HistoryRecord objects

Returns:

the history record of the closest-to-the-end-fail or none if either history is empty or no parser failed

static most_advanced_match(history: List[HistoryRecord]) HistoryRecord | None[source]

Returns the closest-to-the-end-match from the parsing-history. :param history: the parsing-history as a list of HistoryRecord objects

Returns:

the history record of the closest-to-the-end-match or none if either history is empty or no parser could match

append_log(log_name: str, *strings, echo: bool = False) None[source]

Appends one or more strings to the log-file with the name log_name, if logging is turned on and log_name is not the empty string, or log_name contains path information.

Parameters:
  • log_name – The name of the log file. The file must already exist. (See: create_log() above).

  • strings – One or more strings that will be written to the log-file. No delimiters will be added, i.e. all delimiters like blanks or linefeeds need to be added explicitly to the list of strings, before calling append_log().

  • echo – If True, the log message will be echoed on the terminal. In this case the message will even be echoed if logging is turned off.

callstack_as_str(callstack: Sequence[CallItem], depth=-1) str[source]

Returns a string representation of the callstack!

clear_logs(logfile_types=frozenset({'.ast', '.cst', '.log'}))[source]

Removes all logs from the log-directory and removes the log-directory if it is empty.

create_log(log_name: str) str[source]

Creates a new log file. If log_name is not just a file name but a path with at least one directory (which can be ‘./’) the file is not created in the configured log directory but at the given path. If a file with the same name already exists, it will be overwritten.

Parameters:

log_name – The file name of the log file to be created

Returns:

the file name of the log file or an empty string if the log-file has not been created (e.g. because logging is still turned off and no log-directory set).

is_logging(thread_local_query: bool = True) bool[source]

-> True, if logging is turned on.

local_log_dir(path: str = './LOGS')[source]

Context manager for temporarily switching to a different log-directory.

log_ST(syntax_tree, log_file_name) bool[source]

Writes an S-expression-representation of the syntax_tree to a file, if logging is turned on. Returns True, if logging was turned on and log could be written. Returns False, if logging was turned off. Raises a FileSystem error if writing the log failed for some reason.

log_dir(path: str = '') str[source]

Creates a directory for log files (if it does not exist) and returns its path.

WARNING: Any files in the log dir will eventually be overwritten. Don’t use a directory name that could be the name of a directory for other purposes than logging.

ATTENTION: The log-dir is stored thread locally, which means the log-dir as well as the information whether logging is turned on or off will not automatically be transferred to any subprocesses. This needs to be done explicitly. (See testing.grammar_suite() for an example, how this can be done.

Parameters:

path – The directory path. If empty, the configured value will be used, i.e. configuration.get_config_value('log_dir').

Returns:

str - name of the logging directory or “” if logging is turned off.

log_parsing_history(grammar, log_file_name: str = '', as_html: bool = True) bool[source]

Writes a log of the parsing history of the most recently parsed document, if logging is turned on. Returns True, if that was the case and writing the history was successful.

Parameters:
  • grammar (Grammar) – The Grammar object from which the parsing history shall be logged.

  • log_file_name (str) – The (base-)name of the log file to be written. If no name is given (default), then the class name of the grammar object will be used.

  • as_html (bool) – If true (default), the log will be output as html-Table, otherwise as plain test. (Browsers might take a few seconds or minutes to display the table for long histories.)

resume_logging(log_dir: str = '')[source]

Resumes logging in the current thread with the given log-dir.

start_logging(dirname: str = 'LOGS')[source]

Turns logging on and sets the log-directory to dirname. The log-directory, if it does not already exist, will be created lazily, i.e. only when logging actually starts.

suspend_logging() str[source]

Suspends logging in the current thread. Returns the log-dir for resuming logging later.

Module configuration

Module “configuration.py” defines the default configuration for DHParser. The configuration values can be read and changed while running via the get_config_value() and set_config_value()-functions.

The presets can also be overwritten before(!) spawning any parsing processes by overwriting the values in the CONFIG_PRESET dictionary.

The recommended way to use a different configuration in any custom code using DHParser is to use the second method, i.e. to overwrite the values for which this is desired in the CONFIG_PRESET dictionary right after the start of the program and before any DHParser-function is invoked.

access_presets()[source]

Allows read and write access to preset values via get_preset_value() and set_preset_value(). Any call to access_presets() should be matched by a call to finalize_presets() to ensure propagation of changed preset-values to spawned processes. For an explanation why, see: https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

access_thread_locals() Any[source]

Intitializes (if not done yet) and returns the thread local variable store. (Call this function before using THREAD_LOCALS. Direct usage of THREAD_LOCALS is DEPRECATED!)

finalize_presets(fail_on_error: bool = False)[source]

Finalizes changes of the presets of the configuration values. This method should always be called after changing preset values to make sure the changes will be visible to processes spawned later.

get_config_value(key: str, default: ~typing.Any = <configuration.NoDefault object>) Any[source]

Retrieves a configuration value thread-safely.

Parameters:
  • key – the key (an immutable, usually a string)

  • default – a default value that is returned if no config-value exists for the key.

Returns:

the value

get_config_values(key_pattern: str) Dict[source]

Returns a dictionary of all configuration entries that match key_pattern.

get_preset_values(key_pattern: str) Dict[source]

Returns a dictionary of all presets that match key_pattern.

read_local_config(ini_filename: str) str[source]

Reads a local config file and updates the presets accordingly. If the file is not found at the given path, the same base name will be tried in the current working directory, then in the applications config-directory and, ultimately, in the calling script’s directory. This configuration file must be in the .ini-file format so that it can be parsed with “configparser” from the Python standard library. Any key,value-pair under the section “DHParser” will directly be transferred to the configuration presets. For other sections, the section name will be added as a qualifier to the key: “section.key”. Thus, only values under the “DHParser” section modify the DHParser-configuration while configuration parameters unter other sections are free to be evaluated by the calling script and cannot interfere with DHParser’s configuration.

Parameters:

ini_filename – the file path and name of the configuration file.

Returns:

the file path of the actually read .ini-file or the empty string if no .ini-file with the given name could be found either at the given path, in the current working directory or in the calling script’s path.

set_config_value(key: str, value: Any, allow_new_key: bool = False)[source]

Changes a configuration value thread-safely. The configuration value will be set only for the current thread. In order to set configuration values for any new thread, add the key and value to CONFIG_PRESET, before any thread accessing config values is started. :param key: the key (an immutable, usually a string) :param value: the value

validate_value(key: str, value: Any)[source]

Raises a Type- or ValueError, if the values of variable key are restricted to a certain set or range and the value does not lie within this set or range.

Module server

Module server contains an asynchronous tcp-server that receives compilation requests, runs custom compilation functions in a multiprocessing.Pool.

This allows to start a DHParser-compilation environment just once and save the startup time of DHParser for each subsequent compilation. In particular, with a just-in-time-compiler like PyPy (https://pypy.org) setting up a compilation-server is highly recommended, because jit-compilers typically sacrifice startup-speed for running-speed.

It is up to the compilation function to either return the result of the compilation in serialized form, or just save the compilation results on the file system an merely return an success or failure message. Module server does not define any of these message. This is completely up to the clients of module server, i.e. the compilation-modules, to decide.

The communication, i.e. requests and responses, follows the json-rpc protocol:

<https://www.jsonrpc.org/specification>

For JSON see:

<https://json.org/>

The server-module contains some rudimentary support for the language server protocol. For the specification and implementation of the language server protocol, see:

<https://code.visualstudio.com/api/language-extensions/language-server-extension-guide>

<https://microsoft.github.io/language-server-protocol/>

<https://langserver.org/>

class Connection(reader: StreamReader | StreamReaderProxy, writer: StreamWriter | StreamWriterProxy, exec_env: ExecutionEnvironment)[source]

Class Connections encapsulates connection-specific data for the Server class (see below). At the moment, however, only one connection is accepted at one and the same time, assuming there is a one-to-one relationship between the Text-Editor (i.e. the client) and the language server.

Currently, logging is not encapsulated, assuming that for the purpose of debugging the language server it is better not to have more than one connection at a time, anyway.

alive

Boolean flag, indicating that the connection is still alive. When set to false the connection will be closed, but the server will not be stopped.

reader

the stream-reader for this connection

writer

the stream-writer for this connection

exec

the execution environment of the server

active_tasks

A dictionary that maps task id’s (resp. jsonrpc id’s) to their futures to keep track of any running task.

finished_tasks

a set of task id’s (resp. jsonrpc id’s) for tasks that have been finished and should be removed from the active_tasks-dictionary at the next possible time.

response_queue

An asynchronous queue which stores the json-rpc responses and errors received from a language server client as result of commands initiated by the server.

pending_responses

A dictionary of jsonrpc-/task-id’s to lists of JSON-objects that have been fetched from the the response queue but not yet been collected by the calling task.

lsp_initialized

A string-flag indicating that the connection to a language sever via json-rpc has been established.

lsp_shutdown

A string-flag indicating that the connection to a language server via jason-rpc has been is or is being shut down.

log_file

Name of the server-log. Mirrors Server.log_file

echo_log

If True log messages will be echoed to the console. Mirrors Server.log_file

async client_response(call_id: int) JSON_Type[source]

Waits for and returns the response from the lsp-client to the call with the id call_id.

put_response(json_obj: JSON_Type)[source]

Adds a client-response to the waiting queue. The responses to a particular task can be queried with the client_response()- coroutine.

verify_initialization(method: str, strict: bool = True) RPC_Error_Type[source]

Implements the LSP-initialization logic and returns an rpc-error if either initialization went wrong or an rpc-method other than ‘initialize’ was called on an uninitialized language server.

class ExecutionEnvironment(event_loop: AbstractEventLoop)[source]

Class ExecutionEnvironment provides methods for executing server tasks in separate processes, threads, as asynchronous task or as simple function.

Variables:
  • process_executor – A process-pool-executor for cpu-bound tasks

  • thread_executor – A thread-pool-executor for blocking tasks

  • submit_pool – A secondary process-pool-executor to submit tasks synchronously and thread-safe.

  • submit_pool_lock – A threading.Lock to ensure that submissions to the submit_pool will be thread_safe

  • manager – a multiprocessing.SyncManager-object that can be used share data across different processes.

  • loop – The asynchronous event loop for running coroutines

  • log_file – The name of the log-file to which error messages are written if an executor raises a Broken-Error.

  • _closed – A Flag that is set to True after the shutdown-method has been called. After that any call to the `execute()-method yields an error.

async execute(executor: Executor | None, method: Callable, params: dict | tuple | list) Tuple[JSON_Type | BytesType, RPC_Error_Type | None][source]

Executes a method with the given parameters in a given executor (ThreadPoolExcecutor or ProcessPoolExecutor). execute() waits for the completion and returns the JSON result and an RPC error tuple (see the type definition above). The result may be None and the error may be zero, i.e. no error. If executor is None the method will be called directly instead of deferring it to an executor.

shutdown(wait: bool = True)[source]

Shuts the thread and process executor of the execution environment. The wait parameter is passed to the shutdown-method of the thread and process-executor.

submit_as_process(func, *args) Future[source]

Submits a long running function to the secondary process-pool. Other than execute() this works synchronously and thread-safe.

class Server(rpc_functions: RPC_Type, cpu_bound: ~typing.Set[str] = {'*'}, blocking: ~typing.Set[str] = {}, connection_callback: ConnectionCallback = <function connection_cb_dummy>, server_name: str = '', strict_lsp: bool = True)[source]

Class Server contains all the boilerplate code for a Language-Server-Protocol-Server. Class Server should be considered final, i.e. do not derive from this class to add LSP-functionality, rather implement the lsp_functionality in a dedicated class (or set of functions) and pass the LSP-functionality via the rpc_functions-parameter to the constructor of this class.

Variables:
  • server_name – A name for the server. Defaults to CLASSNAME_OBJECTID

  • strict_lsp – Enforce Language-Server-Protocol on json-rpc-calls. If False json-rpc calls will be processed even without prior initialization, just like plain data or http calls.

  • cpu_bound – Set of function names of functions that are cpu-bound and will be run in separate processes.

  • blocking – Set of functions that contain blocking calls (e.g. IO-calls) and will therefore be run in separate threads.

  • rpc_table – Table mapping LSP-method names to Python functions

  • known_methods – Set of all known LSP-methods. This includes the methods in the rpc-table and the four initialization methods, initialize(), initialized(), shutdown(), exit

  • connection_callback – A callback function that is called with the connection object as argument when a connection to a client is established

  • max_data_size – Maximal size of a data chunk that can be read by the server at a time.

  • stage – The operation stage, the server is in. Can be on of the four values: SERVER_OFFLINE, SERVER_STARTING, SERVER_ONLINE, SERVER_TERMINATING

  • host – The host, the server runs on, e.g. “127.0.0.1”

  • port – The port of the server, e.g. 8890

  • server – The asyncio.Server if the server is online, or None.

  • serving_task – The task in which the asyncio.Server is run.

  • stop_response – The response string that is written to the stream as answer to a stop request.

  • service_calls – A set of names of functions that can be called as “service calls” from a second connection, even if another connection is still open.

  • echo_log – Read from the global configuration. If True, any log message will also be echoed on the console.

  • log_file – The file-name of the server-log.

  • log_filter – A filter to allow or block logging of specific json-rpc calls.

  • use_jsonrpc_header – Read from the global configuration. If True, jsonrpc-calls or responses will always be preceeded by a simple header of the form: Content-Length: {NUM}nn, where {NUM} stands for the byte-size of the rpc-package.

  • exec – An instance of the execution environment that delegates tasks to separate processes, threads, asynchronous tasks or simple function calls.

  • connection – An instance of the connection class representing the data of the current connection or None, if there is no connection at the moment. There can be only one connection to the server at a time!

  • kill_switch – If True, the server will be shut down.

  • loop – The asyncio event loop within which the asyncio stream server is run.

async handle_plaindata_request(task_id: int, reader: StreamReader | StreamReaderProxy, writer: StreamWriter | StreamWriterProxy, data: BytesType, service_call: bool = False)[source]

Processes a request in plain-data-format, i.e. neither http nor json-rpc

register_service_rpc(name, method)[source]

Registers a service request, i.e. a call that will be accepted from a second connection. Otherwise, requests coming from a new connection if a connection has already been established, will be rejected, because language servers only accept one client at a time.

async respond(writer: StreamWriter | StreamWriterProxy, response: str | BytesType)[source]

Sends a response to the given writer. Depending on the configuration, the response will be logged. If the response appears to be a json-rpc response a JSONRPC_HEADER will be added depending on self.use_jsonrpc_header.

rpc_identify_server(service_call: bool = False, html: bool = False, *args, **kwargs)[source]

Returns an identification string for the server.

rpc_info(service_call: bool = False, html: bool = False, *args, **kwargs) str[source]

Returns information on the implemented LSP- and service-functions.

rpc_logging(*args, **kwargs) str[source]

Starts logging with either a) the default filename, if args is empty or the empty string; or b) the given log file name if args[0] is a non-empty string; or c) stops logging if args[0] is None.

rpc_serve_page(file_path: str, service_call: bool = False, html: bool = False, *args, **kwargs) str | bytes[source]

Loads and returns the HTML page or other files stored in file file_path

async run(method_name: str, method: Callable, params: Dict | List | Tuple) Tuple[JSON_Type | BytesType | None, RPC_Error_Type | None][source]

Picks the right execution method (process, thread or direct execution) and runs it in the respective executor. In case of a broken ProcessPoolExecutor it restarts the ProcessPoolExecutor and tries to execute the method again.

run_stream_server(reader: StreamReader | StreamReaderProxy, writer: StreamWriter | StreamWriterProxy)[source]

Start a DHParser-server that listens on a reader-stream and answers on a writer-stream.

run_tcp_server(host: str = '', port: int = -1, loop=None)[source]

Starts a DHParser-server that listens on a tcp port. This function will not return until the DHParser-Server ist stopped by sending a STOP_SERVER_REQUEST.

start_logging(filename: str = '') str[source]

Starts logging to a file. If filename is void or a directory an auto-generated file name will be used. The file will be written to the standard log-dir, unless a path is specified in filename.

stop_logging()[source]

Stops logging.

async asyncio_connect(host: str = '', port: int = -1, retry_timeout: float = 3.0) Tuple[StreamReader, StreamWriter][source]

Opens a connection with timeout retry-timeout.

asyncio_run(coroutine: Awaitable, loop=None) Any[source]

Backward compatible version of Pyhon3.7’s asyncio.run()

async has_server_stopped(host: str = '', port: int = -1, timeout: float = 3.0) bool[source]

Returns True, if no server is running or any server that is running has stopped within the given timeout. Returns False, if server has not stopped and is still running.

async probe_tcp_server(host, port, timeout=3) str[source]

Connects to server and sends an identify-request. Returns the response or an empty string if connection failed or command timed out.

rpc_entry_info(name: str, rpc_table: RPC_Table, html: bool = False) str[source]

Returns the name, signature and doc-string of a function in the rpc-table as string or HTML-snippet.

rpc_table_info(rpc_table: RPC_Table, html: bool = False) str[source]

Returns the names, function signatures and doc-string of all functions in the rpc_table as a (more or less) well-formatted string or as HTML-snippet.

spawn_tcp_server(host: str = '', port: int = -1, parameters: ~typing.Tuple | ~typing.Dict | ~typing.Callable = <function echo_requests>, Concurrent: ~server.ConcurrentType = <class 'multiprocessing.context.Process'>) ConcurrentType[source]

Starts DHParser-Server that communicates via tcp in a separate process or thread. Can be used for writing test code.

Servers started with this function sometimes seem to run into race conditions. Therefore, USE THIS ONLY FOR TESTING!

Parameters:
  • host – The host for the tcp-communication, e.g. 127.0.0.1

  • port – the port number for the tcp-communication.

  • parameters – The parameter-tuple or -dict for initializing the server or simply a rpc-handling function that takes a string-request as argument and returns a string response.

  • Concurrent – The concurrent class, either mutliprocessing.Process or threading.Tread for running the server.

Returns:

the multiprocessing.Proccess-object of the already started server-processs.

split_header(data: BytesType) Tuple[BytesType, BytesType, BytesType][source]

Splits the given data-chunk and returns tuple (header, data, backlog). If the data-chunk is incomplete it will be returned unchanged while the returned header remains empty.

stop_tcp_server(host: str = '', port: int = -1, timeout: float = 3.0) Exception | None[source]

Sends a STOP_SERVER_REQUEST to a running tcp server. Returns any legitimate exceptions that occur if the server has already been closed.

Module lsp

Module lsp.py defines (some of) the constants and data structures from the Language Server Protocol. See: <https://microsoft.github.io/language-server-protocol/specifications/specification-current/>

EXPERIMENTAL!!!

class AnnotatedTextEdit[source]
class ApplyWorkspaceEditParams[source]
class ApplyWorkspaceEditResponse[source]
class CallHierarchyClientCapabilities[source]
class CallHierarchyIncomingCall[source]
class CallHierarchyIncomingCallsParams[source]
class CallHierarchyItem[source]
class CallHierarchyOptions[source]
class CallHierarchyOutgoingCall[source]
class CallHierarchyOutgoingCallsParams[source]
class CallHierarchyPrepareParams[source]
class CallHierarchyRegistrationOptions[source]
class CancelParams[source]
class ChangeAnnotation[source]
class ClientCapabilities[source]
class General_[source]
class Window_[source]
class Workspace_[source]
class FileOperations_[source]
class CodeAction[source]
class Disabled_[source]
class CodeActionClientCapabilities[source]
class CodeActionLiteralSupport_[source]
class CodeActionKind_[source]
class ResolveSupport_[source]
class CodeActionContext[source]
class CodeActionKind(value)[source]

An enumeration.

class CodeActionOptions[source]
class CodeActionParams[source]
class CodeActionRegistrationOptions[source]
class CodeDescription[source]
class CodeLens[source]
class CodeLensClientCapabilities[source]
class CodeLensOptions[source]
class CodeLensParams[source]
class CodeLensRegistrationOptions[source]
class CodeLensWorkspaceClientCapabilities[source]
class Color[source]
class ColorInformation[source]
class ColorPresentation[source]
class ColorPresentationParams[source]
class Command[source]
class CompletionClientCapabilities[source]
class CompletionItemKind_[source]
class CompletionItem_[source]
class InsertTextModeSupport_[source]
class ResolveSupport_[source]
class TagSupport_[source]
class CompletionContext[source]
class CompletionItem[source]
class CompletionItemKind(value)[source]

An enumeration.

class CompletionItemTag(value)[source]

An enumeration.

class CompletionList[source]
class CompletionOptions[source]
class CompletionParams[source]
class CompletionRegistrationOptions[source]
class CompletionTriggerKind(value)[source]

An enumeration.

class ConfigurationItem[source]
class ConfigurationParams[source]
class CreateFile[source]
class CreateFileOptions[source]
class CreateFilesParams[source]
class DeclarationClientCapabilities[source]
class DeclarationOptions[source]
class DeclarationParams[source]
class DeclarationRegistrationOptions[source]
class DefinitionClientCapabilities[source]
class DefinitionOptions[source]
class DefinitionParams[source]
class DefinitionRegistrationOptions[source]
class DeleteFile[source]
class DeleteFileOptions[source]
class DeleteFilesParams[source]
class Diagnostic[source]
class DiagnosticRelatedInformation[source]
class DiagnosticSeverity(value)[source]

An enumeration.

class DiagnosticTag(value)[source]

An enumeration.

class DidChangeConfigurationClientCapabilities[source]
class DidChangeConfigurationParams[source]
class DidChangeTextDocumentParams[source]
class DidChangeWatchedFilesClientCapabilities[source]
class DidChangeWatchedFilesParams[source]
class DidChangeWatchedFilesRegistrationOptions[source]
class DidChangeWorkspaceFoldersParams[source]
class DidCloseTextDocumentParams[source]
class DidOpenTextDocumentParams[source]
class DidSaveTextDocumentParams[source]
class DocumentColorClientCapabilities[source]
class DocumentColorOptions[source]
class DocumentColorParams[source]
class DocumentColorRegistrationOptions[source]
class DocumentFilter[source]
class DocumentFormattingClientCapabilities[source]
class DocumentFormattingOptions[source]
class DocumentFormattingParams[source]
class DocumentFormattingRegistrationOptions[source]
class DocumentHighlight[source]
class DocumentHighlightClientCapabilities[source]
class DocumentHighlightKind(value)[source]

An enumeration.

class DocumentHighlightOptions[source]
class DocumentHighlightParams[source]
class DocumentHighlightRegistrationOptions[source]
class DocumentLinkClientCapabilities[source]
class DocumentLinkOptions[source]
class DocumentLinkParams[source]
class DocumentLinkRegistrationOptions[source]
class DocumentOnTypeFormattingClientCapabilities[source]
class DocumentOnTypeFormattingOptions[source]
class DocumentOnTypeFormattingParams[source]
class DocumentOnTypeFormattingRegistrationOptions[source]
class DocumentRangeFormattingClientCapabilities[source]
class DocumentRangeFormattingOptions[source]
class DocumentRangeFormattingParams[source]
class DocumentRangeFormattingRegistrationOptions[source]
class DocumentSymbol[source]
class DocumentSymbolClientCapabilities[source]
class SymbolKind_[source]
class TagSupport_[source]
class DocumentSymbolOptions[source]
class DocumentSymbolParams[source]
class DocumentSymbolRegistrationOptions[source]
class ErrorCodes(value)[source]

An enumeration.

class ExecuteCommandClientCapabilities[source]
class ExecuteCommandOptions[source]
class ExecuteCommandParams[source]
class ExecuteCommandRegistrationOptions[source]
class FailureHandlingKind(value)[source]

An enumeration.

class FileChangeType(value)[source]

An enumeration.

class FileCreate[source]
class FileDelete[source]
class FileEvent[source]
class FileOperationFilter[source]
class FileOperationPattern[source]
class FileOperationPatternKind(value)[source]

An enumeration.

class FileOperationPatternOptions[source]
class FileOperationRegistrationOptions[source]
class FileRename[source]
class FileSystemWatcher[source]
class FoldingRange[source]
class FoldingRangeClientCapabilities[source]
class FoldingRangeKind(value)[source]

An enumeration.

class FoldingRangeOptions[source]
class FoldingRangeParams[source]
class FoldingRangeRegistrationOptions[source]
class FormattingOptions[source]
class Hover[source]
class HoverClientCapabilities[source]
class HoverOptions[source]
class HoverParams[source]
class HoverRegistrationOptions[source]
class ImplementationClientCapabilities[source]
class ImplementationOptions[source]
class ImplementationParams[source]
class ImplementationRegistrationOptions[source]
class InitializeError[source]
class InitializeParams[source]
class ClientInfo_[source]
class InitializeResult[source]
class ServerInfo_[source]
class InitializedParams[source]
class InsertReplaceEdit[source]
class InsertTextFormat(value)[source]

An enumeration.

class InsertTextMode(value)[source]

An enumeration.

class LinkedEditingRangeClientCapabilities[source]
class LinkedEditingRangeOptions[source]
class LinkedEditingRangeParams[source]
class LinkedEditingRangeRegistrationOptions[source]
class LinkedEditingRanges[source]
class Location[source]
class LogMessageParams[source]
class LogTraceParams[source]
class MarkdownClientCapabilities[source]
class MarkedString_1[source]
class MarkupContent[source]
class MarkupKind(value)[source]

An enumeration.

class Message[source]
class MessageActionItem[source]
class MessageType(value)[source]

An enumeration.

class Moniker[source]
class MonikerClientCapabilities[source]
class MonikerKind(value)[source]

An enumeration.

class MonikerOptions[source]
class MonikerParams[source]
class MonikerRegistrationOptions[source]
class NotificationMessage[source]
class OptionalVersionedTextDocumentIdentifier[source]
class ParameterInformation[source]
class PartialResultParams[source]
class Position[source]
class PrepareRenameParams[source]
class PrepareSupportDefaultBehavior(value)[source]

An enumeration.

class ProgressParams[source]
class PublishDiagnosticsClientCapabilities[source]
class TagSupport_[source]
class PublishDiagnosticsParams[source]
class Range[source]
class ReferenceClientCapabilities[source]
class ReferenceContext[source]
class ReferenceOptions[source]
class ReferenceParams[source]
class ReferenceRegistrationOptions[source]
class Registration[source]
class RegistrationParams[source]
class RegularExpressionsClientCapabilities[source]
class RenameClientCapabilities[source]
class RenameFile[source]
class RenameFileOptions[source]
class RenameFilesParams[source]
class RenameOptions[source]
class RenameParams[source]
class RenameRegistrationOptions[source]
class RequestMessage[source]
class ResourceOperationKind(value)[source]

An enumeration.

class ResponseError[source]
class ResponseMessage[source]
class SaveOptions[source]
class SelectionRange[source]
class SelectionRangeClientCapabilities[source]
class SelectionRangeOptions[source]
class SelectionRangeParams[source]
class SelectionRangeRegistrationOptions[source]
class SemanticTokenModifiers(value)[source]

An enumeration.

class SemanticTokenTypes(value)[source]

An enumeration.

class SemanticTokens[source]
class SemanticTokensClientCapabilities[source]
class Requests_[source]
class Full_1[source]
class Range_1[source]
class SemanticTokensDelta[source]
class SemanticTokensDeltaParams[source]
class SemanticTokensDeltaPartialResult[source]
class SemanticTokensEdit[source]
class SemanticTokensLegend[source]
class SemanticTokensOptions[source]
class Full_1[source]
class Range_1[source]
class SemanticTokensParams[source]
class SemanticTokensPartialResult[source]
class SemanticTokensRangeParams[source]
class SemanticTokensRegistrationOptions[source]
class SemanticTokensWorkspaceClientCapabilities[source]
class ServerCapabilities[source]
class Workspace_[source]
class FileOperations_[source]
class SetTraceParams[source]
class ShowDocumentClientCapabilities[source]
class ShowDocumentParams[source]
class ShowDocumentResult[source]
class ShowMessageParams[source]
class ShowMessageRequestClientCapabilities[source]
class MessageActionItem_[source]
class ShowMessageRequestParams[source]
class SignatureHelp[source]
class SignatureHelpClientCapabilities[source]
class SignatureInformation_[source]
class ParameterInformation_[source]
class SignatureHelpContext[source]
class SignatureHelpOptions[source]
class SignatureHelpParams[source]
class SignatureHelpRegistrationOptions[source]
class SignatureHelpTriggerKind(value)[source]

An enumeration.

class SignatureInformation[source]
class StaticRegistrationOptions[source]
class SymbolInformation[source]
class SymbolKind(value)[source]

An enumeration.

class SymbolTag(value)[source]

An enumeration.

class TextDocumentChangeRegistrationOptions[source]
class TextDocumentClientCapabilities[source]
class TextDocumentContentChangeEvent_0[source]
class TextDocumentContentChangeEvent_1[source]
class TextDocumentEdit[source]
class TextDocumentIdentifier[source]
class TextDocumentItem[source]
class TextDocumentPositionParams[source]
class TextDocumentRegistrationOptions[source]
class TextDocumentSaveReason(value)[source]

An enumeration.

class TextDocumentSaveRegistrationOptions[source]
class TextDocumentSyncClientCapabilities[source]
class TextDocumentSyncKind(value)[source]

An enumeration.

class TextDocumentSyncOptions[source]
class TextEdit[source]
class TokenFormat(value)[source]

An enumeration.

class TypeDefinitionClientCapabilities[source]
class TypeDefinitionOptions[source]
class TypeDefinitionParams[source]
class TypeDefinitionRegistrationOptions[source]
class UniquenessLevel(value)[source]

An enumeration.

class Unregistration[source]
class UnregistrationParams[source]
class VersionedTextDocumentIdentifier[source]
class WatchKind(value)[source]

An enumeration.

class WillSaveTextDocumentParams[source]
class WorkDoneProgressBegin[source]
class WorkDoneProgressCancelParams[source]
class WorkDoneProgressCreateParams[source]
class WorkDoneProgressEnd[source]
class WorkDoneProgressOptions[source]
class WorkDoneProgressParams[source]
class WorkDoneProgressReport[source]
class WorkspaceEdit[source]
class WorkspaceEditClientCapabilities[source]
class ChangeAnnotationSupport_[source]
class WorkspaceFolder[source]
class WorkspaceFoldersChangeEvent[source]
class WorkspaceFoldersServerCapabilities[source]
class WorkspaceSymbolClientCapabilities[source]
class SymbolKind_[source]
class TagSupport_[source]
class WorkspaceSymbolOptions[source]
class WorkspaceSymbolParams[source]
class WorkspaceSymbolRegistrationOptions[source]
gen_lsp_name(func_name: str, prefix: str = 'lsp_') str[source]

Generates the name of an lsp-method from a function name, e.g. “lsp_S_cancelRequest” -> “$/cancelRequest”

gen_lsp_table(lsp_funcs_or_instance: Iterable[Callable] | Any, prefix: str = 'lsp_') Dict[str, Callable][source]

Creates an RPC from a list of functions or from the methods of a class that implement the language server protocol. The dictionary keys are derived from the function name by replacing an underscore _ with a slash / and a single capital S with a $-sign. if prefix is not the empty string only functions or methods that start with prefix will be added to the table. The prefix will be removed before converting the functions’ name to a dictionary key.

>>> class LSP:
...     def lsp_initialize(self, **kw):
...         pass  # return InitializeResult
...     def lsp_shutdown(self, **kw):
...         pass
>>> lsp = LSP()
>>> gen_lsp_table(lsp, 'lsp_').keys()
dict_keys(['initialize', 'shutdown'])
lsp_candidates(cls: Any, prefix: str = 'lsp_') Iterator[str][source]

Returns an iterator over all method names from a class that either have a certain prefix or, if no prefix was given, all non-special and non-private-methods of the class.

Module stringview

StringView provides string-slicing without copying. Slicing Python-strings always yields copies of a segment of the original string. See: https://mail.python.org/pipermail/python-dev/2008-May/079699.html However, this becomes costly (in terms of space and as a consequence also time) when parsing longer documents. Unfortunately, Python’s memoryview does not work for unicode strings. Hence, the StringView class.

It is recommended to compile this module with the Cython-compiler for speedup. The module comes with a stringview.pxd that contains some type declarations to more fully exploit the benefits of the Cython-compiler.

class StringView(text: str, begin: int | None = 0, end: int | None = None)[source]

A rudimentary StringView class, just enough for the use cases in parse.py. The difference between a StringView and the python builtin strings is that StringView-objects do slicing without copying, i.e. slices are just a view on a section of the sliced string.

count(sub: str, start: int | None = None, end: int | None = None) int[source]

Returns the number of non-overlapping occurrences of substring sub in StringView S[start:end]. Optional arguments start and end are interpreted as in slice notation.

endswith(suffix: str, start: int = 0, end: int | None = None) bool[source]

Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position.

find(sub: str, start: int | None = None, end: int | None = None) int[source]

Returns the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation. Returns -1 on failure.

finditer(regex)[source]

Executes regex.finditer on the StringView object and returns the iterator of match objects. Keep in mind that match.end(), match.span() etc. are mapped to the underlying text, not the StringView-object!!!

get_text() str[source]

Returns the underlying string.

index(absolute_index: int) int[source]

Converts an index for a string watched by a StringView object to an index relative to the string view object, e.g.:

>>> import re
>>> sv = StringView('xxIxx')[2:3]
>>> match = sv.match(re.compile('I'))
>>> match.end()
3
>>> sv.index(match.end())
1
indices(absolute_indices: Iterable[int]) Tuple[int, ...][source]

Converts indices for a string watched by a StringView object to indices relative to the string view object. See also: sv_index()

lstrip(chars=' \n\t') StringView[source]

Returns a copy of self with leading whitespace removed.

match(regex, flags: int = 0)[source]

Executes regex.match on the StringView object and returns the result, which is either a match-object or None. Keep in mind that match.end(), match.span() etc. are mapped to the underlying text, not the StringView-object!!!

replace(old, new) str[source]

Returns a string where old is replaced by new.

rfind(sub: str, start: int | None = None, end: int | None = None) int[source]

Returns the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation. Returns -1 on failure.

rstrip(chars=' \n\t') StringView[source]

Returns a copy of self with trailing whitespace removed.

search(regex, start: int | None = None, end: int | None = None)[source]

Executes regex.search on the StringView object and returns the result, which is either a match-object or None. Keep in mind that match.end(), match.span() etc. are mapped to the underlying text, not the StringView-object!!!

split(sep=None)[source]

Returns a list of the words in self, using sep as the delimiter string. If sep is not specified or is None, any whitespace string is a separator and empty strings are removed from the result.

startswith(prefix: str, start: int = 0, end: int | None = None) bool[source]

Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position.

strip(chars: str = ' \n\r\t') StringView[source]

Returns a copy of the StringView self with leading and trailing whitespace removed.

class TextBuffer(text: str | StringView, version: int = 0)[source]

TextBuffer class manages a copy of an edited text for a language server. The text can be changed via incremental edits. TextBuffer keeps track of the state of the complete text at any point in time. It works line oriented and lines of text can be retrieved via indexing or slicing.

snapshot(eol: str = '\n') str | StringView[source]

Returns the current state of the entire text, using the given end of line marker (\n or \r\n)

text_edits(edits: list | dict, version: int = -1)[source]

Incorporates the one or more text-edits or change-events into the text. A Text-Edit is a dictionary of this form:

{"range": {"start": {"line": 0, "character": 0 },
           "end":   {"line": 0, "character": 0 } },
 "newText": "..."}

In case of a change-event, the key “newText” is replaced by “text”.

update(l1: int, c1: int, l2: int, c2: int, replacement: str | StringView)[source]

Replaces the text-range from line and column (l1, c1) to line and column (l2, c2) with the replacement-string.

real_indices(begin: int | None, end: int | None, length: int) Tuple[int, int][source]

Python callable real-indices function for testing.

Module toolkit

Module toolkit contains utility functions that are needed across several of the other DHParser-Modules Helper functions that are not needed in more than one module are best placed within that module and not in the toolkit-module. An acceptable exception to this rule are functions that are very generic.

class JSONnull[source]

JSONnull is a special type that is serialized as null by json_dumps. This can be used whenever it is inconvenient to use None as the null-value.

class JSONstr(serialized_json: str)[source]

JSONStr is a special type that encapsulates already serialized json-chunks in json object-trees. json_dumps will insert the content of a JSONStr-object literally, rather than serializing it as other objects.

class Protocol[source]

Base class for protocol classes.

Protocol classes are defined as:

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:

class GenProto(Protocol[T]):
    def meth(self) -> T:
        ...
class SingleThreadExecutor[source]

SingleThreadExecutor is a replacement for concurrent.future.ProcessPoolExecutor and concurrent.future.ThreadPoolExecutor that executes any submitted task immediately in the submitting thread. This helps to avoid writing extra code for the case that multithreading or multiprocesssing has been turned off in the configuration. To do so is helpful for debugging.

It is not recommended to use this in asynchronous code or code that relies on the submit()- or map()-method of executors to return quickly.

submit(fn, *args, **kwargs) Future[source]

Run function “fn” with the given args and kwargs synchronously without multithreading or multiprocessing.

class ThreadLocalSingletonFactory(class_or_factory, name: str = '', *, uniqueID: str | int = 0, ident=None)[source]

Generates a singleton-factory that returns one and the same instance of class_or_factory for one and the same thread, but different instances for different threads.

Note: Parameter uniqueID should be provided if class_or_factory is not unique but generic. See source code of DHParser.dsl.create_transtable_junction()

abbreviate_middle(s: str, max_length: int) str[source]

Shortens string s by replacing the middle part with an ellipsis sign ` … ` if the size of the string exceeds max_length.

as_identifier(s: str, replacement: str = '_') str[source]

Converts a string to an identifier that matches /w+/ by substituting any character not matching /w/ with the given replacement string:

>>> as_identifier('EBNF-m')
'EBNF_m'
as_list(item_or_sequence) List[Any][source]

Turns an arbitrary sequence or a single item into a list. In case of a single item, the list contains this element as its sole item.

as_tuple(item_or_sequence) Tuple[Any][source]

Turns an arbitrary sequence or a single item into a tuple. In case of a single item, the tuple contains this element as its sole item.

compile_python_object(python_src: str, catch_obj='DSLGrammar') Any[source]

Compiles the python source code and returns the (first) object the name of which is either equal to or matched by catch_obj_regex. If catch_obj is the empty string, the namespace dictionary will be returned.

concurrent_ident() str[source]

Returns an identificator for the current process and thread

cpu_count() int[source]

Returns the number of cpus that are accessible to the current process or 1 if the cpu count cannot be determined.

deprecated(message: str) Callable[source]

Decorator that marks a function as deprecated and emits a deprecation message on its first use:

>>> @deprecated('This function is deprecated!')
... def bad():
...     pass
>>> save = get_config_value('deprecation_policy')
>>> set_config_value('deprecation_policy', 'fail')
>>> try: bad()
... except DeprecationWarning as w:  print(w)
This function is deprecated!
>>> set_config_value('deprecation_policy', save)
deprecation_warning(message: str)[source]

Issues a deprecation warning. Makes sure that each message is only called once.

escape_ctrl_chars(strg: str) str[source]

Replace all control characters (e.g. n t) in a string by their back-slashed representation and replaces backslash by double backslash.

escape_formatstr(s: str) str[source]

Replaces single curly braces by double curly-braces in string s, so that they are not misinterpreted as place-holder by “”.format().

escape_re(strg: str) str[source]

Returns the string with all regular expression special characters escaped. TODO: Remove this function in favor of re.escape()

expand_table(compact_table: Dict) Dict[source]

Expands a table by separating keywords that are tuples or strings containing comma separated words into single keyword entries with the same values. Returns the expanded table. Example:

>>> expand_table({"a, b": 1, ('d','e','f'):5, "c":3})
{'a': 1, 'b': 1, 'd': 5, 'e': 5, 'f': 5, 'c': 3}
first(item_or_sequence: Sequence | Any) Any[source]

Returns an item or the first item of a sequence of items.

fix_XML_attribute_value(value: Any) str[source]

Returns the quoted XML-attribute value. In case the values contains illegal characters, like ‘<’, these will be replaced by XML-entities.

identify_python() str[source]

Returns a reasonable identification string for the python interpreter, e.g. “cpython 3.8.6”.

identity(x)[source]

Canonical identity function. The purpose of defining identity() here is to allow it to serve as a default value and to be able to check whether a function parameter has been assigned another than the default value or not.

instantiate_executor(allow_parallel: bool, preferred_executor: Type[Executor], *args, **kwargs) Executor[source]

Instantiates an Executor of a particular type, if the value of the configuration variable ‘debug_parallel_execution’ allows to do so. Otherwise, a surrogate executor will be returned. If ‘allow_parallel` is False, a SingleThreadExecutor will be instantiated, regardless of the preferred_executor and any configuration values.

is_filename(strg: str) bool[source]

Tries to guess whether string strg is a file name.

is_html_name(url: str) bool[source]

Returns True, if url ends with .htm or .html

is_python_code(text_or_file: str) bool[source]

Checks whether ‘text_or_file’ is python code or the name of a file that contains python code.

isgenerictype(t)[source]

Returns True if t is a generic type. WARNING: This is very “hackish”. Caller must make sure that t actually is a type!

issubtype(sub_type, base_type) bool[source]

Returns True if sub_type is a subtype of base_type. WARNING: Implementation is somewhat “hackish” and might break with new Python-versions.

json_dumps(obj: JSON_Type, *, cls=<class 'json.encoder.JSONEncoder'>, partially_serialized: bool = False) str[source]

Returns json-object as string. Other than the standard-library’s json.dumps()-function json_dumps allows to include alrady serialzed parts (in the form of JSONStr-objects) in the json-object. Example:

>>> already_serialized = '{"width":640,"height":400"}'
>>> literal = JSONstr(already_serialized)
>>> json_obj = {"jsonrpc": "2.0", "method": "report_size", "params": literal, "id": None}
>>> json_dumps(json_obj, partially_serialized=True)
'{"jsonrpc":"2.0","method":"report_size","params":{"width":640,"height":400"},"id":null}'
Parameters:
  • obj – A json-object (or a tree of json-objects) to be serialized

  • cls – The class of a custom json-encoder berived from json.JSONEncoder

  • partially_serialized – If True, JSONStr-objects within the json tree will be encoded (by inserting their content). If False, JSONStr-objects will raise a TypeError, but encoding will be faster.

Returns:

The string-serialized form of the json-object.

json_rpc(method: str, params: JSON_Type = [], ID: int | None = None, partially_serialized: bool = True) str[source]

Generates a JSON-RPC-call string for method with parameters params.

Parameters:
  • method – The name of the rpc-function that shall be called

  • params – A json-object representing the parameters of the call

  • ID – An ID for the json-rpc-call or None

  • partially_serialized – If True, the params-object may contain already serialized parts in form of JSONStr-objects. If False, any JSONStr-objects will lead to a TypeError.

Returns:

The string-serialized form of the json-object.

last(item_or_sequence: Sequence | Any) Any[source]

Returns an item or the first item of a sequence of items.

line_col(lbreaks: List[int], pos: int) Tuple[int, int][source]

Returns the position within a text as (line, column)-tuple based on a list of all line breaks, including -1 and EOF.

linebreaks(text: StringView | str) List[int][source]

Returns a list of indices all line breaks in the text.

load_if_file(text_or_file) str[source]

Reads and returns content of a text-file if parameter text_or_file is a file name (i.e. a single line string), otherwise (i.e. if text_or_file is a multi-line string) text_or_file is returned.

lxml_XML_attribute_value(value: Any) str[source]

Makes sure that the attribute value works with the lxml-library, at the cost of replacing all characters with a code > 256 by a quesiton mark.

Parameters:

value – the original attribute value

Returns:

the quoted and lxml-compatible attribute value.

matching_brackets(text: str, openB: str, closeB: str, unmatched: list = []) List[Tuple[int, int]][source]

Returns a list of matching bracket positions. Fills an empty list passed to parameter unmatched with the positions of all unmatched brackets.

>>> matching_brackets('(a(b)c)', '(', ')')
[(2, 4), (0, 6)]
>>> matching_brackets('(a)b(c)', '(', ')')
[(0, 2), (4, 6)]
>>> unmatched = []
>>> matching_brackets('ab(c', '(', ')', unmatched)
[]
>>> unmatched
[2]
md5(*txt)[source]

Returns the md5-checksum for txt. This can be used to test if some piece of text, for example a grammar source file, has changed.

multiprocessing_broken() str[source]

Returns an error message, if, for any reason multiprocessing is not safe to be used. For example, multiprocessing does not work with PyInstaller (under Windows) or GraalVM.

normalize_circular_path(path: Tuple[str, ...] | AbstractSet[Tuple[str, ...]]) Tuple[str, ...] | Set[Tuple[str, ...]][source]

Returns a normalized version of a circular path represented as a tuple or - if called with a set of paths instead of a single path - a set of normalized paths.

A circular (or “recursive”) path is a tuple of items, the order of which matters, but not the starting point. This can, for example, be a tuple of references from one symbol defined in an EBNF source text back to (but excluding) itself.

For example, when defining a grammar for arithmetic, the tuple (‘expression’, ‘term’, ‘factor’) if a recursive path, because the definition of a factor includes a (bracketed) expression and thus refers back to expression Normalizing is done by “ring-shifting” the tuple so that it starts with the alphabetically first symbol in the path.

>>> normalize_circular_path(('term', 'factor', 'expression'))
('expression', 'term', 'factor')
normalize_docstring(docstring: str) str[source]

Strips leading indentation as well as leading and trailing empty lines from a docstring.

pp_json(obj: JSON_Type, *, cls=<class 'json.encoder.JSONEncoder'>) str[source]

Returns json-object as pretty-printed string. Other than the standard-library’s json.dumps()-function pp_json allows to include already serialized parts (in the form of JSONStr-objects) in the json-object. Example:

>>> already_serialized = '{"width":640,"height":400"}'
>>> literal = JSONstr(already_serialized)
>>> json_obj = {"jsonrpc": "2.0", "method": "report_size", "params": literal, "id": None}
>>> print(pp_json(json_obj))
{
  "jsonrpc": "2.0",
  "method": "report_size",
  "params": {"width":640,"height":400"},
  "id": null}
Parameters:
  • obj – A json-object (or a tree of json-objects) to be serialized

  • cls – The class of a custom json-encoder derived from json.JSONEncoder

Returns:

The pretty-printed string-serialized form of the json-object.

pp_json_str(jsons: str) str[source]

Pretty-prints and already serialized (but possibly ugly-printed) json object in a well-readable form. Syntactic sugar for: pp_json(json.loads(jsons)).

printw(s: Any, wrap_column: int = 79, tolerance: int = 24, wrap_chars: str = ')]>, ')[source]

Prints the string or other object nicely wrapped. See wrap_str_nicely().

re_find(s, r, pos=0, endpos=9223372036854775807)[source]

Returns the match of the first occurrence of the regular expression r in string (or byte-sequence) s. This is essentially a wrapper for re.finditer() to avoid a try-catch StopIteration block. If r cannot be found, None will be returned.

relative_path(from_path: str, to_path: str) str[source]

Returns the relative path in order to open a file from to_path when the script is running in from_path. Example:

>>> relative_path('project/common/dir_A', 'project/dir_B').replace(chr(92), '/')
'../../dir_B'
sane_parser_name(name) bool[source]

Checks whether given name is an acceptable parser name. Parser names must not be preceded or succeeded by a double underscore ‘__’!

smart_list(arg: str | Iterable | Any) Sequence | Set[source]

Returns the argument as list, depending on its type and content.

If the argument is a string, it will be interpreted as a list of comma separated values, trying ‘;’, ‘,’, ‘ ‘ as possible delimiters in this order, e.g. >>> smart_list(‘1; 2, 3; 4’) [‘1’, ‘2, 3’, ‘4’] >>> smart_list(‘2, 3’) [‘2’, ‘3’] >>> smart_list(‘a b cd’) [‘a’, ‘b’, ‘cd’]

If the argument is a collection other than a string, it will be returned as is, e.g. >>> smart_list((1, 2, 3)) (1, 2, 3) >>> smart_list({1, 2, 3}) {1, 2, 3}

If the argument is another iterable than a collection, it will be converted into a list, e.g. >>> smart_list(i for i in {1,2,3}) [1, 2, 3]

Finally, if none of the above is true, the argument will be wrapped in a list and returned, e.g. >>> smart_list(125) [125]

split_path(path: str) Tuple[str, ...][source]

Splits a filesystem path into its components. Other than os.path.split() it does not only split of the last part:

>>> split_path('a/b/c')
('a', 'b', 'c')
>>> os.path.split('a/b/c')  # for comparison.
('a/b', 'c')
text_pos(text: StringView | str, line: int, column: int, lbreaks: List[int] = []) int[source]

Returns the text-position for a given line and column or -1 if the line and column exceed the size of the text.

class unrepr(s: str)[source]

unrepr encapsulates a string representing a python function in such a way that the representation of the string yields the function call itself rather than a string representing the function call and delimited by quotation marks.

Example

>>> "re.compile(r'abc+')"
"re.compile(r'abc+')"
>>> unrepr("re.compile(r'abc+')")
re.compile(r'abc+')
validate_XML_attribute_value(value: Any) str[source]

Validates an XML-attribute value and returns the quoted string-value if successful. Otherwise, raises a ValueError.

wrap_str_nicely(s: str, wrap_column: int = 79, tolerance: int = 24, wrap_chars: str = ')]>, ') str[source]

Line-wraps a single-line output string at ‘wrap_column’. Tries to find a suitable point for wrapping, i.e. after any of the wrap_characters.

If the strings spans multiple lines, the existing linebreaks will be kept and no rewrapping takes place. In order to enforce rewrapping of multiline strings, use: wrap_str_nicely(repr(s)[1:-1]). (repr() replaces linebreaks by the \n-marker. The slicing [1:-1] removes the opening and closing angular quotation marks that repr adds.

Examples:

>>> s = ('(X (l ",.") (A (O "123") (P (:Text "4") (em "56"))) '
...      '(em (m "!?")) (B (Q (em "78") (:Text "9")) (R "abc")) '
...      '(n "+-"))')
>>> print(wrap_str_nicely(s))
(X (l ",.") (A (O "123") (P (:Text "4") (em "56"))) (em (m "!?"))
 (B (Q (em "78") (:Text "9")) (R "abc")) (n "+-"))
>>> s = ('(X (s) (A (u) (C "One,")) (em (A (C " ") (D "two, ")))'
...      '(B (E "three, ") (F "four!") (t))))')
>>> print(wrap_str_nicely(s))
(X (s) (A (u) (C "One,")) (em (A (C " ") (D "two, ")))(B (E "three, ")
 (F "four!") (t))))
>>> s = ("[Node('word', 'This'), Node('word', 'is'), "
...      "Node('word', 'Buckingham'), Node('word', 'Palace')]")
>>> print(wrap_str_nicely(s))
[Node('word', 'This'), Node('word', 'is'), Node('word', 'Buckingham'),
 Node('word', 'Palace')]
>>> s = ("Node('phrase', (Node('word', 'Buckingham'), "
...      "Node('blank', ' '), Node('word', 'Palace')))")
>>> print(wrap_str_nicely(s))
Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '),
 Node('word', 'Palace')))
>>> s = ('<hard>Please mark up <foreign lang="de">Stadt\n<lb/></foreign>'
...      '<location><foreign lang="de"><em>München</em></foreign> '
...      'in Bavaria</location> in this sentence.</hard>')
>>> print(wrap_str_nicely(s))
<hard>Please mark up <foreign lang="de">Stadt
<lb/></foreign><location><foreign lang="de"><em>München</em></foreign> in Bavaria</location> in this sentence.</hard>
>>> print(wrap_str_nicely(repr(s)[1:-1]))  # repr to ignore the linebreaks
<hard>Please mark up <foreign lang="de">Stadt\n<lb/></foreign><location>
<foreign lang="de"><em>München</em></foreign> in Bavaria</location>
 in this sentence.</hard>

Module versionnumber