Reference#
EBNF-Reference#
Syntax#
DHParser supports two different variants of the EBNF-snytax, called classical and regular-expression-like (or “regex-like”). The former resembles the ISO-standard for EBNF, the latter is more akin to the commonly used EBNF-syntax for parsing expression grammars.
A grammar consists of a sequence of definitions (also known as “productions”) of the form: symbol = rule. A rule is always a sequence of literals and operators. DHParser supports the following literals and operators:
literals and operators |
classical syntax |
regex-like |
|---|---|---|
literals |
||
insignificant whitespace¹⁾ |
~ |
~ |
string literal |
“…” or `…` |
“…” or `…` |
regular expr. |
/…/ |
/…/ |
operators |
||
sequences |
A B C |
A B C |
alternatives |
A | B | C |
A | B | C |
interleave ²⁾ |
A ° B ° C |
A ° B ° C |
grouping |
(…) |
(…) |
options |
[ … ] |
…? |
repetitions |
{ … } |
…* |
one or more |
{ … }+ |
…+ |
repetition range |
…(i, k) |
…{i, k} |
lookahead assertions |
||
positive lookahead |
& … |
& … |
negative lookahead |
! … |
! … |
lookbehind assertions ³⁾ |
||
positive lookbehind |
<-& … |
<-& … |
negative lookahead |
<-! … |
<-! … |
error raising ⁴⁾ |
||
mandatory-marker |
§ … |
§ … |
error-message |
@Error(”…”) |
@Error(”…”) |
macros ⁵⁾ |
||
macro definition |
$macro($p1, …) = |
same |
macro usage |
$macro(expr, …) |
same |
custom parsers ⁶⁾ |
||
custom parser |
@Custom(parse_func) |
same |
parser factory |
@factory_func(”…”) |
same |
context sensitive ops. ⁷⁾ |
||
pop and match |
:: … |
:: … |
retrieve and match |
: … |
: … |
pop and always match |
:? … |
:? … |
filter-definition |
@SYM_filter = func() |
same |
- ¹⁾ Insignificant whitespace is whitespace
that neither carries any syntactic significance (say, as a delimiter) nor has any semantic relevance (say as part of the data).
- ²⁾ Interleave means that the following elements must appear (like in a
sequence), but it does not matter in which order.
- ³⁾ In DHParser lookbehind-assertions always
operate on the reverse input string! This allows to exploit the full capabilities of regular expressions without need to worry about regex-engines supporting only constant-length look-behinds etc. Keep in mind that looking back for the keyword “BEGIN” then means that you have to check for “NIGEB”, e.g. “<-& /s*NIGEB/” means that the prvious token read “BEGIN”.
- ⁴⁾ Mandatory markers express expectations about
the following items in a document. If a sequences has matched up to the marker but then fails to match an element after the marker, this is mot just a non-match, but an error that will be reported. ref:Error messages <grammar_code_for_errors> should be added to paths in the grammar that should never be reached if the parsed document is correct.
- ⁵⁾ See the Marcros-section in the manual for a detailed
explanation how macros work in DHParser.
- ⁶⁾ Custom parsers are parsers that are defined
as Python-functions which will be called from the generated Python-parser during parsing.
- ⁷⁾ Constext sensitive parsers break with
the paradigm of context-free-grammars and may slow down the parsing process. In connections with filter-functions they provide next to custom parsers another means of realizing semantic actions.
Example#
Here is an example for the classical-syntax:
# Arithmetic-grammar
# directives
@ whitespace = vertical # implicit whitespace, denoted by ~, includes any number of line feeds
@ literalws = right # literals have implicit whitespace on the right hand side
@ comment = /#.*/ # comments range from a '#'-character to the end of the line
@ ignorecase = False # literals and regular expressions are case-sensitive
@ drop = whitespace, strings # drop anonymous whitespace and (anonymous) string literals
# grammar
expression = term { (PLUS | MINUS) term }
term = factor { (DIV | MUL) factor }
factor = [sign] (NUMBER | VARIABLE | group) { VARIABLE | group }
sign = POSITIVE | NEGATIVE
group = "(" expression ")"
PLUS = "+"
MINUS = "-"
MUL = "*"
DIV = "/"
POSITIVE = /[+]/ # no implicit whitespace after signs!
NEGATIVE = /[-]/
NUMBER = /(?:0|(?:[1-9]\d*))(?:\.\d+)?/~
VARIABLE = /[A-Za-z]/~
In regex-like syntax the first part of the grammar for arithmetical expressions would read like this:
expression = term ( (PLUS | MINUS) term )*
term = factor { (DIV | MUL) factor }
factor = sign? (NUMBER | VARIABLE | group) (VARIABLE | group )*
sign = POSITIVE | NEGATIVE
group = "(" expression ")"
DHParser has a heuristic-mode that allows to write grammars using other characters as delimiters. For example, the same passage in strict ISO-style would look like:
@flavor = heuristic
expression ::= term { (PLUS | MINUS) term };
term ::= factor { (DIV | MUL) factor };
factor ::= [sign] (NUMBER | VARIABLE | group) { VARIABLE | group };
sign ::= POSITIVE | NEGATIVE;
group ::= "(" expression ")";
Or, in the more recent parsing expression grammar (PEG)-style:
@flavor = heuristic
expression = term ( (PLUS / MINUS) term )*
term = factor { (DIV / MUL) factor }
factor = sign? (NUMBER / VARIABLE / group) (VARIABLE / group )*
sign = POSITIVE / NEGATIVE
group = "(" expression ")"
Mind that in PEG-style it can be difficult to avoid ambiguities when using regular expressions to define the atomic-expressions. This can lead to unexpected parser-errors. Therefore it is better to use the |-sign for denoting alternatives, rather than a slash /. The meaning is in any case the same, namely, that of PEG-grammars where the first alternative that matches is used rather than checking all alternatives (as Earley-parsers would do).
Note that despite of being based on parsing expression grammars DHParser fully supports direct and indirect left recursion in the grammar-definition. In order to avoid endless-loops it employs the seed-and grow-algorithm.
Directives#
The following is a table of DHParser’s EBNF-directives. EBNF-Directives control how the grammar is interpreted and processed by DHParser. They also influence the verbostiy or sparseness of the concrete-syntax-tree that the parser yields.
EBNF-Directives always have the form:
@[DIRECTIVE] = [VALUES]
Directive |
purpose and possible values |
|---|---|
@comment |
Regular expression for comments, e.g. /#.*(?:n|$)/ |
@whitspace |
Regular expression for whitespace or one of the predifined values: horizontal, linefeed, linestart, vertical |
@literalws |
Implicitly assume insignificant whitespace adjacent to string-literals: left, right, both or none |
@ignorecase |
Global case-sensitivity: True or False |
@include |
include other EBNF files |
@tokens |
List of names of all valid preprocessor-tokens |
@hide |
List of symbols that shall produce anonymous nodes instead of nodes named by the symbol |
@drop |
List of symbols that shall be dropped entirely from the tree |
@reduction |
Reduction level for simplifying trees while parsing: none, flatten, merge_treetops, merge |
@[SYM]_filter |
Name of a Python function that yields a counterpart of the captured symbols, e.g. “)” for “(” |
@[SYM]_error |
A regular expression followed by an error message that is produced if the expression matches at the error-location |
@[SYM]_skip |
List of regular expressions or grammar symbols or function names to find a reentry point after an error |
@[SYM]_resume |
Same as above, only this time the parent parser rather than the failing parser continues at the reentry point |
@optimizations |
Optimization level for speeding up the parsing-process: none, some, all |
@Error(…) |
Attach a parsing-error to the syntax-tree if the parser reaches this point |
@flavor |
The syntax for the EBNF-grammar: dhparser (strict) or heuristic (tolerant with respect to variants of EBNF, but slower) |
Examples 1: Comments and Whitespace:
@comment = /#.*(?:\n|$)/ # form the first `#` everything until
# the end of the line or file is a comment
@whitespace = /[ \t\n]*/ # insignificant whitespace consists of blanks, tabs or linefeeds
# Note: Insignificant whitespace should always be defined in such
# a way that the empty string is also always matched!
@literalws = right # If a string literal is defined in the grammar it is always assumed
# to be succeeded by insignificant whitespace, i.e. you can
# write "+" instead of "+"~ in your grammar to make it easier to read
Examples 2: ref:Merge, hide and drop <simplifying_syntax_trees> parts of the concrete syntax tree:
@hide = /_\w+/ # Let all symbols starting with an underscore produce anonymous Nodes.
# Symbol names with an underscore can then be used for symbols are
# merely introduced for structuring the grammar.
@hide = INT, FRAC, EXP # Capture parts of a number, e.g. 1.5E+10 but the
or @disposable (deprecated!) # division into parts is not needed, anymore, after it's been captured
# Instead of using a directive, "HIDE:" can be written in front of the symbols, e.g.
HIDE:INT = [`-`] ( /[1-9][0-9]+/ | /[0-9]/ )
@hide = comma, full_stop # Drop the delimiters comma and full_stop entirely. Note: Only
@drop = comma, full_stop # hidden symbols can be dropped! This @drop can only be used in
# combination with hide if used as a directive
# Instead of using a directive, "DROP:" can be written in front of the symbols to be dropped, e.g.
DROP:comma = ","
# There is also an inline version of DROP, e.g.
list = word { ("," -> DROP) word }
@reduction = merge # reduce and merge anonymous nodes whereever possible
# in the concrete syntax tree
Example 3: Error Handling and Fail-tolerant parsing:
# Attach an error to the syntax-tree if an illegal character follows a backspace
escape = backspace ("n" | "s" | "t" | @Error("1001:Unknown escpae sequence") /./)
@street_error = /(?!\d)/, "1001:House number expected!"
@street_error = '', "1002:House number too high!"
@street_skip = /\d+/ # in case "street" fails, skip behind the next sequence of digits
@street_resume = /\n/ # if that does not work, resume street's parent parser with the next line
street = /\w+/~ § /\d\d?/ !/\d/
Examples 4: Other directives:
@include = "numbers.ebnf" # insert the content of "numbers.ebf" here
@tokens = BEGIN_INDENTATION, END_INDENTATION # names of preprocessor tokens. These should match
or @preprocessor_tokens = ... # the names in the Python code for the preprocessor
@bracket_filter = matching_bracket # matching_bracket = lambda s: {"(": ")", "[": "]"}[s]
remark = bracket text ::bracket # a remark is a text block enclosed by either (...) or [...]
bracket = "(" | "["
@flavor = heuristic # try to auto-detect the EBNF-style used (e.g ISO, PEG, etc.)
@ignorecase = True # Capital and small letters do not make a difference when caputing a
# document. Note: Usually it is better not to set this globally, but
# to specify it locally with the I-Flag within the regular expressions
# that capture the "atomic" parts, e.g. word = /(?i)[a-z]/
@optimizations = all # make the parser a little faster by using regular expressions for
# some non-recursive parts of the grammar
Modules#
ebnfAlthough DHParser also offers a Python-interface for specifying grammars (similar to pyparsing), the recommended way of using DHParser is by specifying the grammar in EBNF. Here it is described how grammars are specified in EBNF and how parsers can be auto-generated from these grammars and how they are used to parse text.
nodetreeSyntax-trees are the central data-structure of any parsing system. The description to this modules explains how syntax-trees are represented within DHParser, how they can be manipulated, queried and serialized or deserialized as XML, S-expressions or json.
transformIt is not untypical for digital humanities applications that document tress are transformed again and again to produce different representations of research data or various output forms. DHParser supplies the scaffolding for two different types of tree transformations, both of which a variations of the visitor pattern. The scaffolding supplied by the transform-module allows to specify tree-transformations in a declarative style by filling in a dictionary of tag-names with lists of transformation functions that are called in sequence on a node. A number of transformations are pre-defined that cover the most needed cases that occur in particular when transforming concrete syntax trees to more abstract syntax trees. (An example for this kind of declaratively specified transformation is the
EBNF_AST_transformation_tablewithin DHParser’s ebnf-module.)compileoffers an object-oriented scaffolding for the visitor pattern that is more suitable for complex transformations that make heavy use of algorithms as well as transformations from trees to non-tree objects like program code. (An example for the latter kind of transformation is the
EBNFCompiler-class of DHParser’s ebnf-module.)pipelineoffers support for “processing-pipelines” composed out of “junctions” A processing pipe-line consists of a series of tree-transformations that are applied in sequence. “Junctions” declare which source-tree-stage is transformed by which transformation-routine into which destination tree-stage. Processing-pipelines can contain bifurcations, which are needed if from one source-document different kinds of output-data shall be derived.
testingprovides a rich framework for unit-testing of grammars, parsers and any kind of tree-transformation. Usually, developers will not need to interact with this module directly, but rely on the unit-testing script generated by the “dhparser.py” command-line tool. The tests themselves a specified declaratively in test-input-files (in the very simple “.ini”-format) that reside by default in the “test_grammar”-directory of a DHParser-project.
preprocessprovides support for DSL-pre-processors as well as source mapping of (error-)locations from the preprocessed document to the original document(s). Pre-processors are a practical means for adding features to a DSL which are difficult or impossible to define with context-free-grammars in EBNF-notation, like for example scoping based on indentation (as used by Python) or chaining of source-texts via an “include”-directive.
parsecontains the parsing algorithms and the Python-Interface for defining parsers. DHParser features a packrat-parser for parsing-expression-grammars with full left-recursion support as well configurable error catching an continuation after error. The Python-Interface allows to define grammars directly as Python-code without the need to compile an EBNF-grammar first. This is an alternative approach to defining grammars similar to that of pyparsing.
dslcontains high-level functions for compiling ebnf-grammars and domain specific languages “on the fly”.
errordefines the
Error-class, the objects of which describe errors in the source document. Errors are defined by - at least - an error code (indicating at the same time the level of severity), a human readable error message and a position in the source text.traceApart from unit-testing DHParser offers “post-mortem” debugging of the parsing process itself - as described in the A Step by Step Guide. This is helpful to figure out why a parser went wrong. Again, there is little need to interact with this module directly, as it functionality is turned on by setting the configuration variables
history_trackingand, for tracing continuation after errors,resume_notices, which in turn can be triggered by calling the auto-generated -Parser.py-scripts with the parameter--debug.loglogging facilities for DHParser as well as tracking of the parsing-history in connection with module
trace.configurationthe central place for all configuration settings of DHParser. Be sure to use the
access,setandgetfunctions to change presets and configuration values in order to make sure that changes to the configuration work when used in combination with multithreading or multiprocessing.serverIn order to avoid startup times or to provide a language sever for a domain-specific-language (DSL), DSL-parsers generated by DHParser can be run as a server. Module
serverprovides the scaffolding for an asynchronous language server. The -Server.py”-script generated by DHParser provides a minimal language server (sufficient) for compiling a DSL. Especially if used with the just-in-time compiler pypy using the -Server.py script allows for a significant speed-up.- lsp
(as of now, this is just a stub!) provides data classes that resemble the typescript-interfaces of the language server protocol specification.
stringviewdefines a low level class that provides views on slices of strings. It is used by the
parse-module to avoid excessive copying of data when slicing strings. (Python always creates a copy of the data when slicing strings as a design decision.) If any, this module can significantly be sped up by compiling it with cython. (Use thecythonize_stringview-script in DHParser’s main directory or, even better, compile (almost) all modules with thebuild_cython-modules-script. This yields a 2-3x speed increase.)toolkitvarious little helper functions for DHParser. Usually, there is no need to call any of these directly.
Module ebnf#
Module ebnf provides an EBNF-parser-generator. The parser-generator
compiles an EBNF-Grammar into an executable Python-class. An instance of
this class can be called to parse text-documents conforming to the grammar.
It will the concrete-syntax-tree of the document. Various flavors of EBNF-
of PEG- (Parsing-Expression-Grammar) notations are supported.
Usually, the classes and functions provided by the DHParser.ebnf will
not be called directly, because it is much simpler to use the high-level
API in module DHParser.dsl, in particular
DHParser.dsl.create_parser().
The ebnf-module is structured just like the parser-modules that are generated by running the “dhparser”-script on an EBNF-grammar or by executing the test-runner script in the project-directory of a DHParser-project. It consists of the following sections, divided by a comment-block with a sections name:
The EBNF-preprocessor takes care of chaining the main file and the included files (if there are any) into one single document that is passed on to the parser. (=>
preprocessor_factory())The EBNF-parser translates the EBNF-grammar into a concrete-syntax-tree. (=>
ConfigurableEBNFGrammar)The AST-transformation molds the concrete-syntax-tree into abstract-syntax-tree (AST) that represents the syntactical structure of the grammar-source-code. (=>
get_ebnf_transformer())Finally, the EBNF-compiler compiles the AST into an executable Python-class that is a descendant of
Grammarand is composed mostly of instances of the parser-classes from moduleDHParser.parse. (=>EBNFCompiler)(Each symbol of the grammar is represented by a class variable to which a nested call of parser-class-instantiations is assigned. These parser-instances serve as a prototype from which grammar objects are derived via deep-copy upon the instantiation of the grammar-class.)
This section also contains source-code-snippets and -templates for the Python-parser code the compiler produces.
The most notable difference to ordinary DHParser-projects is that
the DHParser.ebnf-module contains two Grammar-classes, one
for parsing code that strictly follows DHParser’s EBNF-syntax
(ConfigurableEBNFGrammar) and another one that is able to
parse many different brands of EBNF-syntax
(DHParser.ebnf_flavors.heuristic.HeuristicEBNFGrammar)
at the cost of parsing speed. When parsing or compiling an EBNF-grammar with
any of the high-level functions into Python-code, the faster “configurable”
EBNF-grammar is tried first, and, if that fails with particular errors which
suggest that the failure might be merely due to the use of a different brand
of EBNF, a second attempt is made with the slower heuristic EBNF-parser.
The EBNF-compiler is actually split into two classes, EBNFCompiler,
which contains the EBNF-AST -> Python compiler proper, and
EBNFDirectives which is a helper class to keep track of the
directives used to avoid overburdening the compiler-class with instance
variables.
Just as DHParser’s (auto-)generated parser-scripts, the classes contained in
DHParser.ebnf should not be instantiated directly. Other than
the parser scripts, though, the ebnf-module does not provide Junctions
with factory-functions for each stage from preprocessing to compiling.
Instead it provides factory-functions that return one singleton instance
per thread of each class, namely:
get_ebnf_parser()get_ebnf_transformer()get_ebnf_compiler()
These are supplemented by the quick-use-functions:
The following example shows how the classes and functions of the
ebnf-module can be connected to produce runnable Python-code from
an EBNF-grammar. It is meant as a help to understand the role of
these classes better as well as - in a simplified manner - the
basic working mechanisms of higher level functions like
DHParser.dsl.create_parser(). In any practical application,
the use of the high-level functions from DHParser.dsl is
to be preferred. One can think of the DHParser.dsl-module
as the public API of the ebnf-module.
This said, here is how a Python-parser can be generated from a grammar, step by step:
>>> arithmetic_ebnf = r"""
... @ whitespace = vertical
... @ literalws = right
... @ drop = whitespace, strings
... expression = term { (add | sub) term}
... term = factor { (div | mul) factor}
... factor = [minus] (NUMBER | VARIABLE | group)
... group = "(" expression ")"
... add = "+"
... sub = "-"
... mul = "*"
... div = "/"
... minus = `-`
... NUMBER = /(?:0|(?:[1-9]\d*))(?:\.\d+)?/~
... VARIABLE = /[A-Za-z]/~"""
>>> # 1. Compilation of an EBNF-grammar into Python-source-code
>>> ebnf_parser = ConfigurableEBNFGrammar()
>>> ebnf_transformer = EBNFTransform()
>>> ebnf_compiler = EBNFCompiler()
>>> CST = ebnf_parser(arithmetic_ebnf)
>>> AST = ebnf_transformer(CST) # CST should be considered invalid after that
>>> ebnf_compiler.set_grammar_name("Arithmetic")
>>> from DHParser.configuration import get_config_value, set_config_value
>>> save = get_config_value('optimizations')
>>> set_config_value('optimizations', frozenset())
>>> python_src = ebnf_compiler(AST)
>>> set_config_value('optimizations', save)
>>> assert not AST.errors
>>> print(python_src[:python_src.find("\ntry:")]) # leave out consistency check at the end, here
class ArithmeticGrammar(Grammar):
r"""Parser for an Arithmetic source file.
"""
expression = Forward()
disposable__ = re.compile('$.')
static_analysis_pending__ = [] # type: List[bool]
parser_initialization__ = ["upon instantiation"]
COMMENT__ = r''
comment_rx__ = RX_NEVER_MATCH
WHITESPACE__ = r'\s*'
WSP_RE__ = mixin_comment(whitespace=WHITESPACE__, comment=COMMENT__)
wsp__ = Whitespace(WSP_RE__)
dwsp__ = Drop(Whitespace(WSP_RE__))
VARIABLE = Series(RegExp('[A-Za-z]'), dwsp__)
NUMBER = Series(RegExp('(?:0|(?:[1-9]\\d*))(?:\\.\\d+)?'), dwsp__)
minus = Text("-")
div = Series(Text("/"), dwsp__)
mul = Series(Text("*"), dwsp__)
sub = Series(Text("-"), dwsp__)
add = Series(Text("+"), dwsp__)
group = Series(Series(Drop(Text("(")), dwsp__), expression, Series(Drop(Text(")")), dwsp__))
factor = Series(Option(minus), Alternative(NUMBER, VARIABLE, group))
term = Series(factor, ZeroOrMore(Series(Alternative(div, mul), factor)))
expression.set(Series(term, ZeroOrMore(Series(Alternative(add, sub), term))))
root__ = expression
parsing: PseudoJunction = create_parser_junction(ArithmeticGrammar)
get_grammar = parsing.factory # for backwards compatibility, only
>>> # 2. Execution of the Python-source and extraction of the Grammar-class
>>> code = compile(DHPARSER_IMPORTS + python_src, '<string>', 'exec')
>>> namespace = {}
>>> exec(code, namespace)
>>> ArithmeticGrammar = namespace['ArithmeticGrammar']
>>> # 3. Instantiation of the Grammar class and parsing of an expression
>>> arithmetic_parser = ArithmeticGrammar()
>>> syntax_tree = arithmetic_parser("2 + 3 * 4")
>>> print(syntax_tree.as_sxpr())
(expression
(term
(factor
(NUMBER "2")))
(add "+")
(term
(factor
(NUMBER "3"))
(mul "*")
(factor
(NUMBER "4"))))
Of course, the first part, compiling of the grammar to Python-code, could also have been achieved with:
>>> python_src = compile_ebnf(arithmetic_ebnf, "Arithmetic").result
And for the execution of the Python-source and extraction of the Grammar-class,
one can use DHParser.toolkit.compile_python_object():
>>> from DHParser import toolkit
>>> ArithmeticGrammar = toolkit.compile_python_object(
... DHPARSER_IMPORTS + python_src, "ArithmeticGrammar")
>>> arithmetic_parser = ArithmeticGrammar()
>>> syntax_tree_2 = arithmetic_parser("2 + 3 * 4")
>>> assert syntax_tree_2.equals(syntax_tree)
The recommended canonical way for the last step, however, would be:
>>> parsing = toolkit.compile_python_object(
... DHPARSER_IMPORTS + python_src, "parsing")
>>> arithmetic_parser = parsing.factory()
>>> syntax_tree_3 = arithmetic_parser("2 + 3 * 4")
>>> assert syntax_tree_3.equals(syntax_tree)
By using the factory function of the parsing-junction to
get a grammar-object instead of instantiating it directly, it is avoided
to instantiate the grammar-object more than once per thread. Re-using
the same grammar-object is more efficient
than re-instantiating it for every new document to be parsed. At the same-time
grammar-objects must not be shared between threads or processes. (See also
ThreadLocalSingletonFactory.)
- class ConfigurableEBNFGrammar(root: Parser | None = None, static_analysis: bool | None = None)[source]#
A parser for an EBNF grammar that can be configured to parse different syntactical variants of EBNF. Other than HeuristicEBNF this parser does not detect the used variant while parsing.
Different syntactical variants can be configured either by adjusting the definitions of DEF, OR, AND, ENDL, RNG_OPEN, RNG_CLOSE, RNG_DELIM, CH_LEADIN, TIMES, RE_LEADIN, RE_LEADOUT either within this grammar definition or in the Grammar-object changing the
text-field of the respective parser objects.EBNF-Definition of the Grammar:
@ comment = /(?!#x[A-Fa-f0-9])#.*(?:\n|$)|\/\*(?:.|\n)*?\*\/|\(\*(?:.|\n)*?\*\)/ # comments can be either C-Style: /* ... */ # or pascal/modula/oberon-style: (* ... *) # or python-style: # ... \n, excluding, however, character markers: #x20 @ whitespace = /\s*/ # whitespace includes linefeed @ literalws = right # trailing whitespace of literals will be ignored tacitly @ hide = is_mdef, component, pure_elem, countable, no_range, FOLLOW_UP, ANY_SUFFIX, MOD_SYM, MOD_SEP, EOF @ drop = whitespace, MOD_SYM, EOF, no_range # do not include these even in the concrete syntax tree # re-entry-rules to resume parsing after a syntax-error @ definition_resume = /\n\s*(?=@|\w+\w*\s*=)/ @ directive_resume = /\n\s*(?=@|\w+\w*\s*=)/ # specialized error messages for certain cases @ definition_error = /,/, 'Delimiter "," not expected in definition!\nEither this was meant to ' 'be a directive and the directive symbol @ is missing\nor the error is ' 'due to inconsistent use of the comma as a delimiter\nfor the elements ' 'of a sequence.' #: top-level syntax = ~ { definition | directive | macrodef } EOF definition = [modifier] symbol §DEF~ [ OR~ ] expression [MOD_SYM~ hide] ENDL~ &FOLLOW_UP # [OR~] to support v. Rossum's syntax modifier = (drop | [hide]) MOD_SEP # node LF after modifier allowed! is_def = [MOD_SEP symbol] DEF | MOD_SEP is_mdef MOD_SEP = / *: */ directive = "@" §symbol "=" component { "," component } &FOLLOW_UP component = regexp | literals | procedure | symbol !is_def | &`$` !is_mdef § placeholder !is_def | "(" expression ")" | RAISE_EXPR_WO_BRACKETS expression literals = { literal }+ # string chaining, only allowed in directives! procedure = SYM_REGEX "()" # procedure name, only allowed in directives! macrodef = [modifier] "$" name~ ["(" §placeholder { "," placeholder } ")"] DEF~ [ OR~ ] macrobody [MOD_SYM~ hide] ENDL~ & FOLLOW_UP macrobody = expression is_mdef = "$" name ["(" placeholder { "," placeholder } ")"] ~DEF FOLLOW_UP = `@` | `$` | modifier | symbol | EOF #: components expression = sequence { OR~ sequence } sequence = ["§"] ( interleave | lookaround ) # "§" means all following terms mandatory { AND~ ["§"] ( interleave | lookaround ) } interleave = difference { "°" ["§"] difference } lookaround = flowmarker § part difference = term [!`->` "-" § part] term = (oneormore | counted | repetition | option | pure_elem) [MOD_SYM~ drop] part = (oneormore | pure_elem) [MOD_SYM~ drop] #: tree-reduction-markers aka "AST-hints" drop = "DROP" | "Drop" | "drop" | "SKIP" | "Skip" | "skip" hide = "HIDE" | "Hide" | "hide" | "DISPOSE" | "Dispose" | "dispose" #: elements countable = option | oneormore | element pure_elem = element § !ANY_SUFFIX # element strictly without a suffix element = [retrieveop] symbol !is_def | literal | plaintext | char_ranges | character ~ | regexp | char_range | any_char | whitespace | group | macro !is_def | placeholder !is_def | parser # a user defined parser ANY_SUFFIX = /[?*+]/ #: flow-operators flowmarker = "!" | "&" # '!' negative lookahead, '&' positive lookahead | "<-!" | "<-&" # '<-!' negative lookbehind, '<-&' positive lookbehind retrieveop = "::" | ":?" | ":" # '::' pop, ':?' optional pop, ':' retrieve #: groups group = "(" no_range §expression ")" oneormore = "{" no_range expression "}+" | element "+" repetition = "{" no_range §expression "}" | element "*" no_range option = !char_range "[" §expression "]" | element "?" counted = countable range | countable TIMES~ multiplier | multiplier TIMES~ §countable range = RNG_OPEN~ multiplier [ RNG_DELIM~ multiplier ] RNG_CLOSE~ no_range = !multiplier | &multiplier TIMES # should that be &(multiplier TIMES)?? multiplier = /\d+/~ #: leaf-elements parser = "@" name "(" §[argument] ")" # a user defined parser argument = literal | name~ symbol = SYM_REGEX ~ # e.g. expression, term, parameter_list literal = /"(?:(?<!\\)\\"|[^"])*?"/~ # e.g. "(", '+', 'while' | /'(?:(?<!\\)\\'|[^'])*?'/~ # whitespace following literals will be ignored tacitly. | /’(?:(?<!\\)\\’|[^’])*?’/~ plaintext = /`(?:(?<!\\)\\`|[^`])*?`/~ # like literal but does not eat whitespace | /´(?:(?<!\\)\\´|[^´])*?´/~ regexp = RE_LEADIN RE_CORE RE_LEADOUT ~ # e.g. /\w+/, ~/#.*(?:\n|$)/~ char_range = `[` [`^`] { restricted_range_desc }+ "]" restricted_range_desc = character [ `-` character ] char_ranges = RE_LEADIN range_chain { `|` range_chain } RE_LEADOUT ~ range_chain = `[` [`^`] { range_desc }+ `]` range_desc = (character | free_char) [ `-` (character | free_char) ] character = (CH_LEADIN | `\x` | `\u` | `\U`) HEXCODE free_char = /[^\n\[\]\\]/ | /\\[nrtfv`´'"(){}\[\]\/\\]/ any_char = "." whitespace = /~/~ # insignificant whitespace #: macros macro = "$" name "(" no_range expression { "," no_range expression } ")" placeholder = "$" name !`(` ~ name = SYM_REGEX #: delimiters EOF = !/./ DEF = `=` OR = `|` AND = `` ENDL = `` RNG_OPEN = `{` RNG_CLOSE = `}` RNG_DELIM = `,` TIMES = `*` RE_LEADIN = `/` RE_LEADOUT = `/` CH_LEADIN = `0x` MOD_SYM = `->` # symbol for postfix modifier #: basic-regexes RE_CORE = /(?:(?<!\\)\\(?:\/)|[^\/])*/ # core of a regular expression, i.e. the dots in /.../ SYM_REGEX = /(?!\d)\w+/ # regular expression for symbols HEXCODE = /(?:[A-Fa-f1-9]|0(?!x)){1,8}/ #: error-markers RAISE_EXPR_WO_BRACKETS = ``
- class EBNFCompiler(grammar_name='DSL', grammar_source='')[source]#
Generates a Parser from an abstract syntax tree of a grammar specified in EBNF-Notation.
Usually, this class will not be instantiated or instances of this class be called directly. Rather high-level functions like
create_parser()orcompileEBNF()will be used to generate callableGrammar-objects or Python-source-code from an EBNF-grammar.Instances of this class must be called with the root-node of the abstract syntax tree from an EBNF-specification of a formal language. The returned value is the Python-source-code of a Grammar class for this language that can be used to parse texts in this language. See classes
compile.Compilerandparser.Grammarfor more information.Additionally, class EBNFCompiler provides helper methods to generate code-skeletons for a preprocessor, AST-transformation and full compilation of the formal language. These method’s names start with the prefix
gen_.- Variables:
current_symbols – During compilation, a list containing the root node of the currently compiled definition as first element and then the nodes of the symbols that are referred to in the currently compiled definition.
cache_literal_symbols – A cache for all symbols that are defined by literals, e.g.
head = "<head>". This is used by the on_expression()-method.rules –
Dictionary that maps rule names to a list of Nodes that contain symbol-references in the definition of the rule. The first item in the list is the node of the rule definition itself. Example:
alternative = a | b
Now
[node.content for node in self.rules['alternative']]yields['alternative = a | b', 'a', 'b']referred_symbols_cache – A dictionary that caches the results of method
referred_symbols().referred_symbols()maps a to the set of symbols that are directly or indirectly referred to in the definition of the symbol.directly_referred_cache – A dictionary that caches the results of method
directly_referred_symbols(), which yields the set of symbols that are referred to in the definition of a particular symbol.referred_otherwise – A set of symbols which are directly referred to in a directive, macro or macro-symbol. It does not matter whether these symbals are reachable (i.e. directly oder indirectly referred to) from the root-symbol.
symbols – A mapping of symbol names to their usages (not their definition!) in the EBNF source.
py_symbols – A set of symbols that are referred to in the grammar, but are (or must be) defined in Python-code outside the Grammar-class resulting from the compilation of the EBNF-source, as, for example, references to user-defined custom-parsers. (See
Custom)variables – A set of symbols names that are used with the Pop or Retrieve operator. Because the values of these symbols need to be captured they are called variables. See
test_parser.TestPopRetrievefor an example.forward – A set of symbols that require a forward operator.
definitions – A dictionary of definitions. Other than
rulesthis maps the symbols to their compiled definienda.macros – A dictionary that maps macro names to the macro-definition, or, more precisely to a tuple of the node of the macro-symbol, the string-list or macro arguments and the node of the AST that is substituted for the macro-symbol.
macro_stack – A stack (i.e. list) of macro names needed to ensure that macro calls are not recursively nested.
required_keywords – A list of keywords (like
comment__orwhitespace__) that need to be defined at the beginning of the grammar class because they are referred to later.deferred_tasks – A list of callables that is filled during compilation, but that will be executed only after compilation has finished. Typically, it contains semantic checks that require information that is only available upon completion of compilation.
root_symbol – The name of the root symbol.
directives – A record of all directives and their default values.
defined_directives – A dictionary of all directives that have already been defined, mapped onto the list of nodes where they have been (re-)defined. Except for those directives contained in EBNFDirectives.REPEATABLE_DIRECTIVES, directives must only be defined once.
consumed_custom_errors – A set of symbols for which a custom error has been defined and(!) consumed during compilation. This allows to add a compiler error in those cases where (i) an error message has been defined but will never be used or (ii) an error message is accidentally used twice. For examples, see
test_ebnf.TestErrorCustomization.consumed_skip_rules – The same as
consumed_custom_errorsonly for in-series-resume-rules (aka ‘skip-rules’) for Series-parsers.P – a dictionary that maps parser class names to qualified names in cases a parser class name has also been used as a symbol name in the grammar. (e.g. Text -> parser_namespace__.Text)
re_flags – A set of regular expression flags to be added to all regular expressions found in the current parsing process
python_src – A string that contains the python source code that was the outcome of the last EBNF-compilation.
grammar_name – The name of the grammar to be compiled
grammar_source – The source code of the grammar to be compiled.
- assemble_parser(definitions: List[Tuple[str, str]], root_symbol: str) str[source]#
Creates the Python code for the parser after compilation of the EBNF-Grammar
- check_rx(node: Node, rx: str, smartRE: bool = False) str[source]#
Checks whether the string rx represents a valid regular expression. Makes sure that multi-line regular expressions are prepended by the multi-line-flag. Returns the regular expression string.
- directly_referred(symbol: str) FrozenSet[str][source]#
Returns the set of symbols that are referred to in the definition of symbol.
- extract_counted(node) Tuple[Node, Tuple[int, int]][source]#
Returns the content of a counted-node in a normalized form: (node, (n, m)) where node is root of the sub-parser that is counted, i.e. repeated n or n up to m times.
- extract_range(node) Tuple[int, int][source]#
Returns the range-value of a range-node as a tuple of two integers.
- gen_compiler_skeleton() str[source]#
Returns Python-skeleton-code for a Compiler-class for the previously compiled formal language.
- gen_preprocessor_skeleton() str[source]#
Returns Python-skeleton-code for a preprocessor-function for the previously compiled formal language.
- gen_transformer_skeleton() str[source]#
Returns Python-skeleton-code for the AST-transformation for the previously compiled formal language.
- literal_rx(content: str, left: str, right: str, escape_braces=False) str[source]#
Returns a regular expression string to parse a literal. This can be used to optimize the parsing of literals with adjacent whitespace by defining a
parse:SmartRe-parser with a regular expression.
- make_search_rule(node: Node, nd: Node, kind: str) ReprType[source]#
Generates a search rule, which can be either a string for simple string search or a regular expression from the node’s content. Returns an empty string in case the node is neither regexp nor literal.
- Parameters:
node – The node of the directive
nd – The node containing the AST of the search rule
kind – The kind of the search rule, which must be one of “resume”, “skip”, “error”
- non_terminal(node: Node, parser_class: str, custom_args: List[str] = []) str[source]#
Compiles any non-terminal, where parser_class indicates the Parser class name for the particular non-terminal.
- optimize_definitions_order(definitions: List[Tuple[str, str]])[source]#
Reorders the definitions to minimize the number of Forward declarations. Forward declarations remain inevitable only where recursion is involved.
- prepare_literal(node: Node) Tuple[str, str, str][source]#
Returns content, left_Whitespace, right_whitspace for a literal-node.
- recursive_paths(symbol: str) FrozenSet[Tuple[str, ...]][source]#
Returns the recursive paths from symbol to itself. If sym is not recursive, the returned tuple (of paths) will be empty. This method exists only for debugging (so far…).
- referred_symbols(symbol: str) FrozenSet[str][source]#
Returns the set of all symbols that are directly or indirectly referred to in the definition of symbol. The symbol itself can be contained in this set, if and only if its rule is recursive.
referred_symbols() only yields reliable results if the collection of definitions has been completed.
- reorder_alternatives(node)[source]#
The following algorithm reorders literal alternatives, so that earlier alternatives do not pre-empt later alternatives, e.g. ‘ID’ | ‘IDREF’ will be reordered as ‘IDREF’ | ‘ID’
- set_grammar_name(grammar_name: str = '', grammar_source: str = '')[source]#
Changes the grammar name and source.
The grammar name and the source text are metadata that do not affect the compilation process. It is used to name and annotate the output. Returns self.
- exception EBNFCompilerError[source]#
Error raised by
EBNFCompilerclass. (Not compilation errors in the strict sense, seeCompilationErrorin moduledsl)
- class EBNFDirectives[source]#
A Record that keeps information about compiler directives during the compilation process.
- Variables:
whitespace – the regular expression string for (insignificant) whitespace
comment – the regular expression string for comments
literalws – automatic “whitespace eating” next to literals. Can be either ‘left’, ‘right’, ‘none’, ‘both’
tokens – set of the names of preprocessor tokens
filter – mapping of symbols to python match functions that will be called on any retrieve / pop - operations on these symbols
error – mapping of symbols to tuples of match conditions and customized error messages. A match condition can be either a string or a regular expression. The first error message where the search condition matches will be displayed. An empty string ‘’ as search condition always matches, so in case of multiple error messages, this condition should be placed at the end.
skip – mapping of symbols to a list of search expressions. A search expressions can be either a string ot a regular expression. The closest match is the point of reentry for the series- or interleave-parser when a mandatory item failed to match the following text.
resume – mapping of symbols to a list of search expressions. A search expressions can be either a string ot a regular expression. The closest match is the point of reentry for after a parsing error has error occurred. Other than the skip field, this configures resuming after the failing parser (
parser.Seriesorparser.Interleave) has returned.disposable – A regular expression to identify “disposable” symbols, i.e. symbols that will not appear as tag-names. Instead, the nodes produced by the parsers associated with these symbols will yield anonymous nodes just like “inner” parsers that are not directly assigned to a symbol.
drop – A set that may contain the elements
DROP_STRINGSandDROP_WSP,DROP_REGEXPor any name of a symbol of a disposable parser (e.g. ‘_linefeed’) the results of which will be dropped during the parsing process, already.reduction – The reduction level (0-3) for early tree-reduction during the parsing stage.
optimizations – Turns on optimizing parser by substituting SmartRE-parsers for compound parsers when possible. (see “optimizations” in DHParser.config.py). An empty set means all optimizations are turned off.
flavor – Selects the EBNF-flavor (or “syntax-variant”) to be used.
_super_ws – Cache for the “super whitespace” which is a regular expression that merges whitespace and comments. This property should only be accessed after the
whitespace- andcomment-field have been filled with the values parsed from the EBNF source.
- compile_ebnf(ebnf_source: str, branding: str = 'DSL', *, preserve_AST: bool = False) CompilationResult[source]#
Compiles an ebnf_source (file_name or EBNF-string) and returns a tuple of the python code of the compiler, a list of warnings or errors and the abstract syntax tree (if called with the keyword argument
preserve_AST=True) of the EBNF-source. This function is merely syntactic sugar.
- compile_ebnf_ast(ast: RootNode) str[source]#
Compiles the abstract-syntax-tree of an EBNF-source-text into python code of a class derived from parse.Grammar that can parse text following the grammar described with the EBNF-code.
- get_ebnf_grammar() Grammar[source]#
Returns a thread-local EBNF-Grammar-object for parsing EBNF sources.
- grammar_changed(grammar_class, grammar_source: str) bool[source]#
Returns
Trueifgrammar_classdoes not reflect the latest changes ofgrammar_source- Parameters:
grammar_class – the parser class representing the grammar or the file name of a compiler suite containing the grammar
grammar_source – File name or string representation of the EBNF code of the grammar
- Returns (bool):
True, if the source text of the grammar is different from the source from which the grammar class was generated
- parse_ebnf(ebnf: str) Node[source]#
Parses and EBNF-source-text and returns the concrete syntax tree of the EBNF-code.
Module nodetree#
Module nodetree encapsulates the functionality for creating and handling
trees of nodes, in particular, syntax-trees. This includes serialization
and deserialization of node-trees, navigating and searching node-trees as well
as annotating node-trees with attributes and error messages.
nodetree can also be seen as a document-tree-library
for handling any kind of XML or S-expression-data. In contrast to
Elementtree
and lxml, nodetree maps mixed content to dedicated nodes,
which simplifies the programming of algorithms that run on the data stored
in the (XML-)tree.
The source code of module nodetree consists of four main sections:
Node-classes, i.e.
Node,FrozenNodeand :py:class`RootNode` as well as a number of top-level functions closely related to these. The Node-classes in turn provide several groups of functionality:Capturing segments of documents and organizing it in trees. (A node is either a “leaf”-node with string-content or a “branch”-node with children.)
Retaining its source-position in the document (important for error reporting, in particular when errors occur later in the processing-pipeline.)
Storing and retrieving of (XML-)attributes. Like XML-attributes, attribute-names are strings, but other than XML-attributes, attributes attached to Node-objects can take any Python-type as value that is serializable with “str()”.
Tree-traversal, in particular node- and path-selection based on arbitrary criteria (passed as match-node or match-path-function)
Experimental (XML-)milestone-support:
Node.milestone_segment()andNode.split(). See also:ContextMappingA very simple method for tree-“evaluation”: (More elaborate scaffolding for evaluation tree are found in
traverseandcompile.)Functions for serialization and deserialization as XML, S-Expression, JSON as well as conversion to and from ElementTree/LXML-representations.
Class
RootNodeserving as both root-node of the tree and a hub for storing data for the tree as a whole (as, for example, the list of errors that have occurred during parsing or further processing) as well as information on the current processing-stage.
Attribute-handling: Functions to handle attributes-values that are organized as blank separated sets of strings like, for example, the class-attribute in HTML.
Path-Navigation: Functions that help to navigate with paths through the tree. A path is the list of nodes that connects the root-node of the tree with one particular node inside or at the leaf of the tree.
Context-mappings: A Class (
ContextMapping) for relating the flat string-content of a document-tree to its structure. This allows using the string-content for searching in the document and then switching to the tree-structure to manipulate it.An (experimental) special case are serialization-mappings that map positions within a serialized version of the syntax-tree (XML, S-Expression or SXML) to locations within the tree (Node, Node-position within the serialization, offset and a flag indicating whether the position falls into the opening-tag, cosing-tag or content of the Node.)
- class ContentMapping(origin: ~nodetree.Node, select: PathSelector = <function LEAF_PATH>, ignore: PathSelector = <function deny>, greedy: bool = True, divisibility: ~typing.Dict[str, ~typing.Container] | ~typing.Container | str = frozenset({':EMPTY', ':RegExp', ':Text', ':Whitespace'}), chain_attr_name: str = '', auto_cleanup: bool = True)[source]#
ContentMapping represents a path-mapping of the string-content of all or a specific selection of the leave-nodes of a tree. A content-mapping is an ordered mapping of the first text position of every (selected) leaf-node to the path of this node.
Path-mappings allow to search the flat document with regular expressions or simple text search and then changing the tree at the appropriate places, for example by adding markup (i.e. nodes) in these places.
The ContentMapping class provides methods for adding markup-nodes. In cases where the new markup-nodes cut across the existing tree-hierarchy, the markup-method takes care of splitting up either the newly created or some of the existing nodes to fit in the markup.
Public properties:
- Variables:
path_list – A list of paths covering the selected leaves of the tree from left to right.
pos_list – The list of positions of the paths in
path_list
Location-related instance variables:
- Variables:
origin – The orogin of the tree for which a path mapping shall be generated. This can be a branch of another tree and therefore does not need to be a RootNode-object.
select_func – Only leaf-paths for which this is true will be considered when generating the content-mapping. This function integrates both the select- and ignore-criteria passed to the constructor of the class. Note that the select-criterion must only accept leaf-paths. Otherwise, a ValueError will be raised.
ignore_func – The ignore function derives from the ignore-parameter of the
__init__()-constructor of class ContentMapping.content – The string content of the selected parts of the tree.
Markup-related instance variables:
- Variables:
greedy – If True, the algorithm for adding markup minimizes the number of required cuts by switching child and parent nodes if the markup fills up a node completely as well as including empty nodes in the markup. In any the case. the string content of the added markup remains the same, but it might cover more tags than strictly necessary.
chain_attr_name – An attribute that will receive one and the same identifier as value for all nodes belonging to the chain of on split-up node.
auto_cleanup – Update the content mapping after the markup has been finished. Should always be true, if it is intended to reuse the same content mapping for further markups in the same range or other purposes.
- Parameters:
divisibility –
A dictionary that contains the information which tags (or nodes as identified by their name) are “harder” than other tags. Each key-tag in the dictionary is harder than (i.e. is allowed to split up) up all tags in the associated value (which is a set of node, or for that matter, tag-names). Tag or node-names associated to the wildcard key
*can be split by any tag.If the markup-method reaches nodes that cannot be split, it will split the markup-node instead to cover the string to be marked up, completely.
- get_node_index(node: Node, reverse: bool = False) int[source]#
Returns the index in the path_list of the first or last (if reverse is True) path that contains node or -1, if the node cannot be found. Note: If node is a leaf node, the first and last index are the same. Otherwise, it occurs (if at all) more often than once if it or any of its children has more than one child.
Examples:
>>> tree = parse_sxpr('(A (B (x "1") (y "2")) (C (z "3")))') >>> cm = ContentMapping(tree) >>> B = tree.pick('B') >>> cm.get_node_index(B) 0 >>> cm.get_node_index(B, reverse=True) 1 >>> cm.get_node_index(tree.pick('y')) 1 >>> cm.get_node_index(tree.pick('z')) 2 >>> cm.get_node_index(tree.pick('A', include_root=True), reverse=True) 2
- get_node_position(node: Node, reverse: bool = False) int[source]#
Returns the string-position of first or last + 1 (if reverse is True) character of the first or last occurrence of node within the mapping. If node is not contained in any path of the mapping, -1 will be returned. If node is a leaf node it occurs only once (if at all) in the mapping. Otherwise, it occurs (if at all) more often than once if it or any of its children has more than one child.
Independent of whether the node is a leaf-node, the position of the first and last + 1 character is the same if and only if the string content of node is empty.
Examples:
>>> tree = parse_sxpr('(A (B (x "1") (y "2")) (C (z "3")))') >>> cm = ContentMapping(tree) >>> B = tree.pick('B') >>> cm.get_node_position(B) 0 >>> cm.get_node_position(B, reverse=True) 2 >>> cm.get_node_position(tree.pick('y')) 1 >>> z = tree.pick('z') >>> cm.get_node_position(z) 2 >>> cm.get_node_position(z, reverse=True) 3 >>> cm.get_node_position(tree.pick('A', include_root=True), reverse=True) 3
- get_path_and_offset(pos: int, left_biased: bool = False, index_out: List[int] | None = None) ContentLocation[source]#
Returns the path and relative position within the leaf-node of the path.
- Parameters:
pos – the position in the string-content for which the path and offset should be determined.
left_biased – yields the location after the end of the previous path rather than the location at the very beginning of the next path. Default value is “False”.
index_out – the index of the path in the content mapping’s path-list will be appended to index_out (optional parameter!)
- Returns:
tuple (path, offset) where the offset is the position of
posrelative to the actual position of the last node in the path.- Raises:
IndexError if not 0 <= position < length of document
- get_path_index(pos: int, left_biased: bool = False) int[source]#
Yields the index for the path in given context-mapping that contains the position
pos.- Parameters:
pos – a position in the content of the tree for which the path mapping
cmwas generatedleft_biased – yields the location after the end of the previous path rather than the location at the very beginning of the next path. Default value is “False”.
- Returns:
the integer index of the path in self.path_list that covers the given position
pos- Raises:
IndexError if not 0 <= position < length of document
Example:
>>> tree = parse_sxpr('(a (b "012") (c (d "34") (e "56")))') >>> cm = ContentMapping(tree) >>> i = cm.get_path_index(4) >>> path = cm.path_list[i] >>> print(pp_path(path, 1, ', ')) a, c, d "34"
- insert_node(pos: int, node: Node, left_biased: bool = False) NodeLocation[source]#
Inserts a node at a specific position into the last or eventually second but last node in the path from the context mapping that covers this position. Returns the parent of the newly inserted node.
- iterate_paths(start_pos: int, end_pos: int, left_biased: bool = False) Iterator[Path][source]#
Yields all paths from position
start_posup to and including positionend_pos. Example:>>> tree = parse_sxpr('(a (b "123") (c (d "456") (e "789")) (f "ABC"))') >>> cm = ContentMapping(tree) >>> [[nd.name for nd in p] for p in cm.iterate_paths(1, 12)] [['a', 'b'], ['a', 'c', 'd'], ['a', 'c', 'e'], ['a', 'f']]
- markup(start_pos: int, end_pos: int, name: str, *attr_dict, **attributes) NodeLocation[source]#
Marks the span [start_pos, end_pos[ up by adding one or more Node’s with
name, eventually cutting throughdivisiblenodes. Returns the nearest common ancestor ofstart_posandend_pos.- Parameters:
cm – A context mapping of the document (or a part therof) where the markup shall be inserted. See
generate_content_mapping()start_pos – The string-position of the first character to be marked up. Note that this is the position in the string-content of the tree over which the content mapping has been generated and not the position in the XML or any other serialization of the tree!
end_pos – The string-position after the last character to be included in the markup. Similar to the slicing of Python lists or strings the beginning and ending define an half-open intervall, [start_pos, ent_pos[ . The character indexed by end_pos is not included in the markup. Also, keep in mind that
end_posis the position in the string-content of the tree over which the content mapping has been generated and not the positionvin the XML or any other serialization of the tree!name – The name, or “tag-name” in XML-terminology, of the element (or tag) to be added.
attr_dict – A dictionary of attributes that will be added to the newly created tag.
attributes – Alternatively, the attributes can also be passed as a list of named parameters.
- Returns:
The nearest (from the top of the tree) node, e.g. “ancestor”, within which the entire markup lies as well as the first path-index of that ancestor.
Examples:
>>> from DHParser.toolkit import printw >>> tree = parse_sxpr('(X (l ",.") (A (O "123") (P "456")) (m "!?") ' ... ' (B (Q "789") (R "abc")) (n "+-"))') >>> X = copy.deepcopy(tree) >>> t = ContentMapping(X) >>> _ = t.markup(2, 8, 'em') >>> printw(X.as_sxpr(flatten_threshold=-1)) (X (l ",.") (A (em (O "123") (P "456"))) (m "!?") (B (Q "789") (R "abc")) (n "+-")) >>> Y = copy.deepcopy(X) >>> t = ContentMapping(Y, divisibility={'bf': {':Text', 'em', 'A', 'P'}}) >>> _ = t.markup(0, 7, 'bf') >>> printw(Y.as_sxpr(flatten_threshold=-1)) (X (bf (l ",.") (A (em (O "123") (P "45")))) (A (em (P "6"))) (m "!?") (B (Q "789") (R "abc")) (n "+-")) >>> X = copy.deepcopy(tree) >>> t = ContentMapping(X) >>> _ = t.markup(2, 10, 'em') >>> printw(X.as_sxpr(flatten_threshold=-1)) (X (l ",.") (em (A (O "123") (P "456")) (m "!?")) (B (Q "789") (R "abc")) (n "+-")) >>> X = copy.deepcopy(tree) >>> t = ContentMapping(X, divisibility={'A'}) >>> _ = t.markup(5, 10, 'em') >>> printw(X.as_sxpr(flatten_threshold=-1)) (X (l ",.") (A (O "123")) (em (A (P "456")) (m "!?")) (B (Q "789") (R "abc")) (n "+-")) >>> X = copy.deepcopy(tree) >>> t = ContentMapping(X) >>> _ = t.markup(2, 13, 'em') >>> printw(X.as_sxpr(flatten_threshold=-1)) (X (l ",.") (em (A (O "123") (P "456")) (m "!?")) (B (em (Q "789")) (R "abc")) (n "+-")) >>> X = copy.deepcopy(tree) >>> t = ContentMapping(X) >>> _ = t.markup(5, 16, 'em') >>> printw(X.as_sxpr(flatten_threshold=-1)) (X (l ",.") (A (O "123") (em (P "456"))) (em (m "!?") (B (Q "789") (R "abc"))) (n "+-")) >>> X = copy.deepcopy(tree) >>> t = ContentMapping(X) >>> _ = t.markup(5, 13, 'em') >>> printw(X.as_sxpr(flatten_threshold=-1)) (X (l ",.") (A (O "123") (em (P "456"))) (em (m "!?")) (B (em (Q "789")) (R "abc")) (n "+-")) >>> X = copy.deepcopy(tree) >>> t = ContentMapping(X) >>> _ = t.markup(6, 12, 'em') >>> printw(X.as_sxpr(flatten_threshold=-1)) (X (l ",.") (A (O "123") (P (:Text "4") (em "56"))) (em (m "!?")) (B (Q (em "78") (:Text "9")) (R "abc")) (n "+-")) >>> X = copy.deepcopy(tree) >>> t = ContentMapping(X) >>> _ = t.markup(1, 17, 'em') >>> printw(X.as_sxpr(flatten_threshold=-1)) (X (l (:Text ",") (em ".")) (em (A (O "123") (P "456")) (m "!?") (B (Q "789") (R "abc"))) (n (em "+") (:Text "-"))) >>> X = copy.deepcopy(tree) >>> t = ContentMapping(X, divisibility={'em': {'l', 'n'}}) >>> _ = t.markup(1, 17, 'em') >>> printw(X.as_sxpr(flatten_threshold=-1)) (X (l ",") (em (l ".") (A (O "123") (P "456")) (m "!?") (B (Q "789") (R "abc")) (n "+")) (n "-"))
- rebuild_mapping(start_pos: int, end_pos: int)[source]#
Reconstructs a particular section of the context mapping after the underlying tree has been restructured.
- Parameters:
start_pos – The string position of the beginning of the text-area that has been affected by earlier changes.
end_pos – The string position of the ending of the text-area that has been affected by earlier changes.
- rebuild_mapping_slice(first_index: int, last_index: int)[source]#
Reconstructs a particular section of the context mapping after the underlying tree has been restructured. Other than
ContentMappin.rebuild_mapping(), the section that needs repairing is defined by the path indices and not the string positions.- Parameters:
first_index – The index (not the position within the string-content!) of the first path that has been affected by restruturing of the tree. Use
ContentMapping.get_path_index()to determine the path-index if only the position is known.last_index – The index (not the position within the string-content!) of the last path that has been affected by restruturing of the tree. Use
ContentMapping.get_path_index()to determine the path-index if only the position is known.
Examples:
>>> tree = parse_sxpr('(a (b (c "123") (d "456")) (e (f (g "789") (h "ABC")) (i "DEF")))') >>> cm = ContentMapping(tree) >>> print(cm) 0 -> a, b, c "123" 3 -> a, b, d "456" 6 -> a, e, f, g "789" 9 -> a, e, f, h "ABC" 12 -> a, e, i "DEF" >>> b = tree.pick('b') >>> b.result = (b[0], Node('x', 'xyz'), b[1]) >>> cm.rebuild_mapping_slice(0, 1) >>> print(cm) 0 -> a, b, c "123" 3 -> a, b, x "xyz" 6 -> a, b, d "456" 9 -> a, e, f, g "789" 12 -> a, e, f, h "ABC" 15 -> a, e, i "DEF" >>> cm.auto_cleanup = False >>> common_ancestor, _ = cm.markup(10, 16, 'Y') >>> print(common_ancestor.as_sxpr()) (e (f (g (:Text "7") (Y "89")) (Y (h "ABC"))) (i (Y "D") (:Text "EF"))) >>> print(cm) 0 -> a, b, c "123" 3 -> a, b, x "xyz" 6 -> a, b, d "456" 9 -> a, e, f, g (:Text "7") (Y "89") 12 -> a, e, f, h "ABC" 15 -> a, e, i (Y "D") (:Text "EF") >>> a = cm.get_path_index(10) >>> b = cm.get_path_index(16, left_biased=True) >>> a, b (3, 5) >>> cm.rebuild_mapping_slice(3, 5) >>> print(cm) 0 -> a, b, c "123" 3 -> a, b, x "xyz" 6 -> a, b, d "456" 9 -> a, e, f, g, :Text "7" 10 -> a, e, f, g, Y "89" 12 -> a, e, f, Y, h "ABC" 15 -> a, e, i, Y "D" 16 -> a, e, i, :Text "EF" >>> tree = parse_sxpr('(a (b (c "123") (d "456")) (e (f (g "789") (h "ABC")) (i "DEF")))') >>> cm = ContentMapping(tree, auto_cleanup=False) >>> common_ancestor, _ = cm.markup(0, 6, 'Y') >>> print(common_ancestor.as_sxpr()) (b (Y (c "123") (d "456"))) >>> a = cm.get_path_index(0) >>> b = cm.get_path_index(6, left_biased=True) >>> a, b (0, 1) >>> cm.rebuild_mapping_slice(a, b) >>> print(cm) 0 -> a, b, Y, c "123" 3 -> a, b, Y, d "456" 6 -> a, e, f, g "789" 9 -> a, e, f, h "ABC" 12 -> a, e, i "DEF"
- select(criterion: NodeSelector, start_from: int = -1073741824, reverse: bool = False) Iterator[NodeLocation][source]#
See
ContentMapping.select_if()Example:
>>> tree = parse_sxpr('(A (B (x "1") (y "2")) (B "!") (C (z "3")))') >>> cm = ContentMapping(tree) >>> print([(i, nd.as_sxpr()) for nd, i in cm.select("y")]) [(1, '(y "2")')]
- select_if(match_func: NodeMatchFunction, start_from: int = -1073741824, reverse: bool = False) Iterator[NodeLocation][source]#
Yields the node and its path-index for all nodes that are matched by the match function. Searching starts from the path with the index
start_from. Searching within a path starts from the end of the path and only the last matching node in every path is returned. Only the path-index from the first path that contains a matching node is returned. Subseuqent pathes that contain the same node are skipped.Examples:
>>> tree = parse_sxpr('(A (B (x "1") (y "2")) (B "!") (C (z "3")))') >>> cm = ContentMapping(tree) >>> mf = create_match_function("B") >>> print([(i, nd.as_sxpr()) for nd, i in cm.select_if(mf)]) [(0, '(B (x "1") (y "2"))'), (2, '(B "!")')] >>> print([(i, nd.as_sxpr()) for nd, i in cm.select_if(mf, reverse=True)]) [(2, '(B "!")'), (1, '(B (x "1") (y "2"))')] >>> i = cm.get_node_index(tree.pick("B", reverse=True)) >>> print([(i, nd.as_sxpr()) for nd, i in cm.select_if(mf, start_from=i)]) [(2, '(B "!")')] >>> i = cm.get_node_index(tree.pick("y")) >>> print([(i, nd.as_sxpr()) for nd, i in cm.select_if(mf, start_from=i, reverse=True)]) [(1, '(B (x "1") (y "2"))')]
- class DHParser_JSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]#
A JSON-encoder that also encodes
nodetree.Nodeas valid json objects. Node-objects are encoded using Node.as_json.- default(o)[source]#
Implement this method in a subclass such that it returns a serializable object for
o, or calls the base implementation (to raise aTypeError).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
- class FrozenNode(name: str, result: ResultType, leafhint: bool = True)[source]#
FrozenNode is an immutable kind of Node, i.e. it must not be changed after initialization. The purpose is mainly to allow certain kinds of optimizations, like not having to instantiate empty nodes (because they are always the same and will be dropped while parsing, anyway) and to be able to trigger errors if the program tries to treat such temporary nodes as a regular ones. (See
DHParser.parse)Frozen nodes must only be used temporarily during parsing or tree-transformation and should not occur in the product of the transformation anymore. This can be verified with
tree_sanity_check(). Or, as comparison criterion for content equality when picking or selecting nodes or paths from a tree (seecreate_match_function()).- property attr#
Returns a dictionary of XML-attributes attached to the node.
Examples:
>>> node = Node('', '') >>> print('Any attributes present?', node.has_attr()) Any attributes present? False >>> node.attr['id'] = 'identificator' >>> node.attr {'id': 'identificator'} >>> node.attr['id'] 'identificator' >>> del node.attr['id'] >>> node.attr {}
NOTE: Use
Node.has_attr()rather than bool(node.attr) to probe the presence of attributes. Attribute dictionaries are created lazily and node.attr would create a dictionary, even though it may never be needed, anymore.
- static from_json_obj(json_obj: Dict | Sequence) Node[source]#
Converts a JSON-object representing a node (or tree) back into a Node object. Raises a ValueError, if json_obj does not represent a node.
- property pos#
Returns the position of the Node’s content in the source text.
- property result: Tuple[Node, ...] | StringView | str#
Returns the result from the parser that created the node.
- to_json_obj(as_dict: bool = False, include_pos: bool = True) List[source]#
Converts the tree into a JSON-serializable nested list. Nodes can be serialized in list-flavor (faster) or dictionary-flavor (
asdict=True, slower).In list-flavor, Nodes are serialized as JSON-lists with either two or three elements:
name (always a string),
content (either a string or a list of JSON-serialized Nodes)
optional: a dictionary that maps attribute names to attribute values, both of which are strings.
In dictionary flavor, Nodes are serialized as dictionaries that map the node’s name to a string (in case of a leaf node) or to a dictionary of its children (in case all the children’s names are unique) or to a list of pairs (child name, child’s result).
Examples (list flavor):
>>> Node('root', 'content').with_attr(importance="high").to_json_obj() ['root', 'content', {'importance': 'high'}] >>> node = parse_sxpr('(letters (a "A") (b "B") (c "C"))') >>> node.to_json_obj() ['letters', [['a', 'A'], ['b', 'B'], ['c', 'C']]]
- Examples (dictionary flavor)
>>> node.to_json_obj(as_dict=True) {'letters': {'a': 'A', 'b': 'B', 'c': 'C'}} >>> node.result = node.children + (Node('a', 'doublette'),) >>> node.to_json_obj(as_dict=True) {'letters': [['a', 'A'], ['b', 'B'], ['c', 'C'], ['a', 'doublette']]}
- with_pos(pos: int) Node[source]#
Initializes the node’s position value. Usually, the parser takes care of assigning the positions in the document to the nodes of the parse-tree. However, when Nodes are created outside the reach of the parser guard, their document-position must be assigned manually. Position values of the child nodes are assigned recursively, too. Example:
>>> node = Node('test', 'position').with_pos(10) >>> node.pos 10 >>> tree = parse_sxpr('(a (b (c "0") (d (e "1")(f "2"))) (g "3"))') >>> _ = tree.with_pos(0) >>> [(nd.name, nd.pos) for nd in tree.select(ANY_NODE, include_root=True)] [('a', 0), ('b', 0), ('c', 0), ('d', 1), ('e', 1), ('f', 2), ('g', 3)]
- Parameters:
pos – The position assigned to be assigned to the node. Value must be >= 0 if the position value as already been initialized, before. If not, a value < 0 has no effect.
- Returns:
the node itself (for convenience).
- Raises:
AssertionError if position has already been assigned or if parameter pos has a value < 0.
- class Node(name: str, result: Tuple[Node, ...] | Node | StringView | str, leafhint: bool = False)[source]#
Represents a node in a tree data structure. This can, for example, be the concrete or abstract syntax tree that is produced by a recursive descent parser.
There are three different kinds of nodes:
Branch nodes that have children, but no string content. Other than in XML there are no mixed-content nodes that contain strings as well other tags. This constraint simplifies tree-processing considerably.
The conversion to and from XML works by enclosing strings in a mixed-content tag with some, freely chosen tag name, and dropping the tag name again when serializing to XML. Since this is easily done, there is no serious restriction involved when not allowing mixed-content nodes. See Node.as_xml() (parameter string_tags) as parse_xml.
Leaf nodes that do not have children but only string content.
The root node which contains further properties that are global properties of the parsing tree, such as the error list (which cannot be stored locally in the nodes, because nodes might be dropped during tree-processing, but error messages should not be forgotten!). Because of that, the root node requires a different class (RootNode) while leaf-nodes as well as branch nodes are both instances of class Node.
A node always has a tag name (which can be empty, though) and a result field, which stores the results of the parsing process and contains either a string or a tuple of child nodes.
All other properties are either optional or represent different views on these two properties. Among these are the ‘attr`-field that contains a dictionary of xml-attributes, the children-filed that contains a tuple of child-nodes or an empty tuple if the node does not have child nodes, the content-field which contains the string content of the node and the pos-field which contains the position of the node’s content in the source code, but may also be left uninitialized.
- Variables:
name –
The name of the node, which is either its parser’s name or, if that is empty, the parser’s class name.
By convention the parser’s class name when used as tag name is prefixed with a colon “:”. A node, the tag name of which starts with a colon “:” or the tag name of which is the empty string is considered as “anonymous”. See Node.anonymous()-property
result – The result of the parser which generated this node, which can be either a string or a tuple of child nodes.
children – The tuple of child nodes or an empty tuple if there are no child nodes. READ ONLY!
content – Yields the contents of the tree as string. The difference to
str(node)is thatnode.contentdoes not add the error messages to the returned string. READ ONLY!pos –
the position of the node within the parsed text.
The default value of
posis -1 meaning invalid by default. Setting pos to a value >= 0 will trigger the assignment of position values of all child nodes relative to this value.The pos field is WRITE ONCE, i.e. once assigned it cannot be reassigned. The assignment of the pos values happens either during the parsing process or, when later added to a tree, the pos-values of which have already been initialized.
Thus, pos-values always retain their position in the source text. If in any tree-processing stage after parsing, nodes are added or deleted, the pos values will not represent the position within in the string value of the tree.
Retaining the original source positions is crucial for correctly locating errors which might only be detected at later stages of the tree-transformation within the source text.
attr – An optional dictionary of attributes attached to the node. This dictionary is created lazily upon first usage.
- property anonymous: bool#
Returns True, if the Node is an “anonymous” Node, i.e. a node that has not been created by a named parser.
The tag name of anonymous node contains a colon followed by the class name of the parser that created the node, i.e. “:Series”. It is the recommended practice to remove (or name) all anonymous nodes during the AST-transformation.
- as_etree(ET=None, string_tags: AbstractSet[str] = frozenset({':CharRef', ':EMPTY', ':EntityRef', ':RegExp', ':Text', ':Whitespace'}), empty_tags: AbstractSet[str] = frozenset({}))[source]#
Returns the tree as standard-library- or lxml-ElementTree.
- Parameters:
ET – The ElementTree-library to be used. If None, the STL ElementTree will be used.
string_tags – A set of tags the content of which will be written without tag-name into the mixed content of the parent.
empty_tags – A set of tags that will be considered empty tags like “<br/>”. No Node with any of these tags must contain any content.
- Returns:
The tree of Nodes as an ElementTree
- as_html(css: str = '', head: str = '', lang: str = 'en', **kwargs) str[source]#
Serialize as HTML-page. See
Node.as_xml()for the further keyword-arguments.
- as_json(indent: int | None = 2, ensure_ascii=False, as_dict: bool = False, include_pos: bool = True) str[source]#
Serializes the tree originating in self as JSON-string. Nodes are serialized as JSON-lists with either two or three elements:
name (always a string),
content (either a string or a list of JSON-serialized Nodes)
optional: a dictionary that maps attribute names to attribute values, both of which are strings.
If as_dict is True, nodes are serialized as JSON dictionaries, which can be better human-readable when serialized. Keep in mind, though, that while this renders the json files more readable, not all json parsers honor the order of the entries of dictionaries. Thus, serializing node trees as ordered JSON-dictionaries is not strictly in accordance with the JSON-specification! Also serializing and de-serializing the dictionary-flavored JSON is slower.
Example:
>>> node = Node('root', 'content').with_attr(importance="high") >>> node.as_json(indent=0) '["root","content",{"importance":"high"}]' >>> node.as_json(indent=0, as_dict=True) '{"root":{"content__":"content","attributes__":{"importance":"high"}}}'
- as_ndst(src: str | None = '', indent: int | None = 2) str[source]#
Serializes the tree as Abstract-Syntax-Tree following the unist -Specification.
- Parameters:
indent – number of spaces for indentation
- as_sxml(src: str | None = None, indentation: int = 2, compact: bool = True, flatten_threshold: int = 92, normal_form: int = 1, reflow_col: int = 0, mapping: Dict[Node, Tuple[int, int | Sequence[int], int]] = {"don't generate a serialization mapping ": (-1, -1, -1)}) str[source]#
Serializes the tree as SXML. See
as_sxpr()for a description of the parameters.
- as_sxpr(src: str | None = None, indentation: int = 2, compact: bool = True, flatten_threshold: int = 92, sxml: int = 0, reflow_col: int = 0, mapping: Dict[Node, Tuple[int, int | Sequence[int], int]] = {"don't generate a serialization mapping ": (-1, -1, -1)}) str[source]#
Serializes the tree as S-expression, i.e. in lisp-like form. If this method is called on a RootNode-object, error strings will be displayed as pseudo-attributes of the nodes where the error is located.
- Parameters:
src – The source text or None. In case the source text is given the position of the element in the text will be reported as position, line, column. In case the empty string is given rather than None, only the position value will be reported in case it has been initialized, i.e. pos >= 0.
indentation – The number of whitespaces for indentation
compact – If True, a compact representation is returned where closing brackets remain on the same line as the last element.
flatten_threshold – Return the S-expression in flattened form if the flattened expression does not exceed the threshold length. A negative number means that it will always be flattened.
sxml –
If >= 1, attributes are rendered according to the SXML -conventions, e.g. `` (@ (attr “value”)`` instead of `` (attr “value”) ` if 2, the attribute node (@) will always be present, even if empty.
reflow_col – If > 0, the serialized form of the tree will be reflowed to the given column width.
mapping – If not NO_MAPPING_SENTINEL, a the passed dictionary will be filled with a mapping of the nodes to the length of their opening, overall length and closing, respectively, e.g. ‘(name “Fritz”)’ -> (5, 14, 1)
- Returns:
A string containing the S-expression serialization of the tree.
- as_unist(src: str | None = '', indent: int | None = 2) str[source]#
Just for lost souls. Everybody else uses
Node.as_ndst()
- as_unist_obj(flavor: str = 'xast', lbreaks: List[int] = []) Dict[source]#
Returns tree as JSON-Object conforming to the xast-Specifictaion. In order to get the actual xast, the returned JSON-object needs to be serialized.
- as_xast(src: str | None = '', indent: int | None = 2) str[source]#
Serializes the tree as XML-Abstract-Syntax-Tree following the xast -Specification.
- Parameters:
indent – number of spaces for indentation
- as_xml(src: str | None = None, indentation: int = 2, inline_tags: AbstractSet[str] = frozenset({}), string_tags: AbstractSet[str] = frozenset({':CharRef', ':EMPTY', ':EntityRef', ':RegExp', ':Text', ':Whitespace'}), empty_tags: AbstractSet[str] = frozenset({'__AUTO_EMPTY_TAGS__'}), strict_mode: bool = True, reflow_col: int = 0, mapping: Dict[Node, Tuple[int, int | Sequence[int], int]] = {"don't generate a serialization mapping ": (-1, -1, -1)}) str[source]#
Serializes the tree of nodes as XML.
- Parameters:
src – The source text or None. In case the source text is given, the position will also be reported as line and column.
indentation – The number of whitespaces for indentation
inline_tags – A set of tag names, the content of which will always be written on a single line, unless it contains explicit line feeds (n). In addition, all nodes that have the attribute
xml:space="preserve"will be inlined.string_tags – A set of tags from which only the content will be printed, but neither the opening tag nor its attr nor the closing tag. This allows producing a mix of plain text and child tags in the output, which otherwise is not supported by the Node object, because it requires its content to be either a tuple of children or string content.
empty_tags – A set of tags which shall be rendered as empty elements, e.g. “<empty/>” instead of “<empty></empty>”.
strict_mode – If True, violation of stylistic or interoperability rules raises a ValueError.
reflow_col – If > 0, the serialized form of the tree will be reflown to the given column width. This is useful for pretty-printing
- Returns:
The XML-string representing the tree originating in self
- property attr#
Returns a dictionary of XML-attributes attached to the node.
Examples:
>>> node = Node('', '') >>> print('Any attributes present?', node.has_attr()) Any attributes present? False >>> node.attr['id'] = 'identificator' >>> node.attr {'id': 'identificator'} >>> node.attr['id'] 'identificator' >>> del node.attr['id'] >>> node.attr {}
NOTE: Use
Node.has_attr()rather than bool(node.attr) to probe the presence of attributes. Attribute dictionaries are created lazily and node.attr would create a dictionary, even though it may never be needed, anymore.
- property children: ChildrenType#
Returns the tuple of child-nodes or an empty tuple if the node does node have any child-nodes but only string content.
- collect_empty_tags() Set[str][source]#
Collects the names of all nodes for which it is True that all nodes with that name are empty. Example:
>>> tree = parse_sxpr('(r (e "") (f "") (g "X") (e "") (f "X") (g ""))') >>> print(tree.collect_empty_tags())
{‘e’}
- Returns:
The set of names of always empty-nodes
- property content: str#
Returns content as string. If the node has child-nodes, the string content of the child-nodes is recursively read and then concatenated.
- equals(other: Node, ignore_attr_order: bool = True) bool[source]#
Equality of value: Two nodes are considered as having the same value, if their tag name is the same, if their results are equal and if their attributes and attribute values are the same and if either ignore_attr_order is True or the attributes also appear in the same order.
- Parameters:
other – The node to which self shall be compared.
ignore_attr_order – If True (default), two sets of attributes are considered as equal if their attribute-names and attribute-values are the same, no matter in which order the attributes have been added.
- Returns:
True, if the tree originating in node self is equal by value to the tree originating in node other.
- evaluate(actions: Dict[str, Callable], path: Path = []) Any[source]#
Simple tree evaluation: For each node the action associated with the node’s tag-name is called with either the tuple of the evaluated children or, in case of a leaf-node, the result-string as parameter(s):
>>> tree = parse_sxpr('(plus (number 3) (mul (number 5) (number 4)))') >>> from operator import add, mul >>> actions = {'plus': add, 'mul': mul, 'number': int} >>> tree.evaluate(actions) 23
evaluate()can operate in two modes. In the basic mode, shown, in the example, only the evaluated values of the children are passed to each function in the action dictionary. However, if evaluate is called with passing the beginning of the path to itspath-argument, each function will be called with the current path as its first argument and the evaluated values of its children as the following arguments, e.g.result = node.evaluate(actions, path=[node])This more sophisticated mode gives the action function access to the nodes of the tree as well.- Parameters:
actions – A dictionary that maps node-names to action functions.
path – If not empty, the current tree-path will be passed as first argument (before the evaluation results of the children) to each action. Start with a list of the node itself to trigger passing the path.
- Raises:
KeyError – if an action is missing in the table, use the joker ‘*’ to void this error, e.g.
{ ..., '*': lambda node: node.content, ...}.ValueError – in case any of the action functions cannot handle the passed arguments.
- Returns:
the result of the evaluation
- find_parent(node) Node | None[source]#
Finds and returns the parent of node within the tree represented by self. If the tree does not contain node, the value None is returned.
- static from_etree(et, string_tag: str = ':Text') Node[source]#
Converts a standard-library- or lxml-ElementTree to a tree of nodes.
- Parameters:
et – the root element-object of the ElementTree
string_tag – A tag-name that will be used for the strings occurring in mixed content.
- Returns:
a tree of nodes.
- static from_json_obj(json_obj: Dict | Sequence) Node[source]#
Converts a JSON-object representing a node (or tree) back into a Node object. Raises a ValueError, if json_obj does not represent a node.
- get(key: int | slice | NodeSelector, surrogate: Node | Sequence[Node]) Node | Sequence[Node][source]#
Returns the child node with the given index if
keyis an integer or the first child node with the given tag name. If no child with the given index or name exists, thesurrogateis returned instead. This mimics the behavior of Python’s dictionary’sget()-method.The type of the return value is always the same type as that of the surrogate. If the surrogate is a Node, but there are several items matching
key, then only the first of these will be returned.
- get_attr(attribute: str, default: Any) Any[source]#
Returns the value of ‘attribute’ if attribute exists. If not, the default value is returned. This function has the same semantics as node.attr.get(attribute, default), but with the advantage then other than node.attr.get it does not automatically create an attribute dictionary on (first) access.
- Parameters:
attribute – The attribute, the value of which shall be looked up
default – A default value that is returned, in case attribute does not exist.
- Returns:
the attribute’s value or, if unassigned, the default value.
- has_attr(attr: str = '') bool[source]#
Returns True, if the node has the attribute attr or, in case attr is the empty string, any attributes at all; False otherwise.
This function does not create an attribute dictionary, therefore it should be preferred to querying node.attr when testing for the existence of any attributes.
- has_equal_attr(other: Node, ignore_order: bool = True) bool[source]#
Returns True, if self and other have the same attributes with the same attribute values. If ignore_order is False, the attributes must also appear in the same order.
- index(selector: NodeSelector, start: int = 0, stop: int = 1073741824) int[source]#
Returns the index of the first child that fulfills the criterion what. If the parameters start and stop are given, the search is restricted to the children with indices from the half-open interval [start:end[. If no such child exists a ValueError is raised.
- Parameters:
selector – the criterion by which the child is identified, the index of which shall be returned.
start – the first index to start searching.
stop – the last index that shall be searched
- Returns:
the index of the first child that matches what.
- Raises:
ValueError, if no child matching the criterion what was found.
- indices(selector: NodeSelector) Tuple[int, ...][source]#
Returns the indices of all children that fulfil the criterion what.
- locate(location: int) Node | None[source]#
Returns the leaf-Node that covers the given
location, where location is the actual position withinself.content(not the source code position that the pos-attribute represents). If the location lies outside the node’s string content, None is returned.See also
ContentMappingfor a more general approach to locating string positions within the tree.
- locate_path(location: int) Path[source]#
Like
Node.locate(), only that the entire path (i.e. chain of descendants) relative to self is returned.
- milestone_segment(begin: Path | Node, end: Path | Node) Node[source]#
EXPERIMENTAL!!! Picks a segment from a tree beginning with start and ending with end.
- Parameters:
begin – the opening milestone (will be included in the result)
end – the closing milestone (will be included in the result)
- Returns:
a tree(-segment) encompassing all nodes from the opening milestone up to and including the closing milestone.
- pick(criteria: NodeSelector, include_root: bool = False, reverse: bool = False, skip_subtree: NodeSelector = <function deny>) Node | None[source]#
Picks the first (or last if run in reverse mode) descendant that fulfils the given criterion. See
create_match_function()for a catalogue of possible criteria.This function is syntactic sugar for next(node.select(criterion, …)). However, rather than raising a StopIterationError if no descendant with the given tag-name exists, it returns None.
- pick_child(criteria: NodeSelector, reverse: bool = False) Node | None[source]#
Picks the first child (or last if run in reverse mode) descendant that fulfils the given criterion. See
create_match_function()for a catalogue of possible criteria.This function is syntactic sugar for next(node.select_children(criterion, False)). However, rather than raising a StopIterationError if no descendant with the given tag-name exists, it returns None.
- pick_if(match_func: NodeMatchFunction, include_root: bool = False, reverse: bool = False, skip_func: NodeMatchFunction = <function deny>) Node | None[source]#
Picks the first (or last if run in reverse mode) descendant for which the match-functions returns True. Or, returns None if no matching node exists.
- pick_path(criteria: PathSelector, include_root: bool = False, reverse: bool = False, skip_subtree: PathSelector = <function deny>) Path[source]#
Like
Node.pick(), only that the entire path (i.e. chain of descendants) relative to self is returned.
- pick_path_if(match_func: PathMatchFunction, include_root: bool = False, reverse: bool = False, skip_func: PathMatchFunction = <function deny>) Path | None[source]#
Picks the first (or last if run in reverse mode) descendant-path for which the match-functions returns True. Or, returns None if no matching node exists.
- reconstruct_path(node: Node) Path[source]#
Determines the chain of ancestors of a node that leads up to self. Note: The use of this quite inefficient method can most of the time be avoided by traversing the tree with the path-selector-methods (e.g.
Node.select_path()) right from the start.- Parameters:
node – the descendant node, the ancestry of which shall be determined.
- Returns:
the list of nodes starting with self and leading to node
- Raises:
ValueError, in case node does not occur in the tree rooted in self
- replace_by(replacement: Node, merge_attr: bool = False)[source]#
Replaces the node’s name, result and attributes by that of another node. This allows replacing the node without needing to change the parent node’s children’s tuple.
- Parameters:
replacement – the node by which self shall be “replaced”.
merge_attr – if True, attributes are merged (by updating the attr dictionary with that of the replacement node) rather than simply be replaced.
- property repr: str#
Return a full (re-)executable representation of self including attributes and position value.
- property result: StrictResultType#
Returns the result from the parser that created the node.
- select(criteria: NodeSelector, include_root: bool = False, reverse: bool = False, skip_subtree: NodeSelector = <function deny>) Iterator[Node][source]#
Generates an iterator over all nodes in the tree that fulfill the given criterion. See
create_match_function()for a catalogue of possible criteria.- Parameters:
criteria – The criteria for selecting nodes.
include_root – If False, only descendant nodes will be checked for a match.
reverse – If True, the tree will be walked in reverse order, i.e. last children first.
skip_subtree – A criterion to identify subtrees that the returned iterator shall not dive into. Note that the root-node of the subtree will still be yielded by the iterator.
- Returns:
An iterator over all descendant nodes which fulfill the given criterion. Traversal is pre-order.
Examples:
>>> tree = parse_sxpr('(a (b "X") (X (c "d")) (e (X "F")))') >>> list(flatten_sxpr(item.as_sxpr()) for item in tree.select("X", False)) ['(X (c "d"))', '(X "F")'] >>> list(flatten_sxpr(item.as_sxpr()) for item in tree.select({"X", "b"}, False)) ['(b "X")', '(X (c "d"))', '(X "F")'] >>> any(tree.select('a', False)) False >>> list(flatten_sxpr(item.as_sxpr()) for item in tree.select('a', True)) ['(a (b "X") (X (c "d")) (e (X "F")))'] >>> flatten_sxpr(next(tree.select("X", False)).as_sxpr()) '(X (c "d"))' >>> tree = parse_sxpr('(a (b (c "") (d (e "")(f ""))) (g ""))') >>> [nd.name for nd in tree.select(ANY_NODE)] ['b', 'c', 'd', 'e', 'f', 'g']
- select_children(criteria: NodeSelector, reverse: bool = False) Iterator[Node][source]#
Returns an iterator over all direct children of a node that fulfil the given criterion. See
Node.select()for a description of the parameters.
- select_if(match_func: NodeMatchFunction, include_root: bool = False, reverse: bool = False, skip_func: NodeMatchFunction = <function deny>) Iterator[Node][source]#
Generates an iterator over all nodes in the tree for which match_function() returns True. See the more general function
Node.select()for a detailed description and examples. The tree is traversed pre-order by the iterator.
- select_path(criteria: PathSelector, include_root: bool = False, reverse: bool = False, skip_subtree: PathSelector = <function deny>) Iterator[Path][source]#
Like
Node.select()but yields the entire path (i.e. list of descendants, the last one being the matching node) instead of just the matching nodes.
- select_path_if(match_func: PathMatchFunction, include_root: bool = False, reverse: bool = False, skip_func: PathMatchFunction = <function deny>) Iterator[Path][source]#
Like
Node.select_if()but yields the entire path (i.e. list of descendants, the last one being the matching node) instead of just the matching nodes. NOTE: In contrast to select_if(), match_function receives the complete path as argument, rather than just the last node!
- serialize(how: str = 'default') str[source]#
Serializes the tree originating in the node self either as S-expression, XML, JSON, or in compact form. Possible values for how are ‘S-expression’, ‘XML’, ‘JSON’, ‘indented’ accordingly, or ‘AST’, ‘CST’, ‘default’, in which case the value of respective configuration variable determines the serialization format. (See module
DHParser.configuration.)
- split(milestone: PathSelector, skip_subtree: PathSelector = <function deny>) Tuple[Node, ...][source]#
Splits the entire tree into several trees at every Path for which the milestone-selector yields True. The last node in the path for which the milestone-selector yields True, will be removed from the tree. Thus, split_if() resembles the split-method of the Python-string-object.
EXPERIMENTAL!
- Parameters:
milestone – The criterion for a milestone-path
skip_subtree – The criterion for subtrees (identified by their path) within which no milestones will be searched. (Default: all subtrees will be searched).
- split_if(milestone: PathMatchFunction, skip_func: PathMatchFunction = <function deny>) Tuple[Node, ...][source]#
Splits the entire tree into several trees at every Path for which the milestone-function yield True. The last node in the path for which the milestone-function yields True, will be removed from the tree. Thus, split_if() resembles the split-method of the Python-string-object.
EXPERIMENTAL!
- strlen() int[source]#
Returns the length of the string-content of this node. Use len(node.children) for the number of children of this node!
- to_json_obj(as_dict: bool = False, include_pos: bool = True) List | Dict[source]#
Converts the tree into a JSON-serializable nested list. Nodes can be serialized in list-flavor (faster) or dictionary-flavor (
asdict=True, slower).In list-flavor, Nodes are serialized as JSON-lists with either two or three elements:
name (always a string),
content (either a string or a list of JSON-serialized Nodes)
optional: a dictionary that maps attribute names to attribute values, both of which are strings.
In dictionary flavor, Nodes are serialized as dictionaries that map the node’s name to a string (in case of a leaf node) or to a dictionary of its children (in case all the children’s names are unique) or to a list of pairs (child name, child’s result).
Examples (list flavor):
>>> Node('root', 'content').with_attr(importance="high").to_json_obj() ['root', 'content', {'importance': 'high'}] >>> node = parse_sxpr('(letters (a "A") (b "B") (c "C"))') >>> node.to_json_obj() ['letters', [['a', 'A'], ['b', 'B'], ['c', 'C']]]
- Examples (dictionary flavor)
>>> node.to_json_obj(as_dict=True) {'letters': {'a': 'A', 'b': 'B', 'c': 'C'}} >>> node.result = node.children + (Node('a', 'doublette'),) >>> node.to_json_obj(as_dict=True) {'letters': [['a', 'A'], ['b', 'B'], ['c', 'C'], ['a', 'doublette']]}
- walk_tree(include_root: bool = False, reverse: bool = False) Iterator[Node][source]#
Yields all nodes of the tree. (Faster than select())
- walk_tree_paths(include_root: bool = False, reverse: bool = False)[source]#
Yields all paths of the tree. (Faster than select_paths_if())
- with_attr(*attr_dict, **attributes) Node[source]#
Adds the attributes which are passed to with_attr() either as an attribute dictionary or as keyword parameters to the node’s attributes and returns self. Example:
>>> node = Node('test', '').with_attr(animal = "frog", plant= "tree") >>> dict(node.attr) {'animal': 'frog', 'plant': 'tree'} >>> node.with_attr({'building': 'skyscraper'}).repr "Node('test', '').with_attr({'animal': 'frog', 'plant': 'tree', 'building': 'skyscraper'})"
- Parameters:
attr_dict – a dictionary of attribute keys and values
attributes – alternatively, a sequences of keyword parameters
- Returns:
self
- with_pos(pos: int) Node[source]#
Initializes the node’s position value. Usually, the parser takes care of assigning the positions in the document to the nodes of the parse-tree. However, when Nodes are created outside the reach of the parser guard, their document-position must be assigned manually. Position values of the child nodes are assigned recursively, too. Example:
>>> node = Node('test', 'position').with_pos(10) >>> node.pos 10 >>> tree = parse_sxpr('(a (b (c "0") (d (e "1")(f "2"))) (g "3"))') >>> _ = tree.with_pos(0) >>> [(nd.name, nd.pos) for nd in tree.select(ANY_NODE, include_root=True)] [('a', 0), ('b', 0), ('c', 0), ('d', 1), ('e', 1), ('f', 2), ('g', 3)]
- Parameters:
pos – The position assigned to be assigned to the node. Value must be >= 0 if the position value as already been initialized, before. If not, a value < 0 has no effect.
- Returns:
the node itself (for convenience).
- Raises:
AssertionError if position has already been assigned or if parameter pos has a value < 0.
- class RootNode(*args, **kwargs)[source]#
The root node for the node-tree is a special kind of node that keeps and manages global properties of the tree as a whole. These are first and foremost the list off errors that occurred during tree generation (i.e. parsing) or any transformation of the tree.
Other properties concern the customization of the XML-serialization and meta-data about the processed document and processing stage.
Although errors are local properties that occur on a specific point or chunk of source code, instead of attaching the errors to the nodes on which they have occurred, the list of errors in managed globally by the root-node object. Otherwise, it would be hard to keep track of the errors when during the transformation of trees node are replaced or dropped that might also contain error messages.
The root node can be instantiated before the tree is fully parsed. This is necessary, because the root node is needed for managing error messages during the parsing process, already. In order to connect the root node to the tree, when parsing is finished, the swallow()-method must be called.
- Variables:
errors – A list of all errors that have occurred so far during processing (i.e. parsing, AST-transformation, compiling) of this tree. The errors are ordered by the time of their being added to the list.
errors_sorted – (read-only property) The list of errors ordered by their position.
error_nodes – A mapping of node-ids to a list of errors that occurred on the node with the respective id.
error_positions – A mapping of locations to a set of ids of nodes that contain an error at that particular location.
error_flag – the highest warning or error level of all errors that occurred.
source – The source code (after preprocessing)
source_mapping – A source mapping function to map source code positions to the positions of the non-preprocessed source. See module preprocess
lbreaks – A list of indices of all linebreaks in the source.
inline_tags – see Node.as_xml() for an explanation.
string_tags – see Node.as_xml() for an explanation.
empty_tags – see Node.as_xml() for an explanation.
docname – a name for the document
stage – a name for the current processing stage or the empty string (default). This name if present is used for verifying the stage in
DHParser.compile.run_pipeline(). Ifstagecontains the empty string, stage-verification is turned off (which may result in obscure error messages in case a tree-transformation is run on the wrong stage.) Stage-names should be considered as case-insensitive, i.e. “AST” is treated as the same stage as “ast”.serialization_type – The kind of serialization for the current processing stage. Can be one of ‘XML’, ‘json’, ‘indented’, ‘S-expression’ or ‘default’. (The latter picks the default serialization from the configuration.)
data – Compiled data. If the data still is a tree this simply contains a reference to self.
- add_error(node: Node | None, error: Error) RootNode[source]#
Adds an Error object to the tree, locating it at a specific node.
- as_xml(src: str | None = None, indentation: int = 2, inline_tags: AbstractSet[str] = frozenset({''}), string_tags: AbstractSet[str] = frozenset({''}), empty_tags: AbstractSet[str] = frozenset({''}), strict_mode: bool = True, reflow_col: int = 0, mapping={"don't generate a serialization mapping ": (-1, -1, -1)}) str[source]#
Serializes the tree of nodes as XML.
- Parameters:
src – The source text or None. In case the source text is given, the position will also be reported as line and column.
indentation – The number of whitespaces for indentation
inline_tags – A set of tag names, the content of which will always be written on a single line, unless it contains explicit line feeds (n). In addition, all nodes that have the attribute
xml:space="preserve"will be inlined.string_tags – A set of tags from which only the content will be printed, but neither the opening tag nor its attr nor the closing tag. This allows producing a mix of plain text and child tags in the output, which otherwise is not supported by the Node object, because it requires its content to be either a tuple of children or string content.
empty_tags – A set of tags which shall be rendered as empty elements, e.g. “<empty/>” instead of “<empty></empty>”.
strict_mode – If True, violation of stylistic or interoperability rules raises a ValueError.
reflow_col – If > 0, the serialized form of the tree will be reflown to the given column width. This is useful for pretty-printing
- Returns:
The XML-string representing the tree originating in self
- continue_with_data(data: Any)[source]#
Drops the swallowed tree in favor of the (non-tree) data resulting from the compilation of the tree. The data can then be retrieved from the field
self.data, which before the tree has been dropped contains a reference to the tree itself.
- did_match() bool[source]#
Returns True, if the parser that has generated this tree did match, False otherwise. Depending on wether the Grammar-object that that generated the node-tree was called with complete_match=True or not this requires either the complete document to have been matched or only the beginning.
Note: If the parser did match, this does not mean that it must have matched without errors. It simply means the no PARSER_STOPPED_BEFORE_END-error has occurred.
- error_safe(level: ErrorCode = 1000) RootNode[source]#
Asserts that the given tree does not contain any errors with a code equal or higher than the given level. Returns the tree if this is the case, raises an AssertionError otherwise.
- new_error(node: Node, message: str, code: ErrorCode = 1000) RootNode[source]#
Adds an error to this tree, locating it at a specific node.
- Parameters:
node – the node where the error occurred
message – a string with the error message
code – an error code to identify the type of the error
- node_errors(node: Node) List[Error][source]#
Returns the List of errors that occurred on the node or any child node at the position of the node that has already been removed from the tree, for example, because it was an anonymous empty child node. The position of the node is here understood to cover the range: [node.pos, node.pos + node.strlen()[
- serialize(how: str = '') str[source]#
Serializes the tree originating in the node self either as S-expression, XML, JSON, or in compact form. Possible values for how are ‘S-expression’, ‘XML’, ‘JSON’, ‘indented’ accordingly, or ‘AST’, ‘CST’, ‘default’, in which case the value of respective configuration variable determines the serialization format. (See module
DHParser.configuration.)
- swallow(node: Node | None, source: str | StringView = '', source_mapping: SourceMapFunc | None = None) RootNode[source]#
Put self in the place of node by copying all its data. Returns self.
This is done by the parse.Grammar object after parsing has finished, so that the Grammar object always returns a node-tree rooted in a RootNode object.
It is possible to add errors to a RootNode object, before it has actually swallowed the root of the node-tree.
- transfer_errors(src: Node, dest: Node)[source]#
Transfers errors to a different node. While errors never get lost during AST-transformation, because they are kept by the RootNode, the nodes they are connected to may be dropped in the course of the transformation. This function allows attaching errors from a node that will be dropped to a different node.
- class SerLocation(path: Path, ser_pos: int, offset: int, part: SerPart)[source]#
A location within a serialized version of the tree (XML, S-expression, SXML).
- path: Path#
Alias for field number 0
- class SerializationMapping(tree: Node, serialization: str, raw_mapping: Dict[Node, Tuple[int, int | Sequence[int], int]])[source]#
Maps serializations (e.g. XML, SXML, S-Expression) to paths. EXPERIMENTAL AND UNTESTED!!!
- content_pos(node: Node, ser_pos: int, offset: int, part: SerPart = SerPart.INSIDE) int[source]#
Returns the corresponding position within the pure string content of the tree.
- get_path(pos: int, left_biased: bool = False) SerLocation[source]#
Returns the path of the innermost node which covers the character at position
posin the serialization. The second return value is the position of the node within the serialization. The third return value is -1 if the character is part of the opening tag, 0 if it is part of the data, and 1 if it is part of the closing tag.The offset of pos within the node’s serialization can easily be dtermined by subtracting from pos the returned serialization-position of the node at the end of the returned path.
- class XMLSpacePolicy(value)[source]#
Policy for treating the xml:space Attribute when reformating XML.
- FAIL
an error will be raised when trying to reformat XML-code inside a tag with xml:space Attribute.
- IGNORE
the xml.space-Attribute will simply be ignored
- RESPECT
no reflow inside of tags that are “protected” by the xml:space-Attribute
- add_class(node: Node, token: str, *, attribute: str = 'class')#
Adds all tokens to attribute of node.
- add_token(token_sequence: str, token: str) str[source]#
Adds the tokens from ‘tokens’ that are not already contained in token_sequence to the end of token_sequence:
>>> add_token('', 'italic') 'italic' >>> add_token('bold italic', 'large') 'bold italic large' >>> add_token('bold italic', 'bold') 'bold italic' >>> add_token('red thin', 'stroked red') 'red thin stroked'
- add_token_to_attr(node: Node, token: str, attribute: str)[source]#
Adds all tokens to attribute of node.
- can_split(t: Path, i: int, left_biased: bool = True, greedy: bool = True, match_func: PathMatchFunction = <function affirm>, skip_func: PathMatchFunction = <function deny>, divisible: ~typing.Container[str] = frozenset({':EMPTY', ':RegExp', ':Text', ':Whitespace'})) int[source]#
Returns the negative index of the first node in the path, from which on all nodes can be split or do not need to be split, because the split-index lies to the left or right of the node.
Examples:
>>> tree = parse_sxpr('(doc (p (:Text "ABC")))') >>> can_split([tree, tree[0], tree[0][0]], 1) -1 >>> can_split([tree, tree[0], tree[0][0]], 0) -2 >>> can_split([tree, tree[0], tree[0][0]], 3) -2 >>> # anonymous nodes, like ":Text" are always divisible >>> can_split([tree, tree[0], tree[0][0]], 1, divisible=set()) -1 >>> # However, non-anonymous nodes aren't ... >>> tree = parse_sxpr('(doc (p (Text "ABC")))') >>> can_split([tree, tree[0], tree[0][0]], 1, divisible=set()) 0 >>> # ... unless explicitly mentioned >>> tree = parse_sxpr('(doc (p (Text "ABC")))') >>> can_split([tree, tree[0], tree[0][0]], 1, divisible={'Text'}) -1 >>> tree = parse_sxpr('(X (Z "!?") (A (B "123") (C "456")))') >>> can_split(tree.pick_path('B'), 0) -2 # edge cases >>> can_split([parse_sxpr('(p "123")')], 1) 0 >>> can_split([parse_sxpr('(:Text "123")')], 1) 0
- content_of(segment: ~nodetree.Node | ~typing.Tuple[~nodetree.Node, ...] | ~DHParser.stringview.StringView | str, select: PathSelector = <function LEAF_PATH>, ignore: PathSelector = <function deny>) str[source]#
Returns the string content from a single node or a tuple of Nodes.
- create_match_function(criterion: NodeSelector) NodeMatchFunction[source]#
Creates a node-match-function (Node -> bool) for the given criterion that returns True, if the node passed to the function matches the criterion.
criterion
type of match
id (int)
object identity
Node
object identity
FrozenNode
equality of tag name, string content and attributes
tag name (str)
equality of tag name only
multiple tag names
equality of tag name with one of the given names
pattern (re.Pattern)
full match of content with pattern
match-function
function returns True
- Parameters:
criterion – Either a node, the id of a node, a frozen node, a name or a container (usually a set) of multiple tag names, a regular expression pattern or another match function.
- Returns:
a match-function (Node -> bool) for the given criterion.
- create_path_match_function(criterion: PathSelector) PathMatchFunction[source]#
Creates a path-match-function (Path -> bool) for the given criterion that returns True, if the last node in the path passed to the function matches the criterion.
See
create_match_function()for a description of the possible criteria and their meaning.- Parameters:
criterion – Either a node, the id of a node, a frozen node, a name or a container (usually a set) of multiple tag names, a regular expression pattern or another match function.
- Returns:
a match-function (Path -> bool) for the given criterion.
- deep_split(path: Path, i: int, left_biased: bool = True, greedy: bool = True, match_func: PathMatchFunction = <function affirm>, skip_func: PathMatchFunction = <function deny>, chain_attr_name: str = '') int[source]#
Splits a tree along the path where i is the offset (or relative index) of the split in the last node of the path. Returns the index of the split-location in the first node of the path.
Exapmles:
>>> from DHParser.toolkit import printw >>> tree = parse_sxpr('(X (s "") (A (u "") (C "One, ") (D "two, ")) ' ... '(B (E "three, ") (F "four!") (t "")))') >>> X = copy.deepcopy(tree) >>> C = X.pick_path('C') >>> a = deep_split(C, 0) >>> a 1 >>> F = X.pick_path('F', reverse=True) >>> b = deep_split(F, F[-1].strlen(), left_biased=False) >>> b 3 >>> printw(X.as_sxpr()) (X (s) (A (u) (C "One, ") (D "two, ")) (B (E "three, ") (F "four!") (t))) >>> a = deep_split(C, 0, greedy=False) >>> a 2 >>> b = deep_split(F, F[-1].strlen(), left_biased=False, greedy=False) >>> b 4 >>> printw(X.as_sxpr(flatten_threshold=-1)) (X (s) (A (u)) (A (C "One, ") (D "two, ")) (B (E "three, ") (F "four!")) (B (t))) >>> X = copy.deepcopy(tree).with_pos(0) >>> C = X.pick_path('C') >>> a = deep_split(C, 4) >>> E = X.pick_path('E') >>> b = deep_split(E, 0, left_biased=False) >>> a, b (2, 3) >>> printw(X.as_sxpr(flatten_threshold=-1)) (X (s) (A (u) (C "One,")) (A (C " ") (D "two, ")) (B (E "three, ") (F "four!") (t))) >>> X.result = X[:a] + (Node('em', X[a:b]).with_pos(X[a].pos),) + X[b:] >>> printw(X.as_sxpr(flatten_threshold=-1)) (X (s) (A (u) (C "One,")) (em (A (C " ") (D "two, "))) (B (E "three, ") (F "four!") (t))) # edge cases >>> Y = parse_sxpr('(Y "123")') >>> deep_split([Y], 1) 1 >>> print(Y.as_sxpr()) (Y "123")
- deserialize(xml_sxpr_or_json: str) Node | None[source]#
Parses either XML or S-expressions or a JSON representation of a syntax-tree. Which of these is detected automatically.
- drop_leaf(leaf_path: Path)[source]#
Drops the last node in the leaf_path, if it is a leaf. Recursively drops all parent nodes, if they are empty after dropping the leaf. Examples:
>>> tree = parse_sxpr('(A (B (C (D (B (E "?"))))))') >>> drop_leaf(tree.pick_path('E')) >>> print(tree.as_sxpr()) (A) >>> tree = parse_sxpr('(A (B (C (D (B (E "?"))) (F "!"))))') >>> drop_leaf(tree.pick_path('E')) >>> print(tree.as_sxpr()) (A (B (C (F "!")))) >>> tree = parse_sxpr('(A (B (C (D (B (E "?"))) (F "!"))))') >>> drop_leaf(tree.pick_path('F')) >>> print(tree.as_sxpr()) (A (B (C (D (B (E "?"))))))
- ensuing_str(path: Path, length: int = -1) str[source]#
Returns length characters from the string content succeeding the path.
- eq_tokens(token_sequence1: str, token_sequence2: str) bool[source]#
Returns True if bothe token sequences contain the same tokens, no matter in what order:
>>> eq_tokens('red thin stroked', 'stroked red thin') True >>> eq_tokens('red thin', 'thin blue') False
- find_common_ancestor(path_A: Path, path_B: Path) Tuple[Node | None, int][source]#
Returns the last common ancestor of path_A, path_B and its index in the path. If there is no common ancestor (None, undefined integer) is returned.
- flatten_sxpr(sxpr: str, threshold: int = -1) str[source]#
Returns S-expression
sxpras a one-liner without unnecessary whitespace.The
thresholdvalue is a maximum number of characters allowed in the flattened expression. If this number is exceeded the un-flattened S-expression is returned. A negative number means that the S-expression will always be flattened. Zero or (any positive integer <= 3) essentially means that the expression will not be flattened. Example:>>> flatten_sxpr('(a\n (b\n c\n )\n)\n') '(a (b c))'
- Parameters:
sxpr – and S-expression in string form
threshold – maximum allowed string-length of the flattened S-exrpession. A value < 0 means that it may be arbitrarily long.
- Returns:
Either flattened S-expression or, if the threshold has been overstepped, the original S-expression without leading or trailing whitespace.
- flatten_xml(xml: str) str[source]#
Returns an XML-tree as a one-liner without unnecessary whitespace, i.e. only whitespace within leaf-nodes is preserved.
A more precise alternative to flatten_xml is to use Node.as_xml() and passing a set containing the top level tag to parameter inline_tags.
- Parameters:
xml – the XML-Text to be “flattened”
- Returns:
the flattened XML-Text
- foregoing_str(path: Path, length: int = -1) str[source]#
Returns length characters from the string content preceding the path.
- full_split(path: Path, i: int, left_biased: bool = True, greedy: bool = True, match_func: PathMatchFunction = <function affirm>, skip_func: PathMatchFunction = <function deny>, chain_attr_name: str = '') Tuple[Node, Node][source]#
Like
deep_split(), but splits the first node in the path, and returns two trees one to the left of the split, one to the right. As an edge case either tree can be an empty node. Note that the type RootNode will only be preserved for the first returned Node. The second Node will always be an ordinary node. Also, no attributes will be transferred to the second Node from the root of the path, nor will its pos-value be initialized. (Instantiate a new RootNode-object and useRootNode.swallow()to turn it into a RootNode-object.)
- has_class(node: Node, token: str, *, attribute: str = 'class', all: bool = True) bool#
Returns True, if ‘attribute’ of ‘node’ contains all ‘tokens’.
- has_token(token_sequence: str, token: str, *, all: bool = True) bool[source]#
Returns true, if token is contained in the blank-spearated token sequence. If token itself is a blank-separated sequence of tokens then, depending on the value of all, True is returned if either all tokens are contained in token_sequence or at least one token is contained in token_sequence.:
>>> has_token('bold italic', 'italic') True >>> has_token('bold italic', 'normal') False >>> has_token('bold italic', 'italic bold') True >>> has_token('bold italic', 'bold normal') False
- has_token_on_attr(node: Node, token: str, attribute: str, *, all: bool = True) bool[source]#
Returns True, if ‘attribute’ of ‘node’ contains all ‘tokens’.
- insert_node(leaf_path: Path, rel_pos: int, node: Node, divisible_leaves: Container = frozenset({':EMPTY', ':RegExp', ':Text', ':Whitespace'})) Node[source]#
Inserts a node at a specific position into the last or eventually second but last node in the path. The path must be a “leaf”-path, i.e. a path that ends in a leaf. Returns the parent of the newly inserted node.
This is a convenient function for inserting milestons into a tree-strcutured document.
Examples:
>>> tree = parse_sxpr('(A "Guten Morgen!")') >>> _ = insert_node([tree], 6, Node('M', ''), divisible_leaves={'A'}) >>> print(tree.as_sxpr()) (A (A "Guten ") (M) (A "Morgen!")) >>> tree = parse_sxpr('(A (B "Guten") (S " ") (C "Morgen"))') >>> path = [tree, tree['S']] >>> _ = insert_node(path, 0, Node('M', ''), divisible_leaves={'B', 'S', 'C'}) >>> print(tree.as_sxpr()) (A (B "Guten") (M) (S " ") (C "Morgen")) >>> del tree['M'] >>> _ = insert_node(path, 1, Node('M', ''), divisible_leaves={'B', 'S', 'C'}) >>> print(tree.as_sxpr()) (A (B "Guten") (S " ") (M) (C "Morgen")) >>> del tree['M'] >>> path = [tree, tree['B']] >>> _ = insert_node(path, 2, Node('Hicks!', ''), divisible_leaves={'B', 'S', 'C'}) >>> print(tree.as_sxpr()) (A (B "Gu") (Hicks!) (B "ten") (S " ") (C "Morgen")) >>> tree = parse_sxpr('(A (B "Guten") (S " ") (C "Morgen"))') >>> path = [tree['B']] # same tree, but path is confined to the leaf node! >>> _ = insert_node(path, 2, Node('Hicks!', ''), divisible_leaves={'B', 'S', 'C'}) >>> print(tree.as_sxpr()) (A (B (B "Gu") (Hicks!) (B "ten")) (S " ") (C "Morgen"))
- leaf_path(path: Path | None, pick_child: PickChildFunction, *, steps: int = -1) Path | None#
Returns the path of a descendant that follows steps generations up the tree originating in path[-1]. If steps < 0 this will be as many generations as are needed to reach a leaf-node. The function pick_child determines which branch to follow during each iteration, as long as the top of the path is not yet a leaf node. A path-parameter value of None will simply be passed through.
- leaf_paths(criterion: PathSelector) PathMatchFunction[source]#
Creates a path-match function that matches only and all leaf paths for those paths that the criterion matches. Warning: This may be slower than a custom algorithm that matches only leaf-paths right from the start. Example:
>>> xml = '''<doc><p>In München<footnote><em>München</em> is the German ... name of the city of Munich</footnote> is a Hofbräuhaus</p></doc>''' >>> tree = parse_xml(xml) >>> for path in tree.select_path(leaf_paths('footnote')): ... pp_path(path, 1) 'doc <- p <- footnote <- em "München"' 'doc <- p <- footnote <- :Text " is the German\nname of the city of Munich"'
Compare this with the result without the leaf_paths-filter:
>>> for path in tree.select_path('footnote'): ... pp_path(path, 1) 'doc <- p <- footnote "München is the German\nname of the city of Munich"'
- match_path_str(path_str: str, glob_pattern: str) bool[source]#
Matches a path_str against a glob-pattern.
- next_path(path: Path) Path | None[source]#
Returns the path of the successor of the last node in the path. The successor is the sibling of the same parent Node succeeding the node, or if it already is the last sibling, the parent’s sibling succeeding the parent, or grandparent’s sibling and so on. In case no successor is found when the first ancestor has been reached, None is returned.
- normalize_token_sequence(token_sequence: str) str[source]#
Normalizeses the token sequence, i.e. whitespace at the beginning and the end will be stripped, any other whitespace will be replaced by a single blank.
- parse_json(json_str: str) RootNode[source]#
Parses a JSON-representation of a node-tree. Other than and parse_xml, this function does not convert any json-document into a node-tree, but only json-documents that represents a node-tree, e.g. a json-document that has been produced by Node.as_json()!
- parse_sxml(sxml: str | StringView) RootNode[source]#
Generates a tree of nodes from SXML. Example:
>>> sxml = '(employee (@ (branch "Secret Service") (id "007")) "James Bond")' >>> tree = parse_sxml(sxml) >>> print(tree.as_xml()) <employee branch="Secret Service" id="007">James Bond</employee>
- parse_sxpr(sxpr: str | StringView) RootNode[source]#
Generates a tree of nodes from an S-expression.
This can - among other things - be used for deserialization of trees that have been serialized with Node.as_sxpr() or as a convenient way to generate test data.
Example: >>> parse_sxpr(“(a (b c))”).as_sxpr(flatten_threshold=0) ‘(an (b “c”))’
parse_sxpr() does not initialize the node’s pos-values. This can be done with Node.with_pos():
>>> tree = parse_sxpr('(A (B "x") (C "y"))').with_pos(0) >>> tree['C'].pos 1
- parse_xml(xml: str | StringView, string_tag: str = ':Text', ignore_pos: bool = False, out_empty_tags: AbstractSet[str] = frozenset({''}), strict_mode: bool = True) RootNode[source]#
Generates a tree of nodes from a (Pseudo-)XML-source. This simplified XML-parser only parses the XML-content, processing instructions, document-type definitions etc. will be ignored. The source-document should not contain embedded DTDs or CData-sections, as the outcome is undefined, i.e. either an error will be raised or the returned tree will contain (possibly inconclusive) fragments of these parts.
For a more precise XML-parser, that conforms to the W3C-XML-specification, use
parse_XML()orparse_HTML().- Parameters:
xml – The XML-string to be parsed into a tree of Nodes
string_tag – A tag-name that will be used for strings inside mixed-content-tags.
ignore_pos – if True, ‘_pos’-attributes will be understood as normal XML-attributes. Otherwise, ‘_pos’ will be understood as a special attribute, the value of which will be written to node._pos and not transferred to the node.attr-dictionary.
out_empty_tags – A set that is filled with the names of those tags that are empty tags, e.g. “<br/>”
strict_mode – If True, errors are raised if XML contains stylistic or interoperability errors, like using one and the same tag-name for empty and non-empty tags, for example.
- path_head(path: Path, criterion: NodeSelector, greedy: bool = False) Path[source]#
Returns the beginning of the path until and including the first node for which the criterion holds. >>> tree = parse_sxpr(‘(A (B (C (D (B (E “?”))))))’) >>> path = tree.pick_path(‘E’) >>> print(pp_path(path)) A <- B <- C <- D <- B <- E >>> head = path_head(path, ‘B’) >>> print(pp_path(head)) A <- B >>> head2 = path_head(path, ‘B’, greedy=True) >>> print(pp_path(head2)) A <- B <- C <- D <- B >>> failure = path_head(path, ‘?’) >>> print(failure) []
- path_head_if(path: Path, match_func: NodeMatchFunction, greedy: bool = False) Path[source]#
Returns the beginning of the path until and including the first node for which match_func is True.
- path_sanity_check(path: Path) bool[source]#
Checks whether the nodes following in the path-list are really immediate descendants of each other.
- path_tail(path: Path, criterion: NodeSelector, greedy: bool = False) Path[source]#
Returns the ending of the path until and including the first node for which the criterion holds. Examples:
>>> tree = parse_sxpr('(A (B (C (D (B (E "?"))))))') >>> path = tree.pick_path('E') >>> print(pp_path(path)) A <- B <- C <- D <- B <- E >>> tail = path_tail(path, 'B') >>> print(pp_path(tail)) B <- E >>> tail2 = path_tail(path, 'B', greedy=True) >>> print(pp_path(tail2)) B <- C <- D <- B <- E >>> failure = path_tail(path, '?') >>> print(failure) []
- path_tail_if(path: Path, match_func: NodeMatchFunction, greedy: bool = False) Path[source]#
Returns the ending of the path until and including the first node for which match_func is True.
- pick_from_path(path: Path, criterion: NodeSelector, reverse: bool = False) Node | None[source]#
Picks the first node from the path that fulfils the criterion. Returns None if the path does not contain any node fulfilling the criterion.
- pick_from_path_if(path: Path, match_func: NodeMatchFunction, reverse: bool = False) Node | None[source]#
Picks the first node from the path for which match_func is True. Returns None if the path does not contain any node for which this is the case.
- pick_path(start_path: Path, criteria: PathSelector, include_root: bool = False, reverse: bool = False, skip_subtree: PathSelector = <function deny>) Path | None[source]#
Returns the first path for which the criterion matches. If no path in the given direction (forward by default or reverse, if parameter reverse is True), None is returned.
- pick_path_if(start_path: Path, match_func: PathMatchFunction, include_root: bool = False, reverse: bool = False, skip_func: PathMatchFunction = <function deny>) Path | None[source]#
Returns the first path for which match_func becomes True. If no path in the given direction (forward by default or reverse, if paramter reverse is True), None is returned.
- pp_path(path: Path, with_content: int | Tuple[int, int] = 0, delimiter: str = ' <- ') str[source]#
Serializes a path as string.
- Parameters:
path – the path to be serialized.
with_content – the number of nodes from the end of the path for which the content will be displayed next to the name. If a tuple is given, the first integer is the number of nodes and the second integer is the maximum length of the content string, before it is abbreviated.
delimiter – The delimiter separating the nodes in the returned string.
- Returns:
the string-serialization of the given path.
- pred_siblings(path: Path, criteria: NodeSelector = <function affirm>, reverse: bool = False) Iterator[Node][source]#
Returns an iterator over the siblings preceeding the end-node in the path. Siblings are iterated left to right, so if the end-node of path is the 5th child of its parent (path[-2]) the siblings will be iterated starting with the 1st child (not with the 4th!). This can be reversed with reverse=True.
- prev_path(path: Path) Path | None[source]#
Returns the path of the predecessor of the last node in the path. The predecessor is the sibling of the same parent-node preceding the node, or, if it already is the first sibling, the parent’s sibling preceding the parent, or grandparent’s sibling and so on. In case no predecessor is found, when the first ancestor has been reached, None is returned.
- reflow_as_oneliner(tree: ~nodetree.Node, leaf_criterion: NodeSelector = <function LEAF_NODE>, whitespace_re=re.compile('\\s+'), xml_space_policy: ~nodetree.XMLSpacePolicy = XMLSpacePolicy.IGNORE) None[source]#
Removes all line-breaks and indentations in the content of leaf-nodes selected by the leaf_criterion. (If any of these nodes have children, a TypeError will be raised.) ‘whitespace_re’ is the regular expression that is used to capture whitespace.
One use case of this function is to normalize XML-code.
- remove_class(node: Node, token: str, *, attribute: str = 'class', remove_empty_attr: bool = True) Node#
Removes all tokens from attribute of node.
- remove_token(token_sequence, token: str) str[source]#
Removes all tokens from token_sequence:
>>> remove_token('red thin stroked', 'thin') 'red stroked' >>> remove_token('red thin stroked', 'blue') 'red thin stroked' >>> remove_token('red thin stroked', 'blue stroked') 'red thin'
- remove_token_from_attr(node: Node, token: str, attribute: str, remove_empty_attr: bool = True) Node[source]#
Removes all tokens from attribute of node.
- reset_chain_ID(chain_length: int = 3)[source]#
For testing and debugging, reset the chain_id counter to ensure deterministic results.
- Parameters:
chain_length – The staring length of the letter-chain used as ID value
- select_from_path(path: Path, criteria: NodeSelector, reverse: bool = False) Iterator[Node][source]#
Yields all nodes from path which fulfill the criterion.
- select_from_path_if(path: Path, match_func: NodeMatchFunction, reverse: bool = False) Iterator[Node][source]#
Yields all nodes from path for which the match_function is true.
- select_path(start_path: Path, criteria: PathSelector, include_root: bool = False, reverse: bool = False, skip_subtree: PathSelector = <function deny>) Iterator[Path][source]#
Like select_path_if() but yields the entire path (i.e. list of descendants, the last one being the matching node) instead of just the matching nodes.
- select_path_if(start_path: Path, match_func: PathMatchFunction, include_root: bool = False, reverse: bool = False, skip_func: PathMatchFunction = <function deny>) Iterator[Path][source]#
Creates an Iterator yielding all paths for which the match_function is true, starting from path.
- split_node(node: Node, parent: Node, i: int, left_biased: bool = True, chain_attr: dict | None = None) int[source]#
Splits a node at the given index (in case of a branch-node) or string-position (in case of a leaf-node). Returns the index of the right part within the parent node after the split. (This means that with
node.insert(index, nd)nd will be inserted exactly at the split location.)Non-anonymous nodes that have been split will be marked by updating their attribute-dictionary with the chain_attr-dictionary if given.
- Parameters:
node – the node to be split
parent – the node’s parent
i – the index either of the child or of the character before which the node will be split.
left_biased – if True, yields the location after the end of the previous path rather than the location at the very beginning of the next path. Default value is “True”.
chain_attr – a dictionary with a single key and value resembling an attribute and value that will be added to the attributes-dicitonary of both nodes after the split, if the node is named node.
- Returns:
the index of the split within the children’s tuple of the parent node.
Examples:
>>> test_tree = parse_sxpr('(X (A "Hello, ") (B "Peter") (C " Smith"))').with_pos(0) >>> X = copy.deepcopy(test_tree) # test edge cases first >>> split_node(X['B'], X, 0) 1 >>> print(X.as_sxpr()) (X (A "Hello, ") (B "Peter") (C " Smith")) >>> split_node(X['B'], X, X['B'].strlen()) 2 >>> print(X.as_sxpr()) (X (A "Hello, ") (B "Peter") (C " Smith")) # standard case >>> split_node(X['B'], X, 2) 2 >>> print(X.as_sxpr()) (X (A "Hello, ") (B "Pe") (B "ter") (C " Smith")) >>> print(X.pick('B', reverse=True).pos) 9 # use split() as preparation for adding markup >>> X = copy.deepcopy(test_tree) >>> a = split_node(X['A'], X, 6) >>> a 1 >>> b = split_node(X['C'], X, 1) >>> b 4 >>> print(X.as_sxpr()) (X (A "Hello,") (A " ") (B "Peter") (C " ") (C "Smith")) >>> markup = Node('em', X[a:b]).with_pos(X[a].pos) >>> X.result = X[:a] + (markup,) + X[b:] >>> print(X.as_sxpr()) (X (A "Hello,") (em (A " ") (B "Peter") (C " ")) (C "Smith")) # a more complex case: add markup to a nested tree >>> X = parse_sxpr('(X (A "Hello, ") (B "Peter") (bold (C " Smith")))').with_pos(0) >>> a = split_node(X['A'], X, 6) >>> b0 = split_node(X['bold']['C'], X['bold'], 1) >>> b0 1 >>> print(X.as_sxpr()) (X (A "Hello,") (A " ") (B "Peter") (bold (C " ") (C "Smith"))) >>> b = split_node(X['bold'], X, b0) >>> b 4 >>> print(X.as_sxpr()) (X (A "Hello,") (A " ") (B "Peter") (bold (C " ")) (bold (C "Smith"))) >>> markup = Node('em', X[a:b]).with_pos(X[a].pos) >>> X.result = X[:a] + (markup,) + X[b:] >>> print(X.as_sxpr()) (X (A "Hello,") (em (A " ") (B "Peter") (bold (C " "))) (bold (C "Smith"))) # use left_bias hint for potentially ambiguous cases: >>> X = parse_sxpr('(X (A ""))') >>> split_node(X['A'], X, X['A'].strlen()) 0 >>> split_node(X['A'], X, X['A'].strlen(), left_biased=False) 1
- strlen_of(segment: ~nodetree.Node | ~typing.Sequence[~nodetree.Node] | ~DHParser.stringview.StringView | str, select: PathSelector = <function LEAF_PATH>, ignore: PathSelector = <function deny>) int[source]#
Returns the string size from a single node or a tuple of Nodes.
- succ_siblings(path: Path, criteria: NodeSelector = <function affirm>, reverse: bool = False) Iterator[Node][source]#
Returns an iterator over the siblings succeeding the end-node in the path. Siblings are iterated left to right. This can be reversed with reverse=True.
- tree_sanity_check(tree: Node) bool[source]#
Sanity check for node-trees: One and the same node must never appear twice in the node-tree. Frozen Nodes (EMTPY_NODE, PLACEHOLDER) should only exist temporarily and must have been dropped or eliminated before any kind of tree generation (i.e. parsing) or transformation is finished. :param tree: the root of the tree to be checked :returns: True, if the tree is “sane”, False otherwise.
- validate_token_sequence(token_sequence: str) bool[source]#
Returns True, if token_sequence is properly formed, i.e. normalized.
Token sequences are strings or words which are separated by single blanks with no leading or trailing blank.
- zoom_into_path(path: Path | None, pick_child: PickChildFunction, steps: int) Path | None[source]#
Returns the path of a descendant that follows steps generations up the tree originating in path[-1]. If steps < 0 this will be as many generations as are needed to reach a leaf-node. The function pick_child determines which branch to follow during each iteration, as long as the top of the path is not yet a leaf node. A path-parameter value of None will simply be passed through.
Module transform#
Module transform contains the functions for transforming the
concrete syntax tree (CST) into an abstract syntax tree (AST).
As these functions are very generic, they can in principle be used for any kind of tree transformations, not necessarily only for CST -> AST transformations.
- add_attributes(path: Path, attributes: dict)[source]#
- add_attributes(*args: dict, **kwargs)
Adds the attributes in the given dictionary to the XML-attributes of the last node in the given path.
- add_error(path: Path, error_msg: str, error_code: ErrorCode = 1000)[source]#
- add_error(*args: str, **kwargs)
Raises an error unconditionally. This makes sense in case illegal paths are encoded in the syntax to provide more accurate error messages.
- all_of(path: Path, bool_func_set: AbstractSet[Callable]) bool[source]#
- all_of(*args: Callable)
- all_of(*args: Set, **kwargs)
Returns True, if all the bool functions in bool_func_set evaluate to True for the given path.
- any_of(path: Path, bool_func_set: AbstractSet[Callable]) bool[source]#
- any_of(*args: Callable)
- any_of(*args: Set, **kwargs)
Returns True, if any of the bool functions in bool_func_set evaluate to True for the given path.
- apply_if(path: Path, transformation: Callable | Tuple[Callable, ...], condition: Callable[[List[Node]], bool])[source]#
- apply_if(*args: Callable, **kwargs)
- apply_if(*args: tuple, **kwargs)
Applies a transformation only if a certain condition is met.
- apply_ifelse(path: Path, if_transformation: Callable | Tuple[Callable, ...], else_transformation: Callable | Tuple[Callable, ...], condition: Callable[[List[Node]], bool])[source]#
- apply_ifelse(*args: Callable, **kwargs)
- apply_ifelse(*args: tuple, **kwargs)
Applies a one particular transformation if a certain condition is met and another transformation otherwise.
- apply_unless(path: Path, transformation: Callable | Tuple[Callable, ...], condition: Callable[[List[Node]], bool])[source]#
- apply_unless(*args: Callable, **kwargs)
- apply_unless(*args: tuple, **kwargs)
Applies a transformation if a certain condition is not met.
- assert_has_children(path: Path, *, condition: CondFunc = <function <lambda>>, error_msg: str = 'Element "%s" has no children', error_code: ErrorCode = 1000)#
Checks for condition; adds an error or warning message if condition is not met.
- change_name(path: Path, name: str)[source]#
- change_name(*args: str, **kwargs)
Changes the tag name of the last node in the path.
- Parameters:
path – the path where the parser shall be replaced
name – The new tag name.
- collapse(path: Path)[source]#
Collapses all sub-nodes of a node by replacing them with the string representation of the node. USE WITH CARE!
>>> sxpr = '(place (abbreviation "p.") (page "26") (superscript "b") (mark ",") (page "18"))' >>> tree = parse_sxpr(sxpr) >>> collapse([tree]) >>> print(flatten_sxpr(tree.as_sxpr())) (place "p.26b,18")
- collapse_children_if(path: Path, condition: ~typing.Callable[[~typing.List[~DHParser.nodetree.Node]], bool], target_name: str, merge_rule: MergeRule = <function join_content>)[source]#
- collapse_children_if(*args: Callable, **kwargs)
(Recursively) merges the content of all adjacent child nodes that fulfill the given condition into a single leaf node with the tag-name target_tag. Nodes that do not fulfil the condition will be preserved.
>>> sxpr = '(place (abbreviation "p.") (page "26") (superscript "b") (mark ",") (page "18"))' >>> tree = parse_sxpr(sxpr) >>> collapse_children_if([tree], not_one_of({'superscript', 'subscript'}), 'text') >>> print(flatten_sxpr(tree.as_sxpr())) (place (text "p.26") (superscript "b") (text ",18"))
See test_transform.TestComplexTransformations for more examples.
- contains_only_whitespace(path: Path) bool[source]#
Returns
Truefor nodes that contain only whitespace regardless of the name, i.e. nodes the content of which matches the regular expression /s*/, including empty nodes. Note that this is not true for anonymous whitespace nodes that contain comments.
- content_matches(path: Path, regexp: str) bool[source]#
- content_matches(*args: str, **kwargs)
Checks a node’s content against a regular expression.
In contrast to
re.matchthe regular expression must match the complete string and not just the beginning of the string to succeed!
- del_attributes(path: Path, attributes: Set = frozenset({}))[source]#
- del_attributes(*args: Set, **kwargs)
Removes XML-attributes from the last node in the given path. If the given set is empty, all attributes will be deleted.
- delimit_children(path: Path, node_factory: Callable[[List[Node]], bool])[source]#
- delimit_children(*args: Callable, **kwargs)
Add a delimiter drawn from the node_factory between all children.
- error_on(path: Path, condition: Callable[[List[Node]], bool], error_msg: str = '', error_code: ErrorCode = 1000)[source]#
- error_on(*args: Callable, **kwargs)
Checks for condition; adds an error or warning message if condition is not met.
- flatten(path: Path, condition: ~typing.Callable[[~typing.List[~DHParser.nodetree.Node]], bool] = <function is_anonymous>, recursive: bool = True)[source]#
- flatten(*args: Callable, **kwargs)
Flattens all children, that fulfill the given
condition(default: all unnamed children). Flattening means that wherever a node has child nodes, the child nodes are inserted in place of the node.If the parameter
recursiveisTruethe same will recursively be done with the child-nodes, first. In other words, all leaves of this node and its child nodes are collected in-order as direct children of this node.Applying flatten recursively will result in these kinds of structural transformation:
(1 (+ 2) (+ 3)) -> (1 + 2 + 3) (1 (+ (2 + (3)))) -> (1 + 2 + 3)
- fuse(result: Sequence[Node], swallow: Callable[[List[Node]], bool] | None = None) str | Tuple[Node, ...][source]#
Merges the nodes in the given sequence of nodes by either mergeing their content, if they are all leaves nodes or their results.
- Parameters:
result – The sequence of nodes to merge.
swallow – A function that takes a node as an argument and returns True if the node should be added as a whole without merging it’s content. See
merge_adjacent for an example().
- Returns:
The merges result, either a tuple of nodes or a string with the merged content in case all nodes where leave-nodes and no node was “swallowed”.
- fuse_anonymous_leaves(result: list[Node]) list[Node, ...][source]#
Mereges all anonymous leave nodes and returns a list of the merge nodes, e.g.:
>>> tree = parse_sxpr('(p (:t "alpha") (:t "beta") (x "zzz") (:y (:t "uuu")) (:t "gamma"))') >>> tree.result = tuple(fuse_anonymous_leaves(list(tree.children))) >>> print(tree.as_sxpr())
(p (:t “alphabeta”) (x “zzz”) (:y (:t “uuu”)) (:t “gamma”))
- has_ancestor(path: Path, name_set: AbstractSet[str], generations: int = -1, until: AbstractSet[str] | str = frozenset({})) bool[source]#
- has_ancestor(*args: Set, **kwargs)
Checks whether a node with one of the given tag names appears somewhere in the path before the last node in the path.
- Parameters:
generations – determines how deep has_ancestor should dive into the ancestry. “1” means only the immediate parents wil be considered, “2” means also the grandparents, and so on. A value smaller or equal zero means all ancestors will be considered.
until – node-names which, when reached, will stop has_ancestor from searching further, even if the generations-parameter would allow a deeper search.
- has_attr(path: Path, attr: str = '', value: str | None = None) bool[source]#
- has_attr(*args: str, **kwargs)
Returns true, if the node has the attribute
attrand its value equalsvalue. Ifvalueis None, True is returned if the attribute exists, no matter what it value is.
- has_child(path: Path, name_set: AbstractSet[str]) bool[source]#
- has_child(*args: str)
- has_child(*args: Set, **kwargs)
Checks whether at least one child (i.e. immediate descendant) has one of the given tags.
- has_content(path: Path, content: str) bool[source]#
- has_content(*args: str, **kwargs)
Checks a node’s content against a given string. This is faster than content_matches for mere string comparisons.
- has_descendant(path: Path, name_set: AbstractSet[str], generations: int = -1, until: AbstractSet[str] | str = frozenset({})) bool[source]#
- has_descendant(*args: Set, **kwargs)
Checks whether a node with one of the given tag names appears somewhere among the descendants (children and children’s children etc.) of the last node in the path.
- Parameters:
generations – determines how deep has_descendant should dive into the descendants. “1” means only the immediate children wil be considered, “2” means also the grandchildren, and so on. A value smaller or equal zero means all ancestors will be considered.
until – node-names which, when reached, will stop has_descendant from searching further, even if the generations-parameter would allow a deeper search.
- has_parent(path: Path, name_set: AbstractSet[str]) bool[source]#
- has_parent(*args: str)
- has_parent(*args: Set, **kwargs)
Checks whether the immediate predecessor in the path has one of the given tags.
- has_sibling(path: Path, name_set: AbstractSet[str])[source]#
- has_sibling(*args: str)
- has_sibling(*args: Set, **kwargs)
Checks whether the last node in the path has a node with one of the given names as sibling.
- insert(path: Path, position: int | tuple | Callable, node_factory: Callable[[List[Node]], bool])[source]#
- insert(*args: int, **kwargs)
- insert(*args: tuple, **kwargs)
- insert(*args: Callable, **kwargs)
Inserts a delimiter at a specific position within the children. If position is None nothing will be inserted. Position values greater or equal the number of children mean that the delimiter will be appended to the tuple of children.
Example:
insert(pos_of('paragraph'), node_maker('LF', '\n'))
- is_one_of(path: Path, name_set: AbstractSet[str]) bool[source]#
- is_one_of(*args: str)
- is_one_of(*args: Set, **kwargs)
Returns true, if the node’s name is one of the given tag names.
- is_token(path: Path, tokens: AbstractSet[str] = frozenset({})) bool[source]#
- is_token(*args: str)
- is_token(*args: Set, **kwargs)
Checks whether the last node in the path has the name “:Text” and it’s content matches one of the given tokens. Leading and trailing whitespace-tokens will be ignored. In case an empty set of tokens is passed, any token is a match.
- keep_children(path: Path, section: slice = slice(None, None, None))[source]#
- keep_children(*args: slice, **kwargs)
Keeps only child-nodes which fall into a slice of the result field.
- keep_children_if(path: Path, condition: Callable[[List[Node]], bool])[source]#
- keep_children_if(*args: Callable, **kwargs)
Removes all children for which condition() returns True.
- keep_content(path: Path, regexp: str)[source]#
- keep_content(*args: str, **kwargs)
Removes children depending on their string value.
- keep_nodes(path: Path, names: AbstractSet[str])[source]#
- keep_nodes(*args: str)
- keep_nodes(*args: Set, **kwargs)
Removes children by tag name.
- keep_tokens(path: Path, tokens: AbstractSet[str] = frozenset({}))[source]#
- keep_tokens(*args: str)
- keep_tokens(*args: Set, **kwargs)
Removes any among a particular set of tokens from the immediate descendants of a node. If
tokensis the empty set, all tokens are removed.
- key_node_name(node: Node) str[source]#
Returns the tag name of the node as key for selecting transformations from the transformation table in function traverse.
- lean_left(path: Path, operators: AbstractSet[str])[source]#
- lean_left(*args: str)
- lean_left(*args: Set, **kwargs)
Turns a right leaning tree into a left leaning tree:
(op1 a (op2 b c)) -> (op2 (op1 a b) c)
If a left-associative operator is parsed with a right-recursive parser, lean_left can be used to rearrange the tree structure so that it properly reflects the order of association.
This transformation is needed, if you want to get the order of precedence right, when writing a grammar, say, for arithmetic that avoids left-recursion. (DHParser does support left-recursion but left-recursive grammars might not be compatible with other PEG-frameworks anymore.)
ATTENTION: This transformation function moves forward recursively, so grouping nodes must not be eliminated during traversal! This must be done in a second pass.
- left_associative(path: Path)[source]#
Rearranges a flat node with infix operators into a left associative tree.
- lstrip(path: Path, condition: ~typing.Callable[[~typing.List[~DHParser.nodetree.Node]], bool] = <function contains_only_whitespace>)[source]#
- lstrip(*args: Callable, **kwargs)
Recursively removes all leading child-nodes that fulfill a given condition.
- merge_adjacent(path: Path, condition: Callable[[List[Node]], bool], preferred_name: str = '', *, swallow: Callable[[List[Node]], bool] | None = None)[source]#
- merge_adjacent(*args: Callable, **kwargs)
Merges adjacent nodes that fulfill the given condition. In case, some nodes are leaf-nodes, but others are not, the leaf-nodes’ content will be added as TOKEN_PTYPE-Node to the result of the merged node.
The merged node’s name will be set to the value
preferred_nameunless that value is the empty string. In this case the name of the first node of the merge will be chosen. (Note that the assignment of the preferred name only happens if a merge actually took place, i.e. if there are at least two nodes that have been merged. ´merge_adjacent()` will not rename single nodes.)‘merge_adjacent’ differs from
collapse_children_if()in two respects:The merged nodes are not “collapsed” to their string content.
The naming rule for merged nodes is different, in so far as the ‘preferred_name’ passed to merge_adjacent is only used if it actually occurs among the nodes to be merged.
This, if ‘merge_adjacent’ is substituted for ‘collapse_children_if’ in doc-string example of the latter function, the example yields:
>>> sxpr = '(place (abbreviation "p.") (page "26") (superscript "b") (mark ",") (page "18"))' >>> tree = parse_sxpr(sxpr) >>> merge_adjacent([tree], not_one_of({'superscript', 'subscript'}), '') >>> print(flatten_sxpr(tree.as_sxpr())) (place (abbreviation "p.26") (superscript "b") (mark ",18"))
The parameter
swallowtakes a function that must yield true for those nodes that shall be swallowed as a whole, i.e. of which the content shall not be merged and which keep their name. This is useful, if you’d like to keep certain nodes intact, like for example internet links, when merging a sequence of nodes, as seen below. Without the swallow-parameter, the link (node named “a”) will not be retained in the merged node, but merely its attribute is copied, which may not be what had been intended:>>> tree = parse_sxpr('''(p (t "In ") (a `(href "www.munich.de") "München") ... (t "steht ") (t "ein ") (t "Hofbräuhaus."))''') >>> import copy; tree_copy = copy.deepcopy(tree) >>> merge_adjacent([tree], is_one_of('t', 'a')) >>> print(tree.as_sxpr()) (p (t `(href "www.munich.de") "In Münchensteht ein Hofbräuhaus."))
To reatain the link-node, merge_adjacent must be instructed to swallow nodes with name “a” as a whole:
>>> merge_adjacent([tree_copy], is_one_of('t', 'a'), swallow=is_a('a')) >>> print(tree_copy.as_sxpr()) (p (t (:Text "In ") (a `(href "www.munich.de") "München") (:Text "steht ein Hofbräuhaus.")))
- merge_connected(path: Path, content: Callable[[List[Node]], bool], delimiter: Callable[[List[Node]], bool], content_name: str = '', delimiter_name: str = '', *, swallow: Callable[[List[Node]], bool] | None = None)[source]#
- merge_connected(*args: Callable, **kwargs)
Merges sequences of content and delimiters. Other than merge_adjacent(), which does not make this distinction, delimiters at the fringe of content blocks are not included in the merge. (Note that other than
merge_adjacent()the content name is always assigned to content nodes, but not to delimiters.)- Parameters:
path – The path, i.e. list of “ancestor” nodes, ranging from the root node (path[0]) to the current node (path[-1])
content – Condition to identify content nodes. (Path -> bool)
delimiter – Condition to identify delimiter nodes. (Path -> bool)
content_name – tag name for the merged content blocks
delimiter_name – tag name for the merged delimiters at the fringe
ATTENTION: The condition to identify content nodes and the condition to identify delimiter nodes must never come true for one and the same node!!!
- merge_leaves(path: Path, *, condition: CondFunc = <function is_anonymous_leaf>, preferred_name: str = '', swallow: Optional[CondFunc] = None)#
Merges adjacent nodes that fulfill the given condition. In case, some nodes are leaf-nodes, but others are not, the leaf-nodes’ content will be added as TOKEN_PTYPE-Node to the result of the merged node.
The merged node’s name will be set to the value
preferred_nameunless that value is the empty string. In this case the name of the first node of the merge will be chosen. (Note that the assignment of the preferred name only happens if a merge actually took place, i.e. if there are at least two nodes that have been merged. ´merge_adjacent()` will not rename single nodes.)‘merge_adjacent’ differs from
collapse_children_if()in two respects:The merged nodes are not “collapsed” to their string content.
The naming rule for merged nodes is different, in so far as the ‘preferred_name’ passed to merge_adjacent is only used if it actually occurs among the nodes to be merged.
This, if ‘merge_adjacent’ is substituted for ‘collapse_children_if’ in doc-string example of the latter function, the example yields:
>>> sxpr = '(place (abbreviation "p.") (page "26") (superscript "b") (mark ",") (page "18"))' >>> tree = parse_sxpr(sxpr) >>> merge_adjacent([tree], not_one_of({'superscript', 'subscript'}), '') >>> print(flatten_sxpr(tree.as_sxpr())) (place (abbreviation "p.26") (superscript "b") (mark ",18"))
The parameter
swallowtakes a function that must yield true for those nodes that shall be swallowed as a whole, i.e. of which the content shall not be merged and which keep their name. This is useful, if you’d like to keep certain nodes intact, like for example internet links, when merging a sequence of nodes, as seen below. Without the swallow-parameter, the link (node named “a”) will not be retained in the merged node, but merely its attribute is copied, which may not be what had been intended:>>> tree = parse_sxpr('''(p (t "In ") (a `(href "www.munich.de") "München") ... (t "steht ") (t "ein ") (t "Hofbräuhaus."))''') >>> import copy; tree_copy = copy.deepcopy(tree) >>> merge_adjacent([tree], is_one_of('t', 'a')) >>> print(tree.as_sxpr()) (p (t `(href "www.munich.de") "In Münchensteht ein Hofbräuhaus."))
To reatain the link-node, merge_adjacent must be instructed to swallow nodes with name “a” as a whole:
>>> merge_adjacent([tree_copy], is_one_of('t', 'a'), swallow=is_a('a')) >>> print(tree_copy.as_sxpr()) (p (t (:Text "In ") (a `(href "www.munich.de") "München") (:Text "steht ein Hofbräuhaus.")))
- merge_results(dest: Node, src: Tuple[Node, ...], root: Node) bool[source]#
Merges the results of nodes src and writes them to the result of dest type-safely, if all src nodes are leaf-nodes (in which case their result-strings are concatenated) or none are leaf-nodes (in which case the tuples of children are concatenated). Returns True in case of a successful merge, False if some source nodes were leaf-nodes and some weren’t and the merge could thus not be done.
Example
>>> head, tail = Node('head', '123'), Node('tail', '456') >>> merge_results(head, (head, tail), head) # merge head and tail (in that order) into head True >>> str(head) '123456'
- merge_treetops(node: Node)[source]#
Recursively traverses the tree and “merges” nodes that contain only anonymous child nodes that are leaves. “mergeing” here means the node’s result will be replaced by the merged content of the children.
- move_fringes(path: Path, condition: Callable[[List[Node]], bool], *, side: str = 'both', merge: bool = True)[source]#
- move_fringes(*args: Callable, **kwargs)
Moves adjacent nodes on the left and right fringe that fulfill the given condition to the parent node. If the merge-flag is set, a moved node will be merged with its predecessor (or successor, respectively) in the parent node in case it also fulfills the given condition. Example:
>>> tree = parse_sxpr('''(paragraph ... (sentence ... (word "Hello ") ... (S " ") ... (word "world,") ... (S " ")) ... (sentence ... (word "said") ... (S " ") ... (word "Hal.")))''') >>> tree = traverse(tree, {'sentence': move_fringes(is_one_of({'S'}))}) >>> print(tree.as_sxpr()) (paragraph (sentence (word "Hello ") (S " ") (word "world,")) (S " ") (sentence (word "said") (S " ") (word "Hal.")))
In this example the blank at the end of the first sentence has been moved BETWEEN the two sentences. This is desirable, because if you extract a sentence from the data, most likely you are not interested in the trailing blank. Of course, this situation can best be avoided by a careful formulation of the grammar in the first place.
WARNING: This function should never follow replace_by_children() in the transformation list!!!
- name_matches(path: Path, regexp: str) bool[source]#
- name_matches(*args: str, **kwargs)
Returns true, if the node’s name matches the regular expression regexp completely. For example, ‘:.*’ matches all anonymous nodes.
- neg(path: Path, bool_func: Callable) bool | None[source]#
- neg(*args: Callable, **kwargs)
Returns the inverted boolean result of
bool_func(path), unless the result is None. In that case None is passed through.
- node_maker(name: str, result: Tuple[Callable[[], Node], ...] | Callable[[], Node] | str, attributes: dict = {}) Callable[source]#
Returns a parameter-free function that upon calling returns a freshly instantiated node with the given result, where result can again contain recursively nested node-factory functions which will be evaluated before instantiating the node.
Example
>>> factory = node_maker('d', (node_maker('c', ','), node_maker('l', ' '))) >>> node = factory() >>> node.serialize() '(d (c ",") (l " "))'
- normalize_position_representation(path: Path, position: int | tuple | Callable) Tuple[int, ...][source]#
Converts a position-representation in any of the forms that PositionType allows into a (possibly empty) tuple of integers.
- normalize_whitespace(path)[source]#
Normalizes Whitespace inside a leaf node, i.e. any sequence of whitespaces, tabs and line feeds will be replaced by a single whitespace. Empty (i.e. zero-length) Whitespace remains empty, however.
- not_one_of(path: Path, name_set: AbstractSet[str]) bool[source]#
- not_one_of(*args: str)
- not_one_of(*args: Set, **kwargs)
Returns true, if the node’s name is not one of the given tag names.
- positions_of(path: Path, names: AbstractSet[str] = frozenset({})) Tuple[int, ...][source]#
- positions_of(*args: str)
- positions_of(*args: Set, **kwargs)
Returns a (potentially empty) tuple of the positions of the children that have one of the given names.
- pull_up(path)[source]#
Moves the last Node in the list one level up in the hierarchy.
>>> tree = parse_sxpr('(p (t "A") (i (t "1") (X "---") (t "2")) (t "B"))') >>> path = tree.pick_path('X') >>> pull_up(path) >>> print(tree.as_sxpr()) (p (t "A") (i (t "1")) (X (i "---")) (i (t "2")) (t "B"))
>>> tree = parse_sxpr('(p (t "A") (i (X "---") (t "2")) (t "B"))') >>> path = tree.pick_path('X') >>> pull_up(path) >>> print(tree.as_sxpr()) (p (t "A") (X (i "---")) (i (t "2")) (t "B"))
>>> tree = parse_sxpr('(p (t "A") (i (t "1") (X "---")) (t "B"))') >>> path = tree.pick_path('X') >>> pull_up(path) >>> print(tree.as_sxpr()) (p (t "A") (i (t "1")) (X (i "---")) (t "B"))
>>> tree = parse_sxpr('(p (t "A") (i (X "---")) (t "B"))') >>> path = tree.pick_path('X') >>> pull_up(path) >>> print(tree.as_sxpr()) (p (t "A") (X (i "---")) (t "B"))
EXPERIMENTAL!!!
- reduce_single_child(path: Path)[source]#
Reduces a single branch node by transferring the result of its immediate descendant to this node, but keeping this node’s parser entry. Reduction only takes place if the last node in the path has exactly one child. Attributes will be merged. In case one and the same attribute is defined for the child as well as the parent, the parent’s attribute value take precedence.
- remove_anonymous_empty(path: Path, *, condition: CondFunc = <function <lambda>>)#
Removes all children for which condition() returns True.
- remove_anonymous_tokens(path: Path, *, condition: CondFunc = <function <lambda>>)#
Removes all children for which condition() returns True.
- remove_brackets(path: Path)[source]#
Removes any leading or trailing sequence of whitespaces, tokens or regexps.
- remove_children(path: Path, names: AbstractSet[str])[source]#
- remove_children(*args: str)
- remove_children(*args: Set, **kwargs)
Removes children by tag name.
- remove_children_if(path: Path, condition: Callable[[List[Node]], bool])[source]#
- remove_children_if(*args: Callable, **kwargs)
Removes all children for which condition() returns True.
- remove_content(path: Path, regexp: str)[source]#
- remove_content(*args: str, **kwargs)
Removes children depending on their string value.
- remove_empty(path: Path, *, condition: CondFunc = <function is_empty>)#
Removes all children for which condition() returns True.
- remove_if(path: Path, condition: Callable[[List[Node]], bool])[source]#
- remove_if(*args: Callable, **kwargs)
Removes node if condition is True
- remove_infix_operator(path: Path, *, section: slice = slice(0, None, 2))#
Keeps only child-nodes which fall into a slice of the result field.
- remove_tokens(path: Path, tokens: AbstractSet[str] = frozenset({}))[source]#
- remove_tokens(*args: str)
- remove_tokens(*args: Set, **kwargs)
Removes any among a particular set of tokens from the immediate descendants of a node. If
tokensis the empty set, all tokens are removed.
- remove_whitespace(path: Path, *, condition: CondFunc = functools.partial(<function is_one_of>, name_set={':Whitespace'}))#
Removes all children for which condition() returns True.
- replace_by_children(path: Path)[source]#
Eliminates the last node in the path by replacing it with its children. The attributes of this node will be dropped. In case the last node is the root-note (i.e. len(path) == 1), it will only be eliminated, if there is but one child.
WARNING: This should never be followed by move_fringes() in the transformation list!!!
- replace_by_single_child(path: Path)[source]#
Removes single branch node, replacing it by its immediate descendant. Replacement only takes place, if the last node in the path has exactly one child. Attributes will be merged. In case one and the same attribute is defined for the child as well as the parent, the child’s attribute value take precedence.
- replace_child_names(path: Path, replacements: Dict[str, str])[source]#
- replace_child_names(*args: str)
- replace_child_names(*args: dict, **kwargs)
Replaces the tag names of the children of the last node in the path according to the replacement dictionary.
- Parameters:
path – The current path (i.e. list of ancestors and current node)
replacements – A dictionary of names. Each tag name of a child node that exists as a key in the dictionary will be replaced by the value for that key.
- replace_content_with(path: Path, content: str)[source]#
- replace_content_with(*args: str, **kwargs)
Replaces the content of the node with the given text content.
- replace_or_reduce(path: Path, condition: ~typing.Callable[[~typing.List[~DHParser.nodetree.Node]], bool] = <function is_named>)[source]#
- replace_or_reduce(*args: Callable, **kwargs)
Replaces node by a single child, if condition is True on child. Reduces the child, if condition is not True and not None. If the condition is None nothing is changed.
- rstrip(path: Path, condition: ~typing.Callable[[~typing.List[~DHParser.nodetree.Node]], bool] = <function contains_only_whitespace>)[source]#
- rstrip(*args: Callable, **kwargs)
Recursively removes all trailing nodes that fulfill a given condition.
- strip(path: Path, condition: ~typing.Callable[[~typing.List[~DHParser.nodetree.Node]], bool] = <function contains_only_whitespace>)[source]#
- strip(*args: Callable, **kwargs)
Removes leading and trailing child-nodes that fulfill a given condition.
- swap_attributes(node: Node, other: Node)[source]#
Exchanges the attributes between node and other. This might be needed when re-arranging trees.
- transform_result(path: Path, func: Callable[[List[Node]], bool])[source]#
- transform_result(*args: Callable, **kwargs)
Replaces the result of the node.
functakes the node’s result as an argument and returns the mapped result.
- transformation_factory(t1=None, t2=None, t3=None, t4=None, t5=None)[source]#
Creates factory functions from transformation-functions that dispatch on the first parameter after the path parameter.
Decorating a transformation-function that has more than merely the
path-parameter withtransformation_factorycreates a function with the same name, which returns a partial-function that takes just the path-parameter.Additionally, there is some syntactic sugar for transformation-functions that receive a collection as their second parameter and do not have any further parameters. In this case a list of parameters passed to the factory function will be converted into a single collection-parameter.
The primary benefit is the readability of the transformation-tables.
Usage:
@transformation_factory(AbstractSet[str]) def remove_tokens(path, tokens): ...
or, alternatively:
@transformation_factory def remove_tokens(path, tokens: AbstractSet[str]): ...
Example:
trans_table = { 'expression': remove_tokens('+', '-') }
instead of:
trans_table = { 'expression': partial(remove_tokens, tokens={'+', '-'}) }
- Parameters:
t1 – type of the second argument of the transformation function, only necessary if the transformation functions’ parameter list does not have type annotations.
- transformer(tree: ~DHParser.nodetree.RootNode, transformation_table: TransformationTableType, key_func: KeyFunc = <function key_node_name>, src_stage: str = '', dst_stage: str = '') RootNode[source]#
Same as
traverse(), but expects a node of type RootNode to be passed in parametertreeand retruns this RootNode. Furthermore, the names of the source and destination stages can be passed optionally in the parameterssrc_stageanddst_stage. If these parameters are not empty strings, thetree.stagewill be checked againstsrc_stagebefore transforming the tree and set todst_stageafter the transformation.See
traverse()for the first three parameters and the general explanation of whattransformdoes.- Parameters:
src_stage – The name of the source stage or the empty string (default) if the source stage shall not be checked.
dst_stage – The name of the destination stage or the empty string (default)
- Raises:
ValueError, if
tree.stage != src_stage
- traverse(tree: ~DHParser.nodetree.Node, transformation_table: TransformationTableType, key_func: KeyFunc = <function key_node_name>) Node[source]#
Traverses the syntax tree starting with the given
nodedepth first and applies the sequences of callback-functions registered in thetransformation_table-dictionary.The most important use case is the transformation of a concrete syntax tree into an abstract tree (AST). But it is also imaginable to employ tree-traversal for the semantic analysis of the AST.
In order to assign sequences of callback-functions to nodes, a dictionary (“processing table”) is used. The keys usually represent tag names, but any other key function is possible. There exist three special keys:
‘<’: always called (before any other processing function)
‘*’: called for those nodes for which no (other) processing function appears in the table
‘>’: always called (after any other processing function)
- Parameters:
tree – The root-node of the syntax tree to be traversed
transformation_table – A mapping node key -> sequence of functions that will be applied to matching nodes in order. This dictionary is interpreted as a
compact_table. Seeexpand_table()orEBNFCompiler.EBNFTransTable()key_func – A mapping key_func(node) -> keystr. The default key_func yields node.name.
- Returns:
The tree that has been transformed in-place. The returned object is the same that has been passed in parameter tree, but be aware that this tree has been changed in-place!
Example:
table = { "term": [replace_by_single_child, flatten], "factor, flowmarker, retrieveop": replace_by_single_child } traverse(node, table)
- traverse_locally(path: Path, transformation_table: dict, key_func: KeyFunc = <function key_node_name>)[source]#
- traverse_locally(*args: dict, **kwargs)
Transforms the syntax tree starting from the last node in the path according to the given transformation table. The purpose of this function is to apply certain transformations locally, i.e. only for those nodes that have the last node in the path as their parent node.
Module compile#
Module compile contains a skeleton class for syntax
driven compilation support. Class Compiler can serve as base
class for a compiler. Compiler objects
are callable and receive the Abstract syntax tree (AST)
as argument and yield whatever output the compiler produces. In
most Digital Humanities applications, this will be
XML-code. However, it can also be anything else, like binary
code or, as in the case of DHParser’s EBNF-compiler, Python
source code.
Function compile_source invokes all stages of the compilation
process, i.e. pre-processing, parsing, CST to AST-transformation
and compilation.
See module ebnf for a sample of the implementation of a
compiler object.
- class Compiler[source]#
Class Compiler is the abstract base class for compilers. Compiler objects are callable and take the root node of the abstract syntax tree (AST) as argument and return the compiled code in a format chosen by the compiler itself.
Subclasses implementing a compiler must define
on_XXX()-methods for each node name that can occur in the AST where ‘XXX’ is the node’s name (for unnamed nodes it is the node’s ptype without the leading colon ‘:’).These compiler methods take the node on which they are run as argument. Other than in the AST transformation, which runs depth-first, compiler methods are called forward moving starting with the root node, and they are responsible for compiling the child nodes themselves. This should be done by invoking the
compile(node)- method which will pick the righton_XXX()-method or, more commonly, by callingfallback_compiler()-methods which compiles of child-nodes and updates the tuple of children according to the results. It is not recommended to call theon_XXX()-methods directly!Variables that are (re-)set only in the constructor and retain their value if changed during subesquent calls:
- Variables:
forbid_returning_None –
Default value: True. Most of the time, if a compiler-method (i.e. on_XXX()) returns None, this is a mistake due to a forgotten return statement. The method compile() checks for this mistake and raises an error if a compiler-method returns None. However, some compilers require the possibility to return None values. In this case
forbis_returing_Noneshould be set to False in the constructor of the derived class.(Another Alternativ would be to return a sentinel object instead of None.)
Object-Variables that are reset after each call of the Compiler-object:
- Variables:
path – A list of parent nodes that ends with the currently compiled node.
tree – The root of the abstract syntax tree.
finalizers – A stack of tuples (function, parameters) that will be called in reverse order after compilation.
method_dict – A cache that maps node-names to their respective compile-methods.
has_attribute_visitors – A flag indicating that the class has attribute-visitor-methods which are named ‘attr_ATTRIBUTENAME’ and will be called if the currently processed node has one or more attributes for which such visitors exist.
forbid_returning_None – A boolean flag that is true by default to catch a common error (i.e. ommiting the return value) when filling in compile-methods. Should be set to False in sub-classes that do want to allow compile-methods to return None
cancel_query – An optional cancel_query function that will be called by the compile method and stop short compilation with a fatal error if it returns True.
_dirty_flag – A flag indicating that the compiler has already been called at least once and that therefore all compilation variables must be reset when it is called again.
_debug – A flag indicating that debugging is turned on. The value for this flag is read before each call of the configuration (see debugging-section in
DHParser.configuration). If debugging is turned on, the compiler class raises en error if there is an attempt to be compiled one and the same node a second time._debug_already_compiled – A set of nodes that have already been compiled.
- attr_visitor_name(attr_name: str) str[source]#
Returns the visitor_method name for attr_name, e.g.:
>>> c = Compiler() >>> c.attr_visitor_name('class') 'attr_class'
- compile(node: Node, find_compilation_method: Callable[[Node], Callable[[Node], Any]] | None = None) Any[source]#
Calls the compilation method for the given node and returns the result of the compilation.
The method’s name is derived from either the node’s parser name or, if the parser is disposable, the node’s parser’s class name by adding the prefix
on_. Other ways of determining the right compilation method are possible by providing a function that returns a compilation-method for a given node to the parameter “find_compilation_method”.Note that
compiledoes not call any compilation functions for the parsers of the sub nodes by itself. Rather, this should be done within the compilation methods.- Parameters:
Node – The node that shall be compiled next. (The path of nodes leading from the root of the tree is kept in the instance-variable self.path.)
find_compilation_method – A function that returns a compilation method for a given node. If None, the default method described above will be used.
- Returns:
An object of any type (determined by the sub-class deriving from class Compile).
- fallback_compiler(node: Node) Any[source]#
This is a generic compiler function that will be called on all those node types for which no compiler method on_XXX has been defined.
- finalize(result: Any) Any[source]#
A finalization method that is called after compilation has finished, and after all tasks from the finalizers-stack have been executed.
- prepare(root: RootNode) None[source]#
A preparation method that will be called after everything else has been initialized and immediately before compilation starts. This method can be overwritten in order to implement preparation tasks.
- visitor_name(node_name: str) str[source]#
Returns the visitor_method name for name, e.g.:
>>> c = Compiler() >>> c.visitor_name('expression') 'on_expression' >>> c.visitor_name('!--') 'on_212d2d'
- wildcard(node: Node) Any[source]#
The wildcard method is called on nodes for which no other compilation method has been specified. This allows to check, whether illegal nodes occur in the tree (although, a static structural validation is to be preferred.) or whether a compilation node has been forgotten.
Per default, wildcard() just redirects to self.fallback_compiler()
- exception CompilerError[source]#
Exception raised when an error of the compiler itself is detected. Compiler errors are not to be confused with errors in the source code to be compiled, which do not raise Exceptions but are merely reported as an error.
- compile_source(source: str, preprocessor: PreprocessorFunc | None, parser: ParserCallable, transformer: TransformerFunc = <function NoTransformation>, compiler: CompilerFunc = <function NoTransformation>, *, start_parser: str | ~DHParser.parse.Parser = 'root_parser__', preserve_AST: bool = False, cancel_query: ~typing.Callable[[], bool] | None = None) CompilationResult[source]#
Compiles a source in four stages:
Pre-Processing (if needed)
Parsing
AST-transformation
Compiling.
The later stages AST-transformation, compilation will only be invoked if no fatal errors occurred in any of the earlier stages of the processing pipeline. Function “compile_source” does not invoke any postprocessing after compiling. See functions:
run_pipeline()andfull_compile()for postprocessing and compiling plus postprocessing.- Parameters:
source – The input text for compilation or the name of a file containing the input text.
preprocessor – text -> text. A preprocessor function or None, if no preprocessor is needed.
parser – A parsing function or grammar class
transformer – A transformation function that takes the root-node of the concrete syntax tree as an argument and transforms it (in place) into an abstract syntax tree.
compiler – A compiler function or compiler class instance
start_parser – The name of the parser (or the parser-object itself) with which to start. This is useful for compiling sections of entire documents without the need to provide a dummy-wrapper.
preserve_AST – Preserves the AST-tree.
cancel_query – A boolean-valued function that will be called between the different compilation-stages as to whether further processing shall be canceled.
- Returns:
The result of the compilation as a 3-tuple (result, errors, abstract syntax tree). In detail:
The result as returned by the compiler or
Nonein case of failureA list of error or warning messages
The root-node of the abstract syntax tree if preserve_ast is True or None otherwise.
- process_tree(tp: CompilerFunc, tree: RootNode) Any[source]#
Process a tree with the tree-processor tp only if no fatal error has occurred so far. Catch any Python exceptions in case any normal errors have occurred earlier in the processing pipeline. Don’t catch Python exceptions if no errors have occurred earlier.
This behavior is based on the assumption that given any non-fatal errors have occurred earlier, the tree passed through the pipeline might not be in a state that is expected by the later stages, thus if an exception occurrs it is not really to be considered a programming error. Processing stages should be written with possible errors occurring in earlier stages in mind, though. However, because it could be difficult to provide for all possible kinds of badly structured trees resulting from errors, exceptions occurring when processing potentially faulty trees will be dealt with gracefully.
Tree processing should generally be assumed to change the tree in place. If the input tree shall be preserved, it is necessary to make a deep copy of the input tree, before calling process_tree.
Module pipeline#
Module pipeline implements support for processing-pipelines for
connecting successive stages of tree transformations (called
“junctions”) to processing pipelines. Processing pipelines have
one staring point, the source-document, but can have one or more
end points. For example, if the source is a text-document, the
end points can be an HTML document for the online-presentation
and a LaTeX-document to produce a printed version.
Each junction is a triple of the name of the source-stage, the transformation-function and the name of the destination-stage. Pipelines are defined by the set of junctions from which paths connecting the source-point to the end-points are derived algorithmically.
- class Junction(src: str, factory: ParserFactory | CompilerFactory | TransformerFactory, dst: str)[source]#
A junction is a triple of the name of the source-stage, a factory-function that returns the actual transformation function and the name of the destination-stage.
- class PseudoJunction(factory)[source]#
- factory: PreprocessorFactory | ParserFactory#
Alias for field number 0
- create_compiler_junction(compile_class: type, src_stage: str, dst_stage: str) Junction[source]#
Creates a thread-safe transformation function and function-factory from a
compile.Compileror another callable class.
- create_evaluation_junction(actions: Dict[str, Callable], src_stage: str, dst_stage: str, supply_path_arg: bool = True) Junction[source]#
Creates a thread-safe transformation function and function-factory from an evaluation-table
nodetree.Node.evaluate().
- create_junction(tool: dict | type, src_stage: str, dst_stage: str, hint: str = '?') Junction[source]#
Generic stage-creation function for tree-transforming stages where a tree-transforming stage is a stage which either reshapes a node-tree or transforms a nodetree into something else, but not a stage where something else (e.g. a text) is turned into a node-tree.
- create_parser_junction(grammar_class: type) PseudoJunction[source]#
Creates a factory for thread-safe parser functions as well as a thread-safe parser function.
- create_preprocess_junction(prep_func: PreprocessorFunc | Tokenizer, include_regex, comment_regex, derive_file_name: DeriveFileNameFunc = <function same_name>, func_type: type | None = None) PseudoJunction[source]#
Creates a factory for thread-safe preprocessing functions as well as a thread-safe preprocessing function.
- end_points(junctions: Iterable[Junction]) Set[str][source]#
Returns all “final” destination stages, i.e. destinations that are not a source of another junction.
- extract_data(tree_or_data: RootNode | Node | Any) Any[source]#
Retrieves the data from the given tree or just passes the data through if argument
tree_or_datais not of type RootNode.
- full_pipeline(source: str, preprocessor_factory: PreprocessorFactory, parser_factory: ParserFactory, junctions: Set[Junction], target_stages: Set[str], start_parser: str | Parser = 'root_parser__', *, cancel_query: Callable[[], bool] | None = None) PipelineResult[source]#
Runs a processing pipeline starting from the source-code (in contrast to “run_pipeline()” which starts from any tree-stage, typically, from the concrete syntax-tree (CST).
“full_pipeline()” preprocesses and compiles the source-document, first. And then it post-processes the source into the given target stages. Mind that if there are fatal errors earlier in the pipeline, some or all target stages might not be reached and thus not be included in the result.
- run_pipeline(junctions: Set[Junction], source_stages: Dict[str, RootNode], target_stages: Set[str], *, cancel_query: Callable[[], bool] | None = None) PipelineResult[source]#
Runs all the intermediary compilation-steps that are necessary to produce the “target-stages” from the given “source-stages”. Here, each source-stage consists of a name for that stage, say “AST”, and a node-tree that represents the data at this stage of the processing pipeline. In the target-stage, the data can be a node-tree or data of any other kind.
The stages or connected through chains of junctions, where a junction is essentially a function that transforms a tree from one particular stage (identified by its name) to another stage, again identified by its name.
TODO: Parallelize processing of junctions? Requires copying a lot ot tree-data!?
Module parse#
Module parse contains the python classes and functions for
DHParser’s packrat-parser. It’s central class is the
Grammar-class, which is the base class for any concrete
Grammar. Grammar-objects are callable and parsing is done by
calling a Grammar-object with a source text as argument.
The different parsing functions are callable descendants of class
Parser. Usually, they are organized in a tree and defined
within the namespace of a grammar-class. See ebnf.EBNFGrammar
for an example.
- class Alternative(*parsers: Parser)#
Matches if one of several alternatives matches. Returns the first match.
This parser represents the EBNF-operator “|” with the qualification that both the symmetry and the ambiguity of the EBNF-or-operator are broken by selecting the first match.:
# the order of the sub-expression matters! >>> number = RE(r'\d+') | RE(r'\d+') + RE(r'\.') + RE(r'\d+') >>> str(Grammar(number)("3.1416")) '3 <<< Error on ".1416" | Parser "root" stopped before end, at: ».1416« Terminating parser. >>> ' # the most selective expression should be put first: >>> number = RE(r'\d+') + RE(r'\.') + RE(r'\d+') | RE(r'\d+') >>> Grammar(number)("3.1416").content '3.1416'
EBNF-Notation:
... | ...EBNF-Example:
number = /\d+\.\d+/ | /\d+/- is_optional() bool | None#
Returns
True, if the parser can never fail, i.e. never yieldsNoneinstead of a node. ReturnsFalse, if the parser can fail. ReturnsNoneif it is not known whether the parser can fail.
- static_analysis() List[AnalysisError]#
Analyses the parser for logical errors after the grammar has been instantiated.
- class Always#
A parser that always matches, but does not capture anything.
- class AnyChar#
A parser that returns the next Unicode character of the document whatever that is. The parser fails only at the very end of the text.
- class Capture(parser: Parser, zero_length_warning: bool = True)#
Applies the contained parser and, in case of a match, saves the result in a variable. A variable is a stack of values associated with the contained parser’s name. This requires the contained parser to be named.
- static_analysis() List[AnalysisError]#
Analyses the parser for logical errors after the grammar has been instantiated.
- class CombinedParser#
Class CombinedParser is the base class for all parsers that call (“combine”) other parsers. It contains functions for the optimization of return values of such parsers (i.e. descendants of classes UnaryParser and NaryParser).
One optimization consists in flattening the tree by eliminating anonymous nodes. This is the same as what the function DHParser.transform.flatten() does, only at an earlier stage. The reasoning is that the earlier the tree is reduced, the less work remains to do at all later processing stages. As these typically run through all nodes of the syntax tree, this saves memory and presumably also time.
Regarding the latter, however, performing flattening or merging during parsing stage alse means that it will be perfomred on all those tree-structures that are discarded later in the parsing process, as well.
Doing flatteining or merging during AST-transformation will ensure that it is performed only on those nodes that made it into the concrete-syntax-tree. Mergeing, in particular, might become costly because of potentially many string-concatenations. But then again, the usual depth-first-traversal during AST-transformation will take longer, because of the much more verbose tree. (Experiments suggest that not much ist to be gained by post-poning flattening and merging to the AST-transformation stage.)
Another optimization consists in returning the singleton EMPTY_NODE for dropped contents, rather than creating a new empty node every time empty content is returned. This optimization should always work.
- class ContextSensitive(parser: Parser)#
Base class for context-sensitive parsers.
Context-Sensitive-Parsers are parsers that either manipulate (store or change) the values of variables or that read values of variables and use them to determine whether the parser matches or not.
While context-sensitive-parsers are quite useful, grammars that use them will not be context-free anymore. Plus they breach the technology of packrat-parsers. In particular, their results cannot simply be memoized by storing them in a dictionary of locations. (In other words, the memoization function is not a function of parser and location anymore, but would need to be a function parser, location and variable (stack-)state.) DHParser blocks memoization for context-sensitive-parsers (see
Parser.__call__()andForward.__call__()). As a consequence the parsing time cannot be assumed to be strictly proportional to the size of the document, anymore. Therefore, it is recommended to use context-sensitive-parsers sparingly.
- class Counted(parser: Parser, repetitions: Tuple[int, int])#
Counted applies a parser for a number of repetitions within a given range, i.e. the parser must at least match the lower bound number of repetitions, and it can at most match the upper bound number of repetitions.
Examples:
>>> A2_4 = Counted(Text('A'), (2, 4)) >>> A2_4 `A`{2,4} >>> Grammar(A2_4)('AA').as_sxpr() '(root (:Text "A") (:Text "A"))' >>> Grammar(A2_4)('AAAAA', complete_match=False).as_sxpr() '(root (:Text "A") (:Text "A") (:Text "A") (:Text "A"))' >>> Grammar(A2_4)('A', complete_match=False).as_sxpr() '(ZOMBIE__ `(err "1:1: Error (1040): Parser did not match!"))' >>> moves = OneOrMore(Counted(Text('A'), (1, 3)) + Counted(Text('B'), (1, 3))) >>> result = Grammar(moves)('AAABABB') >>> result.name, result.content ('root', 'AAABABB') >>> moves = Counted(Text('A'), (2, 3)) * Counted(Text('B'), (2, 3)) >>> moves `A`{2,3} ° `B`{2,3} >>> Grammar(moves)('AAABB').as_sxpr() '(root (:Text "A") (:Text "A") (:Text "A") (:Text "B") (:Text "B"))'
While a Counted-parser could be treated as a special case of Interleave-parser, defining a dedicated class makes the purpose clearer and runs slightly faster.
- is_optional() bool | None#
Returns
True, if the parser can never fail, i.e. never yieldsNoneinstead of a node. ReturnsFalse, if the parser can fail. ReturnsNoneif it is not known whether the parser can fail.
- static_analysis() List[AnalysisError]#
Analyses the parser for logical errors after the grammar has been instantiated.
- class CustomParser(parse_func: CustomParseFunc)#
A wrapper for a simple custom parser function defined by the user:
>>> def parse_magic_number(rest: StringView) -> Node: ... return Node('', rest[:4]) if rest.startswith('1234') else EMPTY_NODE >>> parser = Grammar(CustomParser(parse_magic_number)) >>> result = parser('1234') >>> print(result.as_sxpr()) (root "1234") >>> result = parser('abcd') >>> for e in result.errors: print(e) 1:1: Error (1040): Parser "root" stopped before end, at: »abcd« Terminating parser.
- DTKN(token, wsL='', wsR='\\s*')#
Syntactic Sugar for ‘Series(Whitespace(wsL), DropText(token), Whitespace(wsR))’
- Drop(parser: Parser) Parser#
Returns the parser with the
parser.drop_content-property set toTrue. Parser must be anonymous and disposable. Use`DropFrominstead when this requirement ist not met.
- DropFrom(parser: Parser) Parser#
Encapsulates the parser in an anonymous Synonym-Parser and sets the drop_content-flag of the latter. This leaves the drop-flag of the parser itself untouched. This is needed, if you want to drop the result from a named-parser in one particular context where it is referred to, only.
- class ERR(err_msg: str, err_code: ErrorCode = 1000)#
ERR is a pseudo-parser does not consume any text, but adds an error message at the current location.
- class ErrorCatchingNary(*parsers: Parser, mandatory: int = 1073741824)#
ErrorCatchingNary is the parent class for N-ary parsers that can be configured to fail with a parsing error in case of a non-match, if all contained parsers from a specific subset of non-mandatory parsers have already matched successfully, so that only “mandatory” parsers are left for matching. The idea is that once all non-mandatory parsers have been consumed it is clear that this parser is a match so that the failure to match any of the following mandatory parsers indicates a syntax error in the processed document at the location were a mandatory parser fails to match.
For the sake of simplicity, the division between the set of non-mandatory parsers and mandatory parsers is realized by an index into the list of contained parsers. All parsers from the mandatory-index onward are considered mandatory once all parsers up to the index have been consumed.
In the following example,
Seriesis a descendant ofErrorCatchingNary:>>> fraction = Series(Text('.'), RegExp(r'[0-9]+'), mandatory=1).name('fraction') >>> number = (RegExp(r'[0-9]+') + Option(fraction)).name('number') >>> num_parser = Grammar(TreeReduction(number, CombinedParser.MERGE_TREETOPS)) >>> num_parser('25').as_sxpr() '(number "25")' >>> num_parser('3.1415').as_sxpr() '(number (:RegExp "3") (fraction ".1415"))' >>> str(num_parser('3.1415')) '3.1415' >>> str(num_parser('3.')) '3. <<< Error on "" | /[0-9]+/ expected by parser \'fraction\', but END OF FILE found instead! >>> '
In this example, the first item of the fraction, i.e. the decimal dot, is non-mandatory, because only the parser with an index of one or more are mandatory (
mandator=1). In this case this is only the regular expression parser capturing the decimal digits after the dot. This means, if there is no dot, the fraction parser simply will not match. However, if there is a dot, it will fail with an error if the following mandatory item, i.e. the decimal digits, are missing.- Variables:
mandatory – Number of the element starting at which the element and all following elements are considered “mandatory”. This means that rather than returning a non-match an error message is issued. The default value is NO_MANDATORY, which means that no elements are mandatory. NOTE: The semantics of the mandatory-parameter might change depending on the subclass implementing it.
- get_reentry_point(location: int) Tuple[int, Node]#
Returns a tuple of integer index of the closest reentry point and a Node capturing all text from
restup to this point or(-1, None)if no reentry-point was found. If no reentry-point was found or the skip-list ist empty, -1 and a zombie-node are returned.
- mandatory_violation(location: int, failed_on_lookahead: bool, expected: str, reloc: int, err_node: Node) Tuple[Error, int]#
Chooses the right error message in case of a mandatory violation and returns an error with this message, an error node, to which the error is attached, and the text segment where parsing is to continue.
This is a helper function that abstracts functionality that is needed by the Interleave-parser as well as the Series-parser.
- Parameters:
location – the point, where the mandatory violation happened. As usual, the string view represents the remaining text from this point.
failed_on_lookahead – True if the violating parser was a Lookahead-Parser.
expected – the expected (but not found) text at this point. position where the error occurred to a suggested reentry-position.
reloc – A position offset that represents the reentry point for parsing after the error occurred.
err_node – The node to which the mandatory violation error shall be attached. Usually, this is the skip-node so that the error will be located exactly where the violation occurred.
- Returns:
a tuple of an error object and a location for the continuation the parsing process
- static_analysis() List[AnalysisError]#
Analyses the parser for logical errors after the grammar has been instantiated.
- class Forward#
Forward allows to declare a parser before it is actually defined. Forward declarations are needed for parsers that are recursively nested, e.g.:
>>> class Arithmetic(Grammar): ... r''' ... expression = term { ("+" | "-") term } ... term = factor { ("*" | "/") factor } ... factor = INTEGER | "(" expression ")" ... INTEGER = /\d+/~ ... ''' ... expression = Forward() ... INTEGER = RE('\\d+') ... factor = INTEGER | TKN("(") + expression + TKN(")") ... term = factor + ZeroOrMore((TKN("*") | TKN("/")) + factor) ... expression.set(term + ZeroOrMore((TKN("+") | TKN("-")) + term)) ... root__ = expression
- Variables:
recursion_counter – Mapping of places to how often the parser has already been called recursively at this place. This is needed to implement left recursion. The number of calls becomes irrelevant once a result has been memoized.
- reset()#
Initializes or resets any parser variables. If overwritten, the
reset()-method of the parent class must be called from thereset()-method of the derived class.
- class Grammar(root: Parser | None = None, static_analysis: bool | None = None)#
Class Grammar directs the parsing process and stores global state information of the parsers, i.e. state information that is shared accross parsers.
Grammars are basically collections of parser objects, which are connected to an instance object of class Grammar. There exist two ways of connecting parsers to grammar objects: Either by passing the root parser object to the constructor of a Grammar object (“direct instantiation”), or by assigning the root parser to the class variable
root__of a descendant class of class Grammar.Example for direct instantiation of a grammar:
>>> number = RE(r'\d+') + RE(r'\.') + RE(r'\d+') | RE(r'\d+') >>> number_parser = Grammar(number) >>> number_parser("3.1416").content '3.1416'
Collecting the parsers that define a grammar in a descendant class of class Grammar and assigning the named parsers to class variables rather than to global variables has several advantages:
It keeps the namespace clean.
The parser names of named parsers do not need to be passed to the constructor of the Parser object explicitly, but it suffices to assign them to class variables, which results in better readability of the Python code. See classmethod
Grammar._assign_parser_names__()The parsers in the class do not necessarily need to be connected to one single root parser, which is helpful for testing and when building up a parser gradually from several components.
As a consequence, though, it is highly recommended that a Grammar class should not define any other variables or methods with names that are legal parser names. A name ending with a double underscore
__is not a legal parser name and can safely be used.Example:
class Arithmetic(Grammar): # special fields for implicit whitespace and comment configuration COMMENT__ = r'#.*(?:\n|$)' # Python style comments wspR__ = mixin_comment(whitespace=r'[\t ]*', comment=COMMENT__) # parsers expression = Forward() INTEGER = RE('\\d+') factor = INTEGER | TKN("(") + expression + TKN(")") term = factor + ZeroOrMore((TKN("*") | TKN("/")) + factor) expression.set(term + ZeroOrMore((TKN("+") | TKN("-")) + term)) root__ = expression
Upon instantiation the parser objects are deep-copied to the Grammar object and assigned to object variables of the same name. For any parser that is directly assigned to a class variable the field
parser.pnamecontains the variable name after instantiation of the Grammar class. The parser will nevertheless remain anonymous with respect to the tag names of the nodes it generates, if its name is included in thedisposable__-set or, ifdisposable__has been defined by a regular expression, matched by that regular expression. If one and the same parser is assigned to several class variables such as, for example, the parserexpressionin the example above, which is also assigned toroot__, the first name sticks.Grammar objects are callable. Calling a grammar object with a UTF-8 encoded document, initiates the parsing of the document with the root parser. The return value is the concrete syntax tree. Grammar objects can be reused (i.e. called again) after parsing. Thus, it is not necessary to instantiate more than one Grammar object per thread.
Grammar classes contain a few special class fields for implicit whitespace and comments that should be overwritten, if the defaults (no comments, horizontal right aligned whitespace) don’t fit:
Class Attributes:
- Variables:
root__ – The root parser of the grammar. Theoretically, all parsers of the grammar should be reachable by the root parser. However, for testing of yet incomplete grammars class Grammar does not assume that this is the case.
resume_rules__ – A mapping of parser names to a list of regular expressions that act as rules to find the reentry point if a ParserError was thrown during the execution of the parser with the respective name.
skip_rules__ – A mapping of parser names to a list of regular expressions that act as rules to find the reentry point if a ParserError was thrown during the execution of the parser with the respective name.
error_messages__ – A mapping of parser names to a Tuple of regalar expressions and error messages. If a mandatory violation error occurs on a specific symbol (i.e. parser name) and any of the regular expressions matches the error message of the first matching expression is used instead of the generic mandatory violation error messages. This allows to answer typical kinds of errors (say putting a colon “,” where a semicolon “;” is expected) with more informative error messages.
disposable__ – A set of parser-names or a regular expression to identify names of parsers that are assigned to class fields but shall nevertheless yield anonymous nodes (i.e. nodes the tag name of which starts with a colon “:” followed by the parser’s class name).
parser_initialization__ – Before the grammar class (!) has been initialized, which happens upon the first time it is instantiated (see
_assign_parser_names()for an explanation), this class field contains a value other than “done”. A value of “done” indicates that the class has already been initialized.static_analysis_pending__ – True as long as no static analysis (see the method with the same name for more information) has been done to check parser tree for correctness. Static analysis is done at instantiation and the flag is then set to false, but it can also be carried out once the class has been generated (by DHParser.ebnf.EBNFCompiler) and then be set to false in the definition of the grammar class already.
static_analysis_errors__ – A list of errors and warnings that were found in the static analysis
parser_names__ – The list of the names of all named parsers defined in the grammar class
python_src__ – For the purpose of debugging and inspection, this field can take the python src of the concrete grammar class (see
dsl.grammar_provider()).
Instance Attributes:
- Variables:
all_parsers__ – A set of all parsers connected to this grammar object
comment_rx__ – The compiled regular expression for comments. If no comments have been defined, it defaults to RX_NEVER_MATCH This instance-attribute will only be defined if a class-attribute with the same name does not already exist!
start_parser__ – During parsing, the parser with which the parsing process was started (see method
__call__) orNoneif no parsing process is running.unconnected_parsers__ – A set of parsers that are not connected to the root parser. The set of parsers is collected during instantiation.
resume_parsers__ – A set of parsers that appear either in a resume-rule or a skip-rule. This set is a subset of
unconnected_parsers___dirty_flag__ – A flag indicating that the Grammar has been called at least once so that the parsing-variables need to be reset when it is called again.
text__ – The text that is currently been parsed or that has mose recently been parsed.
document__ – A string view on the text that has most recently been parsed or that is currently being parsed.
document_length__ – the length of the document.
document_lbreaks__ – (property) list of linebreaks within the document, starting with -1 and ending with EOF. This helps to generate line and column number for history recording and will only be initialized if
history_tracking__is true.tree__ – The root-node of the parsing tree. This variable is available for error-reporting already during parsing via
self.grammar.tree__.add_error, but it references the full parsing tree only after parsing has been finished._reversed__ – the same text in reverse order - needed by the
Lookbehind- parsers.variables__ – A mapping for variable names to a stack of their respective string values - needed by the
Capture-,Retrieve- andPop-parsers.rollback__ – A list of tuples (location, rollback-function) that are deposited by the
Capture- andPop-parsers. If the parsing process reaches a dead end then all rollback-functions up to the point to which it retreats will be called and the state of the variable stack restored accordingly.last_rb__loc__ – The last, i.e. most advanced location in the text where a variable changing operation occurred. If the parser backtracks to a location at or before
last_rb__loc__(i.e.location < last_rb__loc__) then a rollback of all variable changing operations is necessary that occurred after the location to which the parser backtracks. This is done by calling methodrollback_to__()(location).ff_pos__ – The “farthest fail”, i.e. the highest location in the document where a parser failed. This gives a good indication where and why parsing failed, if the grammar did not match a text.
ff_parser__ – The parser that failed at the “farthest fail”-location
ff_pos__suspend_memoization__ – A flag that if set suspends memoization of results from returning parsers. This flag is needed by the left-recursion handling algorithm (see
Parser.__call__()andForward.__call__()) as well as the context-sensitive parsers (see functionGrammar.push_rollback__()).left_recursion__ – Turns on left-recursion handling. This prevents the recursive descent parser to get caught in an infinite loop (resulting in a maximum recursion depth reached error) when the grammar definition contains left recursions.
associated_symbol_cache__ –
A cache for the
associated_symbol__()-method.# mirrored class attributes:
static_analysis_pending__ – A pointer to the class attribute of the same name. (See the description above.) If the class is instantiated with a parser, this pointer will be overwritten with an instance variable that serves the same function.
static_analysis_errors__ – A pointer to the class attribute of the same name. (See the description above.) If the class is instantiated with a parser, this pointer will be overwritten with an instance variable that serves the same function.
Tacing and debugging support:
The following parameters are needed by the debugging functions in module
trace.py. They should not be manipulated by the users of class Grammar directly.- Variables:
history_tracking__ – A flag indicating that the parsing history is being tracked. This flag should not be manipulated by the user. Use
trace.set_tracer()(grammar, trace.trace_history)to turn (full) history tracking on andtrace.set_tracer()(grammar, None)to turn it off. Default is off.resume_notices__ – A flag indicating that resume messages are generated in addition to the error messages, in case the parser was able to resume after an error. Use
trace.resume_notices()(grammar)to turn resume messages on andtrace.set_tracer()(grammar, None)to turn resume messages (as well as history recording) off. Default is off.call_stack__ – A stack of the tag names and locations of all parsers in the call chain to the currently processed parser during parsing. The call stack can be thought of as a breadcrumb path. This is required for recording the parser history (for debugging) and, eventually, i.e. one day in the future, for tracing through the parsing process.
history__ – A list of history records. A history record is appended to the list each time a parser either matches, fails or if a parser-error occurs. See class
log.HistoryRecord. History records store copies of the current call stack.moving_forward__ – This flag indicates that the parsing process is currently moving forward. It is needed to reduce noise in history recording and should not be considered as having a valid value if history recording is turned off! (See
Parser.__call__())most_recent_error__ – The most recent parser error that has occurred or
None. This can be read by tracers. See moduletrace
Configuration parameters:
The values of these parameters are copied from the global configuration in the constructor of the Grammar object. (see mpodule
configuration)- Variables:
max_parser_dropouts__ – Maximum allowed number of retries after errors where the parser would exit before the complete document has been parsed. Default is 1, as usually the retry-attemts lead to a proliferation of senseless error messages.
reentry_search_window__ – The number of following characters that the parser considers when searching a reentry point when a syntax error has been encountered. Default is 10.000 characters.
- as_ebnf__() str#
Serializes the Grammar object as a grammar-description in the Extended Backus-Naur-Form. Does not serialize directives and may contain abbreviations with three dots “ … “ for very long expressions.
- associated_symbol__(parser: Parser) Parser#
Returns the closest named parser that contains
parser. Ifparseris a named parser itself,parseris returned. Ifparseris not connected to any symbol in the Grammar, an AttributeError is raised. Example:>>> word = Series(RegExp(r'\w+'), Whitespace(r'\s*')) >>> word.pname = 'word' >>> gr = Grammar(word) >>> anonymous_re = gr['word'].parsers[0] >>> gr.associated_symbol__(anonymous_re).pname 'word'
- fill_associated_symbol_cache__()#
Pre-fills the associated symbol cache with an algorithm that is more efficient than filling the cache by calling
associated_symbol__()on each parser individually.
- fullmatch__(parser: str | Parser, string: str, source_mapping: SourceMapFunc | None = None) str | None#
Returns the matched string, if the parser matches the complete string or
Noneif the parser does not match.
- match__(parser: str | Parser, string: str, source_mapping: SourceMapFunc | None = None) str | None#
Returns the matched string, if the parser matches the beginning of a string or
Noneif the parser does not match.
- push_rollback__(location, func)#
Adds a rollback function that either removes or re-adds values on the variable stack (
self.variables) that have been added (or removed) by Capture or Pop Parsers, the results of which have been dismissed.
- property reversed__: StringView#
Returns a reversed version of the currently parsed document. As about the only case where this is needed is the Lookbehind-parser, this is done lazily.
- rollback_to__(location)#
Rolls back the variable stacks (
self.variables) to its state at an earlier location in the parsed document.
- static_analysis__() List[AnalysisError]#
Checks the parser tree statically for possible errors.
This function is called by the constructor of class Grammar and does not need to (and should not) be called externally.
- Returns:
a list of error-tuples consisting of the narrowest containing named parser (i.e. the symbol on which the failure occurred), the actual parser that failed and an error object.
- exception GrammarError(static_analysis_result: List[AnalysisError])#
GrammarError will be raised if static analysis reveals errors in the grammar.
- class IgnoreCase(text: str)#
Parses plain text strings, ignoring the case, e.g. “head” == “HEAD” == “Head”. (Could be done by RegExp as well, but is faster.)
Example:
>>> tag = IgnoreCase("head") >>> Grammar(tag)("HEAD").content 'HEAD' >>> Grammar(tag)("Head").content 'Head'
- class Interleave(*parsers: Parser, mandatory: int = 1073741824, repetitions: Sequence[Tuple[int, int]] = ())#
Parse elements in arbitrary order.
Examples:
>>> prefixes = TKN("A") * TKN("B") >>> Grammar(prefixes)('A B').content 'A B' >>> Grammar(prefixes)('B A').content 'B A' >>> prefixes = Interleave(TKN("A"), TKN("B"), repetitions=((0, 1), (0, 1))) >>> Grammar(prefixes)('A B').content 'A B' >>> Grammar(prefixes)('B A').content 'B A' >>> Grammar(prefixes)('B').content 'B'
EBNF-Notation:
... ° ...EBNF-Example:
float = { /\d/ }+ ° /\./- is_optional() bool | None#
Returns
True, if the parser can never fail, i.e. never yieldsNoneinstead of a node. ReturnsFalse, if the parser can fail. ReturnsNoneif it is not known whether the parser can fail.
- static_analysis() List[AnalysisError]#
Analyses the parser for logical errors after the grammar has been instantiated.
- class LateBindingUnary(parser_name: str)#
Superclass for late-binding unary parsers. LateBindingUnary only stores the name of a parser upon object creation. This name is resolved at the time when the late-binding-parser-object is connected to the grammar.
EXPERIMENTAL !!!
A possible use case is a custom parser derived from LateBindingUnary that calls another parser without having to worry about whether the called parser has already been defined earlier in the Grammar-class.
LateBindingUnary is not to be confused with
Forwardand should not be abused for recursive parser calls either!
- class Lookahead(parser: Parser)#
Matches, if the contained parser would match for the following text, but does not consume any text.
- static_analysis() List[AnalysisError]#
Analyses the parser for logical errors after the grammar has been instantiated.
- class Lookbehind(parser: Parser)#
Matches, if the contained parser would match backwards. Requires the contained parser to be a RegExp, _RE, Text parser.
EXPERIMENTAL
- class NaryParser(*parsers: Parser)#
Base class of all Nary parsers, i.e. parser that contains one or more other parsers, like the alternative parser for example.
The NaryOperator base class supplies
__deepcopy__()and methods for n-ary parsers. The__deepcopy__()-method needs to be overwritten, however, if the constructor of a derived class takes additional parameters.
- class NegativeLookahead(parser: Parser)#
Matches, if the contained parser would not match for the following text.
- class NegativeLookbehind(parser: Parser)#
Matches, if the contained parser would not match backwards. Requires the contained parser to be a RegExp-parser.
- class Never#
A parser that never matches.
- class NoMemoizationParser#
Base class for parsers that should not memoize
- reset()#
Initializes or resets any parser variables. If overwritten, the
reset()-method of the parent class must be called from thereset()-method of the derived class.
- class OneOrMore(parser: Parser)#
OneOrMoreapplies a parser repeatedly as long as this parser matches. Other thanZeroOrMorewhich always matches, at least one match is required byOneOrMore.Examples:
>>> sentence = OneOrMore(RE(r'\w+,?')) + TKN('.') >>> Grammar(sentence)('Wo viel der Weisheit, da auch viel des Grämens.').content 'Wo viel der Weisheit, da auch viel des Grämens.' >>> str(Grammar(sentence)('.')) # an empty sentence also matches ' <<< Error on "." | Parser root->/\\\\w+,?/ did not match: ».« >>> ' >>> forever = OneOrMore(RegExp('(?=.)|$')) >>> Grammar(forever)('') # infinite loops will automatically be broken Node('root', '')
Except for the end of file a warning will be emitted, if an infinite-loop is detected.
EBNF-Notation:
{ ... }+EBNF-Example:
sentence = { /\w+,?/ }+- static_analysis() List[AnalysisError]#
Analyses the parser for logical errors after the grammar has been instantiated.
- class Option(parser: Parser)#
Parser
Optionalways matches, even if its child-parser did not match.If the child-parser did not match
Optionreturns a node with no content and does not move forward in the text.If the child-parser did match,
Optionreturns a node with the node returned by the child-parser as its single child and the text at the position where the child-parser left it.Examples:
>>> number = Option(TKN('-')) + RegExp(r'\d+') + Option(RegExp(r'\.\d+')) >>> Grammar(number)('3.14159').content '3.14159' >>> Grammar(number)('3.14159').as_sxpr() '(root (:RegExp "3") (:RegExp ".14159"))' >>> Grammar(number)('-1').content '-1'
EBNF-Notation:
[ ... ]EBNF-Example:
number = ["-"] /\d+/ [ /\.\d+/ ]- is_optional() bool | None#
Returns
True, if the parser can never fail, i.e. never yieldsNoneinstead of a node. ReturnsFalse, if the parser can fail. ReturnsNoneif it is not known whether the parser can fail.
- static_analysis() List[AnalysisError]#
Analyses the parser for logical errors after the grammar has been instantiated.
- class Parser#
(Abstract) Base class for Parser combinator parsers. Any parser object that is actually used for parsing (i.e. no mock parsers) should be derived from this class.
Since parsers can contain other parsers (see classes UnaryOperator and NaryOperator) they form a cyclical directed graph. A root parser is a parser from which all other parsers can be reached. Usually, there is one root parser which serves as the starting point of the parsing process. When speaking of “the root parser” it is this root parser object that is meant.
There are two different types of parsers:
Named parsers for which a name is set in field
parser.pname. The results produced by these parsers can later be retrieved in the AST by the parser name.Disposable parsers where the name-field just contains the empty string. AST-transformation of disposable parsers can be hooked only to their class name, and not to the individual parser.
Parser objects are callable and parsing is done by calling a parser object with the text to parse.
If the parser matches it returns a tuple consisting of a node representing the root of the concrete syntax tree resulting from the match as well as the substring
text[i:]where i is the length of matched text (which can be zero in the case of parsers likeZeroOrMoreorOption). Ifi > 0then the parser has “moved forward”.If the parser does not match, it returns
(None, text). Note that this is not the same as an empty match("", text). Any empty match can, for example, be returned by theZeroOrMore-parser in case the contained parser is repeated zero times.- Variables:
pname – The parser’s name.
disposable – A property indicating that the parser returns anonymous nodes. For performance reasons this is implemented as an object variable rather than a property. This property should always be equal to
self.name[0] == ":".drop_content – A property (for performance reasons implemented as simple field) that, if set, induces the parser not to return the parsed content or subtree if it has matched but the dummy
EMPTY_NODE. In effect the parsed content will be dropped from the concrete syntax tree already. Only anonymous (or pseudo-anonymous) parsers are allowed to drop content.node_name – The name for the nodes that are created by the parser. If the parser is named, this is the same as
pname, otherwise it is the name of the parser’s type prefixed with a colon “:”.visited – Mapping of places this parser has already been to during the current parsing process onto the results the parser returned at the respective place. This dictionary is used to implement memoizing.
_parse_proxy – Usually, just a reference to
self._parse, but can be overwritten to run th call to the_parse-method through a proxy like, for example, a tracing debugger. Seetrace_sub_parsers –
Set of parsers that are directly referred to by this parser, e.g. parser “a” defined by the EBNF-expression “a = b (b | c)” has the sub-parser-set {b, c}.
Notes: 1.the set is empty for parsers that derive neither from
UnaryParsernor fromNaryParser2. unary parsers have exactly one sub-parser 3. n-ary parsers have one or more sub_parsers. For n-ary-parsers len(p.sub_parser) can be lower than len(p.parsers), in case one and the same parser is referred to more than once in the contained parser’s list._grammar – A reference to the Grammar object to which the parser is attached.
_symbol – The closest named parser to which this parser is connected in a grammar. If the parser itself is named, this is the same as self. _symbol is private and should be accessed only via the symbol-property which will initialize its value on first use.
_descendants_cache – A cache of all descendant parsers that can be reached from this parser.
_desc_trails_cache – A cache of the trails (i.e. list of parsers) from this parser to all other parsers that can be reached from this parser.
- apply(func: ApplyFunc, grammar=None) bool | None#
Applies function
func(parser)recursively to this parser and all descendant parsers as long asfunc()returnsNoneorFalse. Traversal is pre-order. Stops the further application offuncand returnsTrueoncefunchas returnedTrue.If
funchas been applied to all descendant parsers without issuing a stop signal by returningTrue,Falseis returned.if apply is called for the first time on the parser, the parser will be conntected to
grammarThis use of the return value allows to use theapply-method both to issue tests on all descendant parsers (including self) which may be decided already after some parsers have been visited without any need to visit further parsers. At the same timeapplycan be used to simply apply a procedure to all descendant parsers (including self) without worrying about forgetting the return value of procedure, because a return value ofNonemeans “carry on”.
- apply_to_trail(func: Callable[[Tuple[Parser]], bool | None]) bool | None#
Same as
Parser.apply(), only that the applied function receives the complete “trail”, i.e. list of parsers that lead from self to the visited parser as argument.
- descendant_trails() AbstractSet[ParserTrail]#
Returns a set of the trails of self and all descendant parsers, avoiding circles. NOTE: The algorithm is rather sloppy and the returned set is not really comprehensive, but sufficient to trace anonymous parsers to their nearest named ancestor.
- descendants(grammar=None) AbstractSet[Parser]#
Returns a set of self and all descendant parsers, avoiding circles.
- is_optional() bool | None#
Returns
True, if the parser can never fail, i.e. never yieldsNoneinstead of a node. ReturnsFalse, if the parser can fail. ReturnsNoneif it is not known whether the parser can fail.
- name(pname: str = '', disposable: bool | None = None) Parser#
Sets the parser name to
pnameand returnsself. If disposable is True, the nodes produced by the parser will also be marked as disposable, i.e. they can be eliminated bur their content will be retained. The same can be achieved by prefixing the panme-string with a colon “:” or with “HIDE:”. Another possible prefix is “DROP:” in which case the nodes will be dropped entirely, including their content. (This is useful to keep delimiters out of the syntax-tree.)
- reset()#
Initializes or resets any parser variables. If overwritten, the
reset()-method of the parent class must be called from thereset()-method of the derived class.
- set_proxy(proxy: ParseFunc | None)#
Sets a proxy that replaces the _parse()-method. Call
set_proxywithNoneto remove a previously set proxy. Typical use case is the installation of a tracing debugger. See moduletrace.
- static_analysis() List[AnalysisError]#
Analyses the parser for logical errors after the grammar has been instantiated.
- exception ParserError(parser: Parser, node: Node, node_orig_len: int, location: int, error: Error, *, first_throw: bool)#
A
ParserErroris thrown for those parser errors that allow the controlled re-entrance of the parsing process after the error occurred. If a reentry-rule has been configured for the parser where the error occurred, the parser guard can resume the parsing process.Currently, the only case when a
ParserErroris thrown (and not some different kind of error likeUnknownParserError) is when aSeriesorInterleave-parser detects a missing mandatory element.- Variables:
parser – The parser within which the error has been raised
node – The node within which the error is locted
node_orig_len – The original size of that node. The actual size of that node may change due to later processing steps und thus not be reliable anymore for the description of the error.
location – The location in the document where the parser that caused the error started. This is not to be confused with the location where the error occurred, because by the time the error occurrs the parser may already have read some part of the document.
error – The
Errorobject containing among other things the exact error location.first_throw – A flag that indicates that the error has not yet been re-raised
attributes_locked – A frozenset of attributes that must not be overwritten once the ParrserError-object has been initialized by its constructor
callstack_snapshot – A snapshot of the callstack (if history-recording has been turned on) at the point where the error occurred.
- new_PE(**kwargs)#
Returns a new ParserError object with the same attribute values as
self, except those that are reassigned inkwargs:>>> pe = ParserError(Parser(), Node('test', ""), 0, 0, Error("", 0), first_throw=True) >>> pe_derived = pe.new_PE(first_throw = False) >>> pe.first_throw True >>> pe_derived.first_throw False
- class Pop(symbol: Parser, match_func: MatchVariableFunc | None = None)#
Matches if the following text starts with the value of a particular variable. As a variable in this context means a stack of values, the last value will be compared with the following text. Other than the
Retrieve-parser, thePop-parser removes the value from the stack in case of a match.The constructor parameter
symboldetermines which variable is used.- reset()#
Initializes or resets any parser variables. If overwritten, the
reset()-method of the parent class must be called from thereset()-method of the derived class.
- class PreprocessorToken(token: str)#
Parses tokens that have been inserted by a preprocessor.
Preprocessors can generate Tokens with the
make_token-function. These tokens start and end with magic characters that can only be matched by the PreprocessorToken Parser. Such tokens can be used to insert BEGIN - END delimiters at the beginning or ending of a quoted block, for example.
- RE(regexp, wsL='', wsR='\\s*')#
Syntactic Sugar for ‘Series(Whitespace(wsL), RegExp(regexp), Whitespace(wsR))’
- class RegExp(regexp)#
Regular expression parser.
The RegExp-parser parses text that matches a regular expression. RegExp can also be considered as the “atomic parser”, because all other parsers delegate part of the parsing job to other parsers, but do not match text directly.
Example:
>>> word = RegExp(r'\w+') >>> Grammar(word)("Haus").content 'Haus'
EBNF-Notation:
/ ... /EBNF-Example:
word = /\w+/
- class Retrieve(symbol: Parser, match_func: MatchVariableFunc | None = None)#
Matches if the following text starts with the value of a particular variable. As a variable in this context means a stack of values, the last value will be compared with the following text. It will not be removed from the stack! (This is the difference between the
Retrieveand thePopparser.) The constructor parametersymboldetermines which variable is used.- Variables:
parser –
The name of the parser that has stored the value to be retrieved, in other words: “the observed parser”. (class Retrieve reuses the instance variable “parser” of its
superclass Unary with a slightly different semantic)
match – a procedure through which the processing of the retrieved symbols is channeled. In the simple most case, it merely returns the last string stored by the observed parser. This can be (mis-)used to execute any kind of semantic action.
- get_node_name() str#
Returns a name for the retrieved node. If the Retrieve-parser has a node-name, this overrides the node-name of the retrieved symbol’s parser.
- class Series(*parsers: Parser, mandatory: int = 1073741824)#
Matches if each of a series of parsers matches exactly in the order of the series.
Example:
>>> variable_name = RegExp(r'(?!\d)\w') + RE(r'\w*') >>> Grammar(variable_name)('variable_1').content 'variable_1' >>> str(Grammar(variable_name)('1_variable')) ' <<< Error on "1_variable" | Parser root->/(?!\\\\d)\\\\w/ did not match: »1_variable« >>> '
EBNF-Notation:
... ...(sequence of parsers separated by a blank or new line)EBNF-Example:
series = letter letter_or_digit
- class SmartRE(pattern, repr_str: str = '')#
Regular expression parser that returns a tree with a node for every captured group (named as the group or as the number of the group, in case it is not a named group). The space between groups is dropped.
Example:
>>> name = SmartRE(r'(?P<christian_name>\w+)\s+(?P<family_name>\w+)').name("name") >>> Grammar(name)("Arthur Schopenhauer").as_sxpr() '(name (christian_name "Arthur") (family_name "Schopenhauer"))' >>> name = SmartRE(r'(?P<christian_name>\w+)(\s+)(?P<family_name>\w+)').name("name") >>> Grammar(name)("Arthur Schopenhauer").as_sxpr() '(name (christian_name "Arthur") (:RegExp " ") (family_name "Schopenhauer"))'
EBNF-Notation:
/ ... /EBNF-Example:
name = /(?P<first_name>\w+)\s+(?P<last_name>\w+)/
- class Synonym(parser: Parser)#
Simply calls another parser and encapsulates the result in another node if that parser matches.
This parser is needed to support synonyms in EBNF, e.g.:
jahr = JAHRESZAHL JAHRESZAHL = /\d\d\d\d/
Otherwise, the first line could not be represented by any parser class, in which case it would be unclear whether the parser RegExp(’dddd’) carries the name ‘JAHRESZAHL’ or ‘jahr’.
- TKN(token, wsL='', wsR='\\s*')#
Syntactic Sugar for ‘Series(Whitespace(wsL), Text(token), Whitespace(wsR))’
- class Text(text: str)#
Parses plain text strings. (Could be done by RegExp as well, but is faster.)
Example:
>>> while_token = Text("while") >>> Grammar(while_token)("while").content 'while'
- TreeReduction(root_or_parserlist: Parser | Collection[Parser], level: int = 1) Parser#
Applies tree-reduction level to CombinedParsers either in the collection or parsers passed in the first arg or in the graph of interconnected parsers originating in the single “roo” parser passed as first argument. Returns the root-parser or, if a collection has been passed, the PARSER_PLACEHOLDER
Examples, how tree-reduction works:
>>> root = Text('A') + Text('B') | Text('C') + Text('D') >>> grammar = Grammar(TreeReduction(root, CombinedParser.NO_TREE_REDUCTION)) >>> tree = grammar('AB') >>> print(tree.as_sxpr()) (root (:Series (:Text "A") (:Text "B"))) >>> grammar = Grammar(TreeReduction(root, CombinedParser.FLATTEN)) # default >>> tree = grammar('AB') >>> print(tree.as_sxpr()) (root (:Text "A") (:Text "B")) >>> grammar = Grammar(TreeReduction(root, CombinedParser.MERGE_TREETOPS)) >>> tree = grammar('AB') >>> print(tree.as_sxpr()) (root "AB") >>> grammar = Grammar(TreeReduction(root, CombinedParser.MERGE_LEAVES)) >>> tree = grammar('AB') >>> print(tree.as_sxpr()) (root "AB") >>> root = Series(Text('A'), Text('B'), Text('C').name('important') | Text('D')) >>> grammar = Grammar(TreeReduction(root, CombinedParser.NO_TREE_REDUCTION)) >>> tree = grammar('ABC') >>> print(tree.as_sxpr()) (root (:Text "A") (:Text "B") (:Alternative (important "C"))) >>> grammar = Grammar(TreeReduction(root, CombinedParser.FLATTEN)) # default >>> tree = grammar('ABC') >>> print(tree.as_sxpr()) (root (:Text "A") (:Text "B") (important "C")) >>> tree = grammar('ABD') >>> print(tree.as_sxpr()) (root (:Text "A") (:Text "B") (:Text "D")) >>> grammar = Grammar(TreeReduction(root, CombinedParser.MERGE_TREETOPS)) >>> tree = grammar('ABC') >>> print(tree.as_sxpr()) (root (:Text "A") (:Text "B") (important "C")) >>> tree = grammar('ABD') >>> print(tree.as_sxpr()) (root "ABD") >>> grammar = Grammar(TreeReduction(root, CombinedParser.MERGE_LEAVES)) >>> tree = grammar('ABC') >>> print(tree.as_sxpr()) (root (:Text "AB") (important "C")) >>> tree = grammar('ABD') >>> print(tree.as_sxpr()) (root "ABD")
- class UnaryParser(parser: Parser)#
Base class of all unary parsers, i.e. parser that contains one and only one other parser, like the optional parser for example.
The UnaryOperator base class supplies
__deepcopy__()and methods for unary parsers. The__deepcopy__()-method needs to be overwritten, however, if the constructor of a derived class has additional parameters.
- exception UninitializedError(msg: str)#
An error that results from unintialized objects. This can be a consequence of some broken boot-strapping-process.
- class Whitespace(regexp, keep_comments: bool = False)#
A variant of RegExp that it meant to be used for insignificant whitespace. In contrast to RegExp, Whitespace always returns a match. If the defining regular expression did not match, an empty match is returned.
- Variables:
keep_comments – A boolean indicating whether or not whitespace containing comments should be kept, even if the self.drop_content flag is True. If keep_comments and drop_flag are both True a stretch of whitespace containing a comment will be renamed to “comment__” and whitspace that does not contain any comments will be dropped.
Example:
>>> ws = Whitespace(mixin_comment(r'\s+', r'#.*')) >>> Grammar(ws)(" # comment").as_sxpr() '(root " # comment")' >>> dws = Drop(Whitespace(mixin_comment(r'\s+', r'#.*'))) >>> Grammar(dws)(" # comment").as_sxpr() '(:EMPTY)' >>> dws = Drop(Whitespace(mixin_comment(r'\s+', r'#.*'), keep_comments=True)) >>> Grammar(Synonym(dws))(" # comment").as_sxpr() '(root (comment__ " # comment"))' >>> Grammar(Synonym(dws))(" ").as_sxpr() '(root)' >>> Grammar(dws)(" # comment").as_sxpr() '(root " # comment")' >>> Grammar(dws)(" ").as_sxpr() '(:EMPTY)'
- class ZeroOrMore(parser: Parser)#
ZeroOrMoreapplies a parser repeatedly as long as this parser matches. LikeOptiontheZeroOrMoreparser always matches. In case of zero repetitions, the empty match((), text)is returned.Examples:
>>> sentence = ZeroOrMore(RE(r'\w+,?')) + TKN('.') >>> Grammar(sentence)('Wo viel der Weisheit, da auch viel des Grämens.').content 'Wo viel der Weisheit, da auch viel des Grämens.' >>> Grammar(sentence)('.').content # an empty sentence also matches '.' >>> forever = ZeroOrMore(RegExp('(?=.)|$')) >>> Grammar(forever)('') # infinite loops will automatically be broken Node('root', '')
Except for the end of file a warning will be emitted, if an infinite-loop is detected.
EBNF-Notation:
{ ... }EBNF-Example:
sentence = { /\w+,?/ } "."
- copy_parser_base_attrs(src: Parser, duplicate: Parser)#
Duplicates all attributes of the Parser-class from
srctoduplicate.
- extract_error_code(err_msg: str, err_code: ErrorCode = 1000) Tuple[str, ErrorCode]#
Extracts the error-code-prefix from an error message.
Example:
>>> msg = '2010:Big mistake!' >>> print(extract_error_code(msg)) ('Big mistake!', 2010) >>> msg = "Syntax error at: {1}" >>> print(extract_error_code(msg)) ('Syntax error at: {1}', 1000)
- fullmatch(grammar: Grammar, parser: str | Parser, string: str, source_mapping: SourceMapFunc | None = None) str | None#
- is_context_sensitive(parser: Parser) bool#
Returns True, is
parseris a context-sensitive parser or calls a context-sensitive parser.
- is_parser_placeholder(parser: Parser | None) bool#
Returns True, if
parserisNoneor merely a placeholder for a parser.
- last_value(text: StringView | str, stack: List[str]) str | None#
Matches
textwith the most recent value on the capture stack. This is the default case when retrieving captured substrings.
- longest_match(strings: List[str], text: StringView | str, n: int = 1) str#
Returns the longest string from a given list of strings that matches the beginning of text. Examples:
>>> l = ['a', 'ab', 'ca', 'cd'] >>> longest_match(l, 'a') 'a' >>> longest_match(l, 'abcdefg') 'ab' >>> longest_match(l, 'ac') 'a' >>> longest_match(l, 'cb') '' >>> longest_match(l, 'cab') 'ca'
- match(grammar: Grammar, parser: str | Parser, string: str, source_mapping: SourceMapFunc | None = None) str | None#
- matching_bracket(text: StringView | str, stack: List[str]) str | None#
Returns a closing bracket for the opening bracket on the capture stack, i.e. if “[” was captured, “]” will be retrieved.
- mixin_comment(whitespace: str, comment: str, always_match: bool = True) str#
Returns a regular expression pattern that merges comment and whitespace regexps. Thus comments can occur wherever whitespace is allowed and will be skipped just as implicit whitespace.
Note, that because this works on the level of regular expressions, nesting comments is not possible. It also makes it much harder to use directives inside comments (which isn’t recommended, anyway).
Examples:
>>> import re >>> combined = mixin_comment(r"\s+", r"#.*") >>> print(combined) (?:(?:\s+)?(?:(?:#.*)(?:\s+)?)*) >>> rx = re.compile(combined) >>> rx.match(' # comment').group(0) ' # comment' >>> combined = mixin_comment(r"\s+", r"#.*", always_match=False) >>> print(combined) (?:(?:\s+)(?:(?:#.*)(?:\s+))*) >>> rx = re.compile(combined) >>> rx.match(' # comment').group(0) ' # '
- mixin_nonempty(whitespace: str) str#
Returns a regular expression pattern that matches only if the regular expression pattern
whitespacematches AND if the match is not empty.If
whitespacedoes not match the empty string ‘’, anyway, then it will be returned unaltered.WARNING:
mixin_nonempty()does not work for regular expressions the matched strings of which can be followed by a symbol that can also occur at the start of the regular expression.In particular, it does not work for fixed size regular expressions, that is / / or / / or /t/ won’t work, but / / or /s/ or /s+/ do work. There is no test for this. Fixed-size regular expressions run through
mixin_nonemptywill not match at anymore if they are applied to the beginning or the middle of a sequence of whitespaces!In order to be safe, your whitespace regular expressions should follow the rule: “Whitespace cannot be followed by whitespace” or “Either grab it all or leave it all”.
- Parameters:
whitespace – a regular expression pattern
- Returns:
new regular expression pattern that does not match the empty string ‘’, anymore.
- optional_last_value(text: StringView | str, stack: List[str]) str | None#
Matches
textwith the most recent value on the capture stack or with the empty string, i.e.optional_matchnever returnsNonebut either the value on the stack or the empty string.Use case: Implement shorthand notation for matching tags, i.e.:
Good Morning, Mrs. <emph>Smith</>!
- update_scanner(grammar: Grammar, leaf_parsers: Dict[str, str])#
Updates the “scanner” of a grammar by overwriting the
textorregex-fields of some of or all of its leaf parsers with new values. This works only for those parsers that are assigned to a symbol in the Grammar class.- Parameters:
- Raises:
AttributeError – in case a leaf parser name in the dictionary does not exist or does not refer to a
TextorRegExp-parser.
Module dsl#
Module dsl contains high-level functions for the compilation
of domain-specific languages based on an EBNF-grammar.
- exception CompilationError(errors, dsl_text, dsl_grammar, AST, result)[source]#
Raised when a string or file in a domain specific language (DSL) contains errors. These can also contain definition errors that have been caught early.
- exception DefinitionError(errors, grammar_src)[source]#
Raised when (already) the grammar of a domain specific language (DSL) contains errors. Usually, these are repackaged parse.GrammarError(s).
- batch_process(file_names: List[str], out_dir: str, process_file: Callable[[Tuple[str, str, Callable[[], bool]]], str] | Callable[[Tuple[str, str]], str], *, submit_func: Callable[[Callable, str, str], object] | None = None, log_func: Callable[[str], None] | None = None, cancel_query: Callable[[], bool] | None = None, cancel_func: Callable[[], bool] | None = None) List[str][source]#
Compiles all files listed in file_names and writes the results and/or error messages to the directory our_dir. Returns a list of error messages files.
- Parameters:
file_names – A list of names of files to be processed with the function passed to the process_file-parameter.
out_dir – The name of a directory, to which the output will be written (i.e. compilation and/or pipeline-processing results as well as possibly intermediate stages and error-logs)
process_file – A function for processing source-texts. This function receives the name of a source file and output-directory as parameters and returns the name of an error-log-file (if there have been errors) or the empty string if there were none.
submit_func – A function which calls the process_file function. This additional indirection allows executing the call in a separate thread or process, if desired. submit_func must return a concurrent.futures.Future()-object!
log_func – A function that receives a string as parameter and which is called with an appropriate message whenever another file has been fully processed.
cancel_query – A callback-function without parameters that returns either True if further processing shall be canceled or False if processing shall continue. This allows interrupting long-running batch-processing tasks. Typically, cancel_query will be the event.is_set-function of a multiprocessing.Event-object that has been instantiated before calling batch_process.
cancel_func – DEPRECATED. Please, use cancel_query.
- compileDSL(text_or_file: str, preprocessor: PreprocessorFunc | None, dsl_grammar: str | Grammar, ast_transformation: TransformerFunc, compiler: Compiler, fail_when: ErrorCode = 1000) Any[source]#
Compiles a text in a domain specific language (DSL) with an EBNF-specified grammar. Returns the compiled text or raises a compilation error.
- Raises:
CompilationError if any errors occurred during compilation –
- compileEBNF(ebnf_src: str, branding='DSL') str[source]#
Compiles an EBNF source file and returns the source code of a compiler suite with skeletons for preprocessor, transformer and compiler.
- Parameters:
- Returns:
The complete compiler suite skeleton as Python source code.
- Raises:
CompilationError if any errors occurred during compilation –
- compile_on_disk(source_file: str, parser_name: str = '', compiler_suite: str = '', extension: str = '.xml') Sequence[Error][source]#
Compiles a source file with a given compiler and writes the result to a file.
If no
compiler_suiteis given it is assumed that the source file is an EBNF grammar. In this case the result will be a Python script containing a parser for that grammar as well as the skeletons for a preprocessor, AST transformation table, and compiler. If the Python script already exists only the parser name in the script will be updated. (For this to work, the different names need to be delimited section marker blocks.).compile_on_disk()returns a list of error messages or an empty list if no errors occurred.- Parameters:
source_file – The file name of the source text to be compiled.
parser_name – The name of the generated parser. If the empty string is passed, the default name “…Parser.py” will be used.
compiler_suite – The file name of the parser/compiler-suite (usually ending with ‘Parser.py’), with which the source file shall be compiled. If this is left empty, the source file is assumed to be an EBNF-Grammar that will be compiled with the internal EBNF-Compiler.
extension – The result of the compilation (if successful) is written to a file with the same name but a different extension than the source file. This parameter sets the extension.
- Returns:
A (potentially empty) list of error or warning messages.
- create_parser(ebnf_src: str, branding='DSL', additional_code: str = '') Grammar[source]#
Compiles the ebnf source into a callable Grammar-object. This is essentially syntactic sugar for
grammar_provider(ebnf)().
- create_scripts(ebnf_filename: str, parser_name: str = '', server_name: str | None = '', app_name: str | None = '', overwrite: bool = False)[source]#
Creates a parser script from the grammar with the filename
ebnf_filename'or, if ebnf_filename referes to a directory from all grammars in files ending with “.ebnf” in that directory.If
server_nameis not None a script for starting a parser-server will be created as well. Running the parser as a server has the advantage that the startup time for calling the parser is greatly reduced for subsequent parser calls. (While the same can be achieved with running the parser script in batch-processing-mode by passing a directory or several filenames on the command line to the parser script, batch processing is not suitable for all application cases. For example, it is not usable when implementing language servers to feed editors with data from the parseing process.)if
app_nameis not None an application script with a tkinter-based graphical user interface will be created as well. (When distributing this script with pyinstaller, parallel processing should be turned off at least on MS Windows systems!)- Parameters:
ebnf_filename – The filename of the grammar, from which the servfer script’s filename is derived.
parser_name – The filename of the parser script or the empty string if the default filename shall be used.
server_name – The filename of the server script of the empty string if the default filename shall be used, or None if no server script shall be created.
app_name – The filename of the server script of the empty string if the default filename shall be used, or None if no app-script shall be created
overwrite – If True an existing server script will be overwritten.
- grammar_provider(ebnf_src: str, branding='DSL', additional_code: str = '', fail_when: ErrorCode = 1000) ParserFactory[source]#
Compiles an EBNF-grammar and returns a grammar-parser provider function for that grammar.
- Parameters:
ebnf_src (str) – Either the file name of an EBNF grammar or the EBNF-grammar itself as a string.
branding (str or bool) – Branding name for the compiler suite source code.
additional_code – Python code added to the generated source. This typically contains the source code of semantic actions referred to in the generated source, e.g. filter-functions, resume-point-search-functions
- Returns:
A provider function for a grammar object for texts in the language defined by
ebnf_src.
- load_compiler_suite(compiler_suite: str) Tuple[PreprocessorFactory, ParserFactory, Callable[[], Callable[[RootNode], RootNode] | partial], CompilerFactory][source]#
Extracts a compiler suite from file or string
compiler_suiteand returns it as a tuple (preprocessor, parser, ast, compiler).- Returns:
- 4-tuple (preprocessor function, parser class,
ast transformer function, compiler class)
- process_file(source: str, out_dir: str, preprocessor_factory: PreprocessorFactory, parser_factory: ParserFactory, junctions: Set[Junction], targets: Set[str], serializations: Dict[str, List[str]], cancel_query: Callable[[], bool] | None = None) str[source]#
Compiles the source and writes the serialized results back to disk, unless any fatal errors have occurred. Error and Warning messages are written to a file with the same name as result_filename with an appended “_ERRORS.txt” or “_WARNINGS.txt” in place of the name’s extension. Returns the name of the error-messages file or an empty string, if no errors or warnings occurred.
- Parameters:
source – the source document or the filename of the source-document
out_dir – the path of the output-directory. If the output-directory does not exist, it will be created.
preprocessor_factory – A factory-function that returns a preprocessing function.
parser_factory – A factory-function that returns a parser function which, usually, is a
parse.Grammar-object.junctions – a set of junctions for all processing stages beyond parsing.
serializations – A dictionary of serialization names, e.g. “sxpr”, “xml”, “json” for those target stages that still are node-trees. These will be serialized and written to disk in all given serializations.
cancel_query – A boolean-valued function without parameters that is polled during processing. If it returns True, processing will be canceled.
- Returns:
either the empty string or the file name of a file that contains the errors or warnings that occurred while processing the source.
- raw_compileEBNF(ebnf_src: str, branding='DSL', fail_when: ErrorCode = 1000) EBNFCompiler[source]#
Compiles an EBNF grammar file and returns the compiler object that was used and which can now be queried for the result as well as skeleton code for preprocessor, transformer and compiler objects.
- read_template(template_name: str) str[source]#
Reads a script-template from a template file named
template_namein the template-directory and returns it as a string.
- recompile_grammar(ebnf_filename: str, parser_name: str = '', force: bool = False, notify: ~typing.Callable = <function <lambda>>) bool[source]#
Re-compiles an EBNF-grammar if necessary, that is, if either no corresponding ‘XXXXParser.py’-file exists or if that file is outdated.
- Parameters:
ebnf_filename – The filename of the ebnf-source of the grammar. In case this is a directory and not a file, all files within this directory ending with .ebnf will be compiled.
parser_name – The name of the compiler script. If not given the ebnf-filename without extension and with the addition of “Parser.py” will be used.
force – If False (default), the grammar will only be recompiled if it has been changed.
notify – ‘notify’ is a function without parameters that is called when recompilation actually takes place. This can be used to inform the user.
- Returns:
True, if recompilation of grammar has been successful or did not take place, because the Grammar hasn’t changed since the last compilation. False, if the recompilation of the grammar has been attempted but failed.
Module preprocess#
Module preprocess contains functions for preprocessing source
code before the parsing stage as well as source mapping facilities
to map the locations of parser and compiler errors to the
non-preprocessed source text. (See SourceMap)
Preprocessing (and source mapping of errors) are useful in cases where a syntax or certain syntactical features (like marking blocks with indentation for example), cannot be described completely with context-free grammars.
- class PreprocessorResult(original_text, preprocessed_text, back_mapping, errors)[source]#
- back_mapping: SourceMapFunc#
Alias for field number 2
- class SourceLocation(original_name: str, original_text: str | StringView, pos: int)[source]#
A particular location in the original, not preprocessed source code.
- Variables:
original_name – The original source filename. If the document is composed of a master file and (possibly nested) includes this will be the name of the file that the position is related to.
original_text – The original, i.e. not yet preprocessed text-content of the file the position relates to.
pos – The location within original_text.
- class SourceMap(original_name: str, positions: List[int], offsets: List[int], file_names: List[str], originals_dict: Dict[str, str | StringView])[source]#
Class SourceMap captures a mapping from the preprocessed source code (that is possibly also stitched together from different files) to the original source files and source positions. It is possible to use more than one source map (see
apply_src_mappings()). Thus, several preprocessing stages can be applied in sequence and the positions, say where errors occurred, can still be back-propagated to the original input file(s).- Variables:
original_name – The original source filename. If the source allows includes, this should be the name of the master file.
positions –
A list of locations in the processed file. Each location is to be understood as a marker from which on a different the position in the processed file must be shifted by a different offset to gain the position in the original file. The first element in the list of positions should always be 0 and contain as its last element the length of the processed source plus 1 (or higher). (+1 allows the location to exceed the end of the text by 1 which makes writing
algorithms easier that if the location was not allowed to point beyond the end of the text.)
offsets – The list of offsets corresponding to the positions. For each position entry positions[n], the corresponding offsets value offsets[n] contains the offset (positive or negative or zero) that will be added to all locations in the half open interval [ positions[n], positions[n + 1] [
file_names – A list of file names corresponding to the positions, i.e. for each position[n] the name of the file that the text from this position just until before the next position was taken from.
originals_dict – A dictionary, mapping the file-names to their text-content in form of a
StringView-object.
- apply_src_mappings(position: int, mappings: List[SourceMapFunc]) SourceLocation[source]#
Sequentially apply a number of mapping functions to a source position. In the context of source mapping, the source position usually is a position within a preprocessed source text and mappings should therefore be a list of reverse-mappings in reversed order.
- chain_preprocessors(*preprocessors) PreprocessorFunc[source]#
Merges a sequence of preprocessor functions in to a single function.
- gen_find_include_func(rx: str | ~typing.Any, comment_rx: str | ~typing.Any | None = None, derive_file_name: DeriveFileNameFunc = <function <lambda>>) FindIncludeFunc[source]#
Generates a function to find include-statements in a file.
- Parameters:
rx – A regular expression (either as string or compiled regular expression) to catch the names of the includes in a document. The expression should catch
comment_rx – The regular expression for comments. (This should always either be NEVER_MATCH_PATTERN or exactly the same as the comment-regular rexpression defined in the grammar!)
- gen_neutral_srcmap_func(original_text: StringView | str, original_name: str = '') SourceMapFunc[source]#
Generates a source map function that maps positions to itself.
- make_preprocessor(tokenizer: Tokenizer) PreprocessorFunc[source]#
Generates a preprocessor function from a “naive” tokenizer, i.e. a function that merely adds preprocessor tokens to a source text and returns the modified source.
- make_token(token: str, argument: str = '') str[source]#
Turns the
tokenandargumentinto a special token that will be caught by thePreprocessorToken-parser.This function is a support function that should be used by preprocessors to inject preprocessor tokens into the source text.
- nil_preprocessor(original_text: str, original_name: str) PreprocessorResult[source]#
A preprocessor that does nothing, i.e. just returns the input.
- preprocess_includes(original_text: str | None, original_name: str, find_next_include: FindIncludeFunc, include_reader: ~typing.Callable[[], ~typing.Callable[[str], str]] = <class 'preprocess.ReadIncludeClass'>) PreprocessorResult[source]#
Preprocesses include statements in a file.
- Parameters:
original_text – The original source file (if already read from disk)
original_name – The file-name of the original source
find_next_include – The function to find the next include-statement
include_reader – A factory that returns a function that retrieves the content from an included file.
- Returns:
the result of the preprocessing: (original document, processed document, source mapping, (possibly empty) list of errors)
- prettyprint_tokenized(tokenized: str) str[source]#
Returns a pretty-printable version of a document that contains tokens.
- source_map(position: int, srcmap: SourceMap) SourceLocation[source]#
Maps a position in a (pre-)processed text to its corresponding position in the original document according to the given source map.
- Parameters:
position – the position in the processed text
srcmap – the source map, i.e. a mapping of locations to offset values and source texts.
- Returns:
the mapped position
- tokenized_to_original_mapping(tokenized_text: str, original_text: str, original_name: str = 'UNKNOWN_FILE') SourceMap[source]#
Generates a source map for mapping positions in a text that has been enriched with token markers to their original positions.
- Parameters:
tokenized_text – the source text enriched with token markers
original_text – the original source text
original_name – the name or path or uri of the original source file
- Returns:
a source map, i.e. a list of positions and a list of corresponding offsets. The list of positions is ordered from smallest to highest. An offset is valid for its associated position and all following positions until (and excluding) the next position in the list of positions.
Module error#
Module error defines class Error and a few helpful functions that are
needed for error reporting of DHParser. Usually, what is of interest are
the string representations of the error objects. For example:
from DHParser import compile_source, has_errors
result, errors, ast = compile_source(source, preprocessor, grammar,
transformer, compiler)
if errors:
for error in errors:
print(error)
if has_errors(errors):
print("There have been fatal errors!")
sys.exit(1)
else:
print("There have been warnings, but no errors.")
The central class of module DHParser.error is the Error-class.
The easiest way to create an error object is by instantiating
the Error class with an error message and a source position:
>>> error = Error('Something went wrong', 123)
>>> print(error)
Error (1000): Something went wrong
Without a line and column where the error occurred, error-messages are
not very useful. In particular, it is required that the line and
column-locations are those of the original source-file before any
preprocessing (like inserting included files or stripping front-matter
or meta-data sections). For this purpose module error also provides
“source-mapping” facilities, in particular the types functions
add_source_locations(), py:
- class Error(message: str, pos: int, code: ErrorCode = 1000, line: int = -1, column: int = -1, length: int = 1, related: Sequence[Error] = [], orig_pos: int = -1, orig_doc: str = '')[source]#
The Error class encapsulates the all information for a single error.
- Variables:
message – the error message as text string
pos – the position where the error occurred in the preprocessed text
code –
the error-code, which also indicates the severity of the error:
========= =========== code severity ========= =========== 0 no error < 100 notice < 1000 warning < 10000 error >= 10000 fatal error ========= ===========
In cas of a fatal error (error code >= 10000), no further compilation stages will be processed, because it is assumed that the syntax tree is too distorted for further processing.
orig_pos – the position of the error in the original source file, not in the preprocessed document. This is a write-once value!
orig_doc – the name or path or url of the original source file to which
orig_posis related. This is relevant, if the preprocessed document has been plugged together from several source files.line – the line number where the error occurred in the original text. Lines are counted from 1 onward.
column – the column where the error occurred in the original text. Columns are counted from 1 onward.
length – the length in characters of the faulty passage (default is 1)
end_line – the line number of the position after the last character covered by the error in the original source.
end_column – the column number of the position after the last character covered by the error in the original source.
related – a sequence of related errors.
- diagnostic_obj() dict[source]#
Returns the Error as Language Server Protocol Diagnostic object. https://microsoft.github.io/language-server-protocol/specifications/specification-current/#diagnostic
- range_obj() dict[source]#
Returns the range (position plus length) of the error as an LSP-Range-Object. https://microsoft.github.io/language-server-protocol/specifications/specification-current/#range
- property severity#
Returns a string representation of the error level, e.g. “warning”.
- add_source_locations(errors: List[Error], source_mapping: Callable[[int], SourceLocationAlias] | partial)[source]#
Adds (or adjusts) line and column numbers of error messages inplace.
- Parameters:
errors – The list of errors as returned by the method
errors()of a Node objectsource_mapping – A function that maps error positions to their positions in the original source file.
- canonical_error_strings(errors: List[Error]) List[str][source]#
Returns the list of error strings in canonical form that can be parsed by most editors, i.e. “relative filepath : line : column : severity (code) : error string”
- has_errors(messages: Iterable[Error], level: ErrorCode = 1000) bool[source]#
Returns True, if at least one entry in
messageshas at least the given errorlevel.
- is_error(code: Error | int) bool[source]#
Returns True, if error is a (fatal) error, not just a warning.
- is_fatal(code: Error | int) bool[source]#
Returns True, ir error is fatal. Fatal errors are typically raised when a crash (i.e. Python exception) occurs at later stages of the processing pipeline (e.g. ast transformation, compiling).
Module testing#
Module testing contains support for unit-testing domain specific
languages. Tests for arbitrarily small components of the Grammar can
be written into test files with ini-file syntax in order to test
whether the parser matches or fails as expected. It can also be
tested whether it produces an expected concrete or abstract syntax tree.
Usually, however, unexpected failure to match a certain string is the
main cause of trouble when constructing a context free Grammar.
- class MockStream(name='')[source]#
- Simulates a stream that can be written to from one side and read
from the other side like a pipe. Usage pattern:
pipe = MockStream() reader = StreamReaderProxy(pipe) writer = StreamWriterProxy(pipe) async def main(text): writer.write((text + '
- ‘).encode())
await writer.drain() data (await reader.read()).decode() writer.close() return data
asyncio.run(main(‘Hello World’))
- clean_report(report_dir='REPORT')[source]#
Deletes any test-report-files in the REPORT subdirectory and removes the REPORT subdirectory, if it is empty after deleting the files.
- create_test_templates(symbols_or_ebnf: str | SymbolsDictType, path: str, fmt: str = '.ini') None[source]#
Creates template files for grammar unit-tests for the given symbols .
- Parameters:
symbols_or_ebnf – Either a dictionary that matches section names to the grammar’s symbols under that section or an EBNF-grammar or file name of an EBNF-grammar from which the symbols shall be extracted.
path – the path to the grammar-test directory (usually ‘tests_grammar’). If the last element of the path does not exist, the directory will be created.
fmt – the test-file-format. At the moment only ‘.ini’ is supported
- extract_symbols(ebnf_text_or_file: str) SymbolsDictType[source]#
Extracts all defined symbols from an EBNF-grammar. This can be used to prepare grammar-tests. The symbols will be returned as lists of strings which are grouped by the sections to which they belong and returned as an ordered dictionary, the keys of which are the section names. In order to define a section in the ebnf-source, add a comment-line starting with “#:”, followed by the section name. It is recommended to use valid file names as section names. Example:
#: components
expression = term { EXPR_OP~ term} term = factor { TERM_OP~ factor} factor = [SIGN] ( NUMBER | VARIABLE | group ) { VARIABLE | group } group = “(” expression “)”
#: leaf_expressions
EXPR_OP = /+/ | /-/ TERM_OP = /*/ | /// SIGN = /-/
NUMBER = /(?:0|(?:[1-9]d*))(?:.d+)?/~ VARIABLE = /[A-Za-z]/~
If no sections have been defined in the comments, there will be only one group with the empty string as a key.
- Parameters:
ebnf_text_or_file – Either an ebnf-grammar or the file-name of an ebnf-grammar
- Returns:
Ordered dictionary mapping the section names of the grammar to lists of symbols that appear under that section.
- get_report(test_unit, serializations: Dict[str, List[str]] = {}) str[source]#
Returns a text-report of the results of a grammar unit test. The report lists the source of all tests as well as the error messages, if a test failed or the abstract-syntax-tree (AST) in case of success.
If an asterix has been appended to the test-name then the concrete syntax tree will also be added to the report in this particular case.
The purpose of the latter is to help to construct and debugging of AST-Transformations. It is better to switch the CST-output on and off with the asterix marker when needed than to output the CST for all tests which would unnecessarily bloat the test reports.
- grammar_suite(directory, parser_factory, transformer_factory, fn_patterns=('*test*', ), ignore_unknown_filetypes=False, report='REPORT', verbose=True, junctions={}, show={}, serializations: ~typing.Dict[str, ~typing.List[str]] = {}, preprocessor_factory=<function nil_preprocessor_factory>)[source]#
Runs all grammar unit tests in a directory. A file is considered a test-unit, if it has the word “test” in its name.
- Parameters:
directory – The path of a directory that contains test-files.
parser_factory – the parser-factory-object, typically an instance of
Grammar.transformer_factory – A factory-function for the AST-transformation-function.
fn_patterns – A glob patterns for selecting those files in the test-directory that shall be used for testing.
ignore_unknown_filetypes – If True, unknown file types will silently be ignored. Otherwise, an error will be raised if an unknown file-type is encountered
report – the name of the subdirectory where the test-reports will be saved. If the name is the empty string, no reports will be generated.
verbose – If True, more information will be printed to the console during testing.
junctions – A set of
Junction-objects that define further processing stages after the AST-transformation.show – A set of stage names that shall be shown in the report apart from the AST. (The abstract-syntax-tree will always be shown!)
serializations – A (not necessarily complete) dictionary of stages -> serialization that allows to override the default serialization for specific stages.
preprocessor_factory – The preprocessor factory. This will be ignored of the configuration variable test_skip_preprocessor is set to True. Beware that in this case, all source snippets must already have been preprocessed.
- grammar_unit(test_unit, parser_factory, transformer_factory, report='REPORT', verbose=False, junctions={}, show={}, serializations: ~typing.Dict[str, ~typing.List[str]] = {}, preprocessor_factory=<function nil_preprocessor_factory>)[source]#
Unit tests for a grammar-parser and ast-transformations.
- Parameters:
test_unit – The test-unit in a json-like dictionary format as it is returned by
unit_from_file().parser_factory – the parser-factory-object, typically an instance of
Grammar.transformer_factory – A factory-function for the AST-transformation-function.
report – the name of the subdirectory where the test-reports will be saved. If the name is the empty string, no reports will be generated.
verbose – If True, more information will be printed to the console during testing.
junctions – A set of
Junction-objects that define further processing stages after the AST-transformation.show – A set of stage names that shall be shown in the report apart from the AST. (The abstract-syntax-tree will always be shown!)
serializations – A (not necessarily complete) dictionary of stages -> serialization that allows to override the default serialization for specific stages.
preprocessor_factory – The preprocessor factory. This will be ignored of the configuration variable test_skip_preprocessor is set to True. Beware that in this case, all source snippets must already have been preprocessed.
- merge_test_units(*test_units) Dict[source]#
Merges the tests from one or more test units into a single test-unit. ATEENTION: Test-units will be normalized before merging
- reset_unit(test_unit)[source]#
Resets the tests in
test_unitby removing all results and error messages.
- runner(tests, namespace, profile=False)[source]#
Runs all or some selected Python unit tests found in the namespace. To run all tests in a module, call
runner("", globals())from within that module.Unit-Tests are either classes, the name of which starts with “Test” and methods, the name of which starts with “test” contained in such classes or functions, the name of which starts with “test”.
if tests is either the empty string or an empty sequence, runner checks sys.argv for specified tests. In case sys.argv[0] (i.e. the script’s file name) starts with ‘test’ any argument in sys.argv[1:] (i.e. the rest of the command line) that starts with ‘test’ or ‘Test’ is considered the name of a test function or test method (of a test-class) that shall be run. Test-Methods are specified in the form: class_name.method.name e.g. “TestServer.test_connection”.
- Parameters:
tests – String or list of strings with the names of tests to run. If empty, runner searches by itself all objects the of which starts with ‘test’ and runs it (if it is a function) or all of its methods that start with “test” if it is a class plus the “setup” and “teardown” methods if they exist.
namespace – The namespace for running the test, usually
globals()should be used.profile – If True, the tests will be run with the profiler on. results will be displayed after the test-results. Profiling will also be turned on, if the parameter –profile has been provided on the command line.
Example:
class TestSomething() def setup(self): pass def teardown(self): pass def test_something(self): pass if __name__ == "__main__": from DHParser.testing import runner runner("", globals())
- unique_name(file_name: str) str[source]#
Turns the file or dirname into a unique name by adding a time stamp. This helps to avoid race conditions when running tests in parallel that create and delete files on the disk.
- unit_from_config(config_str: str, filename: str, allowed_stages=frozenset({'AST', 'CST', 'fail', 'match', 'match*'}))[source]#
Reads grammar unit tests contained in a file in config file (.ini) syntax.
- Parameters:
config_str – A string containing a config-file with Grammar unit-tests
filename – The file-name of the config-file containing
config_str.allowed_stages – A set of stage names of stages in the processing pipeline for which the test-file may contain tests.
- Returns:
A JSON-like object (i.e. dictionary) representing the unit tests.
- unit_from_file(filename, additional_stages=frozenset({'AST', 'CST', 'fail', 'match', 'match*'}))[source]#
Reads a grammar unit test from a file. The format of the file is determined by the ending of its name.
Module trace#
Module trace provides trace-debugging functionality for the
parser. The tracers are added or removed via monkey patching to
all or some parsers of a grammar and trace the actions
of these parsers, making use of the call_stack__, history__
and moving_forward__, most_recent_error__-hooks in the
Grammar-object.
This functionality can be used for several purposes:
“live” or “breakpoint”-debugging (not implemented)
recording of parsing history and “post-mortem”-debugging, implemented here and in module
logInterrupting long-running parser processes by polling a threading.Event or multiprocessing.Event once in a while
- set_tracer(parsers: Grammar | Parser | Iterable[Parser], tracer: ParseFunc | None)[source]#
Adds or removes a tracing function to (or from) a single parser, a set of parsers or all parsers in a grammar.
- Parameters:
parsers – the parsers or single parser or grammar-object containing parsers where the
tracershall be added or removed.tracer – a tracer function or
None. IfNoneany existing tracer will be removed. If not None, tracer must be a parsing function. It is up to the tracer to call the original parsing function (self._parse()).
Module log#
Module log contains logging and debugging support for the
parsing process.
For logging functionality, the global variable LOGGING is defined which contains the name of a directory where log files shall be placed. By setting its value to the empty string “” logging can be turned off.
To read the directory name function LOGS_DIR() should be called
rather than reading the variable LOGGING. LOGS_DIR() makes sure
the directory exists and raises an error if a file with the same name
already exists.
For debugging of the parsing process, the parsing history can be logged and written to an html-File.
For ease of use module log defines a context-manager logging
to which either False (turn off logging), a log directory name or
True for the default logging directory is passed as argument.
The other components of DHParser check whether logging is on and
write log files in the logging directory accordingly. Usually,
this will be concrete and abstract syntax trees as well as the full
and abbreviated parsing history.
Example:
from DHParser import compile_source, start_logging, set_config_value
start_logging("LOGS")
set_config_value('log_syntax_trees', {'CST', 'AST'})
set_config_value('history_tracking', True)
set_config_value('resume_notices', True)
result, errors, ast = compile_source(source, preprocessor, grammar,
transformer, compiler)
- class HistoryRecord(call_stack: List[CallItem] | Tuple[CallItem, ...], node: Node | None, text: StringView, line_col: Tuple[int, int], errors: List[Error] = [])[source]#
Stores debugging information about one completed step in the parsing history.
A parsing step is “completed” when the last one of a nested sequence of parser-calls returns. The call stack including the last parser call will be frozen in the
HistoryRecord- object. In addition, a reference to the generated leaf node (if any) will be stored and the result status of the last parser call, which ist either MATCH, FAIL (i.e. no match) or ERROR.- class Snapshot(line, column, stack, status, text)#
- column#
Alias for field number 1
- line#
Alias for field number 0
- stack#
Alias for field number 2
- status#
Alias for field number 3
- text#
Alias for field number 4
- static last_match(history: List[HistoryRecord]) HistoryRecord | None[source]#
Returns the last match from the parsing-history. :param history: the parsing-history as a list of HistoryRecord objects
- Returns:
the history record of the last match or none if either history is empty or no parser could match
- static most_advanced_fail(history: List[HistoryRecord]) HistoryRecord | None[source]#
Returns the closest-to-the-end-fail from the parsing-history. :param history: the parsing-history as a list of HistoryRecord objects
- Returns:
the history record of the closest-to-the-end-fail or none if either history is empty or no parser failed
- static most_advanced_match(history: List[HistoryRecord]) HistoryRecord | None[source]#
Returns the closest-to-the-end-match from the parsing-history. :param history: the parsing-history as a list of HistoryRecord objects
- Returns:
the history record of the closest-to-the-end-match or none if either history is empty or no parser could match
- append_log(log_name: str, *strings, echo: bool = False) None[source]#
Appends one or more strings to the log-file with the name
log_name, if logging is turned on and log_name is not the empty string, orlog_namecontains path information.- Parameters:
log_name – The name of the log file. The file must already exist. (See:
create_log()above).strings – One or more strings that will be written to the log-file. No delimiters will be added, i.e. all delimiters like blanks or linefeeds need to be added explicitly to the list of strings, before calling
append_log().echo – If True, the log message will be echoed on the terminal. In this case the message will even be echoed if logging is turned off.
- callstack_as_str(callstack: Sequence[CallItem], depth=-1) str[source]#
Returns a string representation of the callstack!
- clear_logs(logfile_types=frozenset({'.ast', '.cst', '.log'}))[source]#
Removes all logs from the log-directory and removes the log-directory if it is empty.
- create_log(log_name: str) str[source]#
Creates a new log file. If log_name is not just a file name but a path with at least one directory (which can be ‘./’) the file is not created in the configured log directory but at the given path. If a file with the same name already exists, it will be overwritten.
- Parameters:
log_name – The file name of the log file to be created
- Returns:
the file name of the log file or an empty string if the log-file has not been created (e.g. because logging is still turned off and no log-directory set).
- local_log_dir(path: str = './LOGS')[source]#
Context manager for temporarily switching to a different log-directory.
- log_ST(syntax_tree, log_file_name) bool[source]#
Writes an S-expression-representation of the
syntax_treeto a file, if logging is turned on. Returns True, if logging was turned on and log could be written. Returns False, if logging was turned off. Raises a FileSystem error if writing the log failed for some reason.
- log_dir(path: str = '') str[source]#
Creates a directory for log files (if it does not exist) and returns its path.
WARNING: Any files in the log dir will eventually be overwritten. Don’t use a directory name that could be the name of a directory for other purposes than logging.
ATTENTION: The log-dir is stored thread locally, which means the log-dir as well as the information whether logging is turned on or off will not automatically be transferred to any subprocesses. This needs to be done explicitly. (See
testing.grammar_suite()for an example, how this can be done.)- Parameters:
path – The directory path. If empty, the configured value will be used, i.e.
configuration.get_config_value('log_dir').- Returns:
str - name of the logging directory or “” if logging is turned off.
- log_parsing_history(grammar, log_file_name: str = '', as_html: bool = True) bool[source]#
Writes a log of the parsing history of the most recently parsed document, if logging is turned on. Returns True, if that was the case and writing the history was successful.
- Parameters:
grammar (Grammar) – The Grammar object from which the parsing history shall be logged.
log_file_name (str) – The (base-)name of the log file to be written. If no name is given (default), then the class name of the grammar object will be used.
as_html (bool) – If true (default), the log will be output as html-Table, otherwise as plain test. (Browsers might take a few seconds or minutes to display the table for long histories.)
- resume_logging(log_dir: str = '')[source]#
Resumes logging in the current thread with the given log-dir.
- start_logging(dirname: str = 'LOGS')[source]#
Turns logging on and sets the log-directory to
dirname. The log-directory, if it does not already exist, will be created lazily, i.e. only when logging actually starts.This function should best be called before spawning any subprocesses, otherwise the dirname might not be propagated to the subprocesses.
Module configuration#
Module “configuration.py” defines the default configuration for DHParser.
The best way to change the configuration or to add custom configurations for your own project is to place a [DSL-name]config.ini-file in the main-directory of your DSL-project that is the directory where your parsing and compilation-scripts reside. This configuration can be overwritten by a configuration-file with the same name in the current working-directory if different configurations for different workspaces are needed.
The configuration values can be read and changed while running via the get_config_value() and set_config_value()-functions. However, the functions only affect the configuration values for the current thread. Changes will not be visible to any spawned threads or processes.
In order to change the configuration values for spawned processes or threads, the presets can also be overwritten before(!) spawning any parsing processes with:
access_presets() set_preset_value and get_preset_value() finalize_presets()
Unless configuration-files are used (see above), the recommended way to use a different configuration in any custom code using DHParser is to use the second method, i.e. to overwrite the values for which this is desired in the CONFIG_PRESET dictionary right after the start of the program and before any DHParser-function is invoked.
- access_presets()[source]#
Allows read and write access to preset values via get_preset_value() and set_preset_value(). Any call to access_presets() should be matched by a call to finalize_presets() to ensure propagation of changed preset-values to spawned processes. For an explanation why, see: https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods
- access_thread_locals() Any[source]#
Intitializes (if not done yet) and returns the thread local variable store. (Call this function before using THREAD_LOCALS. Direct usage of THREAD_LOCALS is DEPRECATED!)
- add_config_values(configuration: dict)[source]#
Adds (or overwrites) new configuration values. :param configuration: additional configuration values
- dump_config_data(*key_patterns, use_headings: bool = True) str[source]#
Returns the configuration variables the name of which matches one of the key_patterns config.ini-string.
- finalize_presets(fail_on_error: bool = False)[source]#
Finalizes changes of the presets of the configuration values. This method should always be called after changing preset values to make sure the changes will be visible to processes spawned later.
- get_config_value(key: str, default: ~typing.Any = <configuration.NoDefault object>) Any[source]#
Retrieves a configuration value thread-safely.
- Parameters:
key – the key (an immutable, usually a string)
default – a default value that is returned if no config-value exists for the key.
- Returns:
the value
- get_config_values(key_pattern: str = '*', *additional_patterns) Dict[source]#
Returns a dictionary of all configuration entries that match key_pattern.
- get_preset_values(key_pattern: str) Dict[source]#
Returns a dictionary of all presets that match key_pattern.
- read_local_config(ini_filename: str) List[str][source]#
Reads a local config file(s) and updates the presets accordingly. All config-files with the same basename (i.e. name without path) as “ini_filename” will subsequently be read from these directories and be processed in the same order, which means config-values in files processed later will overwrite config-values from earlier processed files:
script-directory
exact path of “ini_filename”
User’s config-file directory, e.g. ~/.config/basename/
current working directory
This configuration file must be in the .ini-file format so that it can be parsed with “configparser” from the Python standard library. Any key,value-pair under the section “DHParser” will directly be transferred to the configuration presets. For other sections, the section name will be added as a qualifier to the key: “section.key”. Thus, only values under the “DHParser” section modify the DHParser-configuration while configuration parameters unter other sections are free to be evaluated by the calling script and cannot interfere with DHParser’s configuration.
- Parameters:
ini_filename – the file path and name of the configuration file.
- Returns:
the file paths of the actually read .ini-files or the empty string if no .ini-file with the given name could be found either at the given path, in the current working directory or in the calling script’s path.
- set_config_value(key: str, value: Any, allow_new_key: bool = False)[source]#
Changes a configuration value thread-safely. The configuration value will be set only for the current thread. In order to set configuration values for any new thread, add the key and value to CONFIG_PRESET, before any thread accessing config-values is started. :param key: the key (an immutable, usually a string) :param value: the value
Module server#
Module server contains an asynchronous tcp-server that receives compilation requests, runs custom compilation functions in a Process- or InterpreterExecutor.
This allows to start a DHParser-compilation environment just once and save the startup time of DHParser for each subsequent compilation. In particular, with a just-in-time-compiler like PyPy (https://pypy.org) setting up a compilation-server is highly recommended, because jit-compilers typically sacrifice startup-speed for running-speed.
It is up to the compilation function to either return the result of the compilation in serialized form, or just save the compilation results on the file system and merely return a success or failure message. Module server does not define any of these message. This is completely up to the clients of module server, i.e. the compilation-modules, to decide.
The communication, i.e. requests and responses, follows the json-rpc protocol:
<https://www.jsonrpc.org/specification>
For JSON see:
The server-module contains some rudimentary support for the language server protocol. For the specification and implementation of the language server protocol, see:
<https://code.visualstudio.com/api/language-extensions/language-server-extension-guide>
<https://microsoft.github.io/language-server-protocol/>
- class Connection(reader: StreamReader | StreamReaderProxy, writer: StreamWriter | StreamWriterProxy, exec_env: ExecutionEnvironment)[source]#
Class Connections encapsulates connection-specific data for the Server class (see below). At the moment, however, only one connection is accepted at one and the same time, assuming there is a one-to-one relationship between the Text-Editor (i.e. the client) and the language server.
Currently, logging is not encapsulated, assuming that for the purpose of debugging the language server it is better not to have more than one connection at a time, anyway.
- alive#
Boolean flag, indicating that the connection is still alive. When set to false the connection will be closed, but the server will not be stopped.
- reader#
the stream-reader for this connection
- writer#
the stream-writer for this connection
- exec#
the execution environment of the server
- active_tasks#
A dictionary that maps task id’s (resp. jsonrpc id’s) to their futures to keep track of any running task.
- finished_tasks#
a set of task id’s (resp. jsonrpc id’s) for tasks that have been finished and should be removed from the active_tasks-dictionary at the next possible time.
- response_queue#
An asynchronous queue which stores the json-rpc responses and errors received from a language server client as result of commands initiated by the server.
- pending_responses#
A dictionary of jsonrpc-/task-id’s to lists of JSON-objects that have been fetched from the response queue but not yet been collected by the calling task.
- lsp_initialized#
A string-flag indicating that the connection to a language sever via json-rpc has been established.
- lsp_shutdown#
A string-flag indicating that the connection to a language server via jason-rpc has been is or is being shut down.
- log_file#
Name of the server-log. Mirrors Server.log_file
- echo_log#
If True log messages will be echoed to the console. Mirrors Server.log_file
- async client_response(call_id: int) JSON_Type[source]#
Waits for and returns the response from the lsp-client to the call with the id call_id.
- class ExecutionEnvironment(event_loop: AbstractEventLoop)[source]#
Class ExecutionEnvironment provides methods for executing server tasks in separate processes, threads, as asynchronous task or as simple function.
- Variables:
process_executor – A process-pool-executor for cpu-bound tasks
thread_executor – A thread-pool-executor for blocking tasks
submit_pool – A secondary process-pool-executor to submit tasks synchronously and thread-safe.
submit_pool_lock – A threading.Lock to ensure that submissions to the submit_pool will be thread_safe
loop – The asynchronous event loop for running coroutines
log_file – The name of the log-file to which error messages are written if an executor raises a Broken-Error.
_closed – A Flag that is set to True after the shutdown-method has been called. After that any call to the `execute()-method yields an error.
- async execute(executor: Executor | None, method: Callable, params: dict | tuple | list) Tuple[JSON_Type | BytesType, RPC_Error_Type | None][source]#
Executes a method with the given parameters in a given executor (
ThreadPoolExcecutororProcessPoolExecutor).execute()waits for the completion and returns the JSON result and an RPC error tuple (see the type definition above). The result may be None and the error may be zero, i.e. no error. If executor is None the method will be called directly instead of deferring it to an executor.
- class Server(rpc_functions: RPC_Type, cpu_bound: ~typing.Set[str] = {'*'}, blocking: ~typing.Set[str] = {}, connection_callback: ConnectionCallback = <function connection_cb_dummy>, server_name: str = '', strict_lsp: bool = True)[source]#
Class Server contains all the boilerplate code for a Language-Server-Protocol-Server. Class Server should be considered final, i.e. do not derive from this class to add LSP-functionality, rather implement the lsp_functionality in a dedicated class (or set of functions) and pass the LSP-functionality via the rpc_functions-parameter to the constructor of this class.
- Variables:
server_name – A name for the server. Defaults to CLASSNAME_OBJECTID
strict_lsp – Enforce Language-Server-Protocol on json-rpc-calls. If False json-rpc calls will be processed even without prior initialization, just like plain data or http calls.
cpu_bound – Set of function names of functions that are cpu-bound and will be run in separate processes.
blocking – Set of functions that contain blocking calls (e.g. IO-calls) and will therefore be run in separate threads.
rpc_table – Table mapping LSP-method names to Python functions
known_methods – Set of all known LSP-methods. This includes the methods in the rpc-table and the four initialization methods, initialize(), initialized(), shutdown(), exit
connection_callback – A callback function that is called with the connection object as argument when a connection to a client is established
max_data_size – Maximal size of a data chunk that can be read by the server at a time.
stage – The operation stage, the server is in. Can be on of the four values: SERVER_OFFLINE, SERVER_STARTING, SERVER_ONLINE, SERVER_TERMINATING
host – The host, the server runs on, e.g. “127.0.0.1”
port – The port of the server, e.g. 8890
server – The asyncio.Server if the server is online, or None.
serving_task – The task in which the asyncio.Server is run.
stop_response – The response string that is written to the stream as answer to a stop request.
service_calls – A set of names of functions that can be called as “service calls” from a second connection, even if another connection is still open.
echo_log – Read from the global configuration. If True, any log message will also be echoed on the console.
log_file – The file-name of the server-log.
log_filter – A filter to allow or block logging of specific json-rpc calls.
use_jsonrpc_header – Read from the global configuration. If True, jsonrpc-calls or responses will always be preceeded by a simple header of the form: Content-Length: {NUM}nn, where {NUM} stands for the byte-size of the rpc-package.
exec – An instance of the execution environment that delegates tasks to separate processes, threads, asynchronous tasks or simple function calls.
connection – An instance of the connection class representing the data of the current connection or None, if there is no connection at the moment. There can be only one connection to the server at a time!
kill_switch – If True, the server will be shut down.
loop – The asyncio event loop within which the asyncio stream server is run.
- async handle_plaindata_request(task_id: int, reader: StreamReader | StreamReaderProxy, writer: StreamWriter | StreamWriterProxy, data: BytesType, service_call: bool = False)[source]#
Processes a request in plain-data-format, i.e. neither http nor json-rpc
- register_service_rpc(name, method)[source]#
Registers a service request, i.e. a call that will be accepted from a second connection. Otherwise, requests coming from a new connection if a connection has already been established, will be rejected, because language servers only accept one client at a time.
- async respond(writer: StreamWriter | StreamWriterProxy, response: str | BytesType)[source]#
Sends a response to the given writer. Depending on the configuration, the response will be logged. If the response appears to be a json-rpc response a JSONRPC_HEADER will be added depending on self.use_jsonrpc_header.
- rpc_identify_server(service_call: bool = False, html: bool = False, *args, **kwargs)[source]#
Returns an identification string for the server.
- rpc_info(service_call: bool = False, html: bool = False, *args, **kwargs) str[source]#
Returns information on the implemented LSP- and service-functions.
- rpc_logging(*args, **kwargs) str[source]#
Starts logging with either a) the default filename, if args is empty or the empty string; or b) the given log file name if args[0] is a non-empty string; or c) stops logging if args[0] is None.
- rpc_serve_page(file_path: str, service_call: bool = False, html: bool = False, *args, **kwargs) str | bytes[source]#
Loads and returns the HTML page or other files stored in file file_path
- async run(method_name: str, method: Callable, params: Dict | List | Tuple) Tuple[JSON_Type | BytesType | None, RPC_Error_Type | None][source]#
Picks the right execution method (process, thread or direct execution) and runs it in the respective executor. In case of a broken ProcessPoolExecutor it restarts the ProcessPoolExecutor and tries to execute the method again.
- run_stream_server(reader: StreamReader | StreamReaderProxy, writer: StreamWriter | StreamWriterProxy)[source]#
Start a DHParser-server that listens on a reader-stream and answers on a writer-stream.
- run_tcp_server(host: str = '', port: int = -1, loop=None)[source]#
Starts a DHParser-server that listens on a tcp port. This function will not return until the DHParser-Server ist stopped by sending a STOP_SERVER_REQUEST.
- class StreamReaderProxy(io_reader: IOBase)[source]#
StreamReaderProxy simulates an asyncio.StreamReader that sends and receives data through an io.IOBase-Stream.
- class StreamWriterProxy(io_writer: IOBase)[source]#
StreamWriterProxy simulates an asyncio.StreamWriter that sends and receives data through an io.IOBase-Stream.
- async asyncio_connect(host: str = '', port: int = -1, retry_timeout: float = 3.0) Tuple[StreamReader, StreamWriter][source]#
Opens a connection with timeout retry-timeout.
- asyncio_run(coroutine: Awaitable, loop=None) Any[source]#
Backward compatible version of Pyhon3.7’s asyncio.run()
- async has_server_stopped(host: str = '', port: int = -1, timeout: float = 3.0) bool[source]#
Returns True, if no server is running or any server that is running has stopped within the given timeout. Returns False, if server has not stopped and is still running.
- async probe_tcp_server(host, port, timeout=3) str[source]#
Connects to server and sends an identify-request. Returns the response or an empty string if connection failed or command timed out.
- rpc_entry_info(name: str, rpc_table: RPC_Table, html: bool = False) str[source]#
Returns the name, signature and doc-string of a function in the rpc-table as string or HTML-snippet.
- rpc_table_info(rpc_table: RPC_Table, html: bool = False) str[source]#
Returns the names, function signatures and doc-string of all functions in the rpc_table as a (more or less) well-formatted string or as HTML-snippet.
- spawn_tcp_server(host: str = '', port: int = -1, parameters: ~typing.Tuple | ~typing.Dict | ~typing.Callable = <function echo_requests>, Concurrent: ~server.ConcurrentType = <class 'multiprocessing.context.Process'>) ConcurrentType[source]#
Starts DHParser-Server that communicates via tcp in a separate process or thread. Can be used for writing test code.
Servers started with this function sometimes seem to run into race conditions. Therefore, USE THIS ONLY FOR TESTING!
- Parameters:
host – The host for the tcp-communication, e.g. 127.0.0.1
port – the port number for the tcp-communication.
parameters – The parameter-tuple or -dict for initializing the server or simply a rpc-handling function that takes a string-request as argument and returns a string response.
Concurrent – The concurrent class, either mutliprocessing.Process or threading.Tread for running the server.
- Returns:
the multiprocessing.Process-object of the already started server-processs.
Module stringview#
StringView provides string-slicing without copying. Slicing Python-strings always yields copies of a segment of the original string. See: https://mail.python.org/pipermail/python-dev/2008-May/079699.html However, this becomes costly (in terms of space and as a consequence also time) when parsing longer documents. Unfortunately, Python’s memoryview does not work for Unicode strings. Hence, the StringView class.
It is recommended to compile this module with the Cython-compiler for
speedup. The module comes with a stringview.pxd that contains some type
declarations to more fully exploit the benefits of the Cython-compiler.
- class StringView(text: str, begin: int | None = 0, end: int | None = None)[source]#
A rudimentary StringView class, just enough for the use cases in parse.py. The difference between a StringView and the python builtin strings is that StringView-objects do slicing without copying, i.e. slices are just a view on a section of the sliced string.
- count(sub: str, start: int | None = None, end: int | None = None) int[source]#
Returns the number of non-overlapping occurrences of substring sub in StringView S[start:end]. Optional arguments start and end are interpreted as in slice notation.
- endswith(suffix: str, start: int = 0, end: int | None = None) bool[source]#
Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position.
- find(sub: str, start: int | None = None, end: int | None = None) int[source]#
Returns the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation. Returns -1 on failure.
- finditer(regex)[source]#
Executes regex.finditer on the StringView object and returns the iterator of match objects. Keep in mind that match.end(), match.span() etc. are mapped to the underlying text, not the StringView-object!!!
- index(absolute_index: int) int[source]#
Converts an index for a string watched by a StringView object to an index relative to the string view object, e.g.:
>>> import re >>> sv = StringView('xxIxx')[2:3] >>> match = sv.match(re.compile('I')) >>> match.end() 3 >>> sv.index(match.end()) 1
- indices(absolute_indices: Iterable[int]) Tuple[int, ...][source]#
Converts indices for a string watched by a StringView object to indices relative to the string view object. See also: sv_index()
- lstrip(chars=' \n\t') StringView[source]#
Returns a copy of self with leading whitespace removed.
- match(regex, flags: int = 0)[source]#
Executes regex.match on the StringView object and returns the result, which is either a match-object or None. Keep in mind that match.end(), match.span() etc. are mapped to the underlying text, not the StringView-object!!!
- rfind(sub: str, start: int | None = None, end: int | None = None) int[source]#
Returns the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation. Returns -1 on failure.
- rstrip(chars=' \n\t') StringView[source]#
Returns a copy of self with trailing whitespace removed.
- search(regex, start: int | None = None, end: int | None = None)[source]#
Executes regex.search on the StringView object and returns the result, which is either a match-object or None. Keep in mind that match.end(), match.span() etc. are mapped to the underlying text, not the StringView-object!!!
- split(sep=None)[source]#
Returns a list of the words in self, using sep as the delimiter string. If sep is not specified or is None, any whitespace string is a separator and empty strings are removed from the result.
- startswith(prefix: str, start: int = 0, end: int | None = None) bool[source]#
Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position.
- strip(chars: str = ' \n\r\t') StringView[source]#
Returns a copy of the StringView self with leading and trailing whitespace removed.
- class TextBuffer(text: str | StringView, version: int = 0)[source]#
TextBuffer class manages a copy of an edited text for a language server. The text can be changed via incremental edits. TextBuffer keeps track of the state of the complete text at any point in time. It works line oriented and lines of text can be retrieved via indexing or slicing.
- snapshot(eol: str = '\n') str | StringView[source]#
Returns the current state of the entire text, using the given end of line marker (
\nor\r\n)
- text_edits(edits: list | dict, version: int = -1)[source]#
Incorporates the one or more text-edits or change-events into the text. A Text-Edit is a dictionary of this form:
{"range": {"start": {"line": 0, "character": 0 }, "end": {"line": 0, "character": 0 } }, "newText": "..."}
In case of a change-event, the key “newText” is replaced by “text”.
Module toolkit#
Module toolkit contains utility functions that are needed across
several of the other DHParser-Modules Helper functions that are not
needed in more than one module are best placed within that module and
not in the toolkit-module. An acceptable exception to this rule are
functions that are very generic.
- class JSONnull[source]#
JSONnull is a special type that is serialized as
nullbyjson_dumps. This can be used whenever it is inconvenient to useNoneas the null-value.
- class JSONstr(serialized_json: str)[source]#
JSONStr is a special type that encapsulates already serialized json-chunks in json object-trees.
json_dumpswill insert the content of a JSONStr-object literally, rather than serializing it as other objects.
- class LazyRE(regexp: str, flags=0)[source]#
A lazily-evaluating regular expression. This allows to define as many regular expressions on the top-level as you like without wasting startup-time.
>>> rx = LazyRE(r'\w+') >>> rx.match('abc').group(0) 'abc' >>> rx.match('!?')
- class Protocol[source]#
Base class for protocol classes.
Protocol classes are defined as:
class Proto(Protocol): def meth(self) -> int: ...
Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:
class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check
See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:
class GenProto(Protocol[T]): def meth(self) -> T: ...
- class SingleThreadExecutor[source]#
SingleThreadExecutor is a replacement for concurrent.future.ProcessPoolExecutor and concurrent.future.ThreadPoolExecutor that executes any submitted task immediately in the submitting thread. This helps to avoid writing extra code for the case that multithreading or multiprocesssing has been turned off in the configuration. To do so is helpful for debugging.
It is not recommended to use this in asynchronous code or code that relies on the submit()- or map()-method of executors to return quickly.
- class ThreadLocalSingletonFactory(class_or_factory, name: str = '', *, uniqueID: str | int = 0, ident=None)[source]#
Generates a singleton-factory that returns one and the same instance of class_or_factory for one and the same thread, but different instances for different threads.
Note: Parameter uniqueID should be provided if class_or_factory is not unique but generic. See source code of
DHParser.dsl.create_transtable_junction()
- abbreviate_middle(s: str, max_length: int) str[source]#
Shortens string s by replacing the middle part with an ellipsis sign ` … ` if the size of the string exceeds max_length.
- as_identifier(s: str, replacement: str = '_') str[source]#
Converts a string to an identifier that matches /w+/ by substituting any character not matching /w/ with the given replacement string:
>>> as_identifier('EBNF-m') 'EBNF_m'
- as_list(item_or_sequence) List[Any][source]#
Turns an arbitrary sequence or a single item into a list. In case of a single item, the list contains this element as its sole item.
- as_tuple(item_or_sequence) Tuple[Any][source]#
Turns an arbitrary sequence or a single item into a tuple. In case of a single item, the tuple contains this element as its sole item.
- cached_load(file_name: str, deserialize: Callable[[str], Any] | partial, cachedir: str = '~/.cache') Any[source]#
Loads and deserializes as file into a python-object. The Python-object will be pickled and written to “cachedir”. If a pickled version already exists, the same file will not be deserialized again, but the pickled version will be loaded. If cachedir == “”, the pickled version will always be preferred, even if the original file has been updated. In this case, in order to invalidate the cache, the pickled version must be deleted manually. Otherwise, and this includes the case cachedir == “.”, a hash value is used to check whether the original file has been updated, in which case, the source-file will be loaded and deserialized anew.
- Parameters:
file_name – The name of the file to load.
deserialize – The function to deserialize the content of the file.
cachedir – The directory to cache the pickled version in.
- Returns:
The deserialized python object.
- clear_from_cache(file_name: str, cachedir: str = '~/.cache')[source]#
Removes the cached version of file_name from the cache. (See
cached_load())
- compile_python_object(python_src: str, catch_obj='DSLGrammar') Any[source]#
Compiles the python source code and returns the (first) object the name of which is either equal to or matched by
catch_obj_regex. If catch_obj is the empty string, the namespace dictionary will be returned.
- cpu_count() int[source]#
Returns the number of cpus that are accessible to the current process or 1 if the cpu count cannot be determined.
- deprecated(message: str) Callable[source]#
Decorator that marks a function as deprecated and emits a deprecation message on its first use:
>>> @deprecated('This function is deprecated!') ... def bad(): ... pass >>> save = get_config_value('deprecation_policy') >>> set_config_value('deprecation_policy', 'fail') >>> try: bad() ... except DeprecationWarning as w: print(w) This function is deprecated! >>> set_config_value('deprecation_policy', save)
- deprecation_warning(message: str)[source]#
Issues a deprecation warning. Makes sure that each message is only called once.
- escape_ctrl_chars(strg: str) str[source]#
Replace all control characters (e.g. n t) in a string by their back-slashed representation and replaces backslash by double backslash.
- escape_formatstr(s: str) str[source]#
Replaces single curly braces by double curly-braces in string s, so that they are not misinterpreted as place-holder by “”.format().
- escape_re(strg: str) str[source]#
Returns the string with all regular expression special characters escaped.
- expand_table(compact_table: Dict) Dict[source]#
Expands a table by separating keywords that are tuples or strings containing comma separated words into single keyword entries with the same values. Returns the expanded table. Example:
>>> expand_table({"a, b": 1, ('d','e','f'):5, "c":3}) {'a': 1, 'b': 1, 'd': 5, 'e': 5, 'f': 5, 'c': 3}
- first(item_or_sequence: Sequence | Any) Any[source]#
Returns an item or the first item of a sequence of items.
- fix_XML_attribute_value(value: Any) str[source]#
Returns the quoted XML-attribute value. In case the values contains illegal characters, like ‘<’, these will be replaced by XML-entities.
- identify_python() str[source]#
Returns a reasonable identification string for the python interpreter, e.g. “cpython 3.8.6”.
- identity(x)[source]#
Canonical identity function. The purpose of defining identity() here is to allow it to serve as a default value and to be able to check whether a function parameter has been assigned another than the default value or not.
- instantiate_executor(allow_parallel: bool, preferred_executor=<toolkit.PickMultiCoreExecutorShim object>, *args, **kwargs)[source]#
Instantiates an Executor of a particular type.
If allow_parallel is False, a SingleThreadExecutor will be instantiated, regardless of the preferred_executor and any configuration values.
Parallel execution can still be blocked by the configuration variable ‘debug_parallel_execution’. (The default is to allow full multiprocessing unless a command-line switch was used to trigger a different behavior.) Otherwise, a surrogate executor will be returned.
- Parameters:
allow_parallel – If false, a SingeThreadExecutor-object will be returned. If true, it depends on the config value of ‘debug_parallel_executor’ (see comments in config.py for a detailed explanation)
preferred_executor – the preferred executor class that is used if the parameter allaw_parallel is true and ‘debug_parallel_executor’ does not forbid the use of this kind of executor. The inofficial default value is MultiCoreExecutor, which yields a ProcessPoolExecutor for Python versions <= 3.13 and a wrapped InterpreterPoolExecutor for Python versions 3.14 and above.
- Returns:
and executor-object, either an instance of concurrent.futures.Executor or SingleThreadExecutor (see above).
- is_filename(strg: str) bool[source]#
Tries to guess whether the given string is a file name. It is assumed that it is NOT a filename if any of the following conditions is true:
it starts with a byte-order mark, i.e. ‘ufffe’ or ‘ufeff’
it starts or ends with a blank, i.e. “ “
it contains any of the characters in the set [*?”<>|]
For disambiguation of non-filenames it is best to add a byteorder-mark to the beginning of the string, because this will be stripped by the DHParser’s parser, anyway!
- is_python_code(text_or_file: str) bool[source]#
Checks whether ‘text_or_file’ is python code or the name of a file that contains python code.
- isgenerictype(t)[source]#
Returns True if t is a generic type. WARNING: This is very “hackish”. Caller must make sure that t actually is a type!
- issubtype(sub_type, base_type) bool[source]#
Returns True if sub_type is a subtype of base_type. WARNING: Implementation is somewhat “hackish” and might break with new Python-versions.
- json_dumps(obj: JSON_Type, *, cls=<class 'json.encoder.JSONEncoder'>, partially_serialized: bool = False) str[source]#
Returns json-object as string. Other than the standard-library’s json.dumps()-function json_dumps allows to include alrady serialzed parts (in the form of JSONStr-objects) in the json-object. Example:
>>> already_serialized = '{"width":640,"height":400"}' >>> literal = JSONstr(already_serialized) >>> json_obj = {"jsonrpc": "2.0", "method": "report_size", "params": literal, "id": None} >>> json_dumps(json_obj, partially_serialized=True) '{"jsonrpc":"2.0","method":"report_size","params":{"width":640,"height":400"},"id":null}'
- Parameters:
obj – A json-object (or a tree of json-objects) to be serialized
cls – The class of a custom json-encoder berived from
json.JSONEncoderpartially_serialized – If True,
JSONStr-objects within the json tree will be encoded (by inserting their content). If False,JSONStr-objects will raise a TypeError, but encoding will be faster.
- Returns:
The string-serialized form of the json-object.
- json_rpc(method: str, params: JSON_Type = [], ID: int | None = None, partially_serialized: bool = True) str[source]#
Generates a JSON-RPC-call string for method with parameters params.
- Parameters:
method – The name of the rpc-function that shall be called
params – A json-object representing the parameters of the call
ID – An ID for the json-rpc-call or None
partially_serialized – If True, the params-object may contain already serialized parts in form of JSONStr-objects. If False, any JSONStr-objects will lead to a TypeError.
- Returns:
The string-serialized form of the json-object.
- last(item_or_sequence: Sequence | Any) Any[source]#
Returns an item or the first item of a sequence of items.
- line_col(lbreaks: List[int], pos: int) Tuple[int, int][source]#
Returns the position within a text as (line, column)-tuple based on a list of all line breaks, including -1 and EOF.
- linebreaks(text: StringView | str) List[int][source]#
Returns a list of the indices of all line breaks in the text.
- load_if_file(text_or_file) str[source]#
Reads and returns content of a text-file if parameter text_or_file is a file name (i.e. a single line string), otherwise (i.e. if text_or_file is a multi-line string) text_or_file is returned.
- lxml_XML_attribute_value(value: Any) str[source]#
Makes sure that the attribute value works with the lxml-library, at the cost of replacing all characters with a code > 256 by a quesiton mark.
- Parameters:
value – the original attribute value
- Returns:
the quoted and lxml-compatible attribute value.
- matching_brackets(text: str, openB: str, closeB: str, unmatched: list = [], is_regex: bool = False) List[Tuple[int, int]][source]#
Returns a list of matching bracket positions. Fills an empty list passed to parameter unmatched with the positions of all unmatched brackets. If rx is True, escaped brackets and brackets inside charsets will be ignored. In other words, only brackets that are control-characters of the regular expression will be considered. Examples:
>>> matching_brackets('(a(b)c)', '(', ')') [(2, 4), (0, 6)] >>> matching_brackets('(a)b(c)', '(', ')') [(0, 2), (4, 6)] >>> unmatched = [] >>> matching_brackets('ab(c', '(', ')', unmatched) [] >>> unmatched [2] >>> matching_brackets(r'([^\d()]*(?=[\d(]))', '(', ')', is_regex=True) [(9, 17), (0, 18)]
- md5(*txt)[source]#
Returns the md5-checksum for txt. This can be used to test if some piece of text, for example a grammar source file, has changed.
- multiprocessing_broken() str[source]#
Returns an error message, if, for any reason multiprocessing is not safe to be used. For example, multiprocessing does not work with PyInstaller (under Windows) or GraalVM.
- normalize_circular_path(path: Tuple[str, ...]) Tuple[str, ...][source]#
Returns a normalized version of a circular path represented as a tuple.
A circular (or “recursive”) path is a tuple of items, the order of which matters, but not the starting point. This can, for example, be a tuple of references from one symbol defined in an EBNF source text back to (but excluding) itself.
For example, when defining a grammar for arithmetic, the tuple (‘expression’, ‘term’, ‘factor’) if a recursive path, because the definition of a factor includes a (bracketed) expression and thus refers back to expression Normalizing is done by “ring-shifting” the tuple so that it starts with the alphabetically first symbol in the path:
>>> normalize_circular_path(('term', 'factor', 'expression')) ('expression', 'term', 'factor')
- normalize_circular_paths(path: Tuple[str, ...] | AbstractSet[Tuple[str, ...]]) Tuple[str, ...] | MutableSet[Tuple[str, ...]] | MutableSet[source]#
Like
normalize_circular_path(), but normalizes a whole set of paths at once.
- normalize_docstring(docstring: str) str[source]#
Strips leading indentation as well as leading and trailing empty lines from a docstring.
- pp_json(obj: JSON_Type, *, cls=<class 'json.encoder.JSONEncoder'>) str[source]#
Returns json-object as pretty-printed string. Other than the standard-library’s json.dumps()-function pp_json allows to include already serialized parts (in the form of JSONStr-objects) in the json-object. Example:
>>> already_serialized = '{"width":640,"height":400"}' >>> literal = JSONstr(already_serialized) >>> json_obj = {"jsonrpc": "2.0", "method": "report_size", "params": literal, "id": None} >>> print(pp_json(json_obj)) { "jsonrpc": "2.0", "method": "report_size", "params": {"width":640,"height":400"}, "id": null}
- Parameters:
obj – A json-object (or a tree of json-objects) to be serialized
cls – The class of a custom json-encoder derived from json.JSONEncoder
- Returns:
The pretty-printed string-serialized form of the json-object.
- pp_json_str(jsons: str) str[source]#
Pretty-prints and already serialized (but possibly ugly-printed) json object in a well-readable form. Syntactic sugar for: pp_json(json.loads(jsons)).
- printw(s: Any, wrap_column: int = 79, tolerance: int = 24, wrap_chars: str = ')]>, ')[source]#
Prints the string or other object nicely wrapped. See
wrap_str_nicely().
- relative_path(from_path: str, to_path: str) str[source]#
Returns the relative path in order to open a file from to_path when the script is running in from_path. Example:
>>> relative_path('project/common/dir_A', 'project/dir_B').replace(chr(92), '/') '../../dir_B'
- sane_parser_name(name) bool[source]#
Checks whether given name is an acceptable parser name. Parser names must not be preceded or succeeded by a double underscore ‘__’!
- smart_list(arg: str | Iterable | Any) Sequence | Set[source]#
Returns the argument as list, depending on its type and content.
If the argument is a string, it will be interpreted as a list of comma separated values, trying ‘;’, ‘,’, ‘ ‘ as possible delimiters in this order, e.g.:
>>> smart_list('1; 2, 3; 4') ['1', '2, 3', '4'] >>> smart_list('2, 3') ['2', '3'] >>> smart_list('a b cd') ['a', 'b', 'cd']
If the argument is a collection other than a string, it will be returned as is, e.g.:
>>> smart_list((1, 2, 3)) (1, 2, 3) >>> smart_list({1, 2, 3}) {1, 2, 3}
If the argument is another iterable than a collection, it will be converted into a list, e.g.:
>>> smart_list(i for i in {1,2,3}) [1, 2, 3]
Finally, if none of the above is true, the argument will be wrapped in a list and returned, e.g.:
>>> smart_list(125) [125]
- split_path(path: str) Tuple[str, ...][source]#
Splits a filesystem path into its components. Other than os.path.split() it does not only split of the last part:
>>> split_path('a/b/c') ('a', 'b', 'c') >>> os.path.split('a/b/c') # for comparison. ('a/b', 'c')
- static#
alias of
staticmethod
- text_pos(text: StringView | str, line: int, column: int, lbreaks: List[int] = []) int[source]#
Returns the text-position for a given line and column or -1 if the line and column exceed the size of the text.
- class unrepr(s: str)[source]#
unrepr encapsulates a string representing a python function in such a way that the representation of the string yields the function call itself rather than a string representing the function call and delimited by quotation marks. Example:
>>> "re.compile(r'abc+')" "re.compile(r'abc+')" >>> unrepr("re.compile(r'abc+')") re.compile(r'abc+')
- validate_XML_attribute_value(value: Any) str[source]#
Validates an XML-attribute value and returns the quoted string-value if successful. Otherwise, raises a ValueError.
- wrap_str_literal(s: str | List[str], column: int = 80, offset: int = 0) str[source]#
Wraps an excessively long string literal over several lines. Example:
>>> s = '"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' >>> print(wrap_str_literal(s, 25)) "abcdefghijklmnopqrstuvwx" "yzABCDEFGHIJKLMNOPQRSTUVW" "XYZ0123456789" >>> s = 'r"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' >>> print("Call(" + wrap_str_literal(s, 40, 5) + ")") Call(r"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLM" r"NOPQRSTUVWXYZ0123456789") >>> s = 'fr"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' >>> print("Call(" + wrap_str_literal(s, 40, 5) + ")") Call(fr"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLM" fr"NOPQRSTUVWXYZ0123456789") >>> s = ['r"abcde', 'ABCDE"'] >>> print(wrap_str_literal(s)) r"abcde" r"ABCDE"
- wrap_str_nicely(s: str, wrap_column: int = 79, tolerance: int = 24, wrap_chars: str = ')]>, ') str[source]#
Line-wraps a single-line output string at ‘wrap_column’. Tries to find a suitable point for wrapping, i.e. after any of the wrap_characters.
If the strings spans multiple lines, the existing linebreaks will be kept and no rewrapping takes place. In order to enforce rewrapping of multiline strings, use:
wrap_str_nicely(repr(s)[1:-1]). (repr() replaces linebreaks by the \n-marker. The slicing [1:-1] removes the opening and closing angular quotation marks that repr adds.Examples:
>>> s = ('(X (l ",.") (A (O "123") (P (:Text "4") (em "56"))) ' ... '(em (m "!?")) (B (Q (em "78") (:Text "9")) (R "abc")) ' ... '(n "+-"))') >>> print(wrap_str_nicely(s)) (X (l ",.") (A (O "123") (P (:Text "4") (em "56"))) (em (m "!?")) (B (Q (em "78") (:Text "9")) (R "abc")) (n "+-")) >>> s = ('(X (s) (A (u) (C "One,")) (em (A (C " ") (D "two, ")))' ... '(B (E "three, ") (F "four!") (t))))') >>> print(wrap_str_nicely(s)) (X (s) (A (u) (C "One,")) (em (A (C " ") (D "two, ")))(B (E "three, ") (F "four!") (t)))) >>> s = ("[Node('word', 'This'), Node('word', 'is'), " ... "Node('word', 'Buckingham'), Node('word', 'Palace')]") >>> print(wrap_str_nicely(s)) [Node('word', 'This'), Node('word', 'is'), Node('word', 'Buckingham'), Node('word', 'Palace')] >>> s = ("Node('phrase', (Node('word', 'Buckingham'), " ... "Node('blank', ' '), Node('word', 'Palace')))") >>> print(wrap_str_nicely(s)) Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '), Node('word', 'Palace'))) >>> s = ('<hard>Please mark up <foreign lang="de">Stadt\n<lb/></foreign>' ... '<location><foreign lang="de"><em>München</em></foreign> ' ... 'in Bavaria</location> in this sentence.</hard>') >>> print(wrap_str_nicely(s)) <hard>Please mark up <foreign lang="de">Stadt <lb/></foreign><location><foreign lang="de"><em>München</em></foreign> in Bavaria</location> in this sentence.</hard> >>> print(wrap_str_nicely(repr(s)[1:-1])) # repr to ignore the linebreaks <hard>Please mark up <foreign lang="de">Stadt\n<lb/></foreign><location> <foreign lang="de"><em>München</em></foreign> in Bavaria</location> in this sentence.</hard>