Document-Trees¶

Module nodetree encapsulates the functionality for creating and handling document trees and, in particular syntax-trees generated by a parser. This includes serialization and deserialization of node-trees, navigating and searching node-trees as well as annotating node-trees with attributes and error messages.

Node-objects¶

Syntax trees are composed of Node-objects which are linked uni-directionally from parent to children. Nodes can contain either child-nodes, in which case they are informally called “branch-nodes”, or text-strings, in which case they informally called “leaf nodes”, but not both at the same time. (There is no mixed content as in XML!)

Apart from their content, the most important property of a Node-object is its name. Nodes are initialized with their name and content as arguments:

>>> from DHParser.nodetree import *
>>> number_1 = Node('number', "5")
>>> number_1.name
'number'

The Node-object number_1 now has the tag-name “number” and the content “5”. Since the content is a string and not a tuple of child-nodes, the node constructed is a leaf-node.

(By convention, if the tag-name of a node starts with a colon “:”, the node is considered “anonymous”. This distinction is helpful when a tree of nodes is generated in a parsing process to distinguish nodes that contain important pieces of data from nodes that merely contain delimiters or structural information.)

Several nodes can be connected to a tree:

>>> number_2 = Node('number', "4")
>>> addition = Node('add', (number_1, number_2))

Trees spanned by a node can conveniently be serialized as S-expressions (well-known from the computer languages “lisp” and “scheme”):

>>> print(addition.as_sxpr())
(add (number "5") (number "4"))

It is also possible to serialize nodes as XML-snippet:

>>> print(addition.as_xml())
<add>
  <number>5</number>
  <number>4</number>
</add>

or as indented tree:

>>> print(addition.as_tree())
add
  number "5"
  number "4"

or as JSON-data (see further below). Trees can also be deserialized from any of these formats with the exception of the indented tree (see below).

In order to test whether a Node is leaf-node one can check for the absence of children:

>>> node = Node('word', 'Palace')
>>> assert not node.children

The data of a node can be queried by reading the result-property:

>>> node.result
'Palace'

The result is always a string or a tuple of Nodes, even if the node-object has been initialized with a single node:

>>> parent = Node('phrase', node)
>>> parent.result
(Node('word', 'Palace'),)

The result-property can be assigned to, in order to change the data of a node:

>>> parent.result = (Node('word', 'Buckingham'), Node('blank', ' '), node)
>>> print(parent.as_sxpr())
(phrase (word "Buckingham") (blank " ") (word "Palace"))

Content-equality of Nodes must be tested with the equals()-method. The equality operator == tests merely for the identity of the node-object, not for the equality of the content of two different node-objects:

>>> n1 = Node('dollars', '1')
>>> n2 = Node('dollars', '1')
>>> n1.equals(n2)
True
>>> n1 == n2
False

An empty node is always a leaf-node, that is, if initialized with an empty tuple, the node’s result will actually be the empty string:

>>> empty = Node('void', ())
>>> empty.result
''
>>> assert empty.equals(Node('void', ''))

Next to the result-property, a node’s content can be queried with either its children-property or its content-property. The former yields the tuple of child-nodes. The latter yields the string-content of the node, which in the case of a “branch-node” is the (recursively generated) concatenated string-content of all of its children:

>>> node.content
'Palace'
>>> node.children
()
>>> parent.content
'Buckingham Palace'
>>> parent.children
(Node('word', 'Buckingham'), Node('blank', ' '), Node('word', 'Palace'))

Both the content-property and the children-property are read-only-properties. In order to change the data of a node, its result-property must be assigned to (as shown above).

Just like HTML- oder XML-tags, nodes can be annotated with attributes. Unlike XML and HTML, however, the value of these attributes can be of any type, not only strings. The only requirement is that the value is serializable as string. Be aware, though of the possible loss of information when serializing nodes or converting nodes to ElementTree-elements, if there are attributes with non-string values! Attributes are stored in an ordered dictionary that maps string identifiers, i.e. the attribute name, to the content of the attribute. This dictionary can be accessed via the attr-property:

>>> node.attr['price'] = 'very high'
>>> print(node.as_xml())
<word price="very high">Palace</word>

When serializing as S-expressions attributes are shown as a nested list marked with a “tick”:

>>> print(node.as_sxpr())
(word `(price "very high") "Palace")

Attributes can be queried via the has_attr() and get_attr()-methods. This is to be preferred over accessing the attr-property for querying, because the attribute dictionary is created lazily on the first access of the attr-property:

>>> node.has_attr('price')
True
>>> node.get_attr('price', '')
'very high'
>>> parent.get_attr('price', 'unknown')
'unknown'

If called with no parameters or an empty string as attribute name, has_attr() returns True, if at least one attribute is present:

>>> parent.has_attr()
False

Attributes can be deleted like dictionary entries:

>>> del node.attr['price']
>>> node.has_attr('price')
False

Node-objects contain a special “write once, read afterwards”-property named pos that is meant to capture the source code position of the content represented by the Node. Usually, the pos values are initialized with the corresponding source code location by the parser.

The main purpose of keeping source-code locations in the node-objects is to equip the messages of errors that are detected in later processing stages with source code locations. In later processing stages the tree may already have been reshaped and its string-content may have been changed, say, by normalizing whitespace or dropping delimiters.

Before the pos-field can be read, it must have been initialized with the with_pos-method, which recursively initializes the pos-field of the child nodes according to the offset of the string values from the main field:

>>> import copy; essentials = copy.deepcopy(parent)
>>> print(essentials.with_pos(0).as_xml(src=essentials.content))
<phrase line="1" col="1">
  <word line="1" col="1">Buckingham</word>
  <blank line="1" col="11"> </blank>
  <word line="1" col="12">Palace</word>
</phrase>
>>> essentials[-1].pos, essentials.content.find('Palace')
(11, 11)
>>> essentials.result = tuple(child for child in essentials.children
...                           if child.name != 'blank')
>>> print(essentials.as_xml(src=essentials.content))
<phrase line="1" col="1">
  <word line="1" col="1">Buckingham</word>
  <word line="1" col="12">Palace</word>
</phrase>
>>> essentials[-1].pos, essentials.content.find('Palace')
(11, 10)

Serialization¶

Syntax trees can be serialized as S-expressions, XML, JSON and indented text. Module ‘nodetree’ also contains a few simple parsers (parse_sxpr(), parse_xml()) or parse_json() to convert XML-snippets, S-expressions or json objects into trees composed of Node-objects.

In order to make it easier to parameterize serialization, the Node-class also defines a generic serialize()-method next to the more specialized as_sxpr()-, as_json()- and as_xml()-methods:

>>> s = ('(sentence (word "This") (blank " ") (word "is") (blank " ") '
...      '(phrase (word "Buckingham") (blank " ") (word "Palace")))')
>>> sentence = parse_sxpr(s)
>>> print(sentence.serialize(how='indented'))
sentence
  word "This"
  blank " "
  word "is"
  blank " "
  phrase
    word "Buckingham"
    blank " "
    word "Palace"
>>> sxpr = sentence.serialize(how='sxpr')
>>> round_trip = parse_sxpr(sxpr)
>>> assert sentence.equals(round_trip)

When serializing as XML, there will be no mixed-content and, likewise, no empty tags per default, because these do not exist in DHParser’s data model:

>>> print(sentence.as_xml())
<sentence>
  <word>This</word>
  <blank> </blank>
  <word>is</word>
  <blank> </blank>
  <phrase>
    <word>Buckingham</word>
    <blank> </blank>
    <word>Palace</word>
  </phrase>
</sentence>

However, mixed-content can be simulated with string_tags-parameter of the as_xml()-method.:

>>> print(sentence.as_xml(inline_tags={'sentence'}, string_tags={'word', 'blank'}))
<sentence>This is <phrase>Buckingham Palace</phrase></sentence>

The inline_tags-parameter ensures that all listed tags and contained tags will be printed on a single line. This is helpful when opening the XML-serialization in an internet-browser in order to avoid spurious blanks when a line-break occurs in the HTML/XML-source.

Finally, empty tags that do not have a closing tag (e.g. <br />) can be declared as such with the empty_tags-parameter.

Note that using string_tags can lead to a loss of information. A loss of information is inevitable if, like in the example above, more than one tag is listed in the string_tags-set passed to the as_xml()-method. Deserializing the XML-string yields:

>>> tree = parse_xml(
...    '<sentence>This is <phrase>Buckingham Palace</phrase></sentence>',
...    string_tag='MIXED')
>>> print(tree.serialize(how='indented'))
sentence
  MIXED "This is "
  phrase "Buckingham Palace"

ElementTree-Exchange¶

Although DHParser offers rich support for tree-transformation, the wish may arise to use standard XML-tools for tree-transformation as an alternative or supplement to the tools DHParser offers. One way to do so, would be to serialize the tree of Node-objects, then use the XML-tools and, possibly, to deserialize the transformed XML again.

A more efficient method, however, is to utilize any of the various Python-libraries for XML. In order to make this as easy as possible trees of Node-objects can be converted to ElementTree-objects either from the python standard library or from the lxml-library:

>>> import xml.etree.ElementTree as ET  # for lxml write: from lxml import etree as ET
>>> et = sentence.as_etree(ET)
>>> ET.dump(et)
<sentence><word>This</word><blank> </blank><word>is</word><blank> </blank><phrase><word>Buckingham</word><blank> </blank><word>Palace</word></phrase></sentence>
>>> tree = Node.from_etree(et)
>>> print(tree.equals(sentence))
True

The first parameter of as_etree() is the ElementTree-library to be used. If omitted, the standard-library-ElementTree is used.

Like the as_xml()-method, the as_etree() and from_etree() can be parameterized in order to support mixed-content and empty-tags:

>>> et = sentence.as_etree(ET, string_tags={'word', 'blank'})
>>> ET.dump(et)
<sentence>This is <phrase>Buckingham Palace</phrase></sentence>

Tree-Traversal¶

Transforming syntax trees is usually done by traversing the complete tree and applying specific transformation functions on each node. Modules “transform” and “compile” provide high-level interfaces and scaffolding classes for the traversal and transformation of syntax-trees.

Module nodetree does not provide any functions for transforming trees, but it provides low-level functions for navigating trees. These functions cover three different purposes:

Downtree-navigation within the subtree spanned by a particular node.
Uptree- and horizontal navigation to the neighborhood (“siblings”) ancestry of a given node.
Navigation by looking at the string-representation of the tree.

Navigating “downtree”¶

There are a number of useful functions to help navigating a tree spanned by a node and finding particular nodes within in a tree:

>>> from DHParser import printw
>>> printw(list(sentence.select('word')))
[Node('word', 'This'), Node('word', 'is'), Node('word', 'Buckingham'),
 Node('word', 'Palace')]
>>> list(sentence.select(lambda node: node.content == ' '))
[Node('blank', ' '), Node('blank', ' '), Node('blank', ' ')]

The pick functions always picks the first node fulfilling the criterion:

>>> sentence.pick('word')
Node('word', 'This')

Or, reversing the direction:

>>> last_match = sentence.pick('word', reverse=True)
>>> last_match
Node('word', 'Palace')

While nodes contain references to their children, a node does not contain a references to its parent. The method pick_pach() (described below) can be used to pick the complete list of ancestors leading up to and including a particular node. As a last resort (because it is slow) the node’s parent can be found by the find_parent-function which must be executed on any ancestor of the node:

>>> printw(sentence.find_parent(last_match))
Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '),
 Node('word', 'Palace')))

Sometimes, one only wants to select or pick particular children of a node. Apart from accessing these via node.children, there is a tuple-like access to the immediate children via indices and slices:

>>> sentence[0]
Node('word', 'This')
>>> printw(sentence[-1])
Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '),
 Node('word', 'Palace')))
>>> sentence[0:3]
(Node('word', 'This'), Node('blank', ' '), Node('word', 'is'))
>>> sentence.index('blank')
1
>>> sentence.indices('word')
(0, 2)

as well as a dictionary-like access, with the difference that a “key” may occur several times:

>>> sentence['word']
(Node('word', 'This'), Node('word', 'is'))
>>> printw(sentence['phrase'])
Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '),
 Node('word', 'Palace')))

Be aware that always all matching values will be returned and that the return type can accordingly be either a tuple of Nodes or a single Node! An IndexError is raised in case the “key” does not exist or an index is out of range.

It is also possible to delete children conveniently with Python’s del-operator:

>>> s_copy = copy.deepcopy(sentence)
>>> del s_copy['blank'];  print(s_copy)
ThisisBuckingham Palace
>>> del s_copy[2][0:2]; print(s_copy.serialize())
(sentence (word "This") (word "is") (phrase (word "Palace")))

One can also use the Node.pick_child() or Node.select_children()-method in order to select children with an arbitrary condition:

>>> tuple(sentence.select_children(lambda nd: nd.content.find('s') >= 0))
(Node('word', 'This'), Node('word', 'is'))
>>> printw(sentence.pick_child(lambda nd: nd.content.find('i') >= 0,
...                            reverse=True))
Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '),
 Node('word', 'Palace')))

Often, one is neither interested in selecting form the children of a node, nor from the entire subtree, but from a certain “depth-range” of a tree-structure. Say, you would like to pick all word’s from the sentence that are not inside a phrase and assume at the same time that words may occur in nested structures:

>>> nested = copy.deepcopy(sentence)
>>> i = nested.index(lambda nd: nd.content == 'is')
>>> nested[i].result = Node('word', nested[i].result)
>>> nested[i].name = 'italic'
>>> nested[0:i + 1]
(Node('word', 'This'), Node('blank', ' '), Node('italic', (Node('word', 'is'))))

Now, in order to select all words on the level of the sentence, but excluding any sub-phrases, it would not be helpful to use methods based on the selection of children (i.e. immediate descendants), because the word nested in an ‘italic’-Node would be missed. For this purpose the various selection()-methods of class node have a skip_subtree-parameter which can be used to block subtrees from the iterator based on a criteria (which can be a function, a tag name or set of tag names and the like):

>>> tuple(nested.select('word', skip_subtree='phrase'))
(Node('word', 'This'), Node('word', 'is'))

Navigating “uptree”¶

Instead of keeping a link within each node to its parent, it is much more elegant to keep track of a node’s ancestry by using the lineage or “tree-path” which is a simple list of ancestors starting with the root-node and including the node itself as its last item. For most search methods such as select() or pick(), there exists a pendant that returns this path instead of just the node itself:

>>> last_path = sentence.pick_path('word', reverse=True)
>>> last_path[-1] == last_match
True
>>> last_path[0] == sentence
True
>>> pp_path(last_path)
'sentence <- phrase <- word'

One can also think of a tree-path as a breadcrumb-trail or, rather, ant-trail that “points” to a particular part of text by marking the path from the root to the node, the content of which contains this text. This node does not need to be a leaf node, but can be any branch-node on the way from the root to the leaves of the tree. When analyzing or transforming a tree-structured text, it is often helpful to “zoom” in and out of a particular part of text (pointed to by a path) or to move forward and backward from a particular location (again represented by a path).

The next_path() and prev_path()-functions allow to move one step forward or backward from a given path:

>>> pointer = prev_path(last_path)
>>> pp_path(pointer, with_content=-1)
'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace" <- blank " "'

prev_path() and next_path() automatically zoom out by one step, if they move past the first or last child of the last but one node in the list:

>>> pointer = prev_path(pointer)
>>> pp_path(pointer, with_content=-1)
'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace" <- word "Buckingham"'
>>> pp_path(prev_path(pointer), with_content=-1)
'sentence "This is Buckingham Palace" <- blank " "'

Thus:

>>> next_path(prev_path(pointer)) == pointer
False
>>> pointer = prev_path(pointer)
>>> pp_path(next_path(pointer), with_content=-1)
'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace"'

The reason for this behavior is that prev_path() and next_path() try to move to the path which contains the string content preceding or succeeding that of the given path. Therefore, these functions move to the next sibling on the same branch, rather traversing the complete tree like the select() and select_path()- methods of the Node-class. However, when moving past the first or last sibling, it is not clear what the next node on the same level should be. To keep it easy, the function “zooms out” and returns the next sibling of the parent.

It is, of course, possible to zoom back into a path:

>>> pp_path(zoom_into_path(next_path(pointer), FIRST_CHILD, steps=1),
...                with_content=-1)
'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace" <- word "Buckingham"'

Often it is preferable to move through the leaf-nodes and their paths right away. Functions like next_leaf_path() and prev_leaf_path() provide syntactic sugar for this case:

>>> pointer = next_leaf_path(pointer)
>>> pp_path(pointer, with_content=-1)
'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace" <- word "Buckingham"'

It is also possible to inspect just the string content surrounding a path, rather than its structural environment:

>>> ensuing_str(pointer)
' Palace'
>>> assert foregoing_str(pointer, length=1) == ' ', "Blank expected!"

It is also possible to systematically iterate through the paths forward or backward - just like the node.select_path()-method, but starting from an arbitrary path, instead of the one end or the other end of the tree rooted in node:

>>> t = parse_sxpr('(A (B 1) (C (D (E 2) (F 3))) (G 4) (H (I 5) (J 6)) (K 7))')
>>> pointer = t.pick_path('G')
>>> printw([pp_path(ctx, with_content=1)
...         for ctx in select_path(pointer, ANY_PATH, include_root=True)])
['A <- G "4"', 'A <- H "56"', 'A <- H <- I "5"', 'A <- H <- J "6"',
 'A <- K "7"', 'A "1234567"']
>>> printw([pp_path(ctx, with_content=1)
...         for ctx in select_path(
...             pointer, ANY_PATH, include_root=True, reverse=True)])
['A <- G "4"', 'A <- C "23"', 'A <- C <- D "23"', 'A <- C <- D <- F "3"',
 'A <- C <- D <- E "2"', 'A <- B "1"', 'A "1234567"']

Another important difference, besides the starting point, is that the select()-generators of the nodetree-module traverse the tree post-order (or “depth first”), while the respective methods of the Node-class traverse the tree pre-order. See the difference:

>>> l = [pp_path(ctx, with_content=1)
...      for ctx in t.select_path(ANY_PATH, include_root=True)]
>>> l[l.index('A <- G "4"'):]
['A <- G "4"', 'A <- H "56"', 'A <- H <- I "5"', 'A <- H <- J "6"', 'A <- K "7"']
>>> l = [pp_path(ctx, with_content=1)
...      for ctx in t.select_path(ANY_PATH, include_root=True, reverse=True)]
>>> printw(l[l.index('A <- G "4"'):])
['A <- G "4"', 'A <- C "23"', 'A <- C <- D "23"', 'A <- C <- D <- F "3"',
 'A <- C <- D <- E "2"', 'A <- B "1"']

Content Mappings¶

Basics¶

For finding a passage in the text or identifying certain textual features like, for example, matching brackets, traversing the document-tree is not really an option, if only, because a passage may extend over several nodes, possibly even on different levels of the tree hierarchy. For such cases it is possible to generate a content mapping that maps text positions within the pure string-content to the paths of the leaf-nodes to which they belong. This mapping can be thought of as a “string-view” on the tree:

>>> ctx_mapping = ContentMapping(sentence)
>>> print(ctx_mapping.content)
This is Buckingham Palace
>>> print(ctx_mapping)
0 -> sentence, word "This"
4 -> sentence, blank " "
5 -> sentence, word "is"
7 -> sentence, blank " "
8 -> sentence, phrase, word "Buckingham"
18 -> sentence, phrase, blank " "
19 -> sentence, phrase, word "Palace"

Note that the path in the first line of the output is different from the path in the third line, although the sequence of node-names that appears in the pretty-printed version shown here is the same, i.e. “sentence, word”, because the paths really consist of different Nodes.

Now let’s find all letters that are followed by a whitespace character:

>>> import re
>>> locations = [m.start() for m in re.finditer(r'\w ', ctx_mapping.content)]
>>> targets = [ctx_mapping.get_path_and_offset(loc) for loc in locations]

Tip

Other than the node’s content property, the content mappings content field is not generated on the fly every time it is retrieved, but only when instantiating or rebuilding the mapping. Performance-wise it is advisable to always use the content mapping’s content field.

The target returned by get_path_and_offset() is a tuple of the target path and the relative position of the location that falls within this path:

>>> [(pp_path(path), relative_pos) for path, relative_pos in targets]
[('sentence <- word', 3), ('sentence <- word', 1), ('sentence <- phrase <- word', 9)]

Now, the structured text can be manipulated at the precise locations where string search yielded a match. Let’s turn our text into a little riddle by replacing the letters of the leaf-nodes before the match locations with three dots:

>>> for path, pos in targets:
...     path[-1].result = '...' + path[-1].content[pos:]
>>> str(sentence)
'...s ...s ...m Palace'

The positions resemble the text positions of the text represented by the tree at the very moment when the path mapping is generated, not the source positions captured by the pos-property of the node-objects! This also means that the mapping becomes outdated, when the tree is being restructured. Unless you use the methods provided by ContentMapping itself in order to make changes to the tree, you need to either call rebuild_mapping() to update the content mapping at the affected places or instantiate an entirely new content mapping.

Restricted Mappings¶

A very powerful feature of context mappings is that they allow to restrict the string view onto the document tree to selected parts of the tree, which makes it possible to exclude these parts from the search, e.g.:

>>> xml = '''<doc><p>In München<footnote><em>München</em> is the German
... name of the city of Munich</footnote> is a Hofbräuhaus</p></doc>'''
>>> tree = parse_xml(xml)

Now, assume you would like to find all occurrences of “München” in the main text but not in the footnotes, then you can issue a context mapping that ignores all footnotes:

>>> cm = ContentMapping(tree, select=LEAF_PATH, ignore={'footnote'})
>>> list(re.finditer('München', cm.content))
[<re.Match object; span=(3, 10), match='München'>]

In order to restrict the content mapping to certain parts of the tree, the ContentMapping-class takes a same pair of path selectors similar to the “criteria” and “skip_subtree” parameters of Node.select_path() and Node.pick(). However, there is a subtle but important difference: The “select”-parameter of the ContentMapping-class must only accept leaf-paths! Otherwise a ValueError will be raised.

In contrast to the restricted content mapping, the search in the string-content of the entire tree yields:

>>> printw(list(re.finditer('München', tree.content)))
[<re.Match object; span=(3, 10), match='München'>, <re.Match object; span=(10,
 17), match='München'>]

Although, the string locations in a context mappings that has been restricted to certain parts of the tree have shifted with respect to the string locations in the full document tree, there is no need to worry that the mapped location within the tree had changed:

>>> tree_pos = tree.content.find('Hofbräuhaus')
>>> print(tree_pos)
64
>>> tm = ContentMapping(tree)
>>> tm.content.find('Hofbräuhaus')  # should be the same as above
64
>>> cm_pos = cm.content.find('Hofbräuhaus')
>>> print(cm_pos)
16

The string-position is not the same, because the mapping cm omits the footnote-text. Yet, the path and offset within the tree remain the same. (Remember that the :Text-nodes are “anonymous” nodes that the XML-parser inserts for the character data of XML-elements with mixed content.):

>>> cm_path, cm_offset = cm.get_path_and_offset(cm_pos)
>>> print(pp_path(cm_path, delimiter=', '), '->', cm_offset)
doc, p, :Text -> 6
>>> tm_path, tm_offset = tm.get_path_and_offset(tree_pos)
>>> print(pp_path(tm_path, delimiter=', '), '->', tm_offset)
doc, p, :Text -> 6
>>> assert tm_path == cm_path  # paths are really the same sequence of nodes

This can easily be confirmed by looking at the complete mappings in direct comparison. First the unrestricted mapping:

>>> print(tm)
-> doc, p, :Text "In München"
-> doc, p, footnote, em "München"
-> doc, p, footnote, :Text " is the German" "name of the city of Munich"
-> doc, p, :Text " is a Hofbräuhaus"

Now, the mapping that omits the footnotes:

>>> print(cm)
0 -> doc, p, :Text "In München"
10 -> doc, p, :Text " is a Hofbräuhaus"

Note, that the numbers at the beginning of each line represent the string position which is different for the same path, but this has no bearing on the offsets which count from the content-mapping-specific position of each path in the content mapping.

Conversely, we could also have restricted the content mapping only to the footnote(s):

>>> fm = ContentMapping(tree, select=leaf_paths('footnote'), ignore=NO_PATH)
>>> print(fm)
0 -> doc, p, footnote, em "München"
7 -> doc, p, footnote, :Text " is the German" "name of the city of Munich"

Here, the parameter ignore=NO_PATH has to be understood as “from the selected paths do not ignore any paths”. Note, the leaf_path()-filter used to define the value of the select-argument. ContentMapping raises a ValueError if the select-criterion allows paths that are not leaf-path. The leaf_paths-filter is a simple, though slightly costly in terms of speed, means of turning any criteria into a “criteria is true for path AND path is a leaf-path”-condition.

Now, let’s look for the string “München” in the footnotes only:

>>> i = fm.content.find('München')
>>> path, offset = fm.get_path_and_offset(i)
>>> pp_path(path, 1)
'doc <- p <- footnote <- em "München"'
>>> print(offset)
0

We can now manipulate the tree through the path and offset. Let’s insert the word “Stadt” in front of “München”. We do so by changing the result of the leaf node of the path to the term at the given offset:

>>> path[-1].result = path[-1].result[:offset] + "Stadt " + \
...                   path[-1].result[offset:]

In this particular case, because the offset is zero, we could also have written "Stadt " + path[-1].result, but the formula above also works for the general case where cannot be sure that the offset will always be 0.

We expect that the change is reflected in the tree at the right position, i.e. inside the footnote:

>>> printw(tree.as_xml(inline_tags={'doc'}))
<doc><p>In München<footnote><em>Stadt München</em> is the German
name of the city of Munich</footnote> is a Hofbräuhaus</p></doc>

As mentioned earlier, the content mapping should be considered tainted if the underlying tree has been changed by any other means than the methods of the ContentMapping-object itself. In order to rebuild the affected path of the content mapping ContentMapping.rebuild_mapping() must be called for the affected sections of the content mapping which are defined by the first and last path index of the content mapping where a change has taken place:

>>> fm.rebuild_mapping(i, i)
>>> print(fm)
0 -> doc, p, footnote, em "Stadt München"
13 -> doc, p, footnote, :Text " is the German" "name of the city of Munich"

Limitations¶

As of now, a limitation of the content mappings provided by DHParser.nodetree consists in the fact that they remain completely agnostic with respect to any textual meaning of the nodes. For example assume that the node-name “pB” signifies a page break, which implies that there is a gap between the two parts separated by the page break. However, because this is considered part of the meaning of “pb” it may not be required by the encoding guide-lines for the document that a gap, say, a blank character or a linefeed is also redundantly encoded in the string content of the document as well. (It may even be forbidden to do so!) But then a search on the string content may miss phrases separated by a page break:

>>> tree = parse_xml('<doc>xyz New<pb/>York xyz</doc>')
>>> print(tree.content)
xyz NewYork xyz
>>> re.search(r'New\s+York', tree.content)

Currently, the only remedy is to either allow redundant encoding of textual meanings within the string-content or adding specific nodes that carry the redundant textual meanings within their string-content and removing them again, after searches etc. have been finished.

Markup insertion¶

Class ContentMapping provides powerful markup-methods that allows you to add markup at any position you like simply by passing the start- and end-position in the string-representation of the document-tree and “automagically” taking care of such perils as cross-cutting tag-boundaries or overlapping hierarchies.

This solves a common challenge when processing tree structured text-data which consists in adding new nodes that cover certain ranges of the string content that may already have been covered by other elements. The problem is the same as adding further markup to an existing XML or HTML-document. In trivial cases like:

>>> trivial_xml = parse_xml("<trivial>Please mark up Stadt München "
...     "in Bavaria in this sentence.</trivial>")

we would hardly need any help by a library to markup a string “Stadt München”. But both to find certain sub-strings and to mark them up can easily become complicated:

>>> hard_xml = parse_xml("<hard>Please mark up Stadt\n<lb/>"
...     "<location><em>München</em> in Bavaria</location> in this "
...     "sentence.</hard>")

Let’s start with the simple case to see how searching and marking strings works with DHParser:

>>> mapping = ContentMapping(trivial_xml)
>>> match = re.search(r"Stadt\s+München", mapping.content)
>>> _ = mapping.markup(match.start(), match.end(), "foreign",
...                    {'lang': 'de'})
>>> printw(trivial_xml.as_xml(inline_tags={'trivial'}))
<trivial>Please mark up <foreign lang="de">Stadt München</foreign>
 in Bavaria in this sentence.</trivial>

In order to search for the text-string, a regular expression is used rather than a simple search for “Stadt München”, because we cannot assume that it appears in exactly the same form in the text. For example, it could be broken up by a line break, e.g. “Stadt\nMünchen”.

Now, let’s try the more complicated case. Because we will try different configurations, we use copied of the tree “hard_xml”:

>>> hard_xml_copy = copy.deepcopy(hard_xml)
>>> mapping = ContentMapping(hard_xml_copy)
>>> match = re.search(r"Stadt\s+München", mapping.content)
>>> _ = mapping.markup(match.start(), match.end(), "foreign",
...                    {'lang': 'de'})
>>> xml_str = hard_xml_copy.as_xml(empty_tags={'lb'})
>>> print(xml_str)
<hard>
  Please mark up
  <foreign lang="de">
    Stadt
    <lb/>
  </foreign>
  <location>
    <foreign lang="de">
      <em>München</em>
    </foreign>
    in Bavaria
  </location>
  in this sentence.
</hard>

As can be seen the <foreign>-tag is split into two parts, because the markup runs across the border of another tag, in this case <location>. Note, that the <lb/>-tag lies inside the <foreign>-tag. But that makes sense, because it would also have been inside the <foreign>-tag, had there been no <location>-tag and no need to split. (Per default, the algorithm behaves somewhat “greedy”, which, however can be configured with a parameter with the same name passed to the constructor of class ContentMapping.)

But what if you don’t wand the <foreign>-tag to be split up in two or more parts, as the case may be. Well, in this case, you need to allow those tags, the borders of which the new markup runs across, to be split by that markup:

>>> hard_xml_copy = copy.deepcopy(hard_xml)
>>> divisibility_map = {'foreign': {'location', ':Text'},
...                     '*': {':Text'}}
>>> mapping = ContentMapping(hard_xml_copy, divisibility=divisibility_map)
>>> match = re.search(r"Stadt\s+München", mapping.content)
>>> _ = mapping.markup(match.start(), match.end(), "foreign",
...                    {'lang': 'de'})
>>> xml_str = hard_xml_copy.as_xml(empty_tags={'lb'})
>>> print(xml_str)
<hard>
  Please mark up
  <foreign lang="de">
    Stadt
    <lb/>
    <location>
      <em>München</em>
    </location>
  </foreign>
  <location>
    in Bavaria
  </location>
  in this sentence.
</hard>

See the difference? This time the <foreign>-element remains intact, while the <location>-element has been split. This behavior can be configures by the divisibility-map that is passed to the parameter divisibility of the ContentMapping-constructor. It maps elements (or, rather, their names) to sets of elements that can be cut by them. The asterisk * is a wildcard and contains those elements that can be cut by any other element. An element that does not appear in the value-set anywhere in the mapping cannot be cut by any other element. It is also possible to pass a simple set of element-names instead of a dictionary to the divisibility-parameter. In this case any element with a name in this set can be cut by any other element. Any element the name of which is not a member of the set cannot be cut when markup is added.

In cases where markup overlaps element-borders, it is unavoidable to decide which element will be divided and which not. It is a general limitation of tree structures that they do not allow overlapping hierarchies. In this particular example, it would most probably be more reasonable to keep the <location>-element intact, because locations should probably be recognizable as units, while this does not really seem to matter for a foreign language annotation.

The case may arise, though, where you cannot avoid splitting elements that form units. At this point you probably should consider using an entirely different data-structure, say, a graph. But if this is not an option, DHParser.nodetree allows you to mark split elements as belonging to the same “chain” of elements. In order to do so you can pass a chain_attr_name to the constructor of class ContentMapping. This is an (arbitrary) name for an attribute which will contain a unique short string that all elements (of the same name) belonging to one and the same chain share with each other, but not with any other elements. Let’s try this on the previous example:

>>> reset_chain_ID()  # just to ensure deterministic ID values for doctest

>>> hard_xml_copy = copy.deepcopy(hard_xml)
>>> match = re.search(r"Stadt\s+München", hard_xml_copy.content)
>>> divisibility_map = {'foreign': {'location', ':Text'},
...                     '*': {':Text'}}
>>> mapping = ContentMapping(hard_xml_copy, divisibility=divisibility_map,
...                          chain_attr_name="chain")
>>> _ = mapping.markup(match.start(), match.end(), "foreign",
...                    {'lang': 'de'})
>>> xml_str = hard_xml_copy.as_xml(empty_tags={'lb'})
>>> print(xml_str)
<hard>
  Please mark up
  <foreign lang="de">
    Stadt
    <lb/>
    <location chain="VZT">
      <em>München</em>
    </location>
  </foreign>
  <location chain="VZT">
    in Bavaria
  </location>
  in this sentence.
</hard>

Markup plays well together with restricted content mappings as the following example may show:

>>> tree = parse_xml("<doc>Please mark up Stadt\n<lb/>"
...     "<em>München</em><footnote>'Stadt <em>München</em>'"
...     " is German for 'City of Munich'</footnote> in Bavaria"
...     " in this sentence.</doc>")

Let’s assume we’d like to markup locations and text-passages in foreign languages, but only in the main text and not within footnotes and the like. For that purpose, we build a context mapping that is restricted to non-footnote-text:

>>> cm = ContentMapping(tree, select=LEAF_PATH, ignore='footnote',
...                     chain_attr_name='chain')
>>> print(cm.content)
Please mark up Stadt
München in Bavaria in this sentence.

Now, let’s assume for the sake of the example that we have list of location names to be marked up that contains the phrase “München in Bavaria”. So, we search for this phrase and add the required location markup:

>>> m = re.search(r"München\s+in\s+Bavaria", cm.content)
>>> print(m)
<re.Match object; span=(21, 39), match='München in Bavaria'>

>>> _ = cm.markup(m.start(), m.end(), 'location')
>>> print(tree.as_xml(empty_tags={'lb'}))
<doc>
  Please mark up Stadt
  <lb/>
  <location>
    <em>München</em>
    <footnote>
      'Stadt
      <em>München</em>
      ' is German for 'City of Munich'
    </footnote>
    in Bavaria
  </location>
  in this sentence.
</doc>

The <location>-element covers the entire span, including the footnote. This is to be expected as changes are always carried out on the full tree. Only, the mapping is restricted to certain parts of the document. Usually, this is also the desired behavior, though, admittedly, depending on the use case another behavior (e.g. splitting the <location>-element into one part before the <footnote>-element and one part after that element) might be preferable. Such cases are not covered by the markup-method of class ContentMapping.

Because, the <location>-element did not need to be split, it does not need and therefore does not have a “chain”-attribute.

Next, let’s add the <foreign>-element. (We substitute the value of its chain-attribute, so that the doctest does not break, when another random key is picked!):

>>> m = re.search(r'Stadt\s+München', cm.content)
>>> _ = cm.markup(m.start(), m.end(), 'foreign', lang="de")
>>> print(tree.as_xml(empty_tags={'lb'}))
<doc>
  Please mark up
  <foreign lang="de" chain="RZC">
    Stadt
    <lb/>
  </foreign>
  <location>
    <foreign lang="de" chain="RZC">
      <em>München</em>
    </foreign>
    <footnote>
      'Stadt
      <em>München</em>
      ' is German for 'City of Munich'
    </footnote>
    in Bavaria
  </location>
  in this sentence.
</doc>

Here again, one might ask, why the <foreign>-tag contains the <lb>-tag, but the choice makes sense, because if put together again, it should cover the complete stretch including the line-break. Again, different use cases and different choices are imaginable which, however, are not covered by the ContentMapping.markup()-method.

Error Messages¶

Although errors are typically located at a particular point or range of the source code, DHParser treats them as global properties of the syntax tree (albeit with a location), rather than attaching them to particular nodes. This has two advantages:

When restructuring the tree and removing or adding nodes during the abstract-syntax-tree-transformation and possibly further tree-transformation, error messages do not accidentally get lost.
It is not necessary to add another slot to the Node class for keeping an error list which most of the time would remain empty, anyway.

In order to track errors and other global properties, Module nodetree provides the RootNode-class. The root-object of a syntax-tree produced by parsing is of type RootNode. If a root node needs to be created manually, it is necessary to create a Node-object and either pass it to RootNode as parameter on instantiation or, later, to the swallow()-method of the RootNode-object:

>>> document = RootNode(sentence, str(sentence))

The second parameter is normally the source code. In this example we simply use the string representation of the syntax-tree originating in sentence. Before any errors can be added the source-position fields of the nodes of the tree must have be been initialized. Usually, this is done by the parser. Since the syntax-tree in this example does not stem from a parsing-process, we have to do it manually:

>>> _ = document.with_pos(0)

Now, let’s mark all “word”-nodes that contain non-letter characters with an error-message. There should be plenty of them, because, earlier, we have replaced some of the words partially with “…”:

>>> import re
>>> len([document.new_error(node, "word contains illegal characters")
...      for node in document.select('word')
...          if re.fullmatch(r'\w*', node.content) is None])
3
>>> for error in document.errors_sorted:  print(error)
1:1: Error (1000): word contains illegal characters
1:6: Error (1000): word contains illegal characters
1:11: Error (1000): word contains illegal characters

The format of the string representation of Error-objects resembles that of compilers and is understood by many Text-Editors which mark the errors in the source code.

Attribute-Handling¶

While the “Node.attr”-field can be used to store data of any kind, it will often just serve to store XML-attributes, the value of which is always a string. The DHParser.nodetree-module provides a mini-API to simplify typical use cases of XML-attributes.

One important use case of attributes is to add or remove css-classes to the “class”-attribute. The “class”-attribute understood as containing a set of whitespace delimited strings. Module “nodetree” provides a few functions to simplify class-handling:

>>> paragraph = Node('p', 'veni vidi vici')
>>> add_class(paragraph, 'smallprint')
>>> paragraph.attr['class']
'smallprint'

Although the class-attribute is filled with a sequence of strings, it should behave like a set of strings. For example, one and the same class name should not appear twice in the class attribute:

>>> add_class(paragraph, 'smallprint justified')
>>> paragraph.attr['class']
'smallprint justified'

Plus, the order of the class strings does not matter, when checking for elements:

>>> has_class(paragraph, 'justified smallprint')
True
>>> remove_class(paragraph, 'smallprint')
>>> has_class(paragraph, 'smallprint')
False
>>> has_class(paragraph, 'justified smallprint')
False
>>> has_class(paragraph, 'justified')
True

The same logic of treating blank separated sequences of strings as sets can also be applied to other attributes:

>>> car = Node('car', 'Porsche')
>>> add_token_to_attr(car, "Linda Peter", 'owner')
>>> car.attr['owner']
'Linda Peter'

Or, more generally, to strings containing whitespace-separated substrings:

>>> add_token('Linda Paula', 'Peter Paula')
'Linda Paula Peter'

Classes and Functions-Reference¶

The full documentation of all classes and functions can be found in module DHParser.nodetree. The following table of contents lists the most important of these:

class Node¶

Node: the central building-block of a node-tree
- result: either the child nodes or the node’s string content
- children: the node’s immediate children or an empty tuple
- content: the concatenated string content of all descendants
- tag_name: the node’s name
- attr: the dictionary of the node’s attributes
- pos: the source-code position of this node, in case the node stems from a parsing process
  
  Navigation
- select(): Selects nodes from the tree of
  descendants.
- pick(): Picks a particular node from the tree of
  descendants.
- locate(): Finds the leaf-node covering a
  particular location of string content of the tree originating in this node.
- select_path(): Selects paths
  from the tree of descendants.
- pick_path(): Picks a particular path from
  the tree of descendants.
- locate_path(): Finds the path of the
  leaf-node covering a particular location of string content of the tree originating in this node.
  
  Serialization
- as_sxpr(): Serializes the tree originating in a node as S-expression.
- as_xml(): Serializes the tree as XML.
- as_json(): Serializes the tree as JSON.
  
  XML-exchange
- as_etree(): Converts the tree to an
  XML-ElementTree as defined by the respective module of Python’s standard library.
- from_etree(): Converts an XML-ElementTree
  into a tree of Node-objects.
  
  Evaluation
- evaluate(): “Evaluates” a tree by running
  one of a set of functions on each node depending on its tag-name.

Reading serialized trees¶

parse_sxpr(): Converts any S-expression string to a
tree of nodes.
parse_xml(): Converts any XML-document to a tree of
nodes.
parse_json(): Converts a JSON-document that has
previously been created with as_json() from a tree of nodes back to a tree of nodes.
deserialize(): Tries to guess the data-type of a
string and then calls any of the above deserialization-functions accordingly.

Traversing trees via paths¶

prev_path(): Returns the path
preceding a given path.
next_path(): Returns the path
following a given path.
generate_content_mapping(): Generates a
path-mapping for all leaf-nodes of a tree, i.e. a dictionary mapping the current text position of each leaf-node (not the source-code position!) to the leaf-node itself.
get_path_and_offset(): Returns the leaf-node for a
given text position and the number of characters of this position into the leaf-node.

Attribute-handling¶

has_token_attr(): Checks whether an attribute of a node
contains one or more tokens, i.e. blank separated sequences of letters.
ad_token_to_attr(): Adds a token to a particular
attribute of a node.
ad_token_to_attr(): Removes a token from a particular
attribute of a node.
has_class(), has_class(),
has_class(): the same as above, only that these methods manipulate the tokens specifically of the class-attribute

class RootNode¶

Any Node-object can be considered as the origin of a tree and none of the “navigation”-functions requires a tree of nodes to start with a RootNode-object. However, RootNode-objects provide support for certain “global” aspects of a tree like keeping track of the source code with line and column numbers and adding error messages. RootNode-objects can either be initialized with a code node that will then be replaced by the root-node or swallow a a tree originating in a common node later.

RootNode: additional functionality for a tree of nodes
- errors: a list of errors
- errors_sorted: the errors sorted by
  their position in the source code instead of the time of their having been added
- inline_tags: a set of tags that will
  be printed on a single line with their content when serializing. (This helps to avoid undesired whitespace when exporting to HTML!)
- string_tags: a set of tags that will be
  converted to simple strings that appear as mixed content inside their parent when serializing as XML
- empty_tags: a set of tags that will be
  rendered as empty tags, e.g. <mytag /> when serializing as XML
- swallow(): Can be called once in the
  lifetime of the RootNode-object to assign this root-node to an existing tree of nodes.
- new_error(): Creates and adds new error.
- customized_XML(): Serializes the tree as
  XML taking into account the XML-customization attributes of the RootNode-object.

class ContentMapping¶

ContentMapping represents a path-mapping of the string-content of all or a specific selection of the leave-nodes of a tree. A content-mapping is an ordered mapping of the first text position of every (selected) leaf-node to the path of this node.

The class provides methods for mapping string positions to paths and offsets (relative to the beginning of the leaf-node of the path)

ContentMapping: Mapping the tree to its string-content
- get_path_and_offset(): Maps
  positions in string-content of the ContentMapping to the leaf-path into which they fall
- iterate_paths(): Yields all paths from
  position start_pos up to and including position end_pos.
- insert_node(): Inserts a node at a
  particular text-position.
- markup(): Adds markup (i.e. an element)
  to a particular stretch of text.