Document-Trees¶
Module nodetree
encapsulates the functionality for creating
and handling document trees and, in particular syntax-trees generated by
a parser. This includes serialization and deserialization of node-trees,
navigating and searching node-trees as well as annotating node-trees
with attributes and error messages.
Node-objects¶
Syntax trees are composed of Node-objects which are linked uni-directionally from parent to children. Nodes can contain either child-nodes, in which case they are informally called “branch-nodes”, or text-strings, in which case they informally called “leaf nodes”, but not both at the same time. (There is no mixed content as in XML!)
Apart from their content, the most important property of a Node-object is its
name
. Nodes are initialized with their name and content as
arguments:
>>> from DHParser.nodetree import *
>>> number_1 = Node('number', "5")
>>> number_1.name
'number'
The Node-object number_1
now has the tag-name “number” and the content “5”.
Since the content is a string and not a tuple of child-nodes, the node
constructed is a leaf-node.
(By convention, if the tag-name of a node starts with a colon “:”, the node is considered “anonymous”. This distinction is helpful when a tree of nodes is generated in a parsing process to distinguish nodes that contain important pieces of data from nodes that merely contain delimiters or structural information.)
Several nodes can be connected to a tree:
>>> number_2 = Node('number', "4")
>>> addition = Node('add', (number_1, number_2))
Trees spanned by a node can conveniently be serialized as S-expressions (well-known from the computer languages “lisp” and “scheme”):
>>> print(addition.as_sxpr())
(add (number "5") (number "4"))
It is also possible to serialize nodes as XML-snippet:
>>> print(addition.as_xml())
<add>
<number>5</number>
<number>4</number>
</add>
or as indented tree:
>>> print(addition.as_tree())
add
number "5"
number "4"
or as JSON-data (see further below). Trees can also be deserialized from any of these formats with the exception of the indented tree (see below).
In order to test whether a Node is leaf-node one can check for the absence of children:
>>> node = Node('word', 'Palace')
>>> assert not node.children
The data of a node can be queried by reading the result-property:
>>> node.result
'Palace'
The result is always a string or a tuple of Nodes, even if the node-object has been initialized with a single node:
>>> parent = Node('phrase', node)
>>> parent.result
(Node('word', 'Palace'),)
The result-property can be assigned to, in order to change the data of a node:
>>> parent.result = (Node('word', 'Buckingham'), Node('blank', ' '), node)
>>> print(parent.as_sxpr())
(phrase (word "Buckingham") (blank " ") (word "Palace"))
Content-equality of Nodes must be tested with the equals()-method. The equality operator == tests merely for the identity of the node-object, not for the equality of the content of two different node-objects:
>>> n1 = Node('dollars', '1')
>>> n2 = Node('dollars', '1')
>>> n1.equals(n2)
True
>>> n1 == n2
False
An empty node is always a leaf-node, that is, if initialized with an empty tuple, the node’s result will actually be the empty string:
>>> empty = Node('void', ())
>>> empty.result
''
>>> assert empty.equals(Node('void', ''))
Next to the result-property, a node’s content can be queried with either its children-property or its content-property. The former yields the tuple of child-nodes. The latter yields the string-content of the node, which in the case of a “branch-node” is the (recursively generated) concatenated string-content of all of its children:
>>> node.content
'Palace'
>>> node.children
()
>>> parent.content
'Buckingham Palace'
>>> parent.children
(Node('word', 'Buckingham'), Node('blank', ' '), Node('word', 'Palace'))
Both the content-property and the children-property are read-only-properties. In order to change the data of a node, its result-property must be assigned to (as shown above).
Just like HTML- oder XML-tags, nodes can be annotated with attributes. Unlike XML and HTML, however, the value of these attributes can be of any type, not only strings. The only requirement is that the value is serializable as string. Be aware, though of the possible loss of information when serializing nodes or converting nodes to ElementTree-elements, if there are attributes with non-string values! Attributes are stored in an ordered dictionary that maps string identifiers, i.e. the attribute name, to the content of the attribute. This dictionary can be accessed via the attr-property:
>>> node.attr['price'] = 'very high'
>>> print(node.as_xml())
<word price="very high">Palace</word>
When serializing as S-expressions attributes are shown as a nested list marked with a “tick”:
>>> print(node.as_sxpr())
(word `(price "very high") "Palace")
Attributes can be queried via the has_attr() and get_attr()-methods. This is to be preferred over accessing the attr-property for querying, because the attribute dictionary is created lazily on the first access of the attr-property:
>>> node.has_attr('price')
True
>>> node.get_attr('price', '')
'very high'
>>> parent.get_attr('price', 'unknown')
'unknown'
If called with no parameters or an empty string as attribute name, has_attr() returns True, if at least one attribute is present:
>>> parent.has_attr()
False
Attributes can be deleted like dictionary entries:
>>> del node.attr['price']
>>> node.has_attr('price')
False
Node-objects contain a special “write once, read afterwards”-property named pos that is meant to capture the source code position of the content represented by the Node. Usually, the pos values are initialized with the corresponding source code location by the parser.
The main purpose of keeping source-code locations in the node-objects is to equip the messages of errors that are detected in later processing stages with source code locations. In later processing stages the tree may already have been reshaped and its string-content may have been changed, say, by normalizing whitespace or dropping delimiters.
Before the pos-field can be read, it must have been initialized with the with_pos-method, which recursively initializes the pos-field of the child nodes according to the offset of the string values from the main field:
>>> import copy; essentials = copy.deepcopy(parent)
>>> print(essentials.with_pos(0).as_xml(src=essentials.content))
<phrase line="1" col="1">
<word line="1" col="1">Buckingham</word>
<blank line="1" col="11"> </blank>
<word line="1" col="12">Palace</word>
</phrase>
>>> essentials[-1].pos, essentials.content.find('Palace')
(11, 11)
>>> essentials.result = tuple(child for child in essentials.children
... if child.name != 'blank')
>>> print(essentials.as_xml(src=essentials.content))
<phrase line="1" col="1">
<word line="1" col="1">Buckingham</word>
<word line="1" col="12">Palace</word>
</phrase>
>>> essentials[-1].pos, essentials.content.find('Palace')
(11, 10)
Serialization¶
Syntax trees can be serialized as S-expressions, XML, JSON and indented text.
Module ‘nodetree’ also contains a few simple parsers
(parse_sxpr()
, parse_xml()
) or
parse_json()
to convert XML-snippets, S-expressions or
json objects into trees composed of Node-objects.
In order to make it easier to parameterize serialization, the Node-class also
defines a generic serialize()
-method next to the more
specialized as_sxpr()
-,
as_json()
- and
as_xml()
-methods:
>>> s = ('(sentence (word "This") (blank " ") (word "is") (blank " ") '
... '(phrase (word "Buckingham") (blank " ") (word "Palace")))')
>>> sentence = parse_sxpr(s)
>>> print(sentence.serialize(how='indented'))
sentence
word "This"
blank " "
word "is"
blank " "
phrase
word "Buckingham"
blank " "
word "Palace"
>>> sxpr = sentence.serialize(how='sxpr')
>>> round_trip = parse_sxpr(sxpr)
>>> assert sentence.equals(round_trip)
When serializing as XML, there will be no mixed-content and, likewise, no empty tags per default, because these do not exist in DHParser’s data model:
>>> print(sentence.as_xml())
<sentence>
<word>This</word>
<blank> </blank>
<word>is</word>
<blank> </blank>
<phrase>
<word>Buckingham</word>
<blank> </blank>
<word>Palace</word>
</phrase>
</sentence>
However, mixed-content can be simulated with string_tags-parameter of the
as_xml()
-method.:
>>> print(sentence.as_xml(inline_tags={'sentence'}, string_tags={'word', 'blank'}))
<sentence>This is <phrase>Buckingham Palace</phrase></sentence>
The inline_tags-parameter ensures that all listed tags and contained tags will be printed on a single line. This is helpful when opening the XML-serialization in an internet-browser in order to avoid spurious blanks when a line-break occurs in the HTML/XML-source.
Finally, empty tags that do not have a closing tag (e.g. <br />) can be declared as such with the empty_tags-parameter.
Note that using string_tags can lead to a loss of information. A loss of
information is inevitable if, like in the example above, more than one tag is
listed in the string_tags-set passed to the
as_xml()
-method. Deserializing the XML-string yields:
>>> tree = parse_xml(
... '<sentence>This is <phrase>Buckingham Palace</phrase></sentence>',
... string_tag='MIXED')
>>> print(tree.serialize(how='indented'))
sentence
MIXED "This is "
phrase "Buckingham Palace"
ElementTree-Exchange¶
Although DHParser offers rich support for tree-transformation, the wish
may arise to use standard XML-tools for tree-transformation as an
alternative or supplement to the tools DHParser offers. One way to do
so, would be to serialize the tree of
Node
-objects, then use the XML-tools and,
possibly, to deserialize the transformed XML again.
A more efficient method, however, is to utilize any of the various
Python-libraries for XML. In order to make this as easy as possible trees of
Node
-objects can be converted to ElementTree-objects
either from the python standard library or from the lxml-library:
>>> import xml.etree.ElementTree as ET # for lxml write: from lxml import etree as ET
>>> et = sentence.as_etree(ET)
>>> ET.dump(et)
<sentence><word>This</word><blank> </blank><word>is</word><blank> </blank><phrase><word>Buckingham</word><blank> </blank><word>Palace</word></phrase></sentence>
>>> tree = Node.from_etree(et)
>>> print(tree.equals(sentence))
True
The first parameter of as_etree()
is the
ElementTree-library to be used. If omitted, the standard-library-ElementTree is
used.
Like the as_xml()
-method, the
as_etree()
and from_etree()
can be parameterized in order to support mixed-content and empty-tags:
>>> et = sentence.as_etree(ET, string_tags={'word', 'blank'})
>>> ET.dump(et)
<sentence>This is <phrase>Buckingham Palace</phrase></sentence>
Tree-Traversal¶
Transforming syntax trees is usually done by traversing the complete tree and applying specific transformation functions on each node. Modules “transform” and “compile” provide high-level interfaces and scaffolding classes for the traversal and transformation of syntax-trees.
Module nodetree does not provide any functions for transforming trees, but it provides low-level functions for navigating trees. These functions cover three different purposes:
Downtree-navigation within the subtree spanned by a particular node.
Uptree- and horizontal navigation to the neighborhood (“siblings”) ancestry of a given node.
Navigation by looking at the string-representation of the tree.
Content Mappings¶
Basics¶
For finding a passage in the text or identifying certain textual features like, for example, matching brackets, traversing the document-tree is not really an option, if only, because a passage may extend over several nodes, possibly even on different levels of the tree hierarchy. For such cases it is possible to generate a content mapping that maps text positions within the pure string-content to the paths of the leaf-nodes to which they belong. This mapping can be thought of as a “string-view” on the tree:
>>> ctx_mapping = ContentMapping(sentence)
>>> print(ctx_mapping.content)
This is Buckingham Palace
>>> print(ctx_mapping)
0 -> sentence, word "This"
4 -> sentence, blank " "
5 -> sentence, word "is"
7 -> sentence, blank " "
8 -> sentence, phrase, word "Buckingham"
18 -> sentence, phrase, blank " "
19 -> sentence, phrase, word "Palace"
Note that the path in the first line of the output is different from the path in the third line, although the sequence of node-names that appears in the pretty-printed version shown here is the same, i.e. “sentence, word”, because the paths really consist of different Nodes.
Now let’s find all letters that are followed by a whitespace character:
>>> import re
>>> locations = [m.start() for m in re.finditer(r'\w ', ctx_mapping.content)]
>>> targets = [ctx_mapping.get_path_and_offset(loc) for loc in locations]
Tip
Other than the node’s content property, the content mappings content field is not generated on the fly every time it is retrieved, but only when instantiating or rebuilding the mapping. Performance-wise it is advisable to always use the content mapping’s content field.
The target returned by get_path_and_offset()
is a tuple of the target path and the relative position of the location that
falls within this path:
>>> [(pp_path(path), relative_pos) for path, relative_pos in targets]
[('sentence <- word', 3), ('sentence <- word', 1), ('sentence <- phrase <- word', 9)]
Now, the structured text can be manipulated at the precise locations where string search yielded a match. Let’s turn our text into a little riddle by replacing the letters of the leaf-nodes before the match locations with three dots:
>>> for path, pos in targets:
... path[-1].result = '...' + path[-1].content[pos:]
>>> str(sentence)
'...s ...s ...m Palace'
The positions resemble the text positions of the text represented by the
tree at the very moment when the path mapping is generated, not the
source positions captured by the pos-property of the node-objects!
This also means that the mapping becomes outdated, when the tree is
being restructured. Unless you use the methods provided by
ContentMapping
itself in order to make changes to
the tree, you need to either call
rebuild_mapping()
to update the
content mapping at the affected places or instantiate an entirely new
content mapping.
Restricted Mappings¶
A very powerful feature of context mappings is that they allow to restrict the string view onto the document tree to selected parts of the tree, which makes it possible to exclude these parts from the search, e.g.:
>>> xml = '''<doc><p>In München<footnote><em>München</em> is the German
... name of the city of Munich</footnote> is a Hofbräuhaus</p></doc>'''
>>> tree = parse_xml(xml)
Now, assume you would like to find all occurrences of “München” in the main text but not in the footnotes, then you can issue a context mapping that ignores all footnotes:
>>> cm = ContentMapping(tree, select=LEAF_PATH, ignore={'footnote'})
>>> list(re.finditer('München', cm.content))
[<re.Match object; span=(3, 10), match='München'>]
In order to restrict the content mapping to certain parts of the tree, the
ContentMapping-class takes a same pair of path selectors similar to the
“criteria” and “skip_subtree” parameters of Node.select_path()
and Node.pick()
. However, there is a subtle but important difference:
The “select”-parameter of the ContentMapping-class must only accept leaf-paths!
Otherwise a ValueError will be raised.
In contrast to the restricted content mapping, the search in the string-content of the entire tree yields:
>>> printw(list(re.finditer('München', tree.content)))
[<re.Match object; span=(3, 10), match='München'>, <re.Match object; span=(10,
17), match='München'>]
Although, the string locations in a context mappings that has been restricted to certain parts of the tree have shifted with respect to the string locations in the full document tree, there is no need to worry that the mapped location within the tree had changed:
>>> tree_pos = tree.content.find('Hofbräuhaus')
>>> print(tree_pos)
64
>>> tm = ContentMapping(tree)
>>> tm.content.find('Hofbräuhaus') # should be the same as above
64
>>> cm_pos = cm.content.find('Hofbräuhaus')
>>> print(cm_pos)
16
The string-position is not the same, because the mapping cm
omits the
footnote-text. Yet, the path and offset within the tree remain the same.
(Remember that the :Text
-nodes are “anonymous” nodes that the XML-parser
inserts for the character data of XML-elements with mixed content.):
>>> cm_path, cm_offset = cm.get_path_and_offset(cm_pos)
>>> print(pp_path(cm_path, delimiter=', '), '->', cm_offset)
doc, p, :Text -> 6
>>> tm_path, tm_offset = tm.get_path_and_offset(tree_pos)
>>> print(pp_path(tm_path, delimiter=', '), '->', tm_offset)
doc, p, :Text -> 6
>>> assert tm_path == cm_path # paths are really the same sequence of nodes
This can easily be confirmed by looking at the complete mappings in direct comparison. First the unrestricted mapping:
>>> print(tm)
0 -> doc, p, :Text "In München"
10 -> doc, p, footnote, em "München"
17 -> doc, p, footnote, :Text " is the German" "name of the city of Munich"
58 -> doc, p, :Text " is a Hofbräuhaus"
Now, the mapping that omits the footnotes:
>>> print(cm)
0 -> doc, p, :Text "In München"
10 -> doc, p, :Text " is a Hofbräuhaus"
Note, that the numbers at the beginning of each line represent the string position which is different for the same path, but this has no bearing on the offsets which count from the content-mapping-specific position of each path in the content mapping.
Conversely, we could also have restricted the content mapping only to the footnote(s):
>>> fm = ContentMapping(tree, select=leaf_paths('footnote'), ignore=NO_PATH)
>>> print(fm)
0 -> doc, p, footnote, em "München"
7 -> doc, p, footnote, :Text " is the German" "name of the city of Munich"
Here, the parameter ignore=NO_PATH
has to be understood as “from the
selected paths do not ignore any paths”. Note, the
leaf_path()
-filter used to define the value of the
select-argument. ContentMapping raises a ValueError if the
select-criterion allows paths that are not leaf-path. The
leaf_paths-filter is a simple, though slightly costly in terms of speed,
means of turning any criteria into a “criteria is true for path AND path
is a leaf-path”-condition.
Now, let’s look for the string “München” in the footnotes only:
>>> i = fm.content.find('München')
>>> path, offset = fm.get_path_and_offset(i)
>>> pp_path(path, 1)
'doc <- p <- footnote <- em "München"'
>>> print(offset)
0
We can now manipulate the tree through the path and offset. Let’s insert the word “Stadt” in front of “München”. We do so by changing the result of the leaf node of the path to the term at the given offset:
>>> path[-1].result = path[-1].result[:offset] + "Stadt " + \
... path[-1].result[offset:]
In this particular case, because the offset is zero, we could also have
written "Stadt " + path[-1].result
, but the formula above also
works for the general case where cannot be sure that the offset will
always be 0.
We expect that the change is reflected in the tree at the right position, i.e. inside the footnote:
>>> printw(tree.as_xml(inline_tags={'doc'}))
<doc><p>In München<footnote><em>Stadt München</em> is the German
name of the city of Munich</footnote> is a Hofbräuhaus</p></doc>
As mentioned earlier, the content mapping should be considered tainted if the
underlying tree has been changed by any other means than the methods of the
ContentMapping-object itself. In order to rebuild the affected path of the
content mapping ContentMapping.rebuild_mapping()
must be called for
the affected sections of the content mapping which are defined by the first
and last path index of the content mapping where a change has taken place:
>>> fm.rebuild_mapping(i, i)
>>> print(fm)
0 -> doc, p, footnote, em "Stadt München"
13 -> doc, p, footnote, :Text " is the German" "name of the city of Munich"
Limitations¶
As of now, a limitation of the content mappings provided
by DHParser.nodetree
consists in the fact that they remain
completely agnostic with respect to any textual meaning of the nodes.
For example assume that the node-name “pB” signifies a page break, which
implies that there is a gap between the two parts separated by the page
break. However, because this is considered part of the meaning
of “pb” it may not be required by the encoding guide-lines for the
document that a gap, say, a blank character or a linefeed is also
redundantly encoded in the string content of the document as well.
(It may even be forbidden to do so!) But then a search on the
string content may miss phrases separated by a page break:
>>> tree = parse_xml('<doc>xyz New<pb/>York xyz</doc>')
>>> print(tree.content)
xyz NewYork xyz
>>> re.search(r'New\s+York', tree.content)
Currently, the only remedy is to either allow redundant encoding of textual meanings within the string-content or adding specific nodes that carry the redundant textual meanings within their string-content and removing them again, after searches etc. have been finished.
Markup insertion¶
Class ContentMapping
provides powerful markup-methods
that allows you to add markup at any position you like simply by
passing the start- and end-position in the string-representation of
the document-tree and “automagically” taking care of such perils
as cross-cutting tag-boundaries or overlapping hierarchies.
This solves a common challenge when processing tree structured text-data which consists in adding new nodes that cover certain ranges of the string content that may already have been covered by other elements. The problem is the same as adding further markup to an existing XML or HTML-document. In trivial cases like:
>>> trivial_xml = parse_xml("<trivial>Please mark up Stadt München "
... "in Bavaria in this sentence.</trivial>")
we would hardly need any help by a library to markup a string “Stadt München”. But both to find certain sub-strings and to mark them up can easily become complicated:
>>> hard_xml = parse_xml("<hard>Please mark up Stadt\n<lb/>"
... "<location><em>München</em> in Bavaria</location> in this "
... "sentence.</hard>")
Let’s start with the simple case to see how searching and marking strings works with DHParser:
>>> mapping = ContentMapping(trivial_xml)
>>> match = re.search(r"Stadt\s+München", mapping.content)
>>> _ = mapping.markup(match.start(), match.end(), "foreign",
... {'lang': 'de'})
>>> printw(trivial_xml.as_xml(inline_tags={'trivial'}))
<trivial>Please mark up <foreign lang="de">Stadt München</foreign>
in Bavaria in this sentence.</trivial>
In order to search for the text-string, a regular expression is used rather than a simple search for “Stadt München”, because we cannot assume that it appears in exactly the same form in the text. For example, it could be broken up by a line break, e.g. “Stadt\nMünchen”.
Now, let’s try the more complicated case. Because we will try different configurations, we use copied of the tree “hard_xml”:
>>> hard_xml_copy = copy.deepcopy(hard_xml)
>>> mapping = ContentMapping(hard_xml_copy)
>>> match = re.search(r"Stadt\s+München", mapping.content)
>>> _ = mapping.markup(match.start(), match.end(), "foreign",
... {'lang': 'de'})
>>> xml_str = hard_xml_copy.as_xml(empty_tags={'lb'})
>>> print(xml_str)
<hard>
Please mark up
<foreign lang="de">
Stadt
<lb/>
</foreign>
<location>
<foreign lang="de">
<em>München</em>
</foreign>
in Bavaria
</location>
in this sentence.
</hard>
As can be seen the <foreign>-tag is split into two parts, because the markup runs across the border of another tag, in this case <location>. Note, that the <lb/>-tag lies inside the <foreign>-tag. But that makes sense, because it would also have been inside the <foreign>-tag, had there been no <location>-tag and no need to split. (Per default, the algorithm behaves somewhat “greedy”, which, however can be configured with a parameter with the same name passed to the constructor of class ContentMapping.)
But what if you don’t wand the <foreign>-tag to be split up in two or more parts, as the case may be. Well, in this case, you need to allow those tags, the borders of which the new markup runs across, to be split by that markup:
>>> hard_xml_copy = copy.deepcopy(hard_xml)
>>> divisibility_map = {'foreign': {'location', ':Text'},
... '*': {':Text'}}
>>> mapping = ContentMapping(hard_xml_copy, divisibility=divisibility_map)
>>> match = re.search(r"Stadt\s+München", mapping.content)
>>> _ = mapping.markup(match.start(), match.end(), "foreign",
... {'lang': 'de'})
>>> xml_str = hard_xml_copy.as_xml(empty_tags={'lb'})
>>> print(xml_str)
<hard>
Please mark up
<foreign lang="de">
Stadt
<lb/>
<location>
<em>München</em>
</location>
</foreign>
<location>
in Bavaria
</location>
in this sentence.
</hard>
See the difference? This time the <foreign>-element remains intact,
while the <location>-element has been split. This behavior can be
configures by the divisibility-map that is passed to the parameter
divisibility
of the ContentMapping-constructor. It maps elements
(or, rather, their names) to sets of elements that can be cut by them.
The asterisk *
is a wildcard and contains those elements that can be
cut by any other element. An element that does not appear in the
value-set anywhere in the mapping cannot be cut by any other element. It
is also possible to pass a simple set of element-names instead of a
dictionary to the divisibility-parameter. In this case any element with
a name in this set can be cut by any other element. Any element the name
of which is not a member of the set cannot be cut when markup is added.
In cases where markup overlaps element-borders, it is unavoidable to decide which element will be divided and which not. It is a general limitation of tree structures that they do not allow overlapping hierarchies. In this particular example, it would most probably be more reasonable to keep the <location>-element intact, because locations should probably be recognizable as units, while this does not really seem to matter for a foreign language annotation.
The case may arise, though, where you cannot avoid splitting elements
that form units. At this point you probably should consider using an
entirely different data-structure, say, a graph. But if this is not an
option, DHParser.nodetree
allows you to mark split elements as
belonging to the same “chain” of elements. In order to do so you can
pass a chain_attr_name
to the constructor of class ContentMapping.
This is an (arbitrary) name for an attribute which will contain a unique
short string that all elements (of the same name) belonging to one and
the same chain share with each other, but not with any other elements.
Let’s try this on the previous example:
>>> reset_chain_ID() # just to ensure deterministic ID values for doctest
>>> hard_xml_copy = copy.deepcopy(hard_xml)
>>> match = re.search(r"Stadt\s+München", hard_xml_copy.content)
>>> divisibility_map = {'foreign': {'location', ':Text'},
... '*': {':Text'}}
>>> mapping = ContentMapping(hard_xml_copy, divisibility=divisibility_map,
... chain_attr_name="chain")
>>> _ = mapping.markup(match.start(), match.end(), "foreign",
... {'lang': 'de'})
>>> xml_str = hard_xml_copy.as_xml(empty_tags={'lb'})
>>> print(xml_str)
<hard>
Please mark up
<foreign lang="de">
Stadt
<lb/>
<location chain="VZT">
<em>München</em>
</location>
</foreign>
<location chain="VZT">
in Bavaria
</location>
in this sentence.
</hard>
Markup plays well together with restricted content mappings as the following example may show:
>>> tree = parse_xml("<doc>Please mark up Stadt\n<lb/>"
... "<em>München</em><footnote>'Stadt <em>München</em>'"
... " is German for 'City of Munich'</footnote> in Bavaria"
... " in this sentence.</doc>")
Let’s assume we’d like to markup locations and text-passages in foreign languages, but only in the main text and not within footnotes and the like. For that purpose, we build a context mapping that is restricted to non-footnote-text:
>>> cm = ContentMapping(tree, select=LEAF_PATH, ignore='footnote',
... chain_attr_name='chain')
>>> print(cm.content)
Please mark up Stadt
München in Bavaria in this sentence.
Now, let’s assume for the sake of the example that we have list of location names to be marked up that contains the phrase “München in Bavaria”. So, we search for this phrase and add the required location markup:
>>> m = re.search(r"München\s+in\s+Bavaria", cm.content)
>>> print(m)
<re.Match object; span=(21, 39), match='München in Bavaria'>
>>> _ = cm.markup(m.start(), m.end(), 'location')
>>> print(tree.as_xml(empty_tags={'lb'}))
<doc>
Please mark up Stadt
<lb/>
<location>
<em>München</em>
<footnote>
'Stadt
<em>München</em>
' is German for 'City of Munich'
</footnote>
in Bavaria
</location>
in this sentence.
</doc>
The <location>-element covers the entire span, including the footnote. This is to be expected as changes are always carried out on the full tree. Only, the mapping is restricted to certain parts of the document. Usually, this is also the desired behavior, though, admittedly, depending on the use case another behavior (e.g. splitting the <location>-element into one part before the <footnote>-element and one part after that element) might be preferable. Such cases are not covered by the markup-method of class ContentMapping.
Because, the <location>-element did not need to be split, it does not need and therefore does not have a “chain”-attribute.
Next, let’s add the <foreign>-element. (We substitute the value of its chain-attribute, so that the doctest does not break, when another random key is picked!):
>>> m = re.search(r'Stadt\s+München', cm.content)
>>> _ = cm.markup(m.start(), m.end(), 'foreign', lang="de")
>>> print(tree.as_xml(empty_tags={'lb'}))
<doc>
Please mark up
<foreign lang="de" chain="RZC">
Stadt
<lb/>
</foreign>
<location>
<foreign lang="de" chain="RZC">
<em>München</em>
</foreign>
<footnote>
'Stadt
<em>München</em>
' is German for 'City of Munich'
</footnote>
in Bavaria
</location>
in this sentence.
</doc>
Here again, one might ask, why the <foreign>-tag contains the <lb>-tag,
but the choice makes sense, because if put together again, it should
cover the complete stretch including the line-break. Again, different
use cases and different choices are imaginable which, however, are not
covered by the ContentMapping.markup()
-method.
Error Messages¶
Although errors are typically located at a particular point or range of the source code, DHParser treats them as global properties of the syntax tree (albeit with a location), rather than attaching them to particular nodes. This has two advantages:
When restructuring the tree and removing or adding nodes during the abstract-syntax-tree-transformation and possibly further tree-transformation, error messages do not accidentally get lost.
It is not necessary to add another slot to the Node class for keeping an error list which most of the time would remain empty, anyway.
In order to track errors and other global properties, Module nodetree
provides the RootNode-class. The root-object of a syntax-tree produced by
parsing is of type RootNode. If a root node needs to be created manually, it
is necessary to create a Node-object and either pass it to RootNode as
parameter on instantiation or, later, to the swallow()
-method of the
RootNode-object:
>>> document = RootNode(sentence, str(sentence))
The second parameter is normally the source code. In this example we simply use the string representation of the syntax-tree originating in sentence. Before any errors can be added the source-position fields of the nodes of the tree must have be been initialized. Usually, this is done by the parser. Since the syntax-tree in this example does not stem from a parsing-process, we have to do it manually:
>>> _ = document.with_pos(0)
Now, let’s mark all “word”-nodes that contain non-letter characters with an error-message. There should be plenty of them, because, earlier, we have replaced some of the words partially with “…”:
>>> import re
>>> len([document.new_error(node, "word contains illegal characters")
... for node in document.select('word')
... if re.fullmatch(r'\w*', node.content) is None])
3
>>> for error in document.errors_sorted: print(error)
1:1: Error (1000): word contains illegal characters
1:6: Error (1000): word contains illegal characters
1:11: Error (1000): word contains illegal characters
The format of the string representation of Error-objects resembles that of compilers and is understood by many Text-Editors which mark the errors in the source code.
Attribute-Handling¶
While the “Node.attr”-field can be used to store data of any kind, it
will often just serve to store XML-attributes, the value of which is
always a string. The DHParser.nodetree
-module provides a
mini-API to simplify typical use cases of XML-attributes.
One important use case of attributes is to add or remove css-classes to the “class”-attribute. The “class”-attribute understood as containing a set of whitespace delimited strings. Module “nodetree” provides a few functions to simplify class-handling:
>>> paragraph = Node('p', 'veni vidi vici')
>>> add_class(paragraph, 'smallprint')
>>> paragraph.attr['class']
'smallprint'
Although the class-attribute is filled with a sequence of strings, it should behave like a set of strings. For example, one and the same class name should not appear twice in the class attribute:
>>> add_class(paragraph, 'smallprint justified')
>>> paragraph.attr['class']
'smallprint justified'
Plus, the order of the class strings does not matter, when checking for elements:
>>> has_class(paragraph, 'justified smallprint')
True
>>> remove_class(paragraph, 'smallprint')
>>> has_class(paragraph, 'smallprint')
False
>>> has_class(paragraph, 'justified smallprint')
False
>>> has_class(paragraph, 'justified')
True
The same logic of treating blank separated sequences of strings as sets can also be applied to other attributes:
>>> car = Node('car', 'Porsche')
>>> add_token_to_attr(car, "Linda Peter", 'owner')
>>> car.attr['owner']
'Linda Peter'
Or, more generally, to strings containing whitespace-separated substrings:
>>> add_token('Linda Paula', 'Peter Paula')
'Linda Paula Peter'
Classes and Functions-Reference¶
The full documentation of all classes and functions can be found in module
DHParser.nodetree
. The following table of contents lists the most
important of these:
class Node¶
Node
: the central building-block of a node-treeresult
: either the child nodes or the node’s string contentchildren
: the node’s immediate children or an empty tuplecontent
: the concatenated string content of all descendantstag_name
: the node’s nameattr
: the dictionary of the node’s attributespos
: the source-code position of this node, in case the node stems from a parsing processNavigation
select()
: Selects nodes from the tree ofdescendants.
pick()
: Picks a particular node from the tree ofdescendants.
locate()
: Finds the leaf-node covering aparticular location of string content of the tree originating in this node.
select_path()
: Selects pathsfrom the tree of descendants.
pick_path()
: Picks a particular path fromthe tree of descendants.
locate_path()
: Finds the path of theleaf-node covering a particular location of string content of the tree originating in this node.
Serialization
as_sxpr()
: Serializes the tree originating in a node as S-expression.as_xml()
: Serializes the tree as XML.as_json()
: Serializes the tree as JSON.XML-exchange
as_etree()
: Converts the tree to anXML-ElementTree as defined by the respective module of Python’s standard library.
from_etree()
: Converts an XML-ElementTreeinto a tree of
Node
-objects.
Evaluation
evaluate()
: “Evaluates” a tree by runningone of a set of functions on each node depending on its tag-name.
Reading serialized trees¶
parse_sxpr()
: Converts any S-expression string to atree of nodes.
parse_xml()
: Converts any XML-document to a tree ofnodes.
parse_json()
: Converts a JSON-document that haspreviously been created with
as_json()
from a tree of nodes back to a tree of nodes.
deserialize()
: Tries to guess the data-type of astring and then calls any of the above deserialization-functions accordingly.
Traversing trees via paths¶
prev_path()
: Returns the pathpreceding a given path.
next_path()
: Returns the pathfollowing a given path.
generate_content_mapping()
: Generates apath-mapping for all leaf-nodes of a tree, i.e. a dictionary mapping the current text position of each leaf-node (not the source-code position!) to the leaf-node itself.
get_path_and_offset()
: Returns the leaf-node for agiven text position and the number of characters of this position into the leaf-node.
Attribute-handling¶
has_token_attr()
: Checks whether an attribute of a nodecontains one or more tokens, i.e. blank separated sequences of letters.
ad_token_to_attr()
: Adds a token to a particularattribute of a node.
ad_token_to_attr()
: Removes a token from a particularattribute of a node.
has_class()
,has_class()
,has_class()
: the same as above, only that these methods manipulate the tokens specifically of the class-attribute
class RootNode¶
Any Node-object can be considered as the origin of a tree and none of the “navigation”-functions requires a tree of nodes to start with a RootNode-object. However, RootNode-objects provide support for certain “global” aspects of a tree like keeping track of the source code with line and column numbers and adding error messages. RootNode-objects can either be initialized with a code node that will then be replaced by the root-node or swallow a a tree originating in a common node later.
RootNode
: additional functionality for a tree of nodeserrors
: a list of errorserrors_sorted
: the errors sorted bytheir position in the source code instead of the time of their having been added
inline_tags
: a set of tags that willbe printed on a single line with their content when serializing. (This helps to avoid undesired whitespace when exporting to HTML!)
string_tags
: a set of tags that will beconverted to simple strings that appear as mixed content inside their parent when serializing as XML
empty_tags
: a set of tags that will berendered as empty tags, e.g.
<mytag />
when serializing as XML
swallow()
: Can be called once in thelifetime of the RootNode-object to assign this root-node to an existing tree of nodes.
new_error()
: Creates and adds new error.customized_XML()
: Serializes the tree asXML taking into account the XML-customization attributes of the RootNode-object.
class ContentMapping¶
ContentMapping represents a path-mapping of the string-content of all or a specific selection of the leave-nodes of a tree. A content-mapping is an ordered mapping of the first text position of every (selected) leaf-node to the path of this node.
The class provides methods for mapping string positions to paths and offsets (relative to the beginning of the leaf-node of the path)
ContentMapping
: Mapping the tree to its string-contentget_path_and_offset()
: Mapspositions in string-content of the ContentMapping to the leaf-path into which they fall
iterate_paths()
: Yields all paths fromposition
start_pos
up to and including positionend_pos
.
insert_node()
: Inserts a node at aparticular text-position.
markup()
: Adds markup (i.e. an element)to a particular stretch of text.