Document-Trees ============== Module :py:mod:`nodetree` encapsulates the functionality for creating and handling document trees and, in particular syntax-trees generated by a parser. This includes serialization and deserialization of node-trees, navigating and searching node-trees as well as annotating node-trees with attributes and error messages. .. _node_objects: Node-objects ------------ Syntax trees are composed of Node-objects which are linked uni-directionally from parent to children. Nodes can contain either child-nodes, in which case they are informally called "branch-nodes", or text-strings, in which case they informally called "leaf nodes", but not both at the same time. (There is no mixed content as in XML!) Apart from their content, the most important property of a Node-object is its ``name``. Nodes are initialized with their name and content as arguments:: >>> from DHParser.nodetree import * >>> number_1 = Node('number', "5") >>> number_1.name 'number' The Node-object ``number_1`` now has the tag-name "number" and the content "5". Since the content is a string and not a tuple of child-nodes, the node constructed is a leaf-node. (By convention, if the tag-name of a node starts with a colon ":", the node is considered "anonymous". This distinction is helpful when a tree of nodes is generated in a parsing process to distinguish nodes that contain important pieces of data from nodes that merely contain delimiters or structural information.) Several nodes can be connected to a tree:: >>> number_2 = Node('number', "4") >>> addition = Node('add', (number_1, number_2)) Trees spanned by a node can conveniently be serialized as S-expressions (well-known from the computer languages "lisp" and "scheme"):: >>> print(addition.as_sxpr()) (add (number "5") (number "4")) It is also possible to serialize nodes as XML-snippet:: >>> print(addition.as_xml()) 5 4 or as indented tree:: >>> print(addition.as_tree()) add number "5" number "4" or as JSON-data (see further below). Trees can also be deserialized from any of these formats with the exception of the indented tree (see below). In order to test whether a Node is leaf-node one can check for the absence of children:: >>> node = Node('word', 'Palace') >>> assert not node.children The data of a node can be queried by reading the result-property:: >>> node.result 'Palace' The `result` is always a string or a tuple of Nodes, even if the node-object has been initialized with a single node:: >>> parent = Node('phrase', node) >>> parent.result (Node('word', 'Palace'),) The `result`-property can be assigned to, in order to change the data of a node:: >>> parent.result = (Node('word', 'Buckingham'), Node('blank', ' '), node) >>> print(parent.as_sxpr()) (phrase (word "Buckingham") (blank " ") (word "Palace")) Content-equality of Nodes must be tested with the `equals()`-method. The equality operator `==` tests merely for the identity of the node-object, not for the equality of the content of two different node-objects:: >>> n1 = Node('dollars', '1') >>> n2 = Node('dollars', '1') >>> n1.equals(n2) True >>> n1 == n2 False An empty node is always a leaf-node, that is, if initialized with an empty tuple, the node's result will actually be the empty string:: >>> empty = Node('void', ()) >>> empty.result '' >>> assert empty.equals(Node('void', '')) Next to the `result`-property, a node's content can be queried with either its `children`-property or its `content`-property. The former yields the tuple of child-nodes. The latter yields the string-content of the node, which in the case of a "branch-node" is the (recursively generated) concatenated string-content of all of its children:: >>> node.content 'Palace' >>> node.children () >>> parent.content 'Buckingham Palace' >>> parent.children (Node('word', 'Buckingham'), Node('blank', ' '), Node('word', 'Palace')) Both the `content`-property and the `children`-property are read-only-properties. In order to change the data of a node, its `result`-property must be assigned to (as shown above). Just like HTML- oder XML-tags, nodes can be annotated with attributes. Unlike XML and HTML, however, the value of these attributes can be of any type, not only strings. The only requirement is that the value is serializable as string. Be aware, though of the possible loss of information when serializing nodes or converting nodes to ElementTree-elements, if there are attributes with non-string values! Attributes are stored in an ordered dictionary that maps string identifiers, i.e. the attribute name, to the content of the attribute. This dictionary can be accessed via the `attr`-property:: >>> node.attr['price'] = 'very high' >>> print(node.as_xml()) Palace When serializing as S-expressions attributes are shown as a nested list marked with a "tick":: >>> print(node.as_sxpr()) (word `(price "very high") "Palace") Attributes can be queried via the `has_attr()` and `get_attr()`-methods. This is to be preferred over accessing the `attr`-property for querying, because the attribute dictionary is created lazily on the first access of the `attr`-property:: >>> node.has_attr('price') True >>> node.get_attr('price', '') 'very high' >>> parent.get_attr('price', 'unknown') 'unknown' If called with no parameters or an empty string as attribute name, `has_attr()` returns True, if at least one attribute is present:: >>> parent.has_attr() False Attributes can be deleted like dictionary entries:: >>> del node.attr['price'] >>> node.has_attr('price') False Node-objects contain a special "write once, read afterwards"-property named `pos` that is meant to capture the source code position of the content represented by the Node. Usually, the `pos` values are initialized with the corresponding source code location by the parser. The main purpose of keeping source-code locations in the node-objects is to equip the messages of errors that are detected in later processing stages with source code locations. In later processing stages the tree may already have been reshaped and its string-content may have been changed, say, by normalizing whitespace or dropping delimiters. Before the `pos`-field can be read, it must have been initialized with the `with_pos`-method, which recursively initializes the `pos`-field of the child nodes according to the offset of the string values from the main field:: >>> import copy; essentials = copy.deepcopy(parent) >>> print(essentials.with_pos(0).as_xml(src=essentials.content)) Buckingham Palace >>> essentials[-1].pos, essentials.content.find('Palace') (11, 11) >>> essentials.result = tuple(child for child in essentials.children ... if child.name != 'blank') >>> print(essentials.as_xml(src=essentials.content)) Buckingham Palace >>> essentials[-1].pos, essentials.content.find('Palace') (11, 10) .. _serialization: Serialization ------------- Syntax trees can be serialized as S-expressions, XML, JSON and indented text. As a special cases of S-expressions and JSON also `SXML`_ and `unist`_ are supported. Module 'nodetree' also contains a few simple parsers (:py:func:`~nodetree.parse_sxpr`, :py:func:`~nodetree.parse_sxml`, :py:func:`~nodetree.parse_xml`, :py:func:`~nodetree.parse_json()`) to convert XML-snippets, S-expressions or json objects into trees composed of Node-objects. .. note: Function :py:func:`~nodetree.parse_xml` can deserialize *any* XML-file and function :py:func:`~nodetree.parse.sxml` can deserialize *any* SXML*. The other parsing functions, however, can parse only the restricted subset of S-expressions or JSON into Node-trees that is used when serializing into these formats. There are no functions to deserialize indented text or `unist`_-JSON. In order to make it easier to parameterize serialization, the Node-class also defines a generic :py:meth:`~nodetree.Node.serialize()`-method next to the more specialized :py:meth:`~nodetree.Node.as_sxpr`-, :py:meth:`~nodetree.Node.as_json`- and :py:meth:`~nodetree.Node.as_xml()`-methods:: >>> s = ('(sentence (word "This") (blank " ") (word "is") (blank " ") ' ... '(phrase (word "Buckingham") (blank " ") (word "Palace")))') >>> sentence = parse_sxpr(s) >>> print(sentence.serialize(how='indented')) sentence word "This" blank " " word "is" blank " " phrase word "Buckingham" blank " " word "Palace" >>> sxpr = sentence.serialize(how='sxpr') >>> round_trip = parse_sxpr(sxpr) >>> assert sentence.equals(round_trip) When serializing as XML, there will be no mixed-content and, likewise, no empty tags per default, because these do not exist in DHParser's data model:: >>> print(sentence.as_xml()) This is Buckingham Palace However, mixed-content can be simulated with `string_tags`-parameter of the :py:meth:`~nodetree.Node.as_xml`-method.:: >>> print(sentence.as_xml(inline_tags={'sentence'}, string_tags={'word', 'blank'})) This is Buckingham Palace The `inline_tags`-parameter ensures that all listed tags as well as any tag containd in a listed tag will be printed on a single line. This is helpful when opening the XML-serialization in an internet-browser in order to avoid spurious blanks when a line-break occurs in the HTML/XML-source. Finally, empty tags that do not have a closing tag (e.g.
) can be declared as such with the `empty_tags`-parameter. Note that using `string_tags` can lead to a loss of information. A loss of information is inevitable if, like in the example above, more than one tag is listed in the `string_tags`-set passed to the :py:meth:`~nodetree.Node.as_xml`-method. When deserializing an XML-string yields, the text-parts within elements with mixed-content will be assigned to nodes of their own with the default name `:Text`:: >>> tree = parse_xml( ... 'This is Buckingham Palace') >>> print(tree.serialize(how='indented')) sentence :Text "This is " phrase "Buckingham Palace" The name these text-nodes can be configured with the `string_tag`-parameter of the :py:func:`~nodetree.parse_xml`-function:: >>> tree = parse_xml( ... 'This is Buckingham Palace', ... string_tag='MIXED') >>> print(tree.serialize(how='indented')) sentence MIXED "This is " phrase "Buckingham Palace" .. _xml_formatting: XML-reflow ^^^^^^^^^^ A notorious problem when working with XML is the propper handling of whitespace. It ever so often happens that when an XML-document is rewrapped or reformatted that whitespace is either added or removed in places where this can change the meaning of the data. For example, when reformatting the snippet::

King Charles was crowned by the Archbishop of Cantabury

it may happen (depending on the tool used) that a whitespace gets lost::

King Charleswas crowned by the Archbishop of Cantabury

Here, the whitespace between "Charles" and "was" has been erroneously deleted. Or that a whitespace is added, where it should not::

King Charles was crowned by the Archbishop of Cantabury

This time round the data markup up by name encompasses also a line-break and a few blanks, an addition that can mess up algorithms that rely on precise data. Surely, there is the `xml:space `_-Attribute. But this is often forgotten by the people encoding data in XML and sometimes also by the programmers that develop XML-tools. Because of this, DHParser offers the `inline_tags`-parameter (which can be passed to the xml-serialization functions and also be set as an attribute of the :py:class`~nodetree.RootNode`-class). The main problem with the `xml:space`-attribute consists in the fact that it only either allows that all whitespace is preserved literally (xml:space="preserve") or that whitespace may be added or removed at liberty to format the XML-document (xml:space="default"). These two states are not sufficiently finegrained to allow the reflow texts without distorting the text data. Non-distorting reflow requires whitespace inside (but not at the fringes) of particular text-containing elements, like for exmple `

`, is readily expandable to an arbitrarily long sequence of whitespace characters (blanks, tabs and lind-feeds) and likewise compressible to a single blank without causing harm to the data. As you may notice this is true for paragraphs of prose-text but not for poems. But this only means that not all text-data is reflowable and that reflow should only be applied to text-data where reflow makes sense. This constraint for data-preserving reflow assumes that whitespace can always be substituted by other (larger or smaller) whitespace, but must not be added or removed. If this rule is strictly obeyed then any form of the data (i.e. formatting to a particular column-number) can always be reconstructed and will in fact yield identical results for the same reflow-column and the same indentation (which is two blanks by default). DHParser's reflow-algorithm can be triggered by assigning a column-number to the `reflow_col` of the :py:method:`~nodetree.Node.as_xml`-method:: >>> text = '

King Charles was crowned by the Archbishop of Cantabury

' >>> tree = parse_xml(text) >>> reflow = tree.as_xml(inline_tags={'p'}, reflow_col=40) >>> print(reflow)

King Charles was crowned by the Archbishop of Cantabury

No blanks are introduced for the sake of formatting after the opening `

`-tag or before the closing `

`-tag. The same, although this is not visible in the example above, is also true for all tags contained inside the `

`-tag. (Contained tags inherit the inline-property!) It should also be noted that assigning a value to the reflow-parameter changes the meaning of the `inline_tags`-parameter in a subtle way - and likewise the meaning of the `xml:space`-attribute if that is used. Without reflow, the `inline-tags`-parameter marks tags, the content of which is strictly preserved when serializing. (Unless, the data itself contains a line-break it will be written entirely on a single line, thus the name "inline".) However, if the reflow parameter receives a value different from 0, the content of the "inline-tags" and their descendants is not serialized on a single line any more but allowed to reflow according to the above rule. Also, the data can always be "normalized" by reformatting it to a particular column. A special case of this consists in reducing it to one and the same one-line-form, by replacing all line-feeds inside inline-tags by blanks and any sequence of blanks by a single blank. (Line-feeds can still be preserved if necessary by hard-codeing them with tags like `
`) This can be achieved by calling the special function :py:func:`~nodetree.reflow_as_oneliner`:: >>> from DHParser.nodetree import reflow_as_oneliner >>> tree = parse_xml(reflow) >>> print(tree.as_xml(inline_tags={tree.name}))

King Charles was crowned by the Archbishop of Cantabury

>>> reflow_as_oneliner(tree) >>> print(tree.as_xml(inline_tags={tree.name}))

King Charles was crowned by the Archbishop of Cantabury

Note that the call `tree.as_xml(inline_tags={tree.name})` that treats all tags from the root of the tree onward as inline-tags and does not apply reflow yields a "neutral" serialization in the sense that no formatting is applied anywhere. DHParser also provides a command line-tool to reflow xml-files, conveniently named "xml_reflow". It can be called with:: $ xml_reflow --column 80 FILENAME.xml An alternative to reflowing the content of XML-files manually in this way, is to use a text-editor that can reflow (and properly indent) lines with excess length. An advantage ist that this works also with XML-files that contain areas where the data is not reflowable and must be literally preserved. ElementTree-Exchange -------------------- Although DHParser offers rich support for tree-transformation, the wish may arise to use standard XML-tools for tree-transformation as an alternative or supplement to the tools DHParser offers. One way to do so, would be to serialize the tree of :py:class:`~snytaxtree.Node`-objects, then use the XML-tools and, possibly, to deserialize the transformed XML again. A more efficient method, however, is to utilize any of the various Python-libraries for XML. In order to make this as easy as possible trees of :py:class:`~snytaxtree.Node`-objects can be converted to `ElementTree`_-objects either from the python standard library or from the `lxml `_-library:: >>> import xml.etree.ElementTree as ET # for lxml write: from lxml import etree as ET >>> et = sentence.as_etree(ET) >>> ET.dump(et) This is Buckingham Palace >>> tree = Node.from_etree(et) >>> print(tree.equals(sentence)) True The first parameter of :py:meth:`~nodetree.Node.as_etree` is the ElementTree-library to be used. If omitted, the standard-library-ElementTree is used. Like the :py:meth:`~nodetree.Node.as_xml`-method, the :py:meth:`~nodetree.Node.as_etree` and :py:meth:`~nodetree.Node.from_etree` can be parameterized in order to support mixed-content and empty-tags:: >>> et = sentence.as_etree(ET, string_tags={'word', 'blank'}) >>> ET.dump(et) This is Buckingham Palace .. _paths: Tree-Traversal -------------- Transforming syntax trees is usually done by traversing the complete tree and applying specific transformation functions on each node. Modules "transform" and "compile" provide high-level interfaces and scaffolding classes for the traversal and transformation of syntax-trees. Module `nodetree` does not provide any functions for transforming trees, but it provides low-level functions for navigating trees. These functions cover three different purposes: 1. Downtree-navigation within the subtree spanned by a particular node. 2. Uptree- and horizontal navigation to the neighborhood ("siblings") ancestry of a given node. 3. Navigation by looking at the string-representation of the tree. Navigating "downtree" ^^^^^^^^^^^^^^^^^^^^^ There are a number of useful functions to help navigating a tree spanned by a node and finding particular nodes within in a tree:: >>> from DHParser.toolkit import printw >>> printw(list(sentence.select('word'))) [Node('word', 'This'), Node('word', 'is'), Node('word', 'Buckingham'), Node('word', 'Palace')] >>> list(sentence.select(lambda node: node.content == ' ')) [Node('blank', ' '), Node('blank', ' '), Node('blank', ' ')] The pick functions always picks the first node fulfilling the criterion:: >>> sentence.pick('word') Node('word', 'This') Or, reversing the direction:: >>> last_match = sentence.pick('word', reverse=True) >>> last_match Node('word', 'Palace') While nodes contain references to their children, a node does not contain a references to its parent. The method :py:meth:`~nodetree.Node.pick_pach` (described below) can be used to pick the complete list of ancestors leading up to and including a particular node. As a last resort (because it is slow) the node's parent can be found by the `find_parent`-function which must be executed on any ancestor of the node:: >>> printw(sentence.find_parent(last_match)) Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '), Node('word', 'Palace'))) Sometimes, one only wants to select or pick particular children of a node. Apart from accessing these via `node.children`, there is a tuple-like access to the immediate children via indices and slices:: >>> sentence[0] Node('word', 'This') >>> printw(sentence[-1]) Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '), Node('word', 'Palace'))) >>> sentence[0:3] (Node('word', 'This'), Node('blank', ' '), Node('word', 'is')) >>> sentence.index('blank') 1 >>> sentence.indices('word') (0, 2) as well as a dictionary-like access, with the difference that a "key" may occur several times:: >>> sentence['word'] (Node('word', 'This'), Node('word', 'is')) >>> printw(sentence['phrase']) Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '), Node('word', 'Palace'))) Be aware that always all matching values will be returned and that the return type can accordingly be either a tuple of Nodes or a single Node! An IndexError is raised in case the "key" does not exist or an index is out of range. It is also possible to delete children conveniently with Python's `del`-operator:: >>> s_copy = copy.deepcopy(sentence) >>> del s_copy['blank']; print(s_copy) ThisisBuckingham Palace >>> del s_copy[2][0:2]; print(s_copy.serialize()) (sentence (word "This") (word "is") (phrase (word "Palace"))) One can also use the `Node.pick_child()` or `Node.select_children()`-method in order to select children with an arbitrary condition:: >>> tuple(sentence.select_children(lambda nd: nd.content.find('s') >= 0)) (Node('word', 'This'), Node('word', 'is')) >>> printw(sentence.pick_child(lambda nd: nd.content.find('i') >= 0, ... reverse=True)) Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '), Node('word', 'Palace'))) Often, one is neither interested in selecting form the children of a node, nor from the entire subtree, but from a certain "depth-range" of a tree-structure. Say, you would like to pick all word's from the sentence that are not inside a phrase and assume at the same time that words may occur in nested structures:: >>> nested = copy.deepcopy(sentence) >>> i = nested.index(lambda nd: nd.content == 'is') >>> nested[i].result = Node('word', nested[i].result) >>> nested[i].name = 'italic' >>> nested[0:i + 1] (Node('word', 'This'), Node('blank', ' '), Node('italic', (Node('word', 'is')))) Now, in order to select all words on the level of the sentence, but excluding any sub-phrases, it would not be helpful to use methods based on the selection of children (i.e. immediate descendants), because the word nested in an 'italic'-Node would be missed. For this purpose the various selection()-methods of class node have a `skip_subtree`-parameter which can be used to block subtrees from the iterator based on a criteria (which can be a function, a tag name or set of tag names and the like):: >>> tuple(nested.select('word', skip_subtree='phrase')) (Node('word', 'This'), Node('word', 'is')) Navigating "uptree" ^^^^^^^^^^^^^^^^^^^ Instead of keeping a link within each node to its parent, it is much more elegant to keep track of a node's ancestry by using the lineage or "tree-path" which is a simple list of ancestors starting with the root-node and including the node itself as its last item. For most search methods such as select() or pick(), there exists a pendant that returns this path instead of just the node itself:: >>> last_path = sentence.pick_path('word', reverse=True) >>> last_path[-1] == last_match True >>> last_path[0] == sentence True >>> pp_path(last_path) 'sentence <- phrase <- word' One can also think of a tree-path as a breadcrumb-trail or, rather, ant-trail that "points" to a particular part of text by marking the path from the root to the node, the content of which contains this text. This node does not need to be a leaf node, but can be any branch-node on the way from the root to the leaves of the tree. When analyzing or transforming a tree-structured text, it is often helpful to "zoom" in and out of a particular part of text (pointed to by a path) or to move forward and backward from a particular location (again represented by a path). The ``next_path()`` and ``prev_path()``-functions allow to move one step forward or backward from a given path:: >>> pointer = prev_path(last_path) >>> pp_path(pointer, with_content=-1) 'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace" <- blank " "' ``prev_path()`` and ``next_path()`` automatically zoom out by one step, if they move past the first or last child of the last but one node in the list:: >>> pointer = prev_path(pointer) >>> pp_path(pointer, with_content=-1) 'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace" <- word "Buckingham"' >>> pp_path(prev_path(pointer), with_content=-1) 'sentence "This is Buckingham Palace" <- blank " "' Thus:: >>> next_path(prev_path(pointer)) == pointer False >>> pointer = prev_path(pointer) >>> pp_path(next_path(pointer), with_content=-1) 'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace"' The reason for this behavior is that ``prev_path()`` and ``next_path()`` try to move to the path which contains the string content preceding or succeeding that of the given path. Therefore, these functions move to the next sibling on the same branch, rather traversing the complete tree like the ``select()`` and ``select_path()``- methods of the Node-class. However, when moving past the first or last sibling, it is not clear what the next node on the same level should be. To keep it easy, the function "zooms out" and returns the next sibling of the parent. It is, of course, possible to zoom back into a path:: >>> pp_path(zoom_into_path(next_path(pointer), FIRST_CHILD, steps=1), ... with_content=-1) 'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace" <- word "Buckingham"' Often it is preferable to move through the leaf-nodes and their paths right away. Functions like ``next_leaf_path()`` and ``prev_leaf_path()`` provide syntactic sugar for this case:: >>> pointer = next_leaf_path(pointer) >>> pp_path(pointer, with_content=-1) 'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace" <- word "Buckingham"' It is also possible to inspect just the string content surrounding a path, rather than its structural environment:: >>> ensuing_str(pointer) ' Palace' >>> assert foregoing_str(pointer, length=1) == ' ', "Blank expected!" It is also possible to systematically iterate through the paths forward or backward - just like the `node.select_path()`-method, but starting from an arbitrary path, instead of the one end or the other end of the tree rooted in `node`:: >>> t = parse_sxpr('(A (B 1) (C (D (E 2) (F 3))) (G 4) (H (I 5) (J 6)) (K 7))') >>> pointer = t.pick_path('G') >>> printw([pp_path(ctx, with_content=1) ... for ctx in select_path(pointer, ANY_PATH, include_root=True)]) ['A <- G "4"', 'A <- H "56"', 'A <- H <- I "5"', 'A <- H <- J "6"', 'A <- K "7"', 'A "1234567"'] >>> printw([pp_path(ctx, with_content=1) ... for ctx in select_path( ... pointer, ANY_PATH, include_root=True, reverse=True)]) ['A <- G "4"', 'A <- C "23"', 'A <- C <- D "23"', 'A <- C <- D <- F "3"', 'A <- C <- D <- E "2"', 'A <- B "1"', 'A "1234567"'] Another important difference, besides the starting point, is that the `select()`-generators of the `nodetree`-module traverse the tree post-order (or "depth first"), while the respective methods of the Node-class traverse the tree pre-order. See the difference:: >>> l = [pp_path(ctx, with_content=1) ... for ctx in t.select_path(ANY_PATH, include_root=True)] >>> l[l.index('A <- G "4"'):] ['A <- G "4"', 'A <- H "56"', 'A <- H <- I "5"', 'A <- H <- J "6"', 'A <- K "7"'] >>> l = [pp_path(ctx, with_content=1) ... for ctx in t.select_path(ANY_PATH, include_root=True, reverse=True)] >>> printw(l[l.index('A <- G "4"'):]) ['A <- G "4"', 'A <- C "23"', 'A <- C <- D "23"', 'A <- C <- D <- F "3"', 'A <- C <- D <- E "2"', 'A <- B "1"'] Content Mappings ---------------- Basics ^^^^^^ For finding a passage in the text or identifying certain textual features like, for example, matching brackets, traversing the document-tree is not really an option, if only, because a passage may extend over several nodes, possibly even on different levels of the tree hierarchy. For such cases it is possible to generate a content mapping that maps text positions within the pure string-content to the paths of the leaf-nodes to which they belong. This mapping can be thought of as a "string-view" on the tree:: >>> sentence = parse_sxpr( ... '(sentence (word "This") (blank " ") (word "is") (blank " ")' ... ' (phrase (word "Buckingham") (blank " ") (word "Palace")))') >>> ctx_mapping = ContentMapping(sentence) >>> print(ctx_mapping.content) This is Buckingham Palace >>> print(ctx_mapping) 0 -> sentence, word "This" 4 -> sentence, blank " " 5 -> sentence, word "is" 7 -> sentence, blank " " 8 -> sentence, phrase, word "Buckingham" 18 -> sentence, phrase, blank " " 19 -> sentence, phrase, word "Palace" Note that the path in the first line of the output is different from the path in the third line, although the sequence of node-names that appears in the pretty-printed version shown here is the same, i.e. "sentence, word", because the paths really consist of different Nodes. Now let's find all letters that are followed by a whitespace character:: >>> import re >>> locations = [m.start() for m in re.finditer(r'\w ', ctx_mapping.content)] >>> targets = [ctx_mapping.get_path_and_offset(loc) for loc in locations] .. tip:: Other than the node's content property, the content mappings content field is not generated on the fly every time it is retrieved, but only when instantiating or rebuilding the mapping. Performance-wise it is advisable to always use the content mapping's content field. The target returned by :py:meth:`~nodetree.ContentMapping.get_path_and_offset` is a tuple of the target path and the relative position of the location that falls within this path:: >>> [(pp_path(path), relative_pos) for path, relative_pos in targets] [('sentence <- word', 3), ('sentence <- word', 1), ('sentence <- phrase <- word', 9)] Now, the structured text can be manipulated at the precise locations where string search yielded a match. Let's turn our text into a little riddle by replacing the letters of the leaf-nodes before the match locations with three dots:: >>> for path, pos in targets: ... path[-1].result = '...' + path[-1].content[pos:] >>> str(sentence) '...s ...s ...m Palace' The positions resemble the text positions of the text represented by the tree at the very moment when the path mapping is generated, not the source positions captured by the `pos`-property of the node-objects! This also means that the mapping becomes outdated, when the tree is being restructured. Unless you use the methods provided by :py:class:`~nodetree.ContentMapping` itself in order to make changes to the tree, you need to either call :py:meth:`~nodetree.ContentMapping.rebuild_mapping` to update the content mapping at the affected places or instantiate an entirely new content mapping. Restricted Mappings ^^^^^^^^^^^^^^^^^^^ A very powerful feature of context mappings is that they allow to restrict the string view onto the document tree to selected parts of the tree, which makes it possible to exclude these parts from the search, e.g.:: >>> xml = '''

In MünchenMünchen is the German ... name of the city of Munich is a Hofbräuhaus

''' >>> tree = parse_xml(xml) Now, assume you would like to find all occurrences of "München" in the main text but not in the footnotes, then you can issue a context mapping that ignores all footnotes:: >>> cm = ContentMapping(tree, select=LEAF_PATH, ignore={'footnote'}) >>> list(re.finditer('München', cm.content)) [] In order to restrict the content mapping to certain parts of the tree, the ContentMapping-class takes a same pair of path selectors similar to the "criteria" and "skip_subtree" parameters of :py:meth:`Node.select_path` and :py:meth:`Node.pick`. However, there is a subtle but important difference: The "select"-parameter of the ContentMapping-class must only accept leaf-paths! Otherwise a ValueError will be raised. In contrast to the restricted content mapping, the search in the string-content of the entire tree yields:: >>> printw(list(re.finditer('München', tree.content))) [, ] Although, the string locations in a context mappings that has been restricted to certain parts of the tree have shifted with respect to the string locations in the full document tree, there is no need to worry that the mapped locations within the tree have changed:: >>> tree_pos = tree.content.find('Hofbräuhaus') >>> print(tree_pos) 64 >>> tm = ContentMapping(tree) >>> tm.content.find('Hofbräuhaus') # should be the same as above 64 >>> cm_pos = cm.content.find('Hofbräuhaus') >>> print(cm_pos) 16 The string-position is not the same, because the mapping ``cm`` omits the footnote-text. Yet, the path and offset within the tree remain the same. (Remember that the ``:Text``-nodes are "anonymous" nodes that the XML-parser inserts for the character data of XML-elements with `mixed content`_.):: >>> cm_path, cm_offset = cm.get_path_and_offset(cm_pos) >>> print(pp_path(cm_path, delimiter=', '), '->', cm_offset) doc, p, :Text -> 6 >>> tm_path, tm_offset = tm.get_path_and_offset(tree_pos) >>> print(pp_path(tm_path, delimiter=', '), '->', tm_offset) doc, p, :Text -> 6 >>> assert tm_path == cm_path # paths are really the same sequence of nodes This can easily be confirmed by looking at the complete mappings in direct comparison. First the unrestricted mapping:: >>> print(tm) 0 -> doc, p, :Text "In München" 10 -> doc, p, footnote, em "München" 17 -> doc, p, footnote, :Text " is the German" "name of the city of Munich" 58 -> doc, p, :Text " is a Hofbräuhaus" Now, the mapping that omits the footnotes:: >>> print(cm) 0 -> doc, p, :Text "In München" 10 -> doc, p, :Text " is a Hofbräuhaus" Note, that the numbers at the beginning of each line represent the string position which is different for the same path, but this has no bearing on the offsets which count from the content-mapping-specific position of each path in the content mapping. Conversely, we could also have restricted the content mapping only to the footnote(s):: >>> fm = ContentMapping(tree, select=leaf_paths('footnote'), ignore=NO_PATH) >>> print(fm) 0 -> doc, p, footnote, em "München" 7 -> doc, p, footnote, :Text " is the German" "name of the city of Munich" Here, the parameter ``ignore=NO_PATH`` has to be understood as "from the selected paths do not ignore any paths". Note, the :py:func:`leaf_path`-filter used to define the value of the select-argument. ContentMapping raises a ValueError if the select-criterion allows paths that are not leaf-path. The leaf_paths-filter is a simple, though slightly costly in terms of speed, means of turning any criteria into a "criteria is true for path AND path is a leaf-path"-condition. Now, let's look for the string "München" in the footnotes only:: >>> i = fm.content.find('München') >>> path, offset = fm.get_path_and_offset(i) >>> pp_path(path, 1) 'doc <- p <- footnote <- em "München"' >>> print(offset) 0 We can now manipulate the tree through the path and offset. Let's insert the word "Stadt" in front of "München". We do so by changing the result of the leaf node of the path to the term at the given offset:: >>> path[-1].result = path[-1].result[:offset] + "Stadt " + \ ... path[-1].result[offset:] In this particular case, because the offset is zero, we could also have written ``"Stadt " + path[-1].result``, but the formula above also works for the general case where cannot be sure that the offset will always be 0. We expect that the change is reflected in the tree at the right position, i.e. inside the footnote:: >>> printw(tree.as_xml(inline_tags={'doc'}))

In MünchenStadt München is the German name of the city of Munich is a Hofbräuhaus

As mentioned earlier, the content mapping should be considered tainted if the underlying tree has been changed by any other means than the methods of the ContentMapping-object itself. In order to rebuild the affected path of the content mapping :py:meth:`ContentMapping.rebuild_mapping` must be called for the affected sections of the content mapping which are defined by the first and last path index of the content mapping where a change has taken place:: >>> fm.rebuild_mapping(i, i) >>> print(fm) 0 -> doc, p, footnote, em "Stadt München" 13 -> doc, p, footnote, :Text " is the German" "name of the city of Munich" Limitations ^^^^^^^^^^^ As of now, a limitation of the content mappings provided by :py:mod:`DHParser.nodetree` consists in the fact that they remain completely agnostic with respect to any textual meaning of the nodes. For example assume that the node-name "pb" signifies a page break, which implies that there is a gap between the two parts separated by the page break. However, because this is considered part of the meaning of "pb" it may not be required by the encoding guide-lines for the document that a gap, say, a blank character or a linefeed is also redundantly encoded in the string content of the document. (It may even be forbidden to do so!) But then a search on the string content may miss phrases separated by a page break:: >>> tree = parse_xml('xyz NewYork xyz') >>> print(tree.content) xyz NewYork xyz >>> m = re.search(r'New\s+York', tree.content) >>> print(m) None Currently, the only remedy is to either allow redundant encoding of textual meanings within the string-content or adding specific nodes that carry the redundant textual meanings within their string-content and removing them again, after searches etc. have been finished. Markup insertion ---------------- Class :py:class:`ContentMapping` provides powerful markup-methods that allows you to add markup at any position you like simply by passing the start- and end-position in the string-representation of the document-tree and "automagically" taking care of such perils as cross-cutting tag-boundaries or overlapping hierarchies. This solves a common challenge when processing tree structured text-data which consists in adding new nodes that cover certain ranges of the string content that may already have been covered by other elements. The problem is the same as adding further markup to an existing XML or HTML-document. In trivial cases like:: >>> trivial_xml = parse_xml("Please mark up Stadt München " ... "in Bavaria in this sentence.") we would hardly need any help by a library to markup a string "Stadt München". But both to find certain sub-strings and to mark them up can easily become complicated:: >>> hard_xml = parse_xml("Please mark up Stadt\n" ... "München in Bavaria in this " ... "sentence.") Let's start with the simple case to see how searching and marking strings works with DHParser:: >>> mapping = ContentMapping(trivial_xml) >>> match = re.search(r"Stadt\s+München", mapping.content) >>> _ = mapping.markup(match.start(), match.end(), "foreign", ... {'lang': 'de'}) >>> printw(trivial_xml.as_xml(inline_tags={'trivial'})) Please mark up Stadt München in Bavaria in this sentence. In order to search for the text-string, a regular expression is used rather than a simple search for "Stadt München", because we cannot assume that it appears in exactly the same form in the text. For example, it could be broken up by a line break, e.g. "Stadt\\nMünchen". Now, let's try the more complicated case. Because we will try different configurations, we use copied of the tree "hard_xml":: >>> hard_xml_copy = copy.deepcopy(hard_xml) >>> mapping = ContentMapping(hard_xml_copy) >>> match = re.search(r"Stadt\s+München", mapping.content) >>> _ = mapping.markup(match.start(), match.end(), "foreign", ... {'lang': 'de'}) >>> xml_str = hard_xml_copy.as_xml(empty_tags={'lb'}) >>> print(xml_str) Please mark up Stadt München in Bavaria in this sentence. As can be seen the -tag is split into two parts, because the markup runs across the border of another tag, in this case . Note, that the -tag lies inside the -tag. But that makes sense, because it would also have been inside the -tag, had there been no -tag and no need to split. (Per default, the algorithm behaves somewhat "greedy", which, however can be configured with a parameter with the same name passed to the constructor of class ContentMapping.) But what if you don't wand the -tag to be split up in two or more parts, as the case may be. Well, in this case, you need to allow those tags, the borders of which the new markup runs across, to be split by that markup:: >>> hard_xml_copy = copy.deepcopy(hard_xml) >>> divisibility_map = {'foreign': {'location', ':Text'}, ... '*': {':Text'}} >>> mapping = ContentMapping(hard_xml_copy, divisibility=divisibility_map) >>> match = re.search(r"Stadt\s+München", mapping.content) >>> _ = mapping.markup(match.start(), match.end(), "foreign", ... {'lang': 'de'}) >>> xml_str = hard_xml_copy.as_xml(empty_tags={'lb'}) >>> print(xml_str) Please mark up Stadt München in Bavaria in this sentence. See the difference? This time the -element remains intact, while the -element has been split. This behavior can be configures by the divisibility-map that is passed to the parameter ``divisibility`` of the ContentMapping-constructor. It maps elements (or, rather, their names) to sets of elements that can be cut by them. The asterisk ``*`` is a wildcard and contains those elements that can be cut by any other element. An element that does not appear in the value-set anywhere in the mapping cannot be cut by any other element. It is also possible to pass a simple set of element-names instead of a dictionary to the divisibility-parameter. In this case any element with a name in this set can be cut by any other element. Any element the name of which is not a member of the set cannot be cut when markup is added. In cases where markup overlaps element-borders, it is unavoidable to decide which element will be divided and which not. It is a general limitation of tree structures that they do not allow overlapping hierarchies. In this particular example, it would most probably be more reasonable to keep the -element intact, because locations should probably be recognizable as units, while this does not really seem to matter for a foreign language annotation. The case may arise, though, where you cannot avoid splitting elements that form units. At this point you probably should consider using an entirely different data-structure, say, a graph. But if this is not an option, :py:mod:`DHParser.nodetree` allows you to mark split elements as belonging to the same "chain" of elements. In order to do so you can pass a ``chain_attr_name`` to the constructor of class ContentMapping. This is an (arbitrary) name for an attribute which will contain a unique short string that all elements (of the same name) belonging to one and the same chain share with each other, but not with any other elements. Let's try this on the previous example:: >>> reset_chain_ID() # just to ensure deterministic ID values for doctest >>> hard_xml_copy = copy.deepcopy(hard_xml) >>> match = re.search(r"Stadt\s+München", hard_xml_copy.content) >>> divisibility_map = {'foreign': {'location', ':Text'}, ... '*': {':Text'}} >>> mapping = ContentMapping(hard_xml_copy, divisibility=divisibility_map, ... chain_attr_name="chain") >>> _ = mapping.markup(match.start(), match.end(), "foreign", ... {'lang': 'de'}) >>> xml_str = hard_xml_copy.as_xml(empty_tags={'lb'}) >>> print(xml_str) Please mark up Stadt München in Bavaria in this sentence. Markup plays well together with restricted content mappings as the following example may show:: >>> tree = parse_xml("Please mark up Stadt\n" ... "München'Stadt München'" ... " is German for 'City of Munich' in Bavaria" ... " in this sentence.") Let's assume we'd like to markup locations and text-passages in foreign languages, but only in the main text and not within footnotes and the like. For that purpose, we build a context mapping that is restricted to non-footnote-text:: >>> cm = ContentMapping(tree, select=LEAF_PATH, ignore='footnote', ... chain_attr_name='chain') >>> print(cm.content) Please mark up Stadt München in Bavaria in this sentence. Now, let's assume for the sake of the example that we have list of location names to be marked up that contains the phrase "München in Bavaria". So, we search for this phrase and add the required location markup:: >>> m = re.search(r"München\s+in\s+Bavaria", cm.content) >>> print(m) >>> _ = cm.markup(m.start(), m.end(), 'location') >>> print(tree.as_xml(empty_tags={'lb'})) Please mark up Stadt München 'Stadt München ' is German for 'City of Munich' in Bavaria in this sentence. The -element covers the entire span, including the footnote. This is to be expected as changes are always carried out on the full tree. Only, the mapping is restricted to certain parts of the document. Usually, this is also the desired behavior, though, admittedly, depending on the use case another behavior (e.g. splitting the -element into one part before the -element and one part after that element) might be preferable. Such cases are not covered by the markup-method of class ContentMapping. Because, the -element did not need to be split, it does not need and therefore does not have a "chain"-attribute. Next, let's add the -element. (We substitute the value of its chain-attribute, so that the doctest does not break, when another random key is picked!):: >>> m = re.search(r'Stadt\s+München', cm.content) >>> _ = cm.markup(m.start(), m.end(), 'foreign', lang="de") >>> print(tree.as_xml(empty_tags={'lb'})) Please mark up Stadt München 'Stadt München ' is German for 'City of Munich' in Bavaria in this sentence. Here again, one might ask, why the -tag contains the -tag, but the choice makes sense, because if put together again, it should cover the complete stretch including the line-break. Again, different use cases and different choices are imaginable which, however, are not covered by the :py:meth:`ContentMapping.markup`-method. Error Messages -------------- Although errors are typically located at a particular point or range of the source code, DHParser treats them as global properties of the syntax tree (albeit with a location), rather than attaching them to particular nodes. This has two advantages: 1. When restructuring the tree and removing or adding nodes during the abstract-syntax-tree-transformation and possibly further tree-transformations, error messages do not accidentally get lost. 2. It is not necessary to add another slot to the Node class for keeping an error list which most of the time would remain empty, anyway. In order to track errors and other global properties, Module :py:mod:`~nodetree` provides the :py:class:`~nodetree.RootNode`-class. The root-object of a syntax-tree produced by parsing is of type :py:class:`~nodetree.RootNode`. If a root node needs to be created manually, it is necessary to create a plain :py:class:`~nodetree.Node`-object and either pass it to :py:class:`~nodetree.RootNode` as parameter on instantiation or, later, to the :py:meth:`~nodetree.Node.swallow`-method of the RootNode-object:: >>> document = RootNode(sentence, str(sentence)) The second parameter is normally the source code. In this example we simply use the string representation of the syntax-tree originating in `sentence`. Before any errors can be added the source-position fields of the nodes of the tree must have be been initialized. Usually, this is done by the parser. Since the syntax-tree in this example does not stem from a parsing-process, we have to do it manually:: >>> _ = document.with_pos(0) Now, let's mark all ``word``-nodes that contain non-letter characters with an error-message. There should be plenty of them, because, earlier, we have replaced some of the words partially with "...":: >>> import re >>> len([document.new_error(node, "word contains illegal characters") ... for node in document.select('word') ... if re.fullmatch(r'\w*', node.content) is None]) 3 >>> for error in document.errors_sorted: print(error) 1:1: Error (1000): word contains illegal characters 1:6: Error (1000): word contains illegal characters 1:11: Error (1000): word contains illegal characters The format of the string representation of Error-objects resembles that of compilers and is understood by many Text-Editors which mark the errors in the source code. Attribute-Handling ------------------ While the "Node.attr"-field can be used to store data of any kind, it will often just serve to store XML-attributes, the value of which is always a string. The :py:mod:`DHParser.nodetree`-module provides a mini-API to simplify typical use cases of XML-attributes. One important use case of attributes is to add or remove css-classes to the "class"-attribute. The "class"-attribute understood as containing a set of whitespace delimited strings. Module "nodetree" provides a few functions to simplify class-handling:: >>> paragraph = Node('p', 'veni vidi vici') >>> add_class(paragraph, 'smallprint') >>> paragraph.attr['class'] 'smallprint' Although the class-attribute is filled with a sequence of strings, it should behave like a set of strings. For example, one and the same class name should not appear twice in the class attribute:: >>> add_class(paragraph, 'smallprint justified') >>> paragraph.attr['class'] 'smallprint justified' Plus, the order of the class strings does not matter, when checking for elements:: >>> has_class(paragraph, 'justified smallprint') True >>> remove_class(paragraph, 'smallprint') >>> has_class(paragraph, 'smallprint') False >>> has_class(paragraph, 'justified smallprint') False >>> has_class(paragraph, 'justified') True The same logic of treating blank separated sequences of strings as sets can also be applied to other attributes:: >>> car = Node('car', 'Porsche') >>> add_token_to_attr(car, "Linda Peter", 'owner') >>> car.attr['owner'] 'Linda Peter' Or, more generally, to strings containing whitespace-separated substrings:: >>> add_token('Linda Paula', 'Peter Paula') 'Linda Paula Peter' *Classes and Functions-Reference* --------------------------------- The full documentation of all classes and functions can be found in module :py:mod:`DHParser.nodetree`. The following table of contents lists the most important of these: class Node ^^^^^^^^^^ * :py:class:`~nodetree.Node`: the central building-block of a node-tree * :py:attr:`~nodetree.Node.result`: either the child nodes or the node's string content * :py:attr:`~nodetree.Node.children`: the node's immediate children or an empty tuple * :py:attr:`~nodetree.Node.content`: the concatenated string content of all descendants * :py:attr:`~nodetree.Node.name`: the node's name * :py:attr:`~nodetree.Node.attr`: the dictionary of the node's attributes * :py:attr:`~nodetree.Node.pos`: the source-code position of this node, in case the node stems from a parsing process **Navigation** * :py:meth:`~nodetree.Node.select`: Selects nodes from the tree of descendants. * :py:meth:`~nodetree.Node.pick`: Picks a particular node from the tree of descendants. * :py:meth:`~nodetree.Node.locate`: Finds the leaf-node covering a particular location of string content of the tree originating in this node. * :py:meth:`~nodetree.Node.select_path`: Selects :ref:`paths ` from the tree of descendants. * :py:meth:`~nodetree.Node.pick_path`: Picks a particular path from the tree of descendants. * :py:meth:`~nodetree.Node.locate_path`: Finds the path of the leaf-node covering a particular location of string content of the tree originating in this node. **Serialization** * :py:meth:`~nodetree.Node.as_sxpr`: Serializes the tree originating in a node as S-expression. * :py:meth:`~nodetree.Node.as_xml`: Serializes the tree as XML. * :py:meth:`~nodetree.Node.as_json`: Serializes the tree as JSON. **XML-exchange** * :py:meth:`~nodetree.Node.as_etree`: Converts the tree to an XML-`ElementTree`_ as defined by the respective module from the Python's standard library. * :py:meth:`~nodetree.Node.from_etree`: Converts an XML-`ElementTree`_ into a tree of :py:class:`~syntaxtee.Node`-objects. **Evaluation** * :py:meth:`~nodetree.Node.evaluate`: "Evaluates" a tree by picking the function to be run on each node from a dictionary that maps tag-names to functions. Reading serialized trees ^^^^^^^^^^^^^^^^^^^^^^^^ * :py:func:`~nodetree.parse_sxpr`: Converts any S-expression string to a tree of nodes. * :py:func:`~nodetree.parse_xml`: Converts any XML-document to a tree of nodes. * :py:func:`~nodetree.parse_json`: Converts a JSON-document that has previously been created with :py:meth:`~nodetree.as_json` from a tree of nodes back to a tree of nodes. * :py:func:`~nodetree.deserialize`: Tries to guess the data-type of a string and then calls any of the above deserialization-functions accordingly. Traversing trees via paths ^^^^^^^^^^^^^^^^^^^^^^^^^^^ * :py:func:`~nodetree.prev_path`: Returns the :ref:`path ` preceding a given path. * :py:func:`~nodetree.next_path`: Returns the :ref:`path ` following a given path. * :py:func:`~nodetree.pp_path`: Pretty-prints the given :ref:`path ` Attribute-handling ^^^^^^^^^^^^^^^^^^ * :py:func:`~nodetree.has_token_on_attr`: Checks whether an attribute of a node contains one or more tokens, i.e. blank separated sequences of letters. * :py:func:`~nodetree.add_token_to_attr`: Adds a token to a particular attribute of a node. * :py:func:`~nodetree.remove_token_from_attr`: Removes a token from a particular attribute of a node. * :py:func:`~nodetree.has_class`, :py:func:`~nodetree.add_class`, :py:func:`~nodetree.remove_class`: the same as above, only that these methods manipulate the tokens specifically of the class-attribute class RootNode ^^^^^^^^^^^^^^ Any Node-object can be considered as the origin of a tree and none of the "navigation"-functions requires a tree of nodes to start with a RootNode-object. However, RootNode-objects provide support for certain "global" aspects of a tree like keeping track of the source code with line and column numbers and adding error messages. RootNode-objects can either be initialized with a code node that will then be replaced by the root-node or swallow a a tree originating in a common node later. * :py:class:`~nodetree.RootNode`: additional functionality for a tree of nodes * :py:data:`~nodetree.RootNode.errors`: a list of errors * :py:attr:`~nodetree.RootNode.errors_sorted`: the errors sorted by their position in the source code instead of the time of their having been added * :py:data:`~nodetree.RootNode.inline_tags`: a set of tags that will be printed on a single line with their content when serializing. (This helps to avoid undesired whitespace when exporting to HTML!) * :py:data:`~nodetree.RootNode.string_tags`: a set of tags that will be converted to simple strings that appear as mixed content inside their parent when serializing as XML * :py:data:`~nodetree.RootNode.empty_tags`: a set of tags that will be rendered as empty tags, e.g. ```` when serializing as XML * :py:meth:`~nodetree.RootNode.swallow`: Can be called once in the lifetime of the RootNode-object to assign this root-node to an existing tree of nodes. * :py:meth:`~nodetree.RootNode.new_error`: Creates and adds new a error. * :py:meth:`~nodetree.RootNode.as_xml`: Serializes the tree as XML taking into account the XML-customization attributes of the RootNode-object. class ContentMapping ^^^^^^^^^^^^^^^^^^^^ ContentMapping represents a path-mapping of the string-content of all or a specific selection of the leave-nodes of a tree. A content-mapping is an ordered mapping of the first text position of every (selected) leaf-node to the path of this node. The class provides methods for mapping string positions to paths and offsets (relative to the beginning of the leaf-node of the path) * :py:class:`~nodetree.ContentMapping`: Mapping the tree to its string-content * :py:meth:`~nodetree.ContentMapping.get_path_and_offset`: Maps positions in string-content of the ContentMapping to the leaf-path into which they fall * :py:meth:`~nodetree.ContentMapping.iterate_paths`: Yields all paths from position ``start_pos`` up to and including position ``end_pos``. * :py:meth:`~nodetree.ContentMapping.insert_node`: Inserts a node at a particular text-position. * :py:meth:`~nodetree.ContentMapping.markup`: Adds markup (i.e. an element) to a particular stretch of text. .. _ElementTree: https://docs.python.org/3/library/xml.etree.elementtree.html .. _mixed content: https://www.w3.org/TR/xml/#sec-mixed-content .. _unist: https://github.com/syntax-tree/unist .. _SXML: https://okmij.org/ftp/Scheme/SXML.html