Document-Trees
==============


Module :py:mod:`nodetree` encapsulates the functionality for creating
and handling document trees and, in particular syntax-trees generated by
a parser. This includes serialization and deserialization of node-trees,
navigating and searching node-trees as well as annotating node-trees
with attributes and error messages.

.. _node_objects:

Node-objects
------------

Syntax trees are composed of Node-objects which are linked
uni-directionally from parent to children. Nodes can contain either
child-nodes, in which case they are informally called "branch-nodes", or
text-strings, in which case they informally called "leaf nodes", but not
both at the same time. (There is no mixed content as in XML!)

Apart from their content, the most important property of a Node-object is its
``name``. Nodes are initialized with their name and content as
arguments::

    >>> from DHParser.nodetree import *
    >>> number_1 = Node('number', "5")
    >>> number_1.name
    'number'

The Node-object ``number_1`` now has the tag-name "number" and the content "5".
Since the content is a string and not a tuple of child-nodes, the node
constructed is a leaf-node.

(By convention, if the tag-name of a node starts with a colon ":", the node is
considered "anonymous". This distinction is helpful when a tree of nodes is
generated in a parsing process to distinguish nodes that contain important
pieces of data from nodes that merely contain delimiters or structural
information.)

Several nodes can be connected to a tree::

    >>> number_2 = Node('number', "4")
    >>> addition = Node('add', (number_1, number_2))

Trees spanned by a node can conveniently be serialized as S-expressions
(well-known from the computer languages "lisp" and "scheme")::

    >>> print(addition.as_sxpr())
    (add (number "5") (number "4"))

It is also possible to serialize nodes as XML-snippet::

    >>> print(addition.as_xml())
    <add>
      <number>5</number>
      <number>4</number>
    </add>

or as indented tree::

    >>> print(addition.as_tree())
    add
      number "5"
      number "4"

or as JSON-data (see further below). Trees can also be deserialized from any of
these formats with the exception of the indented tree (see below).

In order to test whether a Node is leaf-node one can check for the absence of
children::

    >>> node = Node('word', 'Palace')
    >>> assert not node.children

The data of a node can be queried by reading the result-property::

    >>> node.result
    'Palace'

The `result` is always a string or a tuple of Nodes, even if the
node-object has been initialized with a single node::

    >>> parent = Node('phrase', node)
    >>> parent.result
    (Node('word', 'Palace'),)

The `result`-property can be assigned to, in order to change the data of a
node::

    >>> parent.result = (Node('word', 'Buckingham'), Node('blank', ' '), node)
    >>> print(parent.as_sxpr())
    (phrase (word "Buckingham") (blank " ") (word "Palace"))

Content-equality of Nodes must be tested with the `equals()`-method. The
equality operator `==` tests merely for the identity of the node-object,
not for the equality of the content of two different node-objects::

    >>> n1 = Node('dollars', '1')
    >>> n2 = Node('dollars', '1')
    >>> n1.equals(n2)
    True
    >>> n1 == n2
    False

An empty node is always a leaf-node, that is, if initialized with an empty
tuple, the node's result will actually be the empty string::

    >>> empty = Node('void', ())
    >>> empty.result
    ''
    >>> assert empty.equals(Node('void', ''))

Next to the `result`-property, a node's content can be queried with
either its `children`-property or its `content`-property. The former
yields the tuple of child-nodes. The latter yields the string-content of
the node, which in the case of a "branch-node" is the (recursively
generated) concatenated string-content of all of its children::

    >>> node.content
    'Palace'
    >>> node.children
    ()
    >>> parent.content
    'Buckingham Palace'
    >>> parent.children
    (Node('word', 'Buckingham'), Node('blank', ' '), Node('word', 'Palace'))

Both the `content`-property and the `children`-property are
read-only-properties. In order to change the data of a node, its
`result`-property must be assigned to (as shown above).

Just like HTML- oder XML-tags, nodes can be annotated with attributes.
Unlike XML and HTML, however, the value of these attributes can be of any
type, not only strings. The only requirement is that the value is serializable
as string. Be aware, though of the possible loss of information when
serializing nodes or converting nodes to ElementTree-elements, if there are
attributes with non-string values!
Attributes are stored in an ordered dictionary that maps string identifiers,
i.e. the attribute name, to the content of the attribute. This dictionary
can be accessed via the `attr`-property::

    >>> node.attr['price'] = 'very high'
    >>> print(node.as_xml())
    <word price="very high">Palace</word>

When serializing as S-expressions attributes are shown as a nested list marked
with a "tick"::

    >>> print(node.as_sxpr())
    (word `(price "very high") "Palace")

Attributes can be queried via the `has_attr()` and `get_attr()`-methods.
This is to be preferred over accessing the `attr`-property for querying,
because the attribute dictionary is created lazily on the first access
of the `attr`-property::

    >>> node.has_attr('price')
    True
    >>> node.get_attr('price', '')
    'very high'
    >>> parent.get_attr('price', 'unknown')
    'unknown'

If called with no parameters or an empty string as attribute name, `has_attr()`
returns True, if at least one attribute is present::

    >>> parent.has_attr()
    False

Attributes can be deleted like dictionary entries::

    >>> del node.attr['price']
    >>> node.has_attr('price')
    False

Node-objects contain a special "write once, read afterwards"-property named
`pos` that is meant to capture the source code position of the content
represented by the Node. Usually, the `pos` values are initialized with the
corresponding source code location by the parser.

The main purpose of keeping source-code locations in the node-objects is
to equip the messages of errors that are detected in later processing
stages with source code locations. In later processing stages the tree
may already have been reshaped and its string-content may have been
changed, say, by normalizing whitespace or dropping delimiters.

Before the `pos`-field can be read, it must have been initialized with the
`with_pos`-method, which recursively initializes the `pos`-field of the child
nodes according to the offset of the string values from the main field::

    >>> import copy; essentials = copy.deepcopy(parent)
    >>> print(essentials.with_pos(0).as_xml(src=essentials.content))
    <phrase line="1" col="1">
      <word line="1" col="1">Buckingham</word>
      <blank line="1" col="11"> </blank>
      <word line="1" col="12">Palace</word>
    </phrase>
    >>> essentials[-1].pos, essentials.content.find('Palace')
    (11, 11)
    >>> essentials.result = tuple(child for child in essentials.children
    ...                           if child.name != 'blank')
    >>> print(essentials.as_xml(src=essentials.content))
    <phrase line="1" col="1">
      <word line="1" col="1">Buckingham</word>
      <word line="1" col="12">Palace</word>
    </phrase>
    >>> essentials[-1].pos, essentials.content.find('Palace')
    (11, 10)


.. _serialization:

Serialization
-------------

Syntax trees can be serialized as S-expressions, XML, JSON and indented text.
As a special cases of S-expressions and JSON also `SXML`_ and `unist`_
are supported. Module 'nodetree' also contains a few simple parsers
(:py:func:`~nodetree.parse_sxpr`, :py:func:`~nodetree.parse_sxml`,
:py:func:`~nodetree.parse_xml`, :py:func:`~nodetree.parse_json()`)
to convert XML-snippets, S-expressions or
json objects into trees composed of Node-objects.

.. note: Function :py:func:`~nodetree.parse_xml` can deserialize *any* XML-file
    and function :py:func:`~nodetree.parse.sxml` can deserialize *any* SXML*.
    The other parsing functions, however, can parse only the restricted
    subset of S-expressions or JSON into Node-trees that is used when serializing
    into these formats. There are no functions to deserialize indented text or
    `unist`_-JSON.

In order to make it easier to parameterize serialization, the Node-class also
defines a generic :py:meth:`~nodetree.Node.serialize()`-method next to the more
specialized :py:meth:`~nodetree.Node.as_sxpr`-,
:py:meth:`~nodetree.Node.as_json`- and
:py:meth:`~nodetree.Node.as_xml()`-methods::

    >>> s = ('(sentence (word "This") (blank " ") (word "is") (blank " ") '
    ...      '(phrase (word "Buckingham") (blank " ") (word "Palace")))')
    >>> sentence = parse_sxpr(s)
    >>> print(sentence.serialize(how='indented'))
    sentence
      word "This"
      blank " "
      word "is"
      blank " "
      phrase
        word "Buckingham"
        blank " "
        word "Palace"
    >>> sxpr = sentence.serialize(how='sxpr')
    >>> round_trip = parse_sxpr(sxpr)
    >>> assert sentence.equals(round_trip)

When serializing as XML, there will be no mixed-content and, likewise, no empty
tags per default, because these do not exist in DHParser's data model::

    >>> print(sentence.as_xml())
    <sentence>
      <word>This</word>
      <blank> </blank>
      <word>is</word>
      <blank> </blank>
      <phrase>
        <word>Buckingham</word>
        <blank> </blank>
        <word>Palace</word>
      </phrase>
    </sentence>

However, mixed-content can be simulated with `string_tags`-parameter of the
:py:meth:`~nodetree.Node.as_xml`-method.::

    >>> print(sentence.as_xml(inline_tags={'sentence'}, string_tags={'word', 'blank'}))
    <sentence>This is <phrase>Buckingham Palace</phrase></sentence>

The `inline_tags`-parameter ensures that all listed tags as well as
any tag containd in a listed tag will be printed on a single line.
This is helpful when opening the XML-serialization in an internet-browser
in order to avoid spurious blanks when a line-break occurs in the HTML/XML-source.

Finally, empty tags that do not have a closing tag (e.g. <br />) can be
declared as such with the `empty_tags`-parameter.

Note that using `string_tags` can lead to a loss of information. A loss of
information is inevitable if, like in the example above, more than one tag is
listed in the `string_tags`-set passed to the
:py:meth:`~nodetree.Node.as_xml`-method.

When deserializing an XML-string yields, the text-parts within elements
with mixed-content will be assigned to nodes of their own with the default
name `:Text`::

    >>> tree = parse_xml(
    ...    '<sentence>This is <phrase>Buckingham Palace</phrase></sentence>')
    >>> print(tree.serialize(how='indented'))
    sentence
      :Text "This is "
      phrase "Buckingham Palace"

The name these text-nodes can be configured with the `string_tag`-parameter
of the :py:func:`~nodetree.parse_xml`-function::

    >>> tree = parse_xml(
    ...    '<sentence>This is <phrase>Buckingham Palace</phrase></sentence>',
    ...    string_tag='MIXED')
    >>> print(tree.serialize(how='indented'))
    sentence
      MIXED "This is "
      phrase "Buckingham Palace"

.. _xml_formatting:

XML-reflow
^^^^^^^^^^

A notorious problem when working with XML is the propper handling
of whitespace. It ever so often happens that when an XML-document
is rewrapped or reformatted that whitespace is either added or
removed in places where this can change the meaning of the data.
For example, when reformatting the snippet::

   <p>
     King <name>Charles</name>
     was crowned by the Archbishop
     of Cantabury
   </p>

it may happen (depending on the tool used) that a whitespace gets
lost::

    <p>
      King <name>Charles</name>was crowned
      by the Archbishop of Cantabury
    </p>

Here, the whitespace between "Charles" and "was" has been
erroneously deleted. Or that a whitespace is added, where
it should not::

    <p>
      King <name>Charles
      </name> was crowned
      by the Archbishop
      of Cantabury
    </p>

This time round the data markup up by name encompasses also
a line-break and a few blanks, an addition that can mess up
algorithms that rely on precise data.

Surely, there is the
`xml:space <https://www.w3.org/TR/REC-xml/#sec-white-space>`_-Attribute.
But this is often forgotten by the people encoding data in XML
and sometimes also by the programmers that develop XML-tools.
Because of this, DHParser offers the `inline_tags`-parameter
(which can be passed to the xml-serialization functions and also
be set as an attribute of the :py:class`~nodetree.RootNode`-class).

The main problem with the `xml:space`-attribute consists in the
fact that it only either allows that all whitespace is preserved
literally (xml:space="preserve") or that whitespace may be
added or removed at liberty to format the XML-document
(xml:space="default"). These two states are not sufficiently
finegrained to allow the reflow texts without distorting the
text data.

Non-distorting reflow requires whitespace inside (but not at the fringes)
of particular text-containing elements, like for exmple `<p>`, is readily
expandable to an arbitrarily long sequence of whitespace characters
(blanks, tabs and lind-feeds)
and likewise compressible to a single blank without causing harm to
the data.

As you may notice this is true for paragraphs of prose-text but not
for poems. But this only means that not all text-data is reflowable and
that reflow should only be applied to text-data where reflow makes sense.

This constraint for data-preserving reflow assumes that whitespace can
always be substituted by other (larger or smaller) whitespace, but must
not be added or removed. If this rule is
strictly obeyed then any form of the data (i.e. formatting to a particular
column-number) can always be reconstructed and will in fact yield identical
results for the same reflow-column and the same indentation (which is two
blanks by default).

DHParser's reflow-algorithm can be triggered by assigning a column-number to the
`reflow_col` of the :py:method:`~nodetree.Node.as_xml`-method::

    >>> text = '<p>King <name>Charles</name> was crowned by the Archbishop of Cantabury</p>'
    >>> tree = parse_xml(text)
    >>> reflow = tree.as_xml(inline_tags={'p'}, reflow_col=40)
    >>> print(reflow)
    <p>King <name>Charles</name> was
      crowned by the Archbishop of
      Cantabury</p>

No blanks are introduced for the sake of formatting after the
opening `<p>`-tag or before the closing `</p>`-tag. The same, although this is not
visible in the example above, is also true for all tags contained inside the `<p>`-tag.
(Contained tags inherit the inline-property!)

It should also be noted that assigning a value to the reflow-parameter changes the
meaning of the `inline_tags`-parameter in a subtle way - and likewise the meaning
of the `xml:space`-attribute if that is used. Without reflow, the
`inline-tags`-parameter marks tags, the content of which is strictly preserved
when serializing. (Unless, the data itself contains a line-break it will be
written entirely on a single line, thus the name "inline".) However, if the
reflow parameter receives a value different from 0, the content of the
"inline-tags" and their descendants is not serialized on a single line any more
but allowed to reflow according to the above rule.

Also, the data can always be "normalized" by reformatting it to a particular
column. A special case of this consists in reducing it to one and the same
one-line-form, by replacing all line-feeds inside inline-tags by blanks and
any sequence of blanks by a single blank.
(Line-feeds can still be preserved if necessary by hard-codeing them with
tags like `<br/>`) This can be achieved by calling the special
function :py:func:`~nodetree.reflow_as_oneliner`::

    >>> from DHParser.nodetree import reflow_as_oneliner
    >>> tree = parse_xml(reflow)
    >>> print(tree.as_xml(inline_tags={tree.name}))
    <p>King <name>Charles</name> was
      crowned by the Archbishop of
      Cantabury</p>
    >>> reflow_as_oneliner(tree)
    >>> print(tree.as_xml(inline_tags={tree.name}))
    <p>King <name>Charles</name> was crowned by the Archbishop of Cantabury</p>

Note that the call `tree.as_xml(inline_tags={tree.name})` that treats all tags
from the root of the tree onward as inline-tags and does not apply reflow
yields a "neutral" serialization in the sense that no formatting is applied anywhere.

DHParser also provides a command line-tool to reflow xml-files, conveniently
named "xml_reflow". It can be called with::

    $ xml_reflow --column 80 FILENAME.xml

An alternative to reflowing the content of XML-files manually in this way,
is to use a text-editor that can reflow (and properly indent) lines with
excess length. An advantage ist that this works also with XML-files that
contain areas where the data is not reflowable and must be literally preserved.

ElementTree-Exchange
--------------------

Although DHParser offers rich support for tree-transformation, the wish
may arise to use standard XML-tools for tree-transformation as an
alternative or supplement to the tools DHParser offers. One way to do
so, would be to serialize the tree of
:py:class:`~snytaxtree.Node`-objects, then use the XML-tools and,
possibly, to deserialize the transformed XML again.

A more efficient method, however, is to utilize any of the various
Python-libraries for XML. In order to make this as easy as possible trees of
:py:class:`~snytaxtree.Node`-objects can be converted to `ElementTree`_-objects
either from the python standard library or from the `lxml <https://lxml.de/>`_-library::

    >>> import xml.etree.ElementTree as ET  # for lxml write: from lxml import etree as ET
    >>> et = sentence.as_etree(ET)
    >>> ET.dump(et)
    <sentence><word>This</word><blank> </blank><word>is</word><blank> </blank><phrase><word>Buckingham</word><blank> </blank><word>Palace</word></phrase></sentence>
    >>> tree = Node.from_etree(et)
    >>> print(tree.equals(sentence))
    True

The first parameter of :py:meth:`~nodetree.Node.as_etree` is the
ElementTree-library to be used. If omitted, the standard-library-ElementTree is
used.

Like the :py:meth:`~nodetree.Node.as_xml`-method, the
:py:meth:`~nodetree.Node.as_etree` and :py:meth:`~nodetree.Node.from_etree`
can be parameterized in order to support mixed-content and empty-tags::

    >>> et = sentence.as_etree(ET, string_tags={'word', 'blank'})
    >>> ET.dump(et)
    <sentence>This is <phrase>Buckingham Palace</phrase></sentence>


.. _paths:

Tree-Traversal
--------------

Transforming syntax trees is usually done by traversing the complete
tree and applying specific transformation functions on each node.
Modules "transform" and "compile" provide high-level interfaces and
scaffolding classes for the traversal and transformation of
syntax-trees.

Module `nodetree` does not provide any functions for transforming trees,
but it provides low-level functions for navigating trees. These functions
cover three different purposes:

1. Downtree-navigation within the subtree spanned by a particular node.
2. Uptree- and horizontal navigation to the neighborhood ("siblings") ancestry
   of a given node.
3. Navigation by looking at the string-representation of the tree.


Navigating "downtree"
^^^^^^^^^^^^^^^^^^^^^

There are a number of useful functions to help navigating a tree spanned by a
node and finding particular nodes within in a tree::

    >>> from DHParser.toolkit import printw
    >>> printw(list(sentence.select('word')))
    [Node('word', 'This'), Node('word', 'is'), Node('word', 'Buckingham'),
     Node('word', 'Palace')]
    >>> list(sentence.select(lambda node: node.content == ' '))
    [Node('blank', ' '), Node('blank', ' '), Node('blank', ' ')]

The pick functions always picks the first node fulfilling the criterion::

    >>> sentence.pick('word')
    Node('word', 'This')

Or, reversing the direction::

    >>> last_match = sentence.pick('word', reverse=True)
    >>> last_match
    Node('word', 'Palace')

While nodes contain references to their children, a node does not
contain a references to its parent. The method
:py:meth:`~nodetree.Node.pick_pach` (described below) can be used to pick
the complete list of ancestors leading up to and including a particular
node. As a last resort (because it is slow) the node's parent can be
found by the `find_parent`-function which must be executed on any
ancestor of the node::

    >>> printw(sentence.find_parent(last_match))
    Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '),
     Node('word', 'Palace')))

Sometimes, one only wants to select or pick particular children of a
node. Apart from accessing these via `node.children`, there is a
tuple-like access to the immediate children via indices and slices::

    >>> sentence[0]
    Node('word', 'This')
    >>> printw(sentence[-1])
    Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '),
     Node('word', 'Palace')))
    >>> sentence[0:3]
    (Node('word', 'This'), Node('blank', ' '), Node('word', 'is'))
    >>> sentence.index('blank')
    1
    >>> sentence.indices('word')
    (0, 2)

as well as a dictionary-like access, with the difference that a "key" may occur
several times::

    >>> sentence['word']
    (Node('word', 'This'), Node('word', 'is'))
    >>> printw(sentence['phrase'])
    Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '),
     Node('word', 'Palace')))

Be aware that always all matching values will be returned and that the return
type can accordingly be either a tuple of Nodes or a single Node! An IndexError
is raised in case the "key" does not exist or an index is out of range.

It is also possible to delete children conveniently with Python's
`del`-operator::

    >>> s_copy = copy.deepcopy(sentence)
    >>> del s_copy['blank'];  print(s_copy)
    ThisisBuckingham Palace
    >>> del s_copy[2][0:2]; print(s_copy.serialize())
    (sentence (word "This") (word "is") (phrase (word "Palace")))

One can also use the `Node.pick_child()` or `Node.select_children()`-method in
order to select children with an arbitrary condition::

    >>> tuple(sentence.select_children(lambda nd: nd.content.find('s') >= 0))
    (Node('word', 'This'), Node('word', 'is'))
    >>> printw(sentence.pick_child(lambda nd: nd.content.find('i') >= 0,
    ...                            reverse=True))
    Node('phrase', (Node('word', 'Buckingham'), Node('blank', ' '),
     Node('word', 'Palace')))

Often, one is neither interested in selecting form the children of a node, nor
from the entire subtree, but from a certain "depth-range" of a tree-structure.
Say, you would like to pick all word's from the sentence that are not inside a
phrase and assume at the same time that words may occur in nested structures::

    >>> nested = copy.deepcopy(sentence)
    >>> i = nested.index(lambda nd: nd.content == 'is')
    >>> nested[i].result = Node('word', nested[i].result)
    >>> nested[i].name = 'italic'
    >>> nested[0:i + 1]
    (Node('word', 'This'), Node('blank', ' '), Node('italic', (Node('word', 'is'))))

Now, in order to select all words on the level of the sentence, but excluding
any sub-phrases, it would not be helpful to use methods based on the selection
of children (i.e. immediate descendants), because the word nested in an
'italic'-Node would be missed. For this purpose the various selection()-methods
of class node have a `skip_subtree`-parameter which can be used to block
subtrees from the iterator based on a criteria (which can be a function, a tag
name or set of tag names and the like)::

    >>> tuple(nested.select('word', skip_subtree='phrase'))
    (Node('word', 'This'), Node('word', 'is'))


Navigating "uptree"
^^^^^^^^^^^^^^^^^^^

Instead of keeping a link within each node to its parent, it is much more
elegant to keep track of a node's ancestry by using the lineage or "tree-path"
which is a simple list of ancestors starting with the root-node and including
the node itself as its last item. For most search methods such as select() or
pick(), there exists a pendant that returns this path instead of just the node
itself::

    >>> last_path = sentence.pick_path('word', reverse=True)
    >>> last_path[-1] == last_match
    True
    >>> last_path[0] == sentence
    True
    >>> pp_path(last_path)
    'sentence <- phrase <- word'

One can also think of a tree-path as a breadcrumb-trail or, rather,
ant-trail that "points" to a particular part of text by marking the path
from the root to the node, the content of which contains this text. This
node does not need to be a leaf node, but can be any branch-node on the
way from the root to the leaves of the tree. When analyzing or
transforming a tree-structured text, it is often helpful to "zoom" in
and out of a particular part of text (pointed to by a path) or to move
forward and backward from a particular location (again represented by a
path).

The ``next_path()`` and ``prev_path()``-functions allow to move one step
forward or backward from a given path::

    >>> pointer = prev_path(last_path)
    >>> pp_path(pointer, with_content=-1)
    'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace" <- blank " "'

``prev_path()`` and ``next_path()`` automatically zoom out by one step, if
they move past the first or last child of the last but one node in the list::

    >>> pointer = prev_path(pointer)
    >>> pp_path(pointer, with_content=-1)
    'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace" <- word "Buckingham"'
    >>> pp_path(prev_path(pointer), with_content=-1)
    'sentence "This is Buckingham Palace" <- blank " "'

Thus::

    >>> next_path(prev_path(pointer)) == pointer
    False
    >>> pointer = prev_path(pointer)
    >>> pp_path(next_path(pointer), with_content=-1)
    'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace"'

The reason for this behavior is that ``prev_path()`` and ``next_path()``
try to move to the path which contains the string content preceding or
succeeding that of the given path. Therefore, these functions move to
the next sibling on the same branch, rather traversing the complete tree
like the ``select()`` and ``select_path()``- methods of the Node-class.
However, when moving past the first or last sibling, it is not clear
what the next node on the same level should be. To keep it easy, the
function "zooms out" and returns the next sibling of the parent.

It is, of course, possible to zoom back into a path::

    >>> pp_path(zoom_into_path(next_path(pointer), FIRST_CHILD, steps=1),
    ...                with_content=-1)
    'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace" <- word "Buckingham"'

Often it is preferable to move through the leaf-nodes and their paths right
away. Functions like ``next_leaf_path()`` and ``prev_leaf_path()`` provide
syntactic sugar for this case::

    >>> pointer = next_leaf_path(pointer)
    >>> pp_path(pointer, with_content=-1)
    'sentence "This is Buckingham Palace" <- phrase "Buckingham Palace" <- word "Buckingham"'

It is also possible to inspect just the string content surrounding a path,
rather than its structural environment::

    >>> ensuing_str(pointer)
    ' Palace'
    >>> assert foregoing_str(pointer, length=1) == ' ', "Blank expected!"

It is also possible to systematically iterate through the paths forward or
backward - just like the `node.select_path()`-method, but starting from an
arbitrary path, instead of the one end or the other end of the tree rooted in
`node`::

    >>> t = parse_sxpr('(A (B 1) (C (D (E 2) (F 3))) (G 4) (H (I 5) (J 6)) (K 7))')
    >>> pointer = t.pick_path('G')
    >>> printw([pp_path(ctx, with_content=1)
    ...         for ctx in select_path(pointer, ANY_PATH, include_root=True)])
    ['A <- G "4"', 'A <- H "56"', 'A <- H <- I "5"', 'A <- H <- J "6"',
     'A <- K "7"', 'A "1234567"']
    >>> printw([pp_path(ctx, with_content=1)
    ...         for ctx in select_path(
    ...             pointer, ANY_PATH, include_root=True, reverse=True)])
    ['A <- G "4"', 'A <- C "23"', 'A <- C <- D "23"', 'A <- C <- D <- F "3"',
     'A <- C <- D <- E "2"', 'A <- B "1"', 'A "1234567"']

Another important difference, besides the starting point, is that the
`select()`-generators of the `nodetree`-module traverse the tree post-order
(or "depth first"), while the respective methods of the Node-class traverse the
tree pre-order. See the difference::

    >>> l = [pp_path(ctx, with_content=1)
    ...      for ctx in t.select_path(ANY_PATH, include_root=True)]
    >>> l[l.index('A <- G "4"'):]
    ['A <- G "4"', 'A <- H "56"', 'A <- H <- I "5"', 'A <- H <- J "6"', 'A <- K "7"']
    >>> l = [pp_path(ctx, with_content=1)
    ...      for ctx in t.select_path(ANY_PATH, include_root=True, reverse=True)]
    >>> printw(l[l.index('A <- G "4"'):])
    ['A <- G "4"', 'A <- C "23"', 'A <- C <- D "23"', 'A <- C <- D <- F "3"',
     'A <- C <- D <- E "2"', 'A <- B "1"']


Content Mappings
----------------

Basics
^^^^^^

For finding a passage in the text or identifying certain textual
features like, for example, matching brackets, traversing the
document-tree is not really an option, if only, because a passage may
extend over several nodes, possibly even on different levels of the tree
hierarchy. For such cases it is possible to generate a content mapping
that maps text positions within the pure string-content to the paths of
the leaf-nodes to which they belong. This mapping can be thought of as a
"string-view" on the tree::

    >>> sentence = parse_sxpr(
    ...     '(sentence (word "This") (blank " ") (word "is") (blank " ")'
    ...     ' (phrase (word "Buckingham") (blank " ") (word "Palace")))')
    >>> ctx_mapping = ContentMapping(sentence)
    >>> print(ctx_mapping.content)
    This is Buckingham Palace
    >>> print(ctx_mapping)
    0 -> sentence, word "This"
    4 -> sentence, blank " "
    5 -> sentence, word "is"
    7 -> sentence, blank " "
    8 -> sentence, phrase, word "Buckingham"
    18 -> sentence, phrase, blank " "
    19 -> sentence, phrase, word "Palace"

Note that the path in the first line of the output is different from the path
in the third line, although the sequence of node-names that appears in the
pretty-printed version shown here is the same, i.e. "sentence, word", because
the paths really consist of different Nodes.

Now let's find all letters that are followed by a whitespace character::

    >>> import re
    >>> locations = [m.start() for m in re.finditer(r'\w ', ctx_mapping.content)]
    >>> targets = [ctx_mapping.get_path_and_offset(loc) for loc in locations]

.. tip::
    Other than the node's content property, the content mappings content
    field is not generated on the fly every time it is retrieved, but
    only when instantiating or rebuilding the mapping. Performance-wise
    it is advisable to always use the content mapping's content field.

The target returned by :py:meth:`~nodetree.ContentMapping.get_path_and_offset`
is a tuple of the target path and the relative position of the location that
falls within this path::

    >>> [(pp_path(path), relative_pos) for path, relative_pos in targets]
    [('sentence <- word', 3), ('sentence <- word', 1), ('sentence <- phrase <- word', 9)]

Now, the structured text can be manipulated at the precise locations where
string search yielded a match. Let's turn our text into a little riddle by
replacing the letters of the leaf-nodes before the match locations with three
dots::

    >>> for path, pos in targets:
    ...     path[-1].result = '...' + path[-1].content[pos:]
    >>> str(sentence)
    '...s ...s ...m Palace'

The positions resemble the text positions of the text represented by the
tree at the very moment when the path mapping is generated, not the
source positions captured by the `pos`-property of the node-objects!
This also means that the mapping becomes outdated, when the tree is
being restructured. Unless you use the methods provided by
:py:class:`~nodetree.ContentMapping` itself in order to make changes to
the tree, you need to either call
:py:meth:`~nodetree.ContentMapping.rebuild_mapping` to update the
content mapping at the affected places or instantiate an entirely new
content mapping.


Restricted Mappings
^^^^^^^^^^^^^^^^^^^

A very powerful feature of context mappings is that they allow to
restrict the string view onto the document tree to selected parts of the
tree, which makes it possible to exclude these parts from the search,
e.g.::

    >>> xml = '''<doc><p>In München<footnote><em>München</em> is the German
    ... name of the city of Munich</footnote> is a Hofbräuhaus</p></doc>'''
    >>> tree = parse_xml(xml)

Now, assume you would like to find all occurrences of "München" in the main
text but not in the footnotes, then you can issue a context mapping that
ignores all footnotes::

    >>> cm = ContentMapping(tree, select=LEAF_PATH, ignore={'footnote'})
    >>> list(re.finditer('München', cm.content))
    [<re.Match object; span=(3, 10), match='München'>]

In order to restrict the content mapping to certain parts of the tree, the
ContentMapping-class takes a same pair of path selectors similar to the
"criteria" and "skip_subtree" parameters of :py:meth:`Node.select_path`
and :py:meth:`Node.pick`. However, there is a subtle but important difference:
The "select"-parameter of the ContentMapping-class must only accept leaf-paths!
Otherwise a ValueError will be raised.

In contrast to the restricted content mapping, the search in the
string-content of the entire tree yields::
    
    >>> printw(list(re.finditer('München', tree.content)))
    [<re.Match object; span=(3, 10), match='München'>, <re.Match object; span=(10,
     17), match='München'>]

Although, the string locations in a context mappings that has been
restricted to certain parts of the tree have shifted with respect to the
string locations in the full document tree, there is no need to worry
that the mapped locations within the tree have changed::

    >>> tree_pos = tree.content.find('Hofbräuhaus')
    >>> print(tree_pos)
    64
    >>> tm = ContentMapping(tree)
    >>> tm.content.find('Hofbräuhaus')  # should be the same as above
    64
    >>> cm_pos = cm.content.find('Hofbräuhaus')
    >>> print(cm_pos)
    16

The string-position is not the same, because the mapping ``cm`` omits the
footnote-text. Yet, the path and offset within the tree remain the same.
(Remember that the ``:Text``-nodes are "anonymous" nodes that the XML-parser
inserts for the character data of XML-elements with `mixed content`_.)::

    >>> cm_path, cm_offset = cm.get_path_and_offset(cm_pos)
    >>> print(pp_path(cm_path, delimiter=', '), '->', cm_offset)
    doc, p, :Text -> 6
    >>> tm_path, tm_offset = tm.get_path_and_offset(tree_pos)
    >>> print(pp_path(tm_path, delimiter=', '), '->', tm_offset)
    doc, p, :Text -> 6
    >>> assert tm_path == cm_path  # paths are really the same sequence of nodes

This can easily be confirmed by looking at the complete mappings in
direct comparison. First the unrestricted mapping::

    >>> print(tm)
    0 -> doc, p, :Text "In München"
    10 -> doc, p, footnote, em "München"
    17 -> doc, p, footnote, :Text " is the German" "name of the city of Munich"
    58 -> doc, p, :Text " is a Hofbräuhaus"

Now, the mapping that omits the footnotes::

    >>> print(cm)
    0 -> doc, p, :Text "In München"
    10 -> doc, p, :Text " is a Hofbräuhaus"

Note, that the numbers at the beginning of each line represent the
string position which is different for the same path, but this has no
bearing on the offsets which count from the content-mapping-specific
position of each path in the content mapping.

Conversely, we could also have restricted the content mapping only to
the footnote(s)::

    >>> fm = ContentMapping(tree, select=leaf_paths('footnote'), ignore=NO_PATH)
    >>> print(fm)
    0 -> doc, p, footnote, em "München"
    7 -> doc, p, footnote, :Text " is the German" "name of the city of Munich"

Here, the parameter ``ignore=NO_PATH`` has to be understood as "from the
selected paths do not ignore any paths". Note, the
:py:func:`leaf_path`-filter used to define the value of the
select-argument. ContentMapping raises a ValueError if the
select-criterion allows paths that are not leaf-path. The
leaf_paths-filter is a simple, though slightly costly in terms of speed,
means of turning any criteria into a "criteria is true for path AND path
is a leaf-path"-condition.

Now, let's look for the string "München" in the footnotes only::

    >>> i = fm.content.find('München')
    >>> path, offset = fm.get_path_and_offset(i)
    >>> pp_path(path, 1)
    'doc <- p <- footnote <- em "München"'
    >>> print(offset)
    0

We can now manipulate the tree through the path and offset. Let's insert
the word "Stadt" in front of "München". We do so by changing the result
of the leaf node of the path to the term at the given offset::

    >>> path[-1].result = path[-1].result[:offset] + "Stadt " + \
    ...                   path[-1].result[offset:]

In this particular case, because the offset is zero, we could also have
written ``"Stadt " + path[-1].result``, but the formula above also
works for the general case where cannot be sure that the offset will
always be 0.

We expect that the change is reflected in the tree at the right position, i.e.
inside the footnote::

    >>> printw(tree.as_xml(inline_tags={'doc'}))
    <doc><p>In München<footnote><em>Stadt München</em> is the German
    name of the city of Munich</footnote> is a Hofbräuhaus</p></doc>

As mentioned earlier, the content mapping should be considered tainted if the
underlying tree has been changed by any other means than the methods of the
ContentMapping-object itself. In order to rebuild the affected path of the
content mapping :py:meth:`ContentMapping.rebuild_mapping` must be called for
the affected sections of the content mapping which are defined by the first
and last path index of the content mapping where a change has taken place::

    >>> fm.rebuild_mapping(i, i)
    >>> print(fm)
    0 -> doc, p, footnote, em "Stadt München"
    13 -> doc, p, footnote, :Text " is the German" "name of the city of Munich"


Limitations
^^^^^^^^^^^

As of now, a limitation of the content mappings provided
by :py:mod:`DHParser.nodetree` consists in the fact that they remain
completely agnostic with respect to any textual meaning of the nodes.
For example assume that the node-name "pb" signifies a page break, which
implies that there is a gap between the two parts separated by the page
break. However, because this is considered part of the meaning
of "pb" it may not be required by the encoding guide-lines for the
document that a gap, say, a blank character or a linefeed is also
redundantly encoded in the string content of the document.
(It may even be forbidden to do so!) But then a search on the
string content may miss phrases separated by a page break::

    >>> tree = parse_xml('<doc>xyz New<pb/>York xyz</doc>')
    >>> print(tree.content)
    xyz NewYork xyz
    >>> m = re.search(r'New\s+York', tree.content)
    >>> print(m)
    None

Currently, the only remedy is to either allow redundant encoding
of textual meanings within the string-content or adding specific
nodes that carry the redundant textual meanings within their
string-content and removing them again, after searches etc. have
been finished.


Markup insertion
----------------

Class :py:class:`ContentMapping` provides powerful markup-methods
that allows you to add markup at any position you like simply by
passing the start- and end-position in the string-representation of
the document-tree and "automagically" taking care of such perils
as cross-cutting tag-boundaries or overlapping hierarchies.

This solves a common challenge when processing tree structured text-data
which consists in adding new nodes that cover certain ranges of the
string content that may already have been covered by other elements. The
problem is the same as adding further markup to an existing XML or
HTML-document. In trivial cases like::

    >>> trivial_xml = parse_xml("<trivial>Please mark up Stadt München "
    ...     "in Bavaria in this sentence.</trivial>")

we would hardly need any help by a library to markup a string "Stadt München".
But both to find certain sub-strings and to mark them up can easily become
complicated::

    >>> hard_xml = parse_xml("<hard>Please mark up Stadt\n<lb/>"
    ...     "<location><em>München</em> in Bavaria</location> in this "
    ...     "sentence.</hard>")

Let's start with the simple case to see how searching and marking
strings works with DHParser::

    >>> mapping = ContentMapping(trivial_xml)
    >>> match = re.search(r"Stadt\s+München", mapping.content)
    >>> _ = mapping.markup(match.start(), match.end(), "foreign",
    ...                    {'lang': 'de'})
    >>> printw(trivial_xml.as_xml(inline_tags={'trivial'}))
    <trivial>Please mark up <foreign lang="de">Stadt München</foreign>
     in Bavaria in this sentence.</trivial>

In order to search for the text-string, a regular expression is used
rather than a simple search for "Stadt München", because we cannot
assume that it appears in exactly the same form in the text. For
example, it could be broken up by a line break, e.g. "Stadt\\nMünchen".

Now, let's try the more complicated case. Because we will try
different configurations, we use copied of the tree "hard_xml"::

    >>> hard_xml_copy = copy.deepcopy(hard_xml)
    >>> mapping = ContentMapping(hard_xml_copy)
    >>> match = re.search(r"Stadt\s+München", mapping.content)
    >>> _ = mapping.markup(match.start(), match.end(), "foreign",
    ...                    {'lang': 'de'})
    >>> xml_str = hard_xml_copy.as_xml(empty_tags={'lb'})
    >>> print(xml_str)
    <hard>
      Please mark up
      <foreign lang="de">
        Stadt
        <lb/>
      </foreign>
      <location>
        <foreign lang="de">
          <em>München</em>
        </foreign>
        in Bavaria
      </location>
      in this sentence.
    </hard>

As can be seen the <foreign>-tag is split into two parts, because the markup
runs across the border of another tag, in this case <location>. Note, that the
<lb/>-tag lies inside the <foreign>-tag. But that makes sense, because it would
also have been inside the <foreign>-tag, had there been no <location>-tag and
no need to split. (Per default, the algorithm behaves somewhat "greedy", which,
however can be configured with a parameter with the same name passed to the
constructor of class ContentMapping.)

But what if you don't wand the <foreign>-tag to be split up in two or
more parts, as the case may be. Well, in this case, you need to allow
those tags, the borders of which the new markup runs across, to be split
by that markup::

    >>> hard_xml_copy = copy.deepcopy(hard_xml)
    >>> divisibility_map = {'foreign': {'location', ':Text'},
    ...                     '*': {':Text'}}
    >>> mapping = ContentMapping(hard_xml_copy, divisibility=divisibility_map)
    >>> match = re.search(r"Stadt\s+München", mapping.content)
    >>> _ = mapping.markup(match.start(), match.end(), "foreign",
    ...                    {'lang': 'de'})
    >>> xml_str = hard_xml_copy.as_xml(empty_tags={'lb'})
    >>> print(xml_str)
    <hard>
      Please mark up
      <foreign lang="de">
        Stadt
        <lb/>
        <location>
          <em>München</em>
        </location>
      </foreign>
      <location>
        in Bavaria
      </location>
      in this sentence.
    </hard>

See the difference? This time the <foreign>-element remains intact,
while the <location>-element has been split. This behavior can be
configures by the divisibility-map that is passed to the parameter
``divisibility`` of the ContentMapping-constructor. It maps elements
(or, rather, their names) to sets of elements that can be cut by them.
The asterisk ``*`` is a wildcard and contains those elements that can be
cut by any other element. An element that does not appear in the
value-set anywhere in the mapping cannot be cut by any other element. It
is also possible to pass a simple set of element-names instead of a
dictionary to the divisibility-parameter. In this case any element with
a name in this set can be cut by any other element. Any element the name
of which is not a member of the set cannot be cut when markup is added.

In cases where markup overlaps element-borders, it is unavoidable to
decide which element will be divided and which not. It is a general
limitation of tree structures that they do not allow overlapping
hierarchies. In this particular example, it would most probably be more
reasonable to keep the <location>-element intact, because locations
should probably be recognizable as units, while this does not really
seem to matter for a foreign language annotation.

The case may arise, though, where you cannot avoid splitting elements
that form units. At this point you probably should consider using an
entirely different data-structure, say, a graph. But if this is not an
option, :py:mod:`DHParser.nodetree` allows you to mark split elements as
belonging to the same "chain" of elements. In order to do so you can
pass a ``chain_attr_name`` to the constructor of class ContentMapping.
This is an (arbitrary) name for an attribute which will contain a unique
short string that all elements (of the same name) belonging to one and
the same chain share with each other, but not with any other elements.
Let's try this on the previous example::

    >>> reset_chain_ID()  # just to ensure deterministic ID values for doctest

    >>> hard_xml_copy = copy.deepcopy(hard_xml)
    >>> match = re.search(r"Stadt\s+München", hard_xml_copy.content)
    >>> divisibility_map = {'foreign': {'location', ':Text'},
    ...                     '*': {':Text'}}
    >>> mapping = ContentMapping(hard_xml_copy, divisibility=divisibility_map,
    ...                          chain_attr_name="chain")
    >>> _ = mapping.markup(match.start(), match.end(), "foreign",
    ...                    {'lang': 'de'})
    >>> xml_str = hard_xml_copy.as_xml(empty_tags={'lb'})
    >>> print(xml_str)
    <hard>
      Please mark up
      <foreign lang="de">
        Stadt
        <lb/>
        <location chain="VZT">
          <em>München</em>
        </location>
      </foreign>
      <location chain="VZT">
        in Bavaria
      </location>
      in this sentence.
    </hard>

Markup plays well together with restricted content mappings as the
following example may show::

    >>> tree = parse_xml("<doc>Please mark up Stadt\n<lb/>"
    ...     "<em>München</em><footnote>'Stadt <em>München</em>'"
    ...     " is German for 'City of Munich'</footnote> in Bavaria"
    ...     " in this sentence.</doc>")

Let's assume we'd like to markup locations and text-passages in foreign
languages, but only in the main text and not within footnotes and the
like. For that purpose, we build a context mapping that is restricted to
non-footnote-text::

    >>> cm = ContentMapping(tree, select=LEAF_PATH, ignore='footnote',
    ...                     chain_attr_name='chain')
    >>> print(cm.content)
    Please mark up Stadt
    München in Bavaria in this sentence.

Now, let's assume for the sake of the example that we have list of
location names to be marked up that contains the phrase "München in
Bavaria". So, we search for this phrase and add the required location
markup::

    >>> m = re.search(r"München\s+in\s+Bavaria", cm.content)
    >>> print(m)
    <re.Match object; span=(21, 39), match='München in Bavaria'>

    >>> _ = cm.markup(m.start(), m.end(), 'location')
    >>> print(tree.as_xml(empty_tags={'lb'}))
    <doc>
      Please mark up Stadt
      <lb/>
      <location>
        <em>München</em>
        <footnote>
          'Stadt
          <em>München</em>
          ' is German for 'City of Munich'
        </footnote>
        in Bavaria
      </location>
      in this sentence.
    </doc>

The <location>-element covers the entire span, including the footnote. This
is to be expected as changes are always carried out on the full tree. Only,
the mapping is restricted to certain parts of the document. Usually, this
is also the desired behavior, though, admittedly, depending on the use case
another behavior (e.g. splitting the <location>-element into one part before
the <footnote>-element and one part after that element) might be preferable.
Such cases are not covered by the markup-method of class ContentMapping.

Because, the <location>-element did not need to be split, it does not need
and therefore does not have a "chain"-attribute.

Next, let's add the <foreign>-element. (We substitute the value of its
chain-attribute, so that the doctest does not break, when another random
key is picked!)::

    >>> m = re.search(r'Stadt\s+München', cm.content)
    >>> _ = cm.markup(m.start(), m.end(), 'foreign', lang="de")
    >>> print(tree.as_xml(empty_tags={'lb'}))
    <doc>
      Please mark up
      <foreign lang="de" chain="RZC">
        Stadt
        <lb/>
      </foreign>
      <location>
        <foreign lang="de" chain="RZC">
          <em>München</em>
        </foreign>
        <footnote>
          'Stadt
          <em>München</em>
          ' is German for 'City of Munich'
        </footnote>
        in Bavaria
      </location>
      in this sentence.
    </doc>

Here again, one might ask, why the <foreign>-tag contains the <lb>-tag,
but the choice makes sense, because if put together again, it should
cover the complete stretch including the line-break. Again, different
use cases and different choices are imaginable which, however, are not
covered by the :py:meth:`ContentMapping.markup`-method.


Error Messages
--------------

Although errors are typically located at a particular point or range of the
source code, DHParser treats them as global properties of the syntax tree
(albeit with a location), rather than attaching them to particular nodes. This
has two advantages:

1. When restructuring the tree and removing or adding nodes during the
   abstract-syntax-tree-transformation and possibly further
   tree-transformations, error messages do not accidentally get lost.

2. It is not necessary to add another slot to the Node class for keeping an
   error list which most of the time would remain empty, anyway.

In order to track errors and other global properties, Module :py:mod:`~nodetree`
provides the :py:class:`~nodetree.RootNode`-class. The root-object of
a syntax-tree produced by parsing is of type :py:class:`~nodetree.RootNode`.
If a root node needs to be created manually, it
is necessary to create a plain :py:class:`~nodetree.Node`-object and either
pass it to :py:class:`~nodetree.RootNode` as parameter on instantiation or, later,
to the :py:meth:`~nodetree.Node.swallow`-method of the RootNode-object::

    >>> document = RootNode(sentence, str(sentence))

The second parameter is normally the source code. In this example we
simply use the string representation of the syntax-tree originating in
`sentence`. Before any errors can be added the source-position fields of
the nodes of the tree must have be been initialized. Usually, this is
done by the parser. Since the syntax-tree in this example does not stem
from a parsing-process, we have to do it manually::

    >>> _ = document.with_pos(0)

Now, let's mark all ``word``-nodes that contain non-letter characters with an
error-message. There should be plenty of them, because, earlier, we have
replaced some of the words partially with "..."::

    >>> import re
    >>> len([document.new_error(node, "word contains illegal characters")
    ...      for node in document.select('word')
    ...          if re.fullmatch(r'\w*', node.content) is None])
    3
    >>> for error in document.errors_sorted:  print(error)
    1:1: Error (1000): word contains illegal characters
    1:6: Error (1000): word contains illegal characters
    1:11: Error (1000): word contains illegal characters

The format of the string representation of Error-objects resembles that of
compilers and is understood by many Text-Editors which mark the errors in the
source code.


Attribute-Handling
------------------

While the "Node.attr"-field can be used to store data of any kind, it
will often just serve to store XML-attributes, the value of which is
always a string. The :py:mod:`DHParser.nodetree`-module provides a
mini-API to simplify typical use cases of XML-attributes.

One important use case of attributes is to add or remove css-classes to the
"class"-attribute. The "class"-attribute understood as containing a set of
whitespace delimited strings. Module "nodetree" provides a few functions to
simplify class-handling::

    >>> paragraph = Node('p', 'veni vidi vici')
    >>> add_class(paragraph, 'smallprint')
    >>> paragraph.attr['class']
    'smallprint'

Although the class-attribute is filled with a sequence of strings, it should
behave like a set of strings. For example, one and the same class name should
not appear twice in the class attribute::

    >>> add_class(paragraph, 'smallprint justified')
    >>> paragraph.attr['class']
    'smallprint justified'

Plus, the order of the class strings does not matter, when checking for
elements::

    >>> has_class(paragraph, 'justified smallprint')
    True
    >>> remove_class(paragraph, 'smallprint')
    >>> has_class(paragraph, 'smallprint')
    False
    >>> has_class(paragraph, 'justified smallprint')
    False
    >>> has_class(paragraph, 'justified')
    True

The same logic of treating blank separated sequences of strings as sets
can also be applied to other attributes::

    >>> car = Node('car', 'Porsche')
    >>> add_token_to_attr(car, "Linda Peter", 'owner')
    >>> car.attr['owner']
    'Linda Peter'

Or, more generally, to strings containing whitespace-separated substrings::

    >>> add_token('Linda Paula', 'Peter Paula')
    'Linda Paula Peter'


*Classes and Functions-Reference*
---------------------------------

The full documentation of all classes and functions can be found in module
:py:mod:`DHParser.nodetree`. The following table of contents lists the most
important of these:

class Node
^^^^^^^^^^

* :py:class:`~nodetree.Node`: the central building-block of a node-tree

  * :py:attr:`~nodetree.Node.result`:
    either the child nodes or the node's string content

  * :py:attr:`~nodetree.Node.children`:
    the node's immediate children or an empty tuple

  * :py:attr:`~nodetree.Node.content`:
    the concatenated string content of all descendants

  * :py:attr:`~nodetree.Node.name`:
    the node's name

  * :py:attr:`~nodetree.Node.attr`:
    the dictionary of the node's attributes

  * :py:attr:`~nodetree.Node.pos`:
    the source-code position of this node, in case the node stems from
    a parsing process

    **Navigation**

  * :py:meth:`~nodetree.Node.select`:
    Selects nodes from the tree of descendants.

  * :py:meth:`~nodetree.Node.pick`:
    Picks a particular node from the tree of descendants.

  * :py:meth:`~nodetree.Node.locate`:
    Finds the leaf-node covering a particular location of string
    content of the tree originating in this node.

  * :py:meth:`~nodetree.Node.select_path`:
    Selects :ref:`paths <paths>` from the tree of descendants.

  * :py:meth:`~nodetree.Node.pick_path`:
    Picks a particular path from the tree of descendants.

  * :py:meth:`~nodetree.Node.locate_path`:
    Finds the path of the leaf-node covering a particular location
    of string content of the tree originating in this node.

    **Serialization**

  * :py:meth:`~nodetree.Node.as_sxpr`:
    Serializes the tree originating in a node as S-expression.

  * :py:meth:`~nodetree.Node.as_xml`:
    Serializes the tree as XML.

  * :py:meth:`~nodetree.Node.as_json`:
    Serializes the tree as JSON.

    **XML-exchange**

  * :py:meth:`~nodetree.Node.as_etree`:
    Converts the tree to an XML-`ElementTree`_ as defined by the
    respective module from the Python's standard library.

  * :py:meth:`~nodetree.Node.from_etree`:
    Converts an XML-`ElementTree`_ into a tree
    of :py:class:`~syntaxtee.Node`-objects.

    **Evaluation**

  * :py:meth:`~nodetree.Node.evaluate`:
    "Evaluates" a tree by picking the function to be run on each
    node from a dictionary that maps tag-names to functions.


Reading serialized trees
^^^^^^^^^^^^^^^^^^^^^^^^

* :py:func:`~nodetree.parse_sxpr`:
  Converts any S-expression string to a tree of nodes.

* :py:func:`~nodetree.parse_xml`:
  Converts any XML-document to a tree of nodes.

* :py:func:`~nodetree.parse_json`:
  Converts a JSON-document that has
  previously been created with :py:meth:`~nodetree.as_json` from a
  tree of nodes back to a tree of nodes.

* :py:func:`~nodetree.deserialize`:
  Tries to guess the data-type of a string and then calls any of the
  above deserialization-functions accordingly.


Traversing trees via paths
^^^^^^^^^^^^^^^^^^^^^^^^^^^

* :py:func:`~nodetree.prev_path`:
  Returns the :ref:`path <paths>` preceding a given path.

* :py:func:`~nodetree.next_path`:
  Returns the :ref:`path <paths>` following a given path.

* :py:func:`~nodetree.pp_path`:
  Pretty-prints the given :ref:`path <paths>`


Attribute-handling
^^^^^^^^^^^^^^^^^^

* :py:func:`~nodetree.has_token_on_attr`:
  Checks whether an attribute of a node
  contains one or more tokens, i.e. blank separated sequences of letters.

* :py:func:`~nodetree.add_token_to_attr`:
  Adds a token to a particular attribute of a node.

* :py:func:`~nodetree.remove_token_from_attr`:
  Removes a token from a particular attribute of a node.

* :py:func:`~nodetree.has_class`, :py:func:`~nodetree.add_class`, :py:func:`~nodetree.remove_class`:
  the same as above, only that these methods manipulate the tokens
  specifically of the class-attribute


class RootNode
^^^^^^^^^^^^^^

Any Node-object can be considered as the origin of a tree and none of
the "navigation"-functions requires a tree of nodes to start with
a RootNode-object. However, RootNode-objects provide support for certain
"global" aspects of a tree like keeping track of the source code with line
and column numbers and adding error messages. RootNode-objects can either
be initialized with a code node that will then be replaced by the
root-node or swallow a a tree originating in a common node later.

* :py:class:`~nodetree.RootNode`: additional functionality for a tree of nodes

  * :py:data:`~nodetree.RootNode.errors`:
    a list of errors

  * :py:attr:`~nodetree.RootNode.errors_sorted`:
    the errors sorted by their position in the source code instead of the
    time of their having been added

  * :py:data:`~nodetree.RootNode.inline_tags`:
    a set of tags that will
    be printed on a single line with their content when serializing. (This
    helps to avoid undesired whitespace when exporting to HTML!)
  * :py:data:`~nodetree.RootNode.string_tags`:
    a set of tags that will be
    converted to simple strings that appear as mixed content inside their
    parent when serializing as XML

  * :py:data:`~nodetree.RootNode.empty_tags`:
    a set of tags that will be
    rendered as empty tags, e.g. ``<mytag />`` when serializing as XML

  * :py:meth:`~nodetree.RootNode.swallow`:
    Can be called once in the  lifetime of the RootNode-object to assign
    this root-node to an existing tree of nodes.

  * :py:meth:`~nodetree.RootNode.new_error`:
    Creates and adds new a error.

  * :py:meth:`~nodetree.RootNode.as_xml`:
    Serializes the tree as XML taking into account the XML-customization
    attributes of the RootNode-object.


class ContentMapping
^^^^^^^^^^^^^^^^^^^^

ContentMapping represents a path-mapping of the string-content of
all or a specific selection of the leave-nodes of a tree. A
content-mapping is an ordered mapping of the first text position of
every (selected) leaf-node to the path of this node.

The class provides methods for mapping string positions to paths and
offsets (relative to the beginning of the leaf-node of the path)

* :py:class:`~nodetree.ContentMapping`: Mapping the tree to its string-content

    * :py:meth:`~nodetree.ContentMapping.get_path_and_offset`:
      Maps positions in string-content of the ContentMapping to the
      leaf-path into which they fall

    * :py:meth:`~nodetree.ContentMapping.iterate_paths`:
      Yields all paths from
      position ``start_pos`` up to and including position ``end_pos``.

    * :py:meth:`~nodetree.ContentMapping.insert_node`:
      Inserts a node at a
      particular text-position.

    * :py:meth:`~nodetree.ContentMapping.markup`:
       Adds markup (i.e. an element) to a particular stretch of text.

.. _ElementTree: https://docs.python.org/3/library/xml.etree.elementtree.html
.. _mixed content: https://www.w3.org/TR/xml/#sec-mixed-content
.. _unist: https://github.com/syntax-tree/unist
.. _SXML: https://okmij.org/ftp/Scheme/SXML.html