Testing

DHParser provides a powerful unit-testing framework that allows testing individual components of a grammar separately through all stages of the processing pipeline. DHParser’s unit-testing framework allows to:

- break down the complicated process of writing a grammar into
  relatively simpler tasks of designing grammar-components each
  of which can be tested individually

- turn writing and refactoring grammars into controlled and
  manageable process. In fact, without unit-testing, refactoring
  of grammars is hardly feasible at all.

- extend a grammar incrementally without breaking existing code.

- check the syntax-tree-construction as well as all subsequent
  transformations (as long as their result is serializable).

Because writing grammars can be difficult, refactoring and testing of grammars is crucial for the success of a project. Also, one and the same formal language can be described by different grammars. (Here, two formal languages are considered equal if the set of grammatically correct sentences is the same.) The grammar used influences the shape of the syntax-tree that the parser yields. Therefore, it is quite common to rewrite a grammar or parts of it more than once during the course of developing a formal notation. For example, in the first iteration one tries to find a grammar that matches the given or intended notation. In the second iteration, the grammar is refined to yield a well-shaped syntax-tree to make further processing easier. Unit-tests help to safeguard this process against breaking earlier changes by later changes.

Writing grammar-tests

The canonical way to write grammar-tests in a DHParser-projects is by writing them into a file with a name that matches the the glob-pattern *_test_*.ini. Tests can then simply be executed by running the tst_..._grammar.py-script that has been generated by the dhparser-command (see Full scale DSLs). The format of these files strongly resembles that of common config-files. Here is a test for the outline-parser example from the overview (Macros):

[match:document]
M1: """# Main Heading
    ## Section 1
    ### SubSection 1.1
    ### SubSection 1.2
    ## Section 2"""

[fail:document]
F1: """# Main Heading
    ## Section 1
    #### SubSubSection 1.1.1  BADLY NESTED!!!
    ### SubSection 1.2
    ## Section 2"""

Test-files (“.ini”-Files) are separated into sections by headings that are enclosed in square-brackets, i.e. [ and ]. The heading’s name consists of two parts, separated by a colon :. The first part indicates the kind of the tests that are specified under the heading. The second part is the name of the parser that will be called with these tests as input. Thus, the heading “[match:document]” means that the following tests will be fed to the parser “document” and the parser is expected to match. The following heading “[fail:document]” introduces tests, where the parser “document” is expected to fail, i.e. the test is successful if the parser does not match the snipped fed to the parser.

The tests themselves are specified by a name followed by a colon : followed by a single-line or multiline-string which line Python-strings can be enclosed in single or triple single-quotation-marks ' or double quotation-marks ". Multi-line-strings must always be indented by four spaces for all lines except the first line! (The indentation will automatically be removed before running the test.)

Attention

The names for fail-tests must differ from the names of match-tests! One way to do so is to mark the test with a special letter like “M” for match-test and “F” for fail-tests, respectively. E.G.: “M1”, “M2”, “M3”, … and “F1”, “F2”, “F3”, …

Running grammar-tests

The test-runner-script

Grammar tests can be run either by calling the (auto-generated) test-grammar script with the filename of the test-file as argument. Alternatively, the script can be run without any arguments, in which case it will look for a “test_grammar” folder inside the current-directory and then run all test-files in this directory, where any file is considered a test-file the name of which matches the glob-pattern *_test_*.ini.

Tip

It is a good idea to add the DHParser-project’s tst_..._grammar.py-script to the executable tools of your Python-IDE. Then it suffices to simply point to the test in the IDE’s file-manager and pick the tool from the menu to run a particular test.

This works pretty well with PyCharm, but is also possible with most other integrated development environments or code-editors.

From the command-line grammar-tests can be run with a call like this one:

$ tst_outline_grammar.py

to run all tests from the “tests_grammar”-subdirectory that are contained in test-files the filename of which matches the test-filename-pattern *_test_*.ini, or, in order to run just the tests from a single test-file:

$ tst_outline_grammar.py tests_grammar/03_test_Outline.ini

In the above examples the project name is “outline”, thus the middle part of the test-script name “_outline_”. In other project the name of the autogenerated test-script might be different.

When calling the script with a single file-name as argument, it is not necessary that the file-name matches the test-filename-pattern. For example:

$ tst_outline_grammar.py tests_grammar/Playground.ini

works just as well as long as the file “tests_grammar/Playground.ini” exists, even though its name does not match the test-file-name-pattern and will, therefore, be overlooked, if the script is called without any arguments. This can be quite useful, if you want to experiment with tests that you might not (yet) want to add to your regular test-suite.

Tip

It is a good idea to add the DHParser-project’s tst_..._grammar.py-script to the executable tools of your Python-IDE. Then it suffices to simply point to the test in the IDE’s file-manager and pick the tool from the menu to run a particular test.

This works pretty well with PyCharm, but is also possible with most other integrated development environments or code-editors.

Test-reports

After the test has been run, the results can be found in the “REPORT”-subdirectory of the tests-directory. For each test-file that has been executed the REPORT-subdirectory contains a Markdown-file with the detailed results.

Failures and successes as such will also directly be reported in the terminal-output of the command. If all tests have been successful, the last line of the terminal-output reads: “SUCCESS! All tests passed :-)”. If one or more failures occurred, the number of failed tests will be reported.

The test-code for each test will be repeated in the report-file, followed by the abstract-syntax-tree (AST) that the code generated in the case of (successful) match-tests or the error-messages in case of successful fail-tests. This information is not only helpful for testing purposes, but also for the implementation of further processing stages which rely on the shape of the abstract syntax-tree.

In our example of the outline-parser tests, an excerpt from the report file might look like this:

Match-test "M3"
----------------

### Test-code:

    # Main Heading
    ## Section 1
    ### SubSection 1.1
    ### SubSection 1.2
    ## Section 2

### AST

    (document
      (main
        (heading "Main Heading")
        (section
          (heading "Section 1")
          (subsection
            (heading "SubSection 1.1"))
          (subsection
            (heading "SubSection 1.2")))
        (section
          (heading "Section 2"))))

  ...

  Fail-test "F2"
  ---------------

  ### Test-code:
      # Main Heading
      ## Section 1
      #### BADLY NESTED SubSubSection 1.1.1
      ### SubSection 1.2
      ## Section 2

  ### Messages:

  3:1: Error (1010): 'EOF' expected by parser 'document', but »#### BADLY...« found instead!
  3:4: Error (1040): Parser "document" stopped before end, at: »# BADLY NE...« Terminating parser.

You might expect that a test-report of the parser would show the concrete-syntax-tree (CST) rather than the AST. However, the CST can be quite verbose depending on how far it is curbed or not curbed in the grammar definition, already (see Simplifying Syntax-Trees while Parsing) and is usually less informative than the AST. Typically, you’ll want to see it only in very particular cases and only when debugging the AST-generation. For this purpose, DHParser’s testing-framework allows to quickly turn the additional output of the CST in the test-report on and off by simply placing an asterisk * after the test name of any match test or removing it after the debugging has been done. If for example, your test’s name is “M1” you’d simply write M!*: ...` in the test-input-file with the “.ini”-ending.

In case a test fails, the error-messages will appear in the report-file. DHParser will still attempt to produce an abstract-syntax-tree (AST) and, potentially, the results of further processing stages. But these will not necessarily represent any reasonable structures. Typically, for example, the AST will contain nodes named “ZOMBIE__” which either capture passages of the source could which could not be parsed properly, due to the failure or, if empty, have been added as an anchor for error-messages.

Debugging failed tests

More important is the fact that for each failed test an HTML-log will be produced in the “LOGS”-subdirectory which resides on the same level as the “REPORT”-subdirectory. (If this directory does not exist it will be created the nest time a test fails. Like the REPORT-directory it can safely be deleted, because it will always be recreated and populated anew during the next test-run.) The HTML-log contains a detailed log of the parsing process. This can be seen as a post-mortem debugger for parsing that helps to find the cause of the failure of the test. The most frequent causes for test-failures are 1) EBNF-coding-errors, i.e. some part of the EBNF-encoded grammar does not capture or reject a piece of the source text that it was expected to capture or reject, or 2) the grammar does not yet encode certain constructs of the formal target-language and needs to be extended. Here is an excerpt of the test-log of a failed test from a converter for Typescript-type-definitions which does not yet know the “extends”-keyword and therefore fails a particular unit-test:

L

C

parser call sequence

success

text matched or failed

1

1

type_alias->`export`

DROP

export type Exact<T extends { [key: stri…

1

8

type_alias->`type`

DROP

type Exact<T extends { [key: string]: un…

1

13

type_alias->identifier->!`true`

!FAIL

Exact<T extends { [key: string]: unk …

1

13

type_alias->identifier->!`false`

!FAIL

Exact<T extends { [key: string]: unk …

1

13

type_alias->identifier->_part

MATCH

Exact<T extends { [key: string]: unknown…

1

18

type_alias->identifier->`.`

FAIL

<T extends { [key: string]: unknown …

1

13

type_alias->identifier

MATCH

Exact<T extends { [key: string]: unknown…

1

18

type_alias->type_parameters->`<`

DROP

<T extends { [key: string]: unknown }…

.

.

1

19

… ->parameter_types

MATCH

T extends { [key: string]: unknown }> = …

1

21

type_alias->type_parameters->`,`

FAIL

extends { [key: string]: unknown }> …

1

21

type_alias->type_parameters->`>`

FAIL

extends { [key: string]: unknown }> …

1

21

type_alias->type_parameters

ERROR

ERROR 1010, 50 extends { [key: string]: …

Typically, the parsing-log is a quite long and the error becomes apparent only at the very end. So it is advisable to scroll right to the bottom of the page to see what has caused the test to fail by looking at the error message (which for the sake of brevity has been omitted from the above excerpt, though the error number 1010 for mandatory continuation errors still indicates that another item than the following “extends” was expected).

The parsing log log’s the match or non-match of every leaf-parser (i.e. parsers that do not call other parsers but try to match the next part of the text directly) that is applied during the parsing process. The steps leading up to the call a leaf-parser are not recorded individually but can be seen from the call-stack which follows the line and column-number of the place in the document where the parser tried to match.

The match or non-match of the leaf-parser is indicated by the success-state. There are six different success-states:

success

meaning

MATCH

the parser matched a part of the following text

DROP

the parser matched but the matched text was dropped from the CST

FAIL

the parser failed to match the following text

!MATCH

the parser matched but as part of a negative lookahead it’s a fail

!FAIL

the parser failed but as part of a negative lookahead it’s a match

ERROR

a syntax error was detected during parsing

Finally, the last part of each entry (i.e. line) in the log is an excerpt from the document at the location where the parser stood. In the HTML-log, colors indicate the which part of the excerpt was matched. (In the pure text-output as shown above this can only be inferred from the next line.)

With these information in mind you should be able to “read” the above log-excerpt. It takes a while to get used to reading parsing-logs, though. Reading logs can become confusing when lookahead or, in particular, when look-behind parsers are involved. Also, keep in mind that DHParser uses memoizing to avoid parsing the same part of a document over and over again with the same parser. Thus, if you encounter a line in the log where the call stack appears to be clipped, this is usually due to memoizing an the same parser having been called at the same location earlier in the parsing process. (You might find the first instance by looking for the same line and column in the earlier part of the log.) Still, looking at the parsing-log helps to find and understand the causes of unexpected parser-behavior, quickly.

Tip

Parsing-logs are by default only generated for failed test. In case you’d like to see the parsing-log for a successful test, a simple trick is to flip the type of the test from “match” to “fail” in the “.ini”-file or vice versa.

The test with the flipped type will then be reported as a failure, but the parsing-log is just the same as if it was a success. Once, you have seen the log, you can flip the type back again to get a correct test-report.

Development-Workflows

The development workflows for writing parsers for domain specific languages (DSLs) or parsing (semi-)structured text-data are very similar. Only that in the latter case there already exists plenty of sample material while in the former case one would usually start to draw up some examples.

In both cases, however, it requires going through many iterations of adjustments and refinements before the grammar stands. In the case of a DSL, the even DSL itself might be adjusted in the course of the development, requiring further changes of the grammar all alike.

This is where test-driven-grammar development comes into play. Before even writing a grammar and running it on complete documents, you start with a small subset that you gradually extend. There are basically to strategies for grammar-development:

  1. Top-Down-Grammar development, where one starts with the macro- structure and uses summary parsers to gloss over the microstructure, which will be replaced later.

  2. Bottom-Up-Grammar development, where you start with parsers for the parts of the documents and later connect them with higher level parsers.

Of course, it is also possible to work from both ends and to follow both strategies at the same time, until the top-down and bottom-up-development meets in the middle.

We will look at both strategies with the example of our outline-parser in the following. In case you want to reenact the following steps, you should start by creating a new project with the dhparser-command:

$ dhparser Markdown
$ cd Markdown

Top-Down-Grammar-Development

Suppose, you’d like to write a Markdown-parser, then with a top-down-strategy you’d start with the outer-elements which in this case is the outline of the document, i.e. the structure of headings and sub-headings. In the true spirit of test-driven-development we start by writing some tests, before even coding the first draft of our grammar. So we add a document tests\01_test_outline.ini to a freshly created project with the following content:

[match:document]
M1: """# Main Heading
    ## Section 1
    ### SubSection 1.1
    ### SubSection 1.2
    ## Section 2"""

[fail:document]
F1: """# Main Heading
    ## Section 1
    #### BADLY NESTED SubSubSection 1.1.1
    ### SubSection 1.2
    ## Section 2"""

The meaning of these two test-cases should be obvious: The first is a document that only contains an outline, but not yet any content - because will start writing our grammar top-down with the definition of the outline-elements leaving out the content-elements for now. The match-test test will check that our grammar matches a properly formed document-outline.

The second is a fail test, which checks that the parser for our grammar does not accidentally match a badly structured document. Now, we will start writing a grammar that is suitable to capture the snippet from our match-test. As you’ll see in the following, this already requires quite a few definitions. Here is our first attempt (which still contains a mistake!):

# First attempt of any outline-grammar. Can you spot the error?

#  EBNF-Directives
@ whitespace = /[ \t]*/  # only horizontal whitespace, no line-feeds
@ reduction  = merge     # simplify tree as much as possible
@ hide       = WS, EOF, LINE, GAP
@ drop       = WS, EOF, backticked, whitespace

#:  Outline
document = [WS] main [WS] §EOF

main  = `#` ~ heading { WS section }
section  = `##` ~ heading { WS subsection }
subsection  = `###` ~ heading { WS subsubsection }
subsubsection  = `####` ~ heading { WS s5section }
s5section  = `#####` ~ heading { WS s6section }
s6section  = `######` ~ heading

heading = LINE

#:  Regular Expressions
LINE = /[^\n]+/         # everything up to the next linefeed
GAP  = /(?:[ \t]*\n)+/  # any ws at line-end and all following empty lines
WS   = GAP              # same as GAP, but will be dropped
EOF  =  !/./  # no more characters ahead, end of file reached

When running the grammar-tests, we notice that while the match-test passes as expected, the fail-test fails, that is, it captures the badly structured outline, although it shouldn’t. The output of the tst-grammar-script on the console looks like this:

GRAMMAR TEST UNIT: 01_test_outline
  Match-Tests for parser "document"
    match-test "M1" ... OK
  Fail-Tests for parser "document"
    fail-test  "F1" ... FAIL

Can you guess why the fail-test did not pass? If not it helps to cast a look at the parsing log of the failed test that has been stored in file “tests/LOGS/fail_document_F2_parser.log.html”. There you find the suspicious lines:

3

1

document->main->section->subsection-> ###

DROP

#### BADLY NESTED SubSubSection 1.1.1 ##…

3

4

document->main->section->subsection->heading->LINE

MATCH

# BADLY NESTED SubSubSection 1.1.1 ### S…

Obviously, the parser “subsection” found its marker consisting of three #-signs, but then it did not stop short at the next #-sign, but left this to be captured by its “heading”-parser which simply reads the rest of the line, no matter what it looks like.

The remedy is simple: We add a negative lookahead to check that after each heading-marker that no further #-sign follows. Otherwise, the respective section, subsection, etc. -parser simply won’t match. So, in the “Outline”-section of our grammar, we change the following definitions, accordingly:

main  = [WS] `#` !`#` ~ heading { WS section }
section  = `##` !`#` ~ heading { WS subsection }
subsection  = `###` !`#` ~ heading { WS subsubsection }
subsubsection  = `####` !`#` ~ heading { WS s5section }
s5section  = `#####` !`#` ~ heading { WS s6section }
s6section  = `######` !`#` ~ heading

This time the grammar-tests yield the desired result:

GRAMMAR TEST UNIT: 01_test_outline
  Match-Tests for parser "document"
    match-test "M1" ... OK
  Fail-Tests for parser "document"
    fail-test  "F1" ... OK

Before going further down with our top-down-design of the grammar, we draw up a test-case that contains more structural details. For this purpose we add under the heading [match:document] another test-case with a little more structure:

M2: """# Main Heading

    Some introductory Text

    ## Section 1
    One paragraph of text

    Another paragraph of text. This
    time stretching over several lines.

    ## Section 2
    ### Section 2.1
    ### Section 2.2

    The previous section is (still) empty.
    This one is not.
    """

If we run the test now, it will expectedly fail with an error message like “3:1: Error (1010): ‘EOF’ expected by parser ‘document’, but »Some intro…« found instead!”. Before the test succeeds, we need to extend out grammar so as to capture the content inside of sections as well. In true top-down fashion, first, we provide for the new content elements which we will call “blacks” in the definiens of the section-elements:

main  = [WS] `#` !`#` ~ heading [WS blocks] { WS section }
section  = `##` !`#` ~ heading [WS blocks] { WS subsection }
subsection  = `###` !`#` ~ heading [WS blocks] { WS subsubsection }
subsubsection  = `####` !`#` ~ heading [WS blocks] { WS s5section }
s5section  = `#####` !`#` ~ heading [WS blocks] { WS s6section }
s6section  = `######` !`#` ~ heading [WS blocks]

Then, we define the the “blocks”-element:

blocks  = !is_heading LINE { GAP !is_heading LINE }
is_heading = /##?#?#?#?#?(?!#)/

This time, the grammar passes the recently added test. However, the new element “blocks” is sill a placeholder that does not capture the individual paragraphs, let alone other elements like lists or enumerations, as can easily be seen by looking at the generated abstract-syntax-tree (AST) in the test-report:

(document
  (main
    (heading "Main Heading")
    (blocks "Some introductory Text")
    (section
      (heading "Section 1")
      (blocks
        "One paragraph of text "
        ""
        "Another paragraph of text. This"
        "time stretching over several lines."))
    (section
      (heading "Section 2")
      (subsection
        (heading "Section 2.1"))
      (subsection
        (heading "Section 2.2")
        (blocks
          "The previous section is (still) empty."
          "This one is not.")))))

This use of “placeholder”-parsers which sweepingly capture larger chunks of text without dissecting their detailed structure is typical for the top-down approach. We could continue by replacing (or amending) the “blocks”-parser stepwise with more detailed parsers that capture individual paragraphs, highlighted passages etc., possibly making use of AST-tests (see below) in the process.

However, we will now turn the tables and start with the detail- or “fine”-structure of our outlined text in order to see how the bottom-up-approach works.

Bottom-Up-Grammar-Development

For the bottom-up-approach one must first consider what are the smallest elements that need to be semantically captured. Surely, it would be exaggerated to capture individual letters. One might think of words and lines, but then individual words do not really matter in Markdown-texts and lines have the disadvantage that highlighted elements might stretch over several lines.

A possible choice are pieces of text consisting of letters, punctuation and whitespace the may but do not need to stretch over more than one single line, that is, they may also contain line-feeds, but they should not encompass empty lines. So, basically text is no-whitespace elements interspersed by whitespace and single-line-feeds. Let’s first write a few tests and then cast this into a formal definition, which in my humble opinion is even clearer than the verbal expression. Here are the tests:

[match:TEXT]
M1: "A bit of text."
M2: """A bit of text
    over two lines!"""

[fail:TEXT]
F1: "  No leading whitespace"
F2: """Empty lines

     separate paragraphs!"""

And here is the definition of a piece of text (which, as is typical for the most atomic parsers, consist mostly of regular expressions enclosed by slashes):

TEXT      = CHARS { LLF CHARS }
CHARS     = /[^\s]+/             # sequence of non whitespace and non line-feed characters
LLF       = L | LF               # whitespace or single linefeed
L         = /[ \t]+/             # significant whitespace
LF        = /[ \t]*\n[ \t]*(?!\n)/  # a single linefeed

I am not going to explain the idioms used for encoding text blocks (aka “paragraphs”) separated by empty lines, here, as the code above should be clear enough with the given comments.

The next step will be a little bit more complicated: We would like to allow inline-markup inside paragraphs. Loosely following the Markdown conventions we would like to use a single underscore character (_) to mark emphasized text, e.g. _emphasized_, and double underscore markers to mark bold text, e.g. __bold__. Again, we start with writing test-code. We assume “emphasis” as the name of the parser for emphasized text and “bold” for bold text:

[match:emphasis]
M1: "_emphasized text_"
M2: """_multi
    line_"""

[match:bold]
M1: "__bold text__"
M2: """__multi
    line bold__"""

Now, using underscore characters to markup emphasized or bold text raises the question what to write, if we would like to use the underscore as a normal character in out text without the intention to mark an emphasized block. For this purpose, we add an ordinary escape mechanism that allows any character to be used literally, if it is preceded by a backslash. Let’s write a quick test:

[match:emphasis]
M3: "_emphasis with an escaped \_ character_"

[fail:emphasis]
F1: "_cannot complete parsing, because of a dangling _ underscore_"

Of course, we also need tests for markup text _containing_ emphasized or bold elements:

[match:markup]
M1: "This is **bold** and this is *emphasized*"
M2: """This is a text *with several
    emphasized words* as well as some
    **bold text that contains *emphasized words***."""

Now, let’s start coding! In the first step we will implement our escape-mechanism. For this purpose we define a new text element, named “text” with small letters (in contrast to the “TEXT”-parser defined above). Again, we write the test first:

[match:text]
M1: "Text with \_ three \\ escaped \x elements"

In this case it makes sense to also specify the expected result. With the following test, we test the flat-string-representation of the abstract-syntax-tree (AST) that parser “text” yields for the match-test “M1”. Note, that the names of AST-tests and, in fact of any other tests further down the processing pipelines must be the same as the names of the match test they refer to. (See below for more information on abstract-syntax-tree (AST)-testing.):

[AST:text]
M1: "Text with _ three \ escaped x elements"

Now we are ready to fill in the definitions for the parsers for which we have just written our tests:

text      = (TEXT | ESCAPED) { [LLF] (TEXT | ESCAPED) }
ESCAPED   = ESCAPED   = `\` /./

Note that since we drop any back-ticked literals (see the @drop-directive way above) the “ESCAPED”-parser should always yield the escaped character without the backslash in front of it.

Unfortunately, the ast-test fails with an error message:

ast-test "M1" for parser "text" or deserialization of expected value failed:
    Expr.:     Text with \_ three \\ escaped \x elements
    Expected:  Text with _ three \ escaped x elements
    Received:  Text with \_ three \\ escaped \x elements

(The provisio “or deserialization of expected value failed” means that in case we had specified the actual AST (e.g. (text “Text with _ three escaped x elements”)) rather than its flat-string-representation the cause of the error might also be a syntax-error in the written down abstract syntax tree.)

Something went wrong! In order to find out what exactly went wrong, we could either look into the test-log or into the test-report which shows the full abstract-syntax-tree, which looks like this:

(text
  (TEXT
    (CHARS "Text")
    (LLF
      (L " "))
    (CHARS "with")
    (LLF
      (L " "))
    (CHARS "\_")
    ...

From this it becomes obvious that the “ESCAPED”-parser is never invoked but that the escape-sign “" is captured by the “CHARS”-parser. Thus, we have to exclude it from the “CHARS”-parser explicitly to avoid it being captured by CHARS ans thus, indirectly, also by TEXT (with capital letters). At the same time we can take care to also exclude the underscore delimiter from the regular expression defining the CHARS-parser:

CHARS = /[^\s\\_]+/

We also need to keep in mind that should we add more inline elements in the future that we have to exclude their delimiters from the CHARS-parser as well. Now, the test should succeed. (Or, if we have forgotten to add the contained parsers of “text” back to the @hide-directive, we’ll find that the test fails, but that at least, the AST is sound in the sense that all ESCAPED characters have been properly captured by the ESCAPE-parser.)

We use the same idiom that we have employed in order to enrich simple TEXT with ESCAPED characters in the definition of “text” for defining markup-text that also contains bold and emphasized elements:

markup    = [indent] (text | bold | emphasis)
            { [LLF] (text | bold | emphasis) }
indent    = /[ \t]+(?=[^\s])/
bold      = `__` §inner_bold `__`
emphasis  = `_` §inner_emph `_`
inner_emph = [~ &`_`]
             (text | bold) { [LLF] (text | bold) }
             [<-&`_` ~]
inner_bold = [~ &`_`]/]
             (text | emphasis) { [LLF] (text | emphasis) }
             [<-&`_` ~]

Note that by placing the emphasis-parser after the bold-parser in the definition of the markup-parser, we make sure that a bold-element is not accidentally captured as an emphasized-element containing another emphasized element.

The “mandatory marker” § ensures that errors when marking up bold or emphasized text will be located precisely. (See Mandatory items “§”.) For example, neither bold text nor emphasized text must start or end with whitespace, e.g. * emphasized * must be written as *emphasized*, instead.

The introduction of “inner_emph” and “inner_bold” is due to the fact that the beginning- and the end-markers for emphasis and bold-text, respectively, is the same, which makes things a little more complicated form a parser-development point of view than using different beginning and end-markers. Also, it shall be ensured that - while emphasis and bold-text can be nested the one within the other - emphasis and bold are not redundantly nested within themselves.

Both “inner_emph” and “inner_bold” allow - other than “text” - leading and trailing (insignificant) whitespace before and after its content in case it precedes or succeeds a nested emphasis or bold marker. This allows to disambiguate nested bold and emphasized elements when necessary by adding whitespace. (Because the whitespace between bold and emphasis markers is only needed to disambiguate, it is treated as insignificant whitespace.) Otherwise:

* **bold** text inside emphasized text that can be parsed*

would have to be written as:

***bold** text inside emphasized text that fails to parse*

which leads to a parser-error. (See the CommonMark-specification for a more complicated solution to the same problem. Think about the pros and cons of either solution for a while, if you like!)

We could have skipped the introduction of the intermediary “text”-parser by adding ESCAPED-elements directly to the “markup”-parser, e.g.:

markup    = [indent] (TEXT | ESCAPED | bold | emphasis)
            { [LLF] (TEXT | ESCAPED | bold | emphasis) }

The reason, this has not been done is that while we would like to flatten ESCAPED chars and other TEXT but not the markup-structures. If we add further inline-elements like internet-links for example we would not add more intermediaries but rather extend the “markup”-parser-definition. (You may want to try to add internet links enclosed by < and > as an exercise!)

Before we stop the bottom-up-approach at this point, there is one last touch we might want to add: The abstract syntax tree (AST) still looks rather verbose, e.g.:

(markup
  (text
    (TEXT
      (CHARS "This")
      (LLF
        (L " "))
      (CHARS "is")))
...

Since, for further processing, we are only interested in distinguishing text from highlighted elements (i.e. emphasized and bold text), we add the more atomic elements, LLF, L, LF, CHARS, TEXT, ESCAPED to the list of disposables at the beginning of the EBNF-grammar, which makes them disappear, merging their content in the higher-level elements. Thus we change the @hide-directive at the top of the grammar to:

@ hide = WS, EOF, LINE, GAP, LLF, L, LF,
         CHARS, TEXT, ESCAPED, inner_emph, inner_bold

Now, the syntax-trees look much smoother:

(markup
  (text "This is a text")
  (:L " ")
  (emphasis
    (markup
      (text
        "with several"
        "emphasized words")))
...

For even further refinement, you need to work with the AST-transformation-table that is found in the outlineParser.py-file. For example, by adding an entry to merge the whitespace nodes with the text-nodes:

"markup, bold, emphasis": [merge_adjacent(is_one_of('text', ':L', ':LF'), 'text')]

For now, we’ll stop with the bottom-up development and see if and how we can link the two parts of our grammar that we have developed so far, one in top-down and one in bottom-up-style.

Linking both approaches

In the top-down approach we have defined the largest or the most encompassing elements from the whole document, its sections down to the block elements that make up the sections of the document. For the block elements we have (for the time being) only defined a simple makeshift parser as a fill in to be replaced by parsers for the fine structure, later:

blocks  = !is_heading LINE { GAP !is_heading LINE }

Let’s again, write a test first. Then it will be easy to spot the differences:

# Simple Test

## A test of bold- am emphasis-markup

  This paragraph contains *emphasized
text* that spreads over two lines.

  But what ist this: ** *emphasized* and bold**
or * **bold** and emphasized*?

The AST reveals the use of the summary-parser for blocks that does not capture any markup inside paragraphs. In fact, it does not even divide the text into separate paragraphs:

(document
  (main
    (heading
      "Simple Test"
    )
    (section
      (heading
        "A test of bold- am emphasis-markup"
      )
      (blocks
        "  This paragraph contains *emphasized"
        "text* that spreads over two lines."
        ""
        "  But what ist this: ** *emphasized* and bold**"
        "or * **bold** and emphasized*?"
      )
    )
  )
)

Now, we can replace the “LINE”-parser in the definition of “blocks”, above, by the parser for the markup-block that we have arrived at with the bottom-up-approach:

blocks  = !is_heading markup { GAP !is_heading markup }

The abstract syntax-tree is, as expected, much more verbose, because now it reflects the detailed structure of the markup:

(document
  (main
    (heading "Simple Test")
    (section
      (heading "A test of bold- am emphasis-markup")
      (blocks
        (markup
          (indent "  ")
          (text "This paragraph contains ")
          (emphasis
            (text
              "emphasized"
              "text"))
          (text " that spreads over two lines."))
        (GAP
          ""
          ""
          "")
        (markup
          (indent "  ")
          (text "But what ist this: ")
          (bold
            (emphasis
              (text "emphasized"))
            (text " and bold"))
          (text
            ""
            "or ")
          (emphasis
            (bold
              (text "bold"))
            (text " and emphasized"))
          (text "?"))))))

As far as explaining the basics of test-driven-grammar-development goes, this should suffice as an example. Admittedly, as far as coding a grammar for Markdown, there are still a few things to do, which are here left as an exercise to the reader. Here are some suggestions for more exercises in test-driven grammar-development:

  1. The AST still keeps the content of L, LF and GAP literally, although L and LF are merged with adjacent text-nodes during AST-transformation or even earlier. Ideally, though they should be normalized (before merging). In order to do so, remove these tags from the list of disposable tags, and add normalizations to the AST-transformation-table in the parser-script.

  2. There are more inline elements than bold and emphasis in markdown. Add support for inline-quotes and URL. Think about which symbol- definitions in the grammar need to be changed for which kind of inline-element in order to to so. “markup” or “text” or “both”? Or should new, intermediary symbols be introduced?

  3. You may have noticed that headings starting with one or more #-characters must be separated with at least one empty line from any preceding text-blocks (other headings do not count as text block!). If you haven’t noticed this, verify this with suitable match- or fail-tests!

    How would the grammar have to be changed to allow headings to be detected as such, even if they directly follow a text block?

    Should the grammar be changed in that way? Or does it have advantages (for whom, the writer of the grammar or the writer of markdown-text?) to require headings to be separated by an empty line from preceding text?

  4. Add support for block-quotes, enumerations and unordered lists.

Final remarks

Specifying formal grammars is often considered as a painstaking process. Using test-driven-development encourages to try things a just start writing grammar-code without worrying too much whether you have thought of every detail before writing down the specification. You just start coding the grammar and worry about the details later as you add more and more tests.

Top-Down and bottom-up-development are to different but supplementary strategies for incremental development. There is no rule when to use which of these approaches. Rather, one will switch between these approaches in the course of the grammar development as appropriate. The bottom-up approach is a bit simpler in so far as it does not require summary- or scaffolding-parsers to skip parts of a document for which the grammar has not yet been spelled out.

In connection with the bottom-up and top-down development-strategies test-driven grammar-development allows for “rapid prototyping” of grammars. DHParser’s ability to detect changes in the grammar-code and automatically recompile it before the parser is run allows for short turn-around-times and makes it easy to refactor the grammar frequently.

In Digital-Humanities-Test-scenarios, formal grammars are not only used for parsing strictly defined formal notations (e.g. LaTeX) but also for retro-digitization or, rather, re-structuring of “semi-structured” human-written documents with a notation the rules of which are only verbally described, often somewhat vague and incomplete and in practice not always followed diligently. Examples are dictionaries (see Zacherl 2022), (specialized) bibliographies and the like. In these application-cases, parser-development requires very many iterations and test-driven-grammar-development becomes an almost invaluable tool. (An alternative approach would be to use machine-learning to “read” this kind of data, e.g. GROBID for bibliographies. Your mileage may vary with either approach. It is also at least in principle possible to employ machine learning to find formal grammars that match a large set of test-cases (“Grammar Induction”).)

Monitoring AST-creation

So far, we have only written tests that allow us the check whether our parser(s) match or fail certain kinds of input as expected. However, we might also be interested in testing whether the abstract-syntax-tree (AST) that the parser yields has the expected shape. In particular, since this shape is not strictly determined by our grammar (as is that of the concrete-syntax-tree) but also by the set of AST-transformations that we apply in order to transform the concrete-syntax-tree (CST) to the abstract-syntax-tree (AST). And these transformations may of course contain bugs.

One important method for checking tree-structures (as well as any other data-structure) is structural validation. This, however, requires specifying the structure of the valid AST in another formal language like Relax-NG which is similar to and not much less complicated than specifying the grammar of a formal language with EBNF. For rapid-prototyping of grammars and especially in the early stages of grammar-development, this is hardly a viable option.

DHParser does not yet support structural validation of tree-data. However, DHParser allows to compare the resulting syntax-tree (CST or AST) or their string-content against a given result for any match-test. It is also possible to check the data-trees or the strings-serialized results of any further processing-stages in the same way (see below.)

ASTs can be tested by adding an [AST:parser_name] to the test file. “parser_name” must of course be replaced by a valid parser (symbol) of the grammar. Moreover, it must be a name for which a [match:parser_name] sections exists in the same test-file. Each AST-test is related to a match test for the same parser. The relation between the AST-test and its match-test is established by using the same test-name, e.g. “M1”, “M2”, …, for both.

There exist two different types of AST-tests:

  1. Tests of the structure and content of the AST. Here, the test code is a complete tree that must be specified either as S-expression or as XML-code.

  2. Tests of the concatenated string-content (or “flat string-content”) of the tree. In this case, the test-code consists of a string that is enclosed in either single (’) or double (”) quotation marks for single line strings or triple single (‘’’) or triple double quotation marks (“””) for multiline strings - just like strings in Python.

    The following lines after the first line of multiline strings MUST be indented by 4 spaces. The indentation does not count as part of the test-string and will be automatically removed before the test-result is compared.

The following examples are motivated by a common requirement of electronic document processing which is the normalization of the document. Let’s assume that we want to perform the following three types of normalization to our text-data:

  1. The “GAP”-nodes between “markup”-nodes shall be dropped from the syntax tree. After all gaps of one or more empty lines merely serve as delimiters for paragraphs. Generally, delimiters are not needed in a syntax-tree any more, because the document structure is expressed by the tree-structure.

  2. Line-feeds within paragraphs should be replaced by single blank characters to achieve a stronger regularity of the text-content. After all the exact place where a linefeed occurs is not relevant, anyway, and may change depending on the output-form or device.

  3. Sequences of blank characters should be normalized to a single blank character to indicate.

Testing the tree-structure directly

As mentioned test-cases for ASTs consist of two parts, a match-test-case and a related AST-test-case. A simple trick for writing the AST-test-case quickly is to write the match-test first, then let the test-script run and copy and paste the AST from the report-file to the test-file (“.ini”-file) as AST-test case. Finally, edit this AST to its desired shape. Take this as starting point for programming the AST-transformation or earlier tree-simplifications via the @hide- and @drop- directives.

Here is the full test case for dropping GAP-nodes:

[match:document]
A1: """# No gaps. please

    one paragraph

    and another paragraph"""

[AST:document]
A1: (document
      (main
        (heading "No gaps. please")
        (blocks
          (markup
            (text "one paragraph"))
          (markup
            (text "and another paragraph")))))

Note, that the AST-test-case has the same name, in this case “A1”, and that the code of the AST-test is, of course, not enclosed in quotation marks. The code describing the tree can either - like in the example, above - be denoted as S-expression or as XML. The results will be reported as S-expression, never the less. (If you prefer XML-output, you need to change the respective configuration value for tree-serializations.)

Since we have not yet adjusted the grammar and AST-transformation-code in order to drop the GAP-nodes, running the test-script, yields a failure of the AST-Test “A1”:

Abstract syntax tree test "A1" for parser "document" failed:
...
        Expected:  (document
                     (main (heading 'No gaps. please"')
                     (blocks
                       (markup (text "one paragraph"))
                       (markup (text "and another paragraph")))))
        Received:  (document
                     (main (heading "No gaps. please")
                     (blocks (markup (text "one paragraph"))
                     (GAP "" "" "")
                     (markup (text "and another paragraph")))))

The required adjustments in order to run the test successfully are quite trivial: Simply add the “GAP”-symbol to both the @hide and the @drop-directive of the grammar and the reported AST-test-failure will disappear.

Testing the string-content of a tree

For the conversion of line-feeds to single-whitespace-characters, we will use a simple string comparison instead of a full tree comparison as test (see above):

[match:text]
A1*: """Text in
    two lines"""

[AST:text]
A1: "Text in two lines"

This time, because it is a string comparison, the test code must stand within quotation marks. We mark the match test with an asterisk “*” in order to receive output for the CST in the report, too. This will be helpful for engineering the AST-transformations that we need for the normalization. The AST-test shows what kind of result, we expect in the end. Again, as we have not yet changed our grammar or parser-script, we will receive an error message when running the test-script:

AST-test A1* for parser text or deserialization of expected value failed:
...
        Expected:  Text in two lines
        Received:  Text in
two lines

As we did not specify the expected result as an (S-expression) tree but as a string, the expected and received results are also printed as a strings in the error-message. Also, the error-message is slightly more vague, because there is the possibility that the comparison of expected and received result failed due to the expected result having unintentionally been miss-specified, which is not the case, here, however.

If we look up the AST and CST-trees in the report file, we find that both read as:

(text "Text in" "two lines")

Note, that multiline-text in tree-nodes is rendered by DHParser as a sequence of strings rather than a multiline string with line-feeds. So, the “text”-node really consists of one string with a line-break in between. The line-break is not explicit in form of an “LF”-node, because it has just like the significant whitespace and character-sequences that make up the text-element been added to the @hide-directive in the grammar. This LF and L nodes will be merged with CHARS-nodes wherever possible during the parsing stage, already.

There are two possible strategies for replacing the line-feeds with whitespace: Either a) by replacing the line-feed-characters in the string-content of the text-nodes during AST-transformation by writing a dedicated transformation-procedure or b) removing LF from the list of disposable symbols in the grammar, then exchanging its content of each LF-node with a single whitespace characters and, maybe also changing its name to “L” in the course of doing so, both during the AST-transformation-stage and, finally, merging any CHARS- and L-nodes within all nodes where they could possibly appear (i.e. text, bold and emphasis) into a single flat node, again.

Although, b) is more complicated, we will follow b), because this is the more general approach. So, we start by adjusting our list of disposable nodes, i.e. nodes that like anonymous-nodes can be flattened during parsing, already. It then reads:

@ hide  = WS, EOF, LINE, GAP, LLF, L, CHARS, TEXT, ESCAPED,
          inner_emph, inner_bold, blocks

If we run the test script, the result of the test in the report file now reads:

Match-test "A1*"
-----------------

### Test-code:

    Text in
    two lines

### Error:

AST-test A1* for parser text or deserialization of expected value failed:
    Expr.:     Text in
    two lines
    Expected:  Text in two lines
    Received:  Text in
two lines

### CST

    (text (:Text "Text in") (LF "" "") (:Text "two lines"))

### AST

    (text (:Text "Text in") (LF "" "") (:Text "two lines"))

As expected, the AST-error is duly reported, and the CST and the AST are identical, since we have not yet programmed the AST-transformation to change the line-feed nodes (“LF”) into whitespace-nodes.

What might appear surprising is the occurrence of “:Text”-nodes inside the “text”-nodes. “:Text” is DHParsers stock name for anonymous leaf-nodes. (“:Text”-nodes can be thought of as the equivalent to the plain-text parts inside mixed-content XML-tags.) “:Text”-tags are created when the parser merges anonymous leaf-nodes of different types. (See the definition of anonymous nodes and ref:simplifying_syntax_trees to understand the role of anonymous nodes and “early merging” of leaf-nodes.)

Now, let’s adjust the tree-transformation so that all line-feeds will be replaced by whitespaces. The following shows one possible way how this can be achieved. (For the general understanding of this kind of quasi-declarative tree-transformations, see Declarative Tree-Transformation.):

outline_AST_transformation_table = {
    "LF": [replace_content_with(' '), change_name(':L')],
    "markup, bold, emphasis, text":
          [merge_adjacent(is_one_of('text', ':Text', ':L'), 'text'),
           apply_if(reduce_single_child, is_one_of('text'))],
}

Compared to the earlier version of the transformation-table one entry (“markup, bold, …”) has been changed in several ways and another entry (“LF”) has been added. Let’s look at the differences one by one:

  1. First of all, the entry “LF” has been added. The transformations that are performed on “LF”-nodes replaces their content with a single whitespace and then renames these nodes to “:L” (anonymous significant whitespace).

    The latter is not strictly necessary, but helpful (for debugging) to avoid surprises, since with the change of the content, the semantics of the name “LF” are not appropriate any more.

  2. Then, “text” has been added to the list of elements within which adjacent child-elements with purely textual content are merged. Therefore the key of this table entry now reads “markup, bold, emphasis, text”.

  3. Then, the list of elements, passed to the boolean check “is_one_of” in the condition-clause of the merge_adjacent() has been adjusted by adding “:Text” and removing the “:LF”, which cannot occur anymore, anyway, because the transformations are applied depth-first by DHParser and before the LF-child-node is merged with other nodes by its parent-element, it has been renamed.

    Note that merge_adjacent() is a transformation that (potentially) merges (some of) the children of the node on which it is called or, more precisely, of the last node in the path on which it is called.

  4. Following the merging of adjacent nodes, the :py:func`~transform.reduce_single_child`-transformation will be applied under the condition that the path on which it is called is a “text”-node.

    This is done to ensure that “text”-nodes always come out as leaf-nodes and are not nested within themselves. By contrast, “bold” and “emphasis” are always supposed to be branch nodes, even in the case they contain a single “text”-node as child.

    The deeper reason for both modeling “text”-nodes as leaf nodes and “markup”, “bold”, “emphasis” as branch-nodes is that it makes further processing easier when the same brand of nodes (i.e. nodes with the same name) always have the same type (branch or leave).

With these changes in place the indirect test of the AST by its string-content succeeds. There are no line-feeds anymore in the string, but only whitespaces.

We will use the same strategy for the last normalization step, i.e. normalizing sequences of whitespace to single whitespaces. Let’s again draw up a test-case, first:

[match:text]
A2: "Testing  whitespace   normalization"

[AST:text]
A2: "Testing whitespace normalization"

Now, in order to normalize whitespace we could just as before devise a tree-transformation for “L” or “:L”-nodes. However, there is also another solution that exists in combining significant whitespace that is strictly defined as a single whitespace character (0x20) with insignificant whitespace, which is more performant, because the normalization already happens in the parsing stage (as a byproduct of the elimination of insignificant whitespace). We only need to change the definition of “L” from /[ \t]+/ to:

L  = /[ \t]/~

Here, the tilde character “~” stands for insignificant whitespace. This simple change suffices to normalize (horizontal) whitespace and make the test succeed. Again, coding the AST-test as simple string suffices, because the actual tree-structure is not involved, here. Of course, we could also have written a tree-test, which, since we decided to ensure that text-nodes are always leaf-nodes after the AST-transformation, is quite trivial in this case:

[AST:text]
A2: (text "Testing whitespace normalization")

It is also possible to test the concrete-syntax-tree (CST) in just the same way as the AST. Since the last normalization is performed in the parsing stage, already, it might appear more logical to test the CST rather than the AST. A reason to refrain from CST-tests is that the CST can be awfully verbose. And if the AST-test succeeds one can most of the time assume that the CST has been correct as well.

In this example, however, the CST-test is just as simple as the AST-test. In fact, it differs just by a single letter:

[CST:text]
A2: (text "Testing whitespace normalization")

Testing the processing-pipeline

Also, later stages of the processing pipeline can be tested with the same apparatus as long as their results are either node-trees or serializable as strings. To illustrate both of these cases, let us extend our “outline”-parser so that it transforms the input documents to HTML in two steps.

In the first step the abstract syntax-tree is transformed into a DOM-tree (kind of). In the second step the DOM-tree is serialized as an HTML document. With DHParser’s as_xml()-function the second step is almost trivial, but for the sake of illustration we will nevertheless implement this as a separate processing stage. This also has the benefit that we can test the structure of the DOM-tree independently from the formatting of the final HTML-document.

Preparing tests for a processing-stage

In true test-driven-development spirit, we start by looking at the ASTs for a couple of examples and then ask ourselves what the DOM should look like for these examples. We write down the DOM-trees as tests and then start to program the necessary transformations. With the transformations in place, we finally run our tests to see if everything works as expected. So let’s go ahead and write some test or, what amounts to the same, pick some of the already written tests and look at the resulting AST in the report file. Here are some tests:

[match:emphasis]
D1: "*emphasized*"

[match:blocks]
D1: """First paragraph of text.

Next paragraph
of text."""

[match:document]
D1: M4: """# Simple Test

    ## A test of bold- and emphasis-markup

      This paragraph contains *emphasized
    text* that spreads over two lines.

      But what ist this: ** *emphasized* and bold**
    or * **bold** and emphasized*?"""

The report files show the following ASTs for these tests:

(emphasis (text "emphasized"))

(:blocks
  (markup (text "First paragraph of text."))
  (markup (text "Next paragraph of text.")))

In case you wonder why there is a colon in front of “blocks”: This is due to the fact that we have added “blocks” to the list of disposable nodes in our grammar, earlier. In cases where a disposable node cannot be disposed as in this case where it cannot be flattened, because it is the root node, it will be marked with a colon just like an “anonymous” node. In the following example has been flattened during parsing as expected, leaving the “markup”-nodes as direct children of “section”:

(document
  (main
    (heading "Simple Test")
    (section
      (heading "A test of bold- and emphasis-markup")
      (markup
        (indent "  ")
        (text "This paragraph contains ")
        (emphasis
          (text "emphasized text"))
        (text " that spreads over two lines."))
      (markup
        (indent "  ")
        (text "But what ist this: ")
        (bold
          (emphasis
            (text "emphasized"))
          (text " and bold"))
        (text " or ")
        (emphasis
          (bold
            (text "bold"))
          (text " and emphasized"))
        (text "?")))))

Now, let’s think about what the HTML-equivalents for the node-names and structures in the AST would be. Here ist a list:

  1. “text”- nodes should be reduced when they are the single child of some other node, i.e. the text-node will be dissolved and its content will directly be attached to the parent node.

  2. the simplemost case is that of the “bold” and “emphasis”- nodes which can simply be renamed to [b](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/b) and [i](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/i), respectively. (Alternatively, they could be renamed to [span](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/span) with either the CSS-properties “font-weight: bold” or “font-style: italic” in the style-attribute of that node.). Example:

    (bold (text "bold"))  -> (b "bold")
    
  3. “markup”-blocks should be renamed to [p](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/p), which HTML uses for paragraphs. “indent”-nodes which grammar can occur only as the first child inside a “markup”-node (as can be seen in the grammar), shall be removed after adding the property “text-indent: 2em” to the style-attribute of the parent “markup”-node, e.g.:

    (markup (text "some text"))  -> (p "some text")
    
  4. “heading”-nodes must be renamed to “h1”, “h2”, “h3” … “h6” according their place in the hierarchy of nested “main”, “section”, “subsection” … “s6section”-elements.

  5. After that the “main”, “section”, … -elements can either be dissolve (i.e. reduced) or renamed to “[section](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/section)”. The yields a semantically more explicit DOM while the former yields a more concise document-tree. We will go for the more concise tree:

    (section
      (heading "Section 1")
      (markup
        (indent "    ")
        (text "A paragraph of text"))
    
    ->
    
    (h1 "Section 1")
    (p `(style "text-indent: 2em;") "A paragraph of text")
    
  6. Finally, the “document”-node can simply be renamed to “[body](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/body)”. Later, when serializing as HTML we only need to add a header and the enclosind “html”-tags.

Having made up our mind about how to transform the AST into a DOM-tree, we could now transform the AST shown above by hand into a DOM-tree to get our first test-case. However, a much more pragmatic approach is to program the transformation first (or ask an artificial intelligence to do it, following the above recipe) and then correct the output, which - on the first try - most probably still contains errors, by hand and add it as test case.

There are many different ways of programming a tree-transformation. DHParser offers scaffolding-code for several of these, most notably a table-based tree-transformation (see AST-Transformation) and a object-oriented, class-based approach (see Post-Processing), both of which are variants of the well-known visitor-pattern. As of now, DHParser does not offer a pattern-matching approach like XSLT. The following is a solution with the class-based approach. The table-based approach is usually much more concise, but less readable.

In the (autogenerated) “outlineParser.py”-file there is already an empty outlineCompiler-class in the “COMPILE SECTION” which can filled in and renamed as follows. Let’s set out with the glue-code:

class DOMCompiler(Compiler):
    def prepare(self, root: RootNode) -> None:
        assert root.stage == "AST", f"Source stage `AST` expected, `but `{root.stage}`found."
        root.stage = "DOM"

In the prepare-method the destination-stage should needs to set so that the processing-pipeline can keep track of the progress. This is essential for the processing pipeline to work. The check of the source-stage with the assertion-statement is not necessary but helps to locate potential programming errors in case something goes wrong.

Note

We use the term “stage” in two different meanings (which, however, are easily distinguished by the context) to denote either:

  1. the data-stage which denotes the state the data is in at a specific

    point in the processing pipeline, e.g. source document, preprocessed source documtnet, AST, CST, etc.

  2. the “processing-” or “transformation-stage” which denotes a transformation-function from one data-stage to the next.

In the source code, the name stage is always used for the data-stage, while the transformation stage is either a Junction-tuple or the transformation-callable that is returned by the factory-field of that tuple.

Next, we define the finalize-method. This might be surprising at first, because after the AST has been run through all visitor-methods the tree should only consist of HTML-nodes, already. This, however, is only true if we serve complete documents to our processing-pipeline. But during unit-testing, we only serve snippets of documents to the pipeline in most cases. Thus, we cannot assume that the visitor-method of the root-node of complete documents “on_document()” or any other visitor-method will be called under any circumstances.

Generally speaking, all transformations in the processing-pipeline should be designed in such a way that they work not only on entire documents but also on parts of documents. Usually this only requires little extra-effort. In cases where this is not possible, the only alternative that is left is to write mockups for each unit-test that represent complete documents. This is much more cumbersome and not well supported by DHParser’s testing-framework which groups the tests by symbols-names of the parts of the grammar that shall be tested.

In our case it could happen that the root-node is not a valid HTML-tag-name after compiling an AST that does not represent an entire document. We can use the finalize-method to rename the root-node (whatever that may be) to “div” in cases where its name is not that of an HTML-tag. While it would not do much harm to leave it as it is (HTML5 allows custom tag names and most internet browsers a pretty tolerant even towards invalid tag-names), it can be confusing to get test-outputs that look like mistakes. Also, since the results will be passed on to a further processing-stage (HTML-serialization), it it is better to avoid testing-artifacts at this stage.

In order to fix left-behind node-names, We could either check the root-node’s name against a list of valid tag-names or against a list of potentially left-behind tag-names from the AST. In our case it is easier to pursue the latter strategy, because only “container”-nodes, i.e. nodes that can contain more than one child and at the same time are meant to be flattened if possible can become “left behind”. (In case you think this would be hard to know or analyze beforehand: Don’t worry! You can start to provide for these cases when they occur and you can even confine yourself to those cases that come up in your test - because, as said, this is a harmless problem, if it is a problem at all.) So, let’s just rename all of these node-names to “div” if they appear as root-name:

def finalize(self, result: Any) -> Any:
    if result.name in ('main', 'section', 'subsection', 'subsubsection',
                       's5section', 's6section', 'blocks', ':blocks'):
        result.name = 'div'
    return result

The following visitor-method for the document-node is self-explaining. As described in Post-Processing, visitor-methods are called by the scaffolding code of Compiler when the tree-traversal reaches the node with the name that corresponds to the name of the visitor-method. The scaffolding code also updates the self.path-variable (which we will make use of, further below). The traversal of the child-nodes must explicitly be triggered by the visitor-method by calling fallback_compiler() which is usually done at the beginning of the visitor-method. Every visitor-method is required to return the compilation-result:

def on_document(self, node):
    node = self.fallback_compiler(node)
    node.name = "body"
    return node

Since the transformation of the structural-components (i.e. sections, subsections etc.) is very similar for each of these components we factor out the similarities to a meta-method for these components which is called by the visitor-methods for the structural-components:

def compile_structure(self, node, heading_name):
    node = self.fallback_compiler(node)
    node['heading'].name = heading_name
    replace_by_children(self.path)
    return node

def on_main(self, node):
    return self.compile_structure(node, "h1")

def on_section(self, node):
    return self.compile_structure(node, "h2")

def on_subsection(self, node):
    return self.compile_structure(node, "h3")

def on_subsubsection(self, node):
    return self.compile_structure(node, "h4")

def on_s5section(self, node):
    return self.compile_structure(node, "h5")

def on_s6section(self, node):
    return self.compile_structure(node, "h6")

Finally, we provide visitor-methods for the paragraph and inline-elements. For the actual transformation-work, we are, of course, free to delegate to the transformation-methods of transform like replace_by_children() or reduce_single_child():

def on_markup(self, node):
    node = self.fallback_compiler(node)
    if node[0].name == 'indent':
        node.attr['style'] = "text-indent: 2em;"
        del node[0]
    if len(node.children) == 1 and node[0].name == 'text':
       reduce_single_child(self.path)
    node.name = "p"
    return node

def on_bold(self, node):
    node = self.fallback_compiler(node)
    if len(node.children) == 1 and node[0].name == 'text':
        reduce_single_child(self.path)
    node.name = "b"
    return node

def on_emphasis(self, node):
    node = self.fallback_compiler(node)
    if len(node.children) == 1 and node[0].name == 'text':
        reduce_single_child(self.path)
    node.name = "i"
    return node

Before the test-reports yields the results of this processing-stage for the match-tests defined in the test-files, the junction for this stage needs to be declared:

compiling: Junction = create_junction(DOMCompiler, "AST", "DOM")

and the junction must be added to the list of junctions which is defined a bit further below in the Processing-Pipeline-section:

junctions: Set[Junction] = {ASTTransformation, compiling}

By default, all destinations of all available junctions will be written to the test-report. If this is not desired, the variable “test_targets” in the same section must be supplied with a set of names of those data-stages that shall be reported.

When we run the test-script (“tst_outline_grammar.py”) again, the results will not only report the AST but also the DOM-stage, e.g.:

Test of parser: "emphasis"
==========================

Match-test "M1"
----------------

### Test-code:
    *emphasized*
### AST
    (emphasis (text "emphasized"))
### DOM
    (i "emphasized")

By default, stages the results of which are trees are serialized as S-expressions. This can be changed by adjusting the global “serializations”-variable in the …Parser.py-script:

serializations: dict[str, list[str]] = {'DOM': ['XML'], '*', :['Sxpr']

“serializations” is a dictionary where the keys are the names of the stage (of the data) and the values are lists of serialization-types which can be any of “sxpr” (S-expression), “SXML” (S-expression conforming to the SXML-specification), “XML” or “tree” (a simple indented tree). The serialization of test results is determined by the first item from the value-list. The other items are only used when serializing the output of the …Parser.py-script. The asterisk “*” is a joker which means that any stage that has not been explicitly specified with a key in the serializations-dictionary will use the serialization(s) associated to the asterisk.

The output for a given stage will be serialized in all formats of the list associated to this stage. Because often the same value will be associated to several keys, it can be helpful to use a “multi-keyword table” the kind of which is also used for specifying AST-transformations. See expand_table().

Writing tests for a processing-stage

Once the junctions for the steps beyond AST-transformation have been specified, writing test for data-stages beyond the AST is as simple as writing an AST-test. If the data at the respective stage is tree-data, you can write down the expected tree as an XML-snippet or S-expression. In case the data is of any other type, you write the expected string that results from applying “str()” to the test-data at that stage.

Since the DOM-tree is still tree-data, we specify the expected result in a tree-serialization-format, say, as as S-expression. As mentioned earlier, it is a pragmatic approach to run the test first and then copy and past the result for the DOM-tree from the report-file and adjust these by hand so that they reflect what we want the results to look like in this stage (although they do not yes, but that is what test-driven-development is about).

Note that the same symbol names from the grammar are used to specify tests for later processing stages. It does not matter if the name of the root-node has been changed as in this-case where the “document”-root-node is renamed to “body”. The test must never the less be specified as “[DOM:document]”, because it takes the data that has been passed on from the [match:document]-test.

This is what the DOM-test look like in the test-input-file (“.ini”-file):

[match:emphasis]
D1: "*emphasized*"

[match:blocks]
D1: """First paragraph of text.

Next paragraph
of text."""

[match:document]
D1: M4: """# Simple Test

    ## A test of bold- and emphasis-markup

      This paragraph contains *emphasized
    text* that spreads over two lines.

      But what ist this: ** *emphasized* and bold**
    or * **bold** and emphasized*?"""

...

[DOM:emphasis]
D1: (i "emphasized")

[DOM:blocks]
D1: (div (p "First paragraph of text.") (p "Next paragraph of text."))

[DOM:document]  # element-name reflects the original grammar-symbol!
D1: (body
    (h1 "Simple Test")
    (h2 "A test of bold- and emphasis-markup")
    (p `(style "text-indent: 2em;")
      (text "This paragraph contains ")
      (i "emphasized text")
      (text " that spreads over two lines."))
    (p `(style "text-indent: 2em;")
      (text "But what ist this: ")
      (b
        (i "emphasized")
        (text " and bold"))
      (text " or ")
      (i
        (b "bold")
        (text " and emphasized"))
      (text "?")))

The next time, we run the test, the DOM-trees that a generated during the test will be matched against the trees specified under the “[DOM:…]”-headings of the test-file.

Writing tests for the last stage, the HTML-output, is just the same. Again, we will, for the sake of simplicity start with the transformation. The transformation from the DOM-tree to actual HTML is pretty simple, because we only need to add an HTML-header and a serialize the DOM-tree as XML. For the header, we can simply write a string-template:

HTML_TMPL = """<!DOCTYPE html>
<html lang="en-GB">
<head>
    <title>{title}</title>
    <meta charset="utf8"/>
</head>
{body}
</html>
"""

class HTMLSerializer(Compiler):
    def prepare(self, root: RootNode) -> None:
        assert root.stage == "DOM", f"Source stage `DOM` expected, `but `{root.stage}` found."
        root.stage = "html"
        h1 = root.pick('h1')
        self.title = h1.content if h1 else ''

    def on_body(self, node: Node) -> str:
        body = node.as_xml(string_tags={'text'})
        return HTML_TMPL.format(title=self.title, body=body)

    def wildcard(self, node: Node) -> str:
        return node.as_xml(string_tags={'text'})

serializing: Junction = create_junction(HTMLSerializer, "DOM", "html")

Again, don’t forget to add the “serializing”-junction to the set of junctions that define the processing-pipeline:

junctions: Set[Junction] = {ASTTransformation, compiling, serializing}

The HTML-Serializer above is pretty straight-forward. Almost all of the work is done by as_xml() called in the wildcard-method. Note that the wildcard-method does not even descend into the tree, it just returns the XML-serialized tree for whatever element is the root element of the DOM-tree received from the last stage.

We add the html-header in the visitor-method of the root-element (body) and not in the wildcard-method for serializing all other nodes or in the “finalize”-method which is not even present, here. This has the effect that the header is only added in tests of entire documents, but not of single elements which would be cumbersome.

Other than the AST->DOM-transformation we return the HTML-document or the (HTML-snippets from tests of document-parts) as strings. Therefore we must specify the test-code for the html-stage also in form of strings. This means that the formatting of the html-test-code must be exactly the same as that returned by the DOM->html-transformation with the same line-breaks, indentation and all:

[html:emphasis]
D1: "<i>emphasized</i>"

[html:blocks]
D1: """<div>
      <p>First paragraph of text.</p>
      <p>Next paragraph of text.</p>
    </div>"""


[html:document]
D1: """<!DOCTYPE html>
    <html lang="en-GB">
    <head>
        <title>Simple Test</title>
        <meta charset="utf8"/>
    </head>
    <body>
      <h1>Simple Test</h1>
      <h2>A test of bold- and emphasis-markup</h2>
      <p style="text-indent: 2em;">
        This paragraph contains
        <i>emphasized text</i>
        that spreads over two lines.
      </p>
      <p style="text-indent: 2em;">
        But what ist this:
        <b>
          <i>emphasized</i>
          and bold
        </b>
        or
        <i>
          <b>bold</b>
          and emphasized
        </i>
        ?
      </p>
    </body>
    </html>"""

Conventional Unit-Testing

All though, the DHParser’s testing-framework is by now quite comprehensive there are situation where you might want to resort to conventional unit-testing:

  1. Testing for the occurrence of particular error-codes or error-messages. As of Version 1.7 DHParser’s testing framework has no support for testing errors. This may change with future versions.

  2. Testing of data-stages in which the data is not compositional, any more in the sense that you could test isolated pieces of code (without context) and still receive meaningful results.

    This assumption which is true for parsing of context-free grammars (save for the use of lookahead and lookbehind-parsers for which DHParsers testing-framework offers some workarounds, though), might not be true for later processing stages, any more. The assumption context-independence is baked into DHParser’s testing-framework and will therefore not change in the future.

Therefore, we will illustrate how conventional unit-testing works. Let’s assume that we want to test error-localization (a common pain-point of parser-building) and error-reporting in case something goes wrong. The glue code for a conventional unit-test looks like this:

TO BE CONTINUED...