3.  The Document Processing Model

3.1. Document Components
3.2. Line Breaks and Indentation
3.3. Text Handling

XML documents consist primarily of elements arranged in nested fashion. Elements may also contain text. xmlformat acts to rearrange elements by removing or adding line breaks and indentation, and to reformat text.

3.1.  Document Components

XML elements within input documents may be of three types:

  • block elements

    This is the default element type. The DocBook <chapter>, <sect1>, and <para> elements are examples of block elements.

    Typically a block element will begin a new line. (That is the default formatting behavior, although xmlformat allows you to override it.)

    Spacing between sub-elements can be controlled, and sub-elements can be indented. Whitespace in block element text may be normalized. If normalization is in effect, line-wrapping may be applied as well. Normalization and line-wrapping may be appropriate for a block element with mixed content (such as <para>).

  • inline elements

    These are elements that are contained within a block or within other inlines. The DocBook <emphasis> and <literal> elements are examples of inline elements.

    Normalization and line-wrapping of inline element tags and content is handled the same way as for the enclosing block element. In essence, an inline element is treated as part of parent's "text" content.

  • verbatim elements

    No formatting is done for verbatim elements. The DocBook <programlisting> and <screen> elements are examples of verbatim elements.

    Verbatim element content is written out exactly as it appears in the input document. This also applies to child elements. Any formatting that would otherwise be performed on them is suppressed when they occur within a verbatim element.

xmlformat never reformats element tags. In particular, it does not change whitespace betweeen attributes or which attribute values. This is true even for inline tags within line-wrapped block elements.

xmlformat handles empty elements as follows:

  • If an element appears as <abc/> in the input document, it is written as <abc/>.

  • If an element appears as <abc></abc>, it is written as <abc></abc>. No line break is placed between the two tags.

XML documents may contain other constructs besides elements and text:

  • Processing instructions

  • Comments

  • DOCTYPE declaration

  • CDATA sections

xmlformat handles these constructs much the same way as verbatim elements. It does not reformat them.

3.2.  Line Breaks and Indentation

Line breaks within block elements are controlled by the entry-break, element-break, and exit-break formatting options. A break value of n means n newlines. (This produces n-1 blank lines.)

Example. Suppose input text looks like this:

<elt>
<subelt/> <subelt/> <subelt/>
</elt>

Here, an <elt> element contains three nested <subelt> elements, which for simplicity are empty.

This input can be formatted several ways, depending on the configuration options. The following examples show how to do this.

  1. To produce output with all sub-elements are on the same line as the <elt> element, add a section to the configuration file that defines <elt> as a block element and sets all its break values to 0:

    elt
      format          block
      entry-break     0
      exit-break      0
      element-break   0
    

    Result:

    <elt><subelt/><subelt/><subelt/></elt>
    
  2. To leave the sub-elements together on the same line, but on a separate line between the <elt> tags, leave the element-break value set to 0, but set the entry-break and exit-break values to 1. To suppress sub-element indentation, set subindent to 0.

    elt
      format          block
      entry-break     1
      exit-break      1
      element-break   0
      subindent       0
    

    Result:

    <elt>
    <subelt/><subelt/><subelt/>
    </elt>
    
  3. To indent the sub-elements, make the subindent value greater than zero.

    elt
      format          block
      entry-break     1
      exit-break      1
      element-break   0
      subindent       2
    

    Result:

    <elt>
      <subelt/><subelt/><subelt/>
    </elt>
    
  4. To cause the each sub-element begin a new line, change the element-break to 1.

    elt
      format          block
      entry-break     1
      exit-break      1
      element-break   1
      subindent       2
    

    Result:

    <elt>
      <subelt/>
      <subelt/>
      <subelt/>
    </elt>
    
  5. To add a blank line between sub-elements, increase the element-break from 1 to 2.

    elt
      format          block
      entry-break     1
      exit-break      1
      element-break   2
      subindent       2
    

    Result:

    <elt>
      <subelt/>
    
      <subelt/>
    
      <subelt/>
    </elt>
    
  6. To also produce a blank line after the <elt> opening tag and before the closing tag, increase the entry-break and exit-break values from 1 to 2.

    elt
      format          block
      entry-break     2
      exit-break      2
      element-break   2
      subindent       2
    

    Result:

    <elt>
    
      <subelt/>
    
      <subelt/>
    
      <subelt/>
    
    </elt>
    
  7. To have blank lines only after the opening tag and before the closing tag, but not have blank lines between the sub-elements, decrease the element-break from 2 to 1.

    elt
      format          block
      entry-break     2
      exit-break      2
      element-break   1
      subindent       2
    

    Result:

    <elt>
    
      <subelt/>
      <subelt/>
      <subelt/>
    
    </elt>
    

Breaks within block elements are suppressed in certain cases:

  • Breaks apply to nested block or verbatim elements, but not to inline elements, which are, after all, inline. (If you really want an inline to begin a new line, define it as a block element.)

  • Breaks are not applied to text within non-normalized blocks. Non-normalized text should not be changed, and adding line breaks changes the text.

    For example if <x> elements are normalized, you might elect to format this:

    <x>This is a sentence.</x>
    

    Like this:

    <x>
    This is a sentence.
    </x>
    

    Here, breaks are added before and after the text to place it on a separate line. But if <x> is not normalized, the text content will be written as it appears in the input, to avoid changing it.

3.3.  Text Handling

The XML standard considers whitespace nodes insignificant in elements that contain only other elements. In other words, for elements that have element content, sub-elements may optionally be separated by whitespace, but that whitespace is insignificant and may be ignored.

An element that has mixed content may have text (#PCDATA) content, optionally interspersed with sub-elements. In this case, whitespace-only nodes may be significant.

xmlformat treats only literal whitespace as whitespace. This includes the space, tab, newline (linefeed), and carriage return characters. xmlformat does not resolve entity references, so entities such as &#32; or &#x20; that represent whitespace characters are seen as non-whitespace text, not as whitespace.

xmlformat doesn't know whether a block element has element content or mixed content. It handles text content as follows:

  • If an element has element content, it will have only sub-elements and possibly all-whitespace text nodes. In this case, it is assumed that you'll want to control line-break behavior between sub-elements, so that the (all-whitespace) text nodes can be discarded and replaced with the proper number of newlines, and possibly indentation.

  • If an element has mixed content, you may want to leave text nodes alone, or you may want to normalize (and possibly line-wrap) them. In xmlformat, normalization converts runs of whitespace characters to single spaces, and discards leading and trailing whitespace.

To achieve this kind of formatting, xmlformat recognizes normalize and wrap-length configuration options for block elements. They affect text formatting as follows:

  • You can enable or disable text normalization by setting the normalize option to yes or no.

  • Within a normalized block, runs of whitespace are converted to single spaces. Leading and trailing whitespace is discarded. Line-wrapping and indenting may be applied.

  • In a non-normalized block, text nodes are not changed as long as they contain any non-whitespace characters. No line-wrapping or indenting is applied. However, if a text node contains only whitespace (for example, a space or newline between sub-elements), it is assumed to be insignficant and is discarded. It may be replaced by line breaks and indentation when output formatting occurs.

Consider the following input:

<row> <cell> A </cell> <cell> B </cell> </row>

Suppose that the <row> and <cell> elements both are to be treated as non-normalized. The contents of the <cell> elements are text nodes that contain non-whitespace characters, so they would not be reformatted. On the other hand, the spaces between tags are all-whitespace text nodes and are not significant. This means that you could reformat the input like this:

<row><cell> A </cell><cell> B </cell></row>

Or like this:

<row>
<cell> A </cell><cell> B </cell>
</row>

Or like this:

<row>
  <cell> A </cell>
  <cell> B </cell>
</row>

In each of those cases, the whitespace between tags was subject to reformatting, but the text content of the <cell> elements was not.

The input would not be formatted like this:

<row><cell>A</cell><cell>B</cell></row>

Or like this:

<row>
  <cell>
    A
  </cell>
  <cell>
   B
  </cell>
</row>

In both of those cases, the text content of the <cell> elements has been modified, which is not allowed within non-normalized blocks. You would have to declare <cell> to have a normalize value of yes to achieve either of those output styles.

Now consider the following input:

<para> This is a        sentence. </para>

Suppose that <para> is to be treated as a normalized element. It could be reformatted like this:

<para>This is a sentence.</para>

Or like this:

<para>
This is a sentence.
</para>

Or like this:

<para>
  This is a sentence.
</para>

Or even (with line-wrapping) like this:

<para>
  This is a
  sentence.
</para>

The preceding description of normalization is a bit oversimplified. Normalization is complicated by the possibility that non-normalized elements may occur as sub-elements of a normalized block. In the following example, a verbatim block occurs in the middle of a normalized block:

<para>This is a paragraph that contains
<programlisting>
a code listing
</programlisting>
in the middle.
</para>

In general, when this occurs, any whitespace in text nodes adjacent to non-reformatted nodes is discarded.

There is no "preserve all whitespace as is" mode for block elements. Even if normalization is disabled for a block, any all-whitespace text nodes are considered dispensible. If you really want all text within an element to be preserved intact, you should declare it as a verbatim element. (Within verbatim elements, nothing is ever reformatted, so whitespace is significant as a result.)

If you want to see how xmlformat handles whitespace nodes and text normalization, invoke it with the --canonized-output option. This option causes xmlformat to display the document after it has been canonized by removing whitespace nodes and performing text normalization, but before it has been reformatted in final form. By examining the canonized document, you can see what effect your configuration options have on treatment of the document before line-wrapping and indentation is performed and line breaks are added.