5.  How xmlformat Works

Briefly, xmlformat processes an XML document using the following steps:

  1. Read the document into memory as a single string.

  2. Parse the document into a list of tokens.

  3. Convert the list of tokens into nodes in a tree structure, tagging each node according to the token type.

  4. Discard extraneous whitespace nodes and normalize text nodes. (The meaning of "normalize" is described in Section 3.3, “ Text Handling ”.)

  5. Process the tree to produce a single string representing the reformatted document.

  6. Print the string.

xmlformat is not an XSLT processor. In essence, all it does is add or delete whitespace to control line breaking, indentation, and text normalization.

xmlformat uses the REX parser developed by Robert D. Cameron (see Section 7, “ References ”). REX performs a parse based on a regular expression that operates on a string representing the XML document. The parse produces a list of tokens. REX does a pure lexical scan that performs no alteration of the text except to tokenize it. In particular:

xmlformat expects its input documents to be legal XML. It does not consider fixing broken documents to be its job, so if xmlformat finds error tokens in the result produced by REX, it lists them and exits.

Assuming the document contains no error tokens, xmlformat uses the token list to construct a tree structure. It categorizes each token based on its initial characters:

Initial CharactersToken Type
<?processing instruction (this includes the <?xml?> instruction)
<!DOCTYPEDOCTYPE declaration
<![CDATA section
</element closing tag
<element opening tag

Anything token not beginning with one of the sequences shown in the preceding table is a text token.

The token categorization determineas the node types of nodes in the document tree. Each node has a label that identifies the node type:

LabelNode Type
commentcomment node
piprocessing instruction node
DOCTYPEDOCTYPE declaration node
CDATACDATA section node
eltelement node
texttext node

If the document is not well-formed, tree construction will fail. In this case, xmlformat displays one or more error messages and exits. For example, this document is invalid:

<p>This is a <strong>malformed document.</p>

Running that document through xmlformat produces the following result:

MISMATCH open (strong), close (p); malformed document?
Non-empty tag stack; malformed document?
Non-empty children stack; malformed document?
Cannot continue.

That is admittedly cryptic, but remember that it's not xmlformat's job to repair (or even diagnose) bad XML. If a document is not well-formed, you may find Tidy a useful tool for fixing it up.

Tokens of each type except element tokens correspond to single distinct nodes in the document. Elements are more complex. They may consist of multiple tokens, and may contain children:

Element opening tag tokens include any attributes that are present, because xmlformat performs no tag reformatting. Tags are preserved intact in the output, including any whitespace between attributes or within attribute values.

In addition to the type value that labels a node as a given node type, each node has content:

After constructing the node tree, xmlformat performs two operations on it:

Here's an example input document, representing a single-row table:


After reading this in and constructing the tree, the canonized output looks like this:


The output after applying the default formatting options looks like this: