Thursday, April 10, 2008

Best way to create an XML document

I posted following question on the xml-dev mailing list, regarding this topic:

Let's suppose I need to create an XML document from scratch in a Java program.

What's the best way to do this?

I have seen that a quick way to do this is preparing a XML string by hand.

For e.g.,

String xml_str = "<x><y><z/></y></x>";

I want to understand the pros and cons of this approach.

I most of the time prefer using an API like DOM to create an in-memory representation, and then serializing the tree to String.

Following are my arguments in favor of using the DOM approach:

1) Creating a XML string by hand can become cumbersome, if XML is huge. Maintaining the correct parent child relationship for a huge document can be difficult, if done by hand (imagine a document of size 50 MB). This would lead to difficult debugging. Using a DOM API can do this inherently in memory.

2) It's difficult to remember correct XML name conventions if done by hand.

for e.g., <9abc> is an invalid XML name (because it starts with a number).

There are more rules for XML names.

Using DOM API does this automatically.

3) Using DOM API can check well-formedness of entities (like, &abc; etc). Doing this by hand in a string can become difficult.

Following people added useful comments to this thread.

Martin Gallagher: Creating via DOM can be advantageous if further manipulation via the DOM is required. This will remove the overhead of creating a string and the initial parse to a DOM object.

Alain Couthures: I do appreciate not having too much program lines so I really prefer the string approach. You can always copy/paste the string in a text editor (I use NotePad++ myself...) which will check it is well-formed !

Michael Kay: My preference is to use a SAX-like serializer driven by calls such as

startElement("x")
attribute("a", "3")
text("content")
endElement("x")

This avoids the overhead of creating a tree in memory, while still giving the benefits of having the system take care of matters such as escaping special characters.

Robert Koberg: Rob referred to the following link to implement Mike's ideas in Java,

http://www.megginson.com/downloads/

xml-writer-0.2.zip

(Mukul: I tried this, but found that this XML writer library doesn't check for well-formedness of XML documents, which is a bit of a drawback of this library).

Andrew Welch: Create an empty SAXTransformerFactory get the TransformerHandler and then just call the event methods on that... (using Xalan or Saxon rather than Xerces)

Alternatively use the streaming api in Java 6 - see XMLStreamWriter.

Here is the working code as per Andrew's suggestions:

TransformerFactoryImpl tfi = new TransformerFactoryImpl();
TransformerHandler tHandler = tfi.newTransformerHandler();
tHandler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
tHandler.setResult(new StreamResult(System.out));
tHandler.startDocument();
tHandler.startElement("", "x", "x", null);
AttributesImpl attrs = new AttributesImpl();
attrs.addAttribute("", "attr1","attr1", "", "123");
attrs.addAttribute("", "attr2","attr2", "", "456");
tHandler.startElement("", "y", "y", attrs);
tHandler.startElement("", "z", "z", null);
tHandler.endElement("", "z", "z");
tHandler.endElement("", "y", "y");
tHandler.endElement("", "x", "x");
tHandler.endDocument();

But unfortunately, with this technique also, we cannot ensure well-formedness of the XML output.

I questioned ...
Here I am using the transformer functionality for creating XML, which looks more like a XSLT feature (i.e., transformation task). Should we not have this capability in the XML parser (for e.g., in Xerces)? Should we have something like xml-writer (which Rob pointed) built into Xerces (possibly as an enhancement)?

Michael Kay provided following explanation for this scenario:
XSLT processors include a serializer because it's defined in the XSLT specification, and since they have one, it makes sense to expose it even if you aren't doing a transformation.

Michael provided following explanation for the inability of the XSLT serializer to check for well-formedness of XML output.
"Because the spec doesn't say it has to. And because XSLT serializers were primarily written to get their input from XSLT transformers, which they trust; so why incur the extra expense?"

Using the XMLStreamWriter approach, as suggested by Andrew, following is the working code:

XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(System.out);
writer.writeStartDocument();
writer.writeStartElement("x");
writer.writeStartElement("y");
writer.writeAttribute("attr1", "123");
writer.writeEndDocument();
writer.flush();
writer.close();

This looks better than the SAX writer approach, for the well-formedness requirement.

Michael Glavassevich further commented on this approach:

It keeps the tags nested properly but you can still write all sorts of garbage.

Extending your example:

writer.writeStartDocument();
writer.writeDTD("");
writer.writeComment("bad -- comment");
writer.writeStartElement("x");
writer.writeCharacters("\u0000");
writer.writeProcessingInstruction("xml", "version=\"1.0\"");
writer.writeStartElement("y");
writer.writeAttribute("attr1", "123");
writer.writeAttribute("attr1", "456");
writer.writeEndElement();
writer.writeEndElement();
writer.writeEmptyElement("3");
writer.writeEndDocument();
writer.flush();
writer.close();

produces:

<?xml version="1.0" ?><!GARBAGE><!--bad -- comment--><x><?xml version="1.0"?><y attr1="123" attr1="456"></y></x><3/>

with a NUL char after <x>.

No comments: