I posted following question on the xml-dev mailing list, regarding this topic:
Let's suppose I need to create an XML document from scratch in a Java program.
What's the best way to do this?
I have seen that a quick way to do this is preparing a XML string by hand.
For e.g.,
String xml_str = "<x><y><z/></y></x>";
I want to understand the pros and cons of this approach.
I most of the time prefer using an API like DOM to create an in-memory representation, and then serializing the tree to String.
Following are my arguments in favor of using the DOM approach:
1) Creating a XML string by hand can become cumbersome, if XML is huge. Maintaining the correct parent child relationship for a huge document can be difficult, if done by hand (imagine a document of size 50 MB). This would lead to difficult debugging. Using a DOM API can do this inherently in memory.
2) It's difficult to remember correct XML name conventions if done by hand.
for e.g., <9abc> is an invalid XML name (because it starts with a number).
There are more rules for XML names.
Using DOM API does this automatically.
3) Using DOM API can check well-formedness of entities (like, &abc; etc). Doing this by hand in a string can become difficult.
Following people added useful comments to this thread.
Martin Gallagher: Creating via DOM can be advantageous if further manipulation via the DOM is required. This will remove the overhead of creating a string and the initial parse to a DOM object.
Alain Couthures: I do appreciate not having too much program lines so I really prefer the string approach. You can always copy/paste the string in a text editor (I use NotePad++ myself...) which will check it is well-formed !
Michael Kay: My preference is to use a SAX-like serializer driven by calls such as
startElement("x")
attribute("a", "3")
text("content")
endElement("x")
This avoids the overhead of creating a tree in memory, while still giving the benefits of having the system take care of matters such as escaping special characters.
Robert Koberg: Rob referred to the following link to implement Mike's ideas in Java,
http://www.megginson.com/downloads/xml-writer-0.2.zip
(Mukul: I tried this, but found that this XML writer library doesn't check for well-formedness of XML documents, which is a bit of a drawback of this library).
Andrew Welch: Create an empty SAXTransformerFactory get the TransformerHandler and then just call the event methods on that... (using Xalan or Saxon rather than Xerces)
Alternatively use the streaming api in Java 6 - see XMLStreamWriter.
Here is the working code as per Andrew's suggestions:
TransformerFactoryImpl tfi = new TransformerFactoryImpl();
TransformerHandler tHandler = tfi.newTransformerHandler();
tHandler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
tHandler.setResult(new StreamResult(System.out));
tHandler.startDocument();
tHandler.startElement("", "x", "x", null);
AttributesImpl attrs = new AttributesImpl();
attrs.addAttribute("", "attr1","attr1", "", "123");
attrs.addAttribute("", "attr2","attr2", "", "456");
tHandler.startElement("", "y", "y", attrs);
tHandler.startElement("", "z", "z", null);
tHandler.endElement("", "z", "z");
tHandler.endElement("", "y", "y");
tHandler.endElement("", "x", "x");
tHandler.endDocument();
But unfortunately, with this technique also, we cannot ensure well-formedness of the XML output.
I questioned ...
Here I am using the transformer functionality for creating XML, which looks more like a XSLT feature (i.e., transformation task). Should we not have this capability in the XML parser (for e.g., in Xerces)? Should we have something like xml-writer (which Rob pointed) built into Xerces (possibly as an enhancement)?
Michael Kay provided following explanation for this scenario:
XSLT processors include a serializer because it's defined in the XSLT specification, and since they have one, it makes sense to expose it even if you aren't doing a transformation.
Michael provided following explanation for the inability of the XSLT serializer to check for well-formedness of XML output.
"Because the spec doesn't say it has to. And because XSLT serializers were primarily written to get their input from XSLT transformers, which they trust; so why incur the extra expense?"
Using the XMLStreamWriter approach, as suggested by Andrew, following is the working code:
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(System.out);
writer.writeStartDocument();
writer.writeStartElement("x");
writer.writeStartElement("y");
writer.writeAttribute("attr1", "123");
writer.writeEndDocument();
writer.flush();
writer.close();
This looks better than the SAX writer approach, for the well-formedness requirement.
Michael Glavassevich further commented on this approach:
It keeps the tags nested properly but you can still write all sorts of garbage.
Extending your example:
writer.writeStartDocument();
writer.writeDTD("");
writer.writeComment("bad -- comment");
writer.writeStartElement("x");
writer.writeCharacters("\u0000");
writer.writeProcessingInstruction("xml", "version=\"1.0\"");
writer.writeStartElement("y");
writer.writeAttribute("attr1", "123");
writer.writeAttribute("attr1", "456");
writer.writeEndElement();
writer.writeEndElement();
writer.writeEmptyElement("3");
writer.writeEndDocument();
writer.flush();
writer.close();
produces:
<?xml version="1.0" ?><!GARBAGE><!--bad -- comment--><x><?xml version="1.0"?><y attr1="123" attr1="456"></y></x><3/>
with a NUL char after <x>.