Mukul Gandhi: April 2008

Monday, April 14, 2008

Efficient XML creation with well-formedness support

After analyzing lot of points (with a discussion in the thread, "Best way to create an XML document" on xml-dev list) for creating XML document from scratch, I concluded that the following technique is perhaps the most efficient way to do this (with well-formedness support):


public static void main(String[] args) {
  try {
    TransformerFactoryImpl tfi = new TransformerFactoryImpl();
    TransformerHandler tHandler = tfi.newTransformerHandler();

    tHandler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
    String output = "output.xml";
    tHandler.setResult(new StreamResult(new File(output)));

    tHandler.startDocument();
    tHandler.startElement("", "x", "x", null);
    AttributesImpl attrs = new AttributesImpl();
    attrs.addAttribute("", "attr1","attr1", "", "123");
    attrs.addAttribute("", "attr2","attr2", "", "456");
    tHandler.startElement("", "y", "y", attrs);
    tHandler.startElement("", "z", "z", null);
    tHandler.endElement("", "z", "z");
    tHandler.endElement("", "y", "y");
    tHandler.endElement("", "x", "x");
    tHandler.endDocument();

    if (isWellFormed(output)) {      
      /*
        do something with the generated file
      */
    }
    else {
      System.out.println("Generated XML document is not well-formed.");
    }
  }
  catch(Exception ex) {
    ex.printStackTrace();
  }
}

private static boolean isWellFormed(String output) {
 try {
   XMLReaderAdapter xra = new XMLReaderAdapter();
   InputSource is = new InputSource(new FileInputStream(output));
   xra.parse(is);
   return true;
 }
 catch(Exception ex) {
   return false;
 }
}

I also agree to this opinion:
Since version 1.6, Java has supported the javax.xml.stream package, so the most straightforward way would be to use an XMLStreamWriter.

Please note: javax.xml.stream API is available to previous versions of Java through JSR 173 (https://sjsxp.dev.java.net/)

Acknowledgements
Martin Gallagher
Alain Couthures
Michael Kay
Robert Koberg
Andrew Welch
Michael Glavassevich
Chris Burdess

Thursday, April 10, 2008

Best way to create an XML document

I posted following question on the xml-dev mailing list, regarding this topic:

Let's suppose I need to create an XML document from scratch in a Java program.

What's the best way to do this?

I have seen that a quick way to do this is preparing a XML string by hand.

For e.g.,

String xml_str = "<x><y><z/></y></x>";

I want to understand the pros and cons of this approach.

I most of the time prefer using an API like DOM to create an in-memory representation, and then serializing the tree to String.

Following are my arguments in favor of using the DOM approach:

1) Creating a XML string by hand can become cumbersome, if XML is huge. Maintaining the correct parent child relationship for a huge document can be difficult, if done by hand (imagine a document of size 50 MB). This would lead to difficult debugging. Using a DOM API can do this inherently in memory.

2) It's difficult to remember correct XML name conventions if done by hand.

for e.g., <9abc> is an invalid XML name (because it starts with a number).

There are more rules for XML names.

Using DOM API does this automatically.

3) Using DOM API can check well-formedness of entities (like, &abc; etc). Doing this by hand in a string can become difficult.

Following people added useful comments to this thread.

Martin Gallagher: Creating via DOM can be advantageous if further manipulation via the DOM is required. This will remove the overhead of creating a string and the initial parse to a DOM object.

Alain Couthures: I do appreciate not having too much program lines so I really prefer the string approach. You can always copy/paste the string in a text editor (I use NotePad++ myself...) which will check it is well-formed !

Michael Kay: My preference is to use a SAX-like serializer driven by calls such as

startElement("x")
attribute("a", "3")
text("content")
endElement("x")

This avoids the overhead of creating a tree in memory, while still giving the benefits of having the system take care of matters such as escaping special characters.

Robert Koberg: Rob referred to the following link to implement Mike's ideas in Java,

http://www.megginson.com/downloads/

xml-writer-0.2.zip

(Mukul: I tried this, but found that this XML writer library doesn't check for well-formedness of XML documents, which is a bit of a drawback of this library).

Andrew Welch: Create an empty SAXTransformerFactory get the TransformerHandler and then just call the event methods on that... (using Xalan or Saxon rather than Xerces)

Alternatively use the streaming api in Java 6 - see XMLStreamWriter.

Here is the working code as per Andrew's suggestions:

TransformerFactoryImpl tfi = new TransformerFactoryImpl();
TransformerHandler tHandler = tfi.newTransformerHandler();
tHandler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
tHandler.setResult(new StreamResult(System.out));
tHandler.startDocument();
tHandler.startElement("", "x", "x", null);
AttributesImpl attrs = new AttributesImpl();
attrs.addAttribute("", "attr1","attr1", "", "123");
attrs.addAttribute("", "attr2","attr2", "", "456");
tHandler.startElement("", "y", "y", attrs);
tHandler.startElement("", "z", "z", null);
tHandler.endElement("", "z", "z");
tHandler.endElement("", "y", "y");
tHandler.endElement("", "x", "x");
tHandler.endDocument();

But unfortunately, with this technique also, we cannot ensure well-formedness of the XML output.

I questioned ...
Here I am using the transformer functionality for creating XML, which looks more like a XSLT feature (i.e., transformation task). Should we not have this capability in the XML parser (for e.g., in Xerces)? Should we have something like xml-writer (which Rob pointed) built into Xerces (possibly as an enhancement)?

Michael Kay provided following explanation for this scenario:
XSLT processors include a serializer because it's defined in the XSLT specification, and since they have one, it makes sense to expose it even if you aren't doing a transformation.

Michael provided following explanation for the inability of the XSLT serializer to check for well-formedness of XML output.
"Because the spec doesn't say it has to. And because XSLT serializers were primarily written to get their input from XSLT transformers, which they trust; so why incur the extra expense?"

Using the XMLStreamWriter approach, as suggested by Andrew, following is the working code:

XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(System.out);
writer.writeStartDocument();
writer.writeStartElement("x");
writer.writeStartElement("y");
writer.writeAttribute("attr1", "123");
writer.writeEndDocument();
writer.flush();
writer.close();

This looks better than the SAX writer approach, for the well-formedness requirement.

Michael Glavassevich further commented on this approach:

It keeps the tags nested properly but you can still write all sorts of garbage.

Extending your example:

writer.writeStartDocument();
writer.writeDTD("");
writer.writeComment("bad -- comment");
writer.writeStartElement("x");
writer.writeCharacters("\u0000");
writer.writeProcessingInstruction("xml", "version=\"1.0\"");
writer.writeStartElement("y");
writer.writeAttribute("attr1", "123");
writer.writeAttribute("attr1", "456");
writer.writeEndElement();
writer.writeEndElement();
writer.writeEmptyElement("3");
writer.writeEndDocument();
writer.flush();
writer.close();

produces:

<?xml version="1.0" ?><!GARBAGE><x><?xml version="1.0"?><y attr1="123" attr1="456"></y></x><3/>

with a NUL char after <x>.

Xalan-J serializer

I thought there was some problem with Xalan-J serializer. I posted the following question on xalan-dev mailing list:

I think, there is scope of improvement to the Xalan-J 2.7.1 serializer.

I tried this sample XSLT stylesheet with Xalan-J 2.7.1.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:output method="xml" indent="yes" />

<xsl:template match="/">
  <x>
    <y/>
  </x>
</xsl:template>

</xsl:stylesheet>

The output produced by Xalan is:
<?xml version="1.0" encoding="UTF-8"?><x>
<y/>
</x>

Please note that top most element tag, <x> is not indented properly.

I wish the output in this case should be:

<?xml version="1.0" encoding="UTF-8"?>
<x>
  <y/>
</x>

This problem seems to happen with any XML output.

Henry Zongaro provided a good argument that why this is so:

The problem here is that the serializer considers that the result document might be used as an external general parsed entity. So, suppose the result is named result.xml. If it's referenced inside a document such as the following, inserting whitespace before the x element in result.xml would affect the text content of its parent element, doc.

<!DOCTYPE doc [
<!ENTITY ref SYSTEM "result.xml">
]>
<doc>Some non-whitespace test &ref; Some more non-whitespace text</doc>

Saturday, April 5, 2008

Schema aware XSLT design

I think it would be great to start my blog with a Schema Aware XSLT idea.

We all know that XSLT 2.0 has introduced the concept of utilizing W3C Schemas within XSLT stylesheets. Schemas can be put to use in various ways in the stylesheets. Following are the major ways in which Schemas can be utilized in stylesheets:

1) Validating the input XML documents prior to doing the XSLT transformation. This can ensure that invalid input is not processed by the stylesheet. Input validation has an additional benefit, that it attaches type annotations to XML nodes. This makes possible many useful type aware operations within the stylesheet.

2) Validating the output trees (in most of the cases) prior to serialization. This can ensure that the XSLT stylesheet doesn't produce invalid output.

3) The Schemas can also be utilized to validate intermediate trees.

4) In XSLT 2.0, we can specify types of function/template parameters and return types of functions. We can also specify types of variables. The types can be any user-defined type derived from the imported Schemas. This is tremendously useful, as now the type system of XSLT (2.0) can be extended in an unlimited way.

5) Apart from enhanced static type checking by the XSLT processor (made possible by Schema Awareness), the XSLT processor has opportunity to generate efficient code. Better static type checking allows faster debugging (mostly during compile time).

I would be interested to know about other benefits of Schema Aware stylesheet design.

Wednesday, April 2, 2008

I've started blogging

I've started to blog today. The topics which I currently find interesting are XML, XML Schema, XPath, XSLT and XQuery. Though occasionally, I might post my thoughts on other technical topics as well.

I intend to use this medium, to share some random as well some carefully thought out ideas in information technology domain, to bring them in open for discussion.

I hope that this blog would be useful.

Please feel free to post any comments.

Mukul Gandhi