Saturday, June 4, 2011

Dealing with multiple roots within an XML Schema

I've been thinking on this problem for a while, and have collected some opinions, which I'm presenting here.

We'll be working with the following XML Schema documents:

a.xsd [1]
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="x" type="xs:string"/>

    <xs:element name="y" type="xs:string"/>

    <xs:element name="z">
       <xs:complexType>
          <xs:sequence>
             <xs:element ref="x"/>
             <xs:element ref="y"/>
          </xs:sequence>
       </xs:complexType>
    </xs:element>

</xs:schema>

b.xsd [2]
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:include schemaLocation="c.xsd"/>

    <xs:element name="z">
       <xs:complexType>
          <xs:sequence>
             <xs:element ref="x"/>
             <xs:element ref="y"/>
          </xs:sequence>
       </xs:complexType>
    </xs:element>

</xs:schema>

c.xsd [3]
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="x" type="xs:string"/>

    <xs:element name="y" type="xs:string"/>

</xs:schema>
The schema documents [1] and [2] are equivalent for the purpose of validating an XML instance document (it's just that the schema document b.xsd includes c.xsd).

Our application requires the following XML document to be successfully validated, by the schemas [1] or [2] above:

z.xml [4]
<z>
   <x>hello</x>
   <y>world</y>
</z>
All of this is just fine, and XML document [4] get's successfully validated by the schemas [1] or [2] above.

But the above schema design (either [1] or [2]), may present following problems sometimes:

The side effect of schema documents [1] or [2] is to also successfully validate the following XML documents,

<x>...</x>

OR

<y>...</y>
Since elements "x" and "y" are also valid roots defined in the schema (due to the global declarations of elements "x" and "y" in the schema). But the purpose of defining elements "x" and "y" in the schema, is to include them by reference else where in the schema document (as in element declaration "z" in schemas [1] or [2]).

This kind of schema design is sometimes necessary, for the reasons of modularity (for e.g using one declaration at multiple places) and re-usability (for e.g. by including a foreign schema in our own schema) -- this design can be more beneficial, if the complexity of the schema (for e.g with more schema components, and more & deep nesting of schema components) is more.

So how do we live with following design trade-off,
i.e having schema like [1] or [2] above (which gives us benefits of modularity and re-usability) and also a side effect of these schema documents validating multiple root elements (which risks an application to accept invalid XML documents -- in this example, the roots "x" and "y" are invalid for the application, while the root "z" is valid).

In this use case, if we desire that the application must reject XML documents with roots "x" or "y" but should accept documents with root "z", then to my opinion this problem cannot be solved completely with XML Schema language (there's no way currently in the XML Schema language, to forbid validating the top level XML element in instance document, with a global schema element declaration).

Solving this problem would require a little bit of non schema solution (for e.g a SAX java add-on along with schema validation).

Here's a sketch of a java SAX application which can be and-ed with the XML Schema validation (using schemas above), to achieve the desired overall XML validation effect (i.e successful validation for the root element "z" and prohibiting the XML roots "x" and "y"),

(java imports are omitted to keep the text short)
class SAXUtil extends DefaultHandler {

     String[] excludedElems = new String[] {"x", "y"};

     private boolean isRootElemOK(String docUri) {  
        boolean rootElemOk = true;
  
        try {
           SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
           saxParserFactory.setNamespaceAware(true);
           SAXParser saxParser = saxParserFactory.newSAXParser();
           saxParser.parse(docUri, this);
        }
        catch(SAXException ex) {
           if (ex instanceof RootElementSAXException) {
              RootElementSAXException expObj = (RootElementSAXException) ex;
              if ("100".equals(expObj.getErrCode())) {
                 rootElemOk = false;  
              }
           }
        }
  
        return rootElemOk;  
     }

     public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
        if (!isElementAllowed(localName)) {
           throw new RootElementSAXException("100"); 
        }
        throw new RootElementSAXException("101");
     }

     private boolean isElementAllowed(String localName) {    
        boolean elemAllowed = true;       
        for (int elemIdx = 0; elemIdx < excludedElems.length; elemIdx++) {
           if (localName.equals(excludedElems[elemIdx])) {
              elemAllowed = false;
              break;
           }
        }       
        return elemAllowed;       
     }

     class RootElementSAXException extends SAXException {
        String errorCode;
  
        public RootElementSAXException(String errorCode) {
           this.errorCode = errorCode;
        }
  
        public String getErrCode() {
           return errorCode;
        }
     }

} // class SAXUtil

Following is an algorithmic summary of the above java validation add-on,

1) A SAX parser is instantiated and parsing is invoked/triggered with the parse() method.

2) The SAX parser cannot go beyond parsing the root element -- the algorithm is intentionally designed in this way (since the SAX "startElement" callback method would always throws an exception [user defined exception, RootElementSAXException], upon encountering the top most element). The constructor parameter to the exception ("100" or "101" in this case) RootElementSAXException determines, whether the top most element was allowed or not (which is determined by an element name forbidden-list "excludedElems", defined in the above java class).

Notes:

1) To terminate the SAX parsing prior to completing parsing the whole of XML document, a SAXException can be thrown from the SAX call back methods. A custom exception class (like RootElementSAXException in the above example) is desirable, to distinguish our application designed exception from the built in SAXException events.

2) It's recommended to use SAX API for this use case, since it'll be much more efficient than for e.g using DOM APIs, which would load the whole of XML document in memory (which doesn't look a sensible approach to me, for just knowing the name of top most element of XML document).

3) The exclude element name list can be externalized from the java application, to make the above program reusable for any kind of XML documents.

4) We may use something like the java JAXP validation APIs, to help achieve the "and" of the two validation steps (i.e, schema validation and the SAX application step) described here, if we want to integrate this approach in a java application.

5) The java code snippet presented above can be made XML namespace aware (i.e if the XML elements are in namespace), by considering the namespace name parameter in SAX callback methods (for e.g the method parameter "String uri", in the startElement callback method).

I hope that this post is useful.

2011-06-26:
The explanation given by me in this blog post originally, seems to convey that multiple global element declarations in XML Schema documents are allowable by the XML Schema language, and this is inherently a bad/wrong design present within the XML Schema language. One of the solution to prohibit certain XML elements to be global in an XML instance document, was presented earlier in this blog post (using an additional restricted SAX parsing step in an application).

All this is fine. But I wanted to follow up on my thoughts written earlier in this post, arguing now, that multiple global element declarations allowable in XML Schema language is not a bad/wrong design present in XML Schema language. One of the features of XML Schema language, which requires multiple global element declarations is XML Schema "substitution groups" (i.e one element substituting for another) -- and "substitution groups" is a core and important concept within XML Schema language.

Of-course, if not working with XML Schema "substitution groups" or otherwise, one could use the SAX add-on technique I presented earlier to prohibit certain global element declarations to validate the XML instance root element, if that suits someones application design.