Saturday, February 25, 2012

modular XML instances and modular XSD schemas

I was playing with some new ideas lately related to exploring design options, to construct modular XML instance documents vs/and modular XSD schema documents and thought to write my findings as a blog post here.

I believe, there are primarily following concepts related to constructing modular XML documents (and XSD schemas) when XSD validation is involved:
1. Modularize XML documents using the XInclude construct.
2. Modularize an XSD document via <xs:include> and <xs:import>. The <xs:include> construct maps significantly to modularlity concepts in XSD schemas, and <xs:import> is necessary (necessary in XSD 1.0, and optional in XSD 1.1) to compose (and also to modularize) XSD schemas coming from two or more distinct XML namespaces.

I don't intend to delve much in this post into concepts related to XSD constructs <xs:include> and <xs:import> since these are well known within the XSD and XML communities. In this post, I would tend to primarily focus on XML document modularization via the XInclude construct and presenting few thoughts about various design options (I don't claim to have covered every design option for these use cases, but I feel that I would cover few of the important ones) to validate such XML instance documents via XSD validation.

What is XInclude?
This is an XML standards specification, that defines about how to modularize any XML document information. The primary construct of XInclude is an <xi:include> XML element. Following is a small example of an XInclude aware XML document,

z.xml

<z xmlns:xi="http://www.w3.org/2001/XInclude">
    <xi:include href="x.xml"/>
    <xi:include href="y.xml"/>
</z>

x.xml

<x>
    <a>1</a>
    <b>2</b>
</x>

y.xml

<y>
    <p>5</p>
    <q>6</q>
</y>

We'll be using the XML document, z.xml provided above that is composed from other XML documents via an XInclude meta-data, to provide to an XSD validator for validation.

I essentially discuss here, the XSD schema design options to validate an XML instance document like z.xml above. Following are the XSD design options (that cause successful XML instance validations) that currently come to my mind for this need, along with some explanation of the corresponding design rationale:

XS1:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="z">
          <xs:complexType>
               <xs:sequence>
                    <xs:any processContents="skip" minOccurs="2" maxOccurs="2"/>
               </xs:sequence>
          </xs:complexType>
    </xs:element>
   
</xs:schema>

This schema is written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. An xs:any wild-card in this schema would weakly validate (since this wild-card declaration only requires *any particular* XML element to be present in an instance document, which is validated by this wild-card. the wild-card here doesn't specify any other constraint for it's corresponding XML instance elements) each of the included XML document element roots (i.e XML elements "x" and "y").

XS2:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

        <xs:element name="z">
                <xs:complexType>
                     <xs:complexContent>
                         <xs:restriction base="T1">
                              <xs:sequence>
                                   <xs:element name="include"  minOccurs="2" maxOccurs="2" targetNamespace="http://www.w3.org/2001/XInclude"/>
                             </xs:sequence>
                         </xs:restriction>
                    </xs:complexContent>
                </xs:complexType>
        </xs:element>
   
    <xs:complexType name="T1" abstract="true">
          <xs:sequence>
               <xs:any processContents="skip" maxOccurs="unbounded"/>
          </xs:sequence>
    </xs:complexType>
   
</xs:schema>

This schema is also written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. But this schema specifies slightly stronger XSD validation constraints as compared to the previous example (stronger in a sense that, this schema declares an XML element and specifies it's name and an namespace). This schema would need an XSD 1.1 processor, since the element declaration specifies a "targetNamespace" attribute. An XSD 1.0 version of this design approach is possible, which would involve using an XSD <xs:import> element to import XSD components from the XInclude namespace.

XS3:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

       <xs:element name="z">
              <xs:complexType>
                  <xs:sequence>
                       <xs:any processContents="skip" minOccurs="2" maxOccurs="2" namespace="http://www.w3.org/2001/XInclude"/>
                 </xs:sequence>
                 <xs:assert test="count(*[local-name() = 'include']) = 2"/>
                 <xs:assert test="deep-equal((*[1] | *[2])/@*/name(), ('href','href'))"/>
             </xs:complexType>
      </xs:element>
   
</xs:schema>

This schema is also written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. But this schema enforces XSD validation even more strongly than the example "XS2" above (since this schema also requires the XInclude attribute "href" to be present on the XInclude meta-data, which the previous XSD schema doesn't enforce). This schema validates the names of XML instance elements, that are intended to be XInclude meta-data via XSD 1.1 <assert> elements (this may not be the best XSD validation approach, but such an XSD design idiom is now possible with XSD 1.1 language).

XS4:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="z">
         <xs:complexType>
               <xs:sequence>
                    <xs:element name="x">
                         <xs:complexType>
                             <xs:sequence>
                                  <xs:element name="a" type="xs:integer"/>
                                 <xs:element name="b" type="xs:integer"/>
                            </xs:sequence>
                        </xs:complexType>
                    </xs:element>
                    <xs:element name="y">
                         <xs:complexType>
                             <xs:sequence>
                                  <xs:element name="p" type="xs:integer"/>
                                  <xs:element name="q" type="xs:integer"/>
                             </xs:sequence>
                        </xs:complexType>
                   </xs:element>
              </xs:sequence>
         </xs:complexType>
     </xs:element>
   
</xs:schema>

This schema is written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data expanded. This schema specifies the strongest of XSD validation constraints as compared to the previous three approaches (strongest in a sense that, the internal structure of XML element instances "x" and 'y" are now completely specified by the XSD document).

But to make this XSD validation approach to work, the XInclude meta-data needs to be expanded and the expanded XML infoset needs to be supplied to the XSD validator for validation. This would require an XInclude processor (like Apache Xerces), that plugs within the XML parsing stage to expand the <xi:include> tags.

For the interest of readers, following are few java code snippets (the skeletal class structure and imports are omitted to keep the text shorter) that enable XInclude processing and supplying the resulting XML infoset (i.e post the XInclude meta-data expansion) to the Xerces XSD validator,

try {           
     Schema schema = schemaFactory.newSchema(getSaxSource(xsdUri, false));
     Validator validator = schema.newValidator();
     validator.setErrorHandler(new ValidationErrHandler());
     validator.validate(getSaxSource(xmlUri, true));
}
catch(SAXException se) {
     se.printStackTrace();
}
catch (IOException ioe) {
     ioe.printStackTrace();
}

private SAXSource getSaxSource(String docUri, boolean isInstanceDoc) {

     XMLReader reader = null;

     try {
          reader = XMLReaderFactory.createXMLReader();
          if (isInstanceDoc) {
              reader.setFeature("http://apache.org/xml/features/xinclude", true);
              reader.setFeature("http://apache.org/xml/features/xinclude/fixup-base-uris", false);
          }
     }
     catch (SAXException se) {
          se.printStackTrace();
     }

     return new SAXSource(reader, new InputSource(docUri));

}
     
class ValidationErrHandler implements ErrorHandler {

      public void error(SAXParseException spe) throws SAXException {
           String formattedMesg = getFormattedMesg(spe.getSystemId(), spe.getLineNumber(), spe.getColumnNumber(), spe.getMessage());
           System.err.println(formattedMesg);
      }

      public void fatalError(SAXParseException spe) throws SAXException {
             String formattedMesg = getFormattedMesg(spe.getSystemId(), spe.getLineNumber(), spe.getColumnNumber(), spe.getMessage());
             System.err.println(formattedMesg);
      }

      public void warning(SAXParseException spe) throws SAXException {
           // NO-OP           
      }
       
}

private String getFormattedMesg(String systemId, int lineNo, int colNo, String mesg) {
      return systemId + ", line "+lineNo + ", col " + colNo + " : " + mesg;   
}

Summary: I would ponder that, is devising the above various XSD design approaches beneficial for an XSD schema design that involves validating XML instance documents that contain <xi:include> meta-data directives? My thought process with regards to the above presented XSD validation options had following concerns:
1) Providing various degrees of XSD validation strenghts for <xi:include> directives (essentially the un-expanded and expanded modes).
2) Exploring some of the new XML validation idioms offered by XSD 1.1 language for the use cases presented above (essentially using "targetNamespace" attribute on xs:element elements, and using <assert> elements).
3) Exploring the java SAX and JAXP APIs to enable XInclude meta-data expansion, and providing a SAXSource object containing an XInclude expanded XML infoset which is subsequently supplied further to the XSD validation pipeline.

I hope that this post was useful.

No comments: