Saturday, April 14, 2012

XSD 1.1 is now a W3C standard

I've been looking forward (and I hope many others as well) to have the XSD 1.1 language to become a W3C recommendation (REC). XSD 1.1 did become a REC on 5th April 2012. This was a big big wait for the XSD community! But finally this has come so. The previous XSD standard (XSD 1.0 2nd edition) dates back to 2004.

XSD 1.1 implementations: There seems to be currently two XSD 1.1 implementations, which are Xerces and Saxon. Xerces is a project from Apache Software Foundation's XML activity, and Saxon is a product set from Saxonica (via Michael Kay). Both of these implementations pass near to 100% of the W3C XSD 1.1 test suite, so these tools are reliable implementations of the XSD 1.1 standard.

For the interest of readers (for those not aware), following are the feature change list (these are non-normative details, but are fairly complete. for the complete list of changes within XSD 1.1 wrt the XSD 1.0 language, you'll have to read the whole of XSD 1.1 language) that is within the XSD 1.1 language:

XSD 1.1 Structures specification: http://www.w3.org/TR/xmlschema11-1/#changes
XSD 1.1 Datatypes specification: http://www.w3.org/TR/xmlschema11-2/#changes

Wishing a happy reading, to the XSD folks (and to the wider XML community) of the new XSD 1.1 specification, and trying out the available implementations :)

Saturday, February 25, 2012

modular XML instances and modular XSD schemas

I was playing with some new ideas lately related to exploring design options, to construct modular XML instance documents vs/and modular XSD schema documents and thought to write my findings as a blog post here.

I believe, there are primarily following concepts related to constructing modular XML documents (and XSD schemas) when XSD validation is involved:
1. Modularize XML documents using the XInclude construct.
2. Modularize an XSD document via <xs:include> and <xs:import>. The <xs:include> construct maps significantly to modularlity concepts in XSD schemas, and <xs:import> is necessary (necessary in XSD 1.0, and optional in XSD 1.1) to compose (and also to modularize) XSD schemas coming from two or more distinct XML namespaces.

I don't intend to delve much in this post into concepts related to XSD constructs <xs:include> and <xs:import> since these are well known within the XSD and XML communities. In this post, I would tend to primarily focus on XML document modularization via the XInclude construct and presenting few thoughts about various design options (I don't claim to have covered every design option for these use cases, but I feel that I would cover few of the important ones) to validate such XML instance documents via XSD validation.

What is XInclude?
This is an XML standards specification, that defines about how to modularize any XML document information. The primary construct of XInclude is an <xi:include> XML element. Following is a small example of an XInclude aware XML document,

z.xml

<z xmlns:xi="http://www.w3.org/2001/XInclude">
    <xi:include href="x.xml"/>
    <xi:include href="y.xml"/>
</z>

x.xml

<x>
    <a>1</a>
    <b>2</b>
</x>

y.xml

<y>
    <p>5</p>
    <q>6</q>
</y>

We'll be using the XML document, z.xml provided above that is composed from other XML documents via an XInclude meta-data, to provide to an XSD validator for validation.

I essentially discuss here, the XSD schema design options to validate an XML instance document like z.xml above. Following are the XSD design options (that cause successful XML instance validations) that currently come to my mind for this need, along with some explanation of the corresponding design rationale:

XS1:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="z">
          <xs:complexType>
               <xs:sequence>
                    <xs:any processContents="skip" minOccurs="2" maxOccurs="2"/>
               </xs:sequence>
          </xs:complexType>
    </xs:element>
   
</xs:schema>

This schema is written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. An xs:any wild-card in this schema would weakly validate (since this wild-card declaration only requires *any particular* XML element to be present in an instance document, which is validated by this wild-card. the wild-card here doesn't specify any other constraint for it's corresponding XML instance elements) each of the included XML document element roots (i.e XML elements "x" and "y").

XS2:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

        <xs:element name="z">
                <xs:complexType>
                     <xs:complexContent>
                         <xs:restriction base="T1">
                              <xs:sequence>
                                   <xs:element name="include"  minOccurs="2" maxOccurs="2" targetNamespace="http://www.w3.org/2001/XInclude"/>
                             </xs:sequence>
                         </xs:restriction>
                    </xs:complexContent>
                </xs:complexType>
        </xs:element>
   
    <xs:complexType name="T1" abstract="true">
          <xs:sequence>
               <xs:any processContents="skip" maxOccurs="unbounded"/>
          </xs:sequence>
    </xs:complexType>
   
</xs:schema>

This schema is also written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. But this schema specifies slightly stronger XSD validation constraints as compared to the previous example (stronger in a sense that, this schema declares an XML element and specifies it's name and an namespace). This schema would need an XSD 1.1 processor, since the element declaration specifies a "targetNamespace" attribute. An XSD 1.0 version of this design approach is possible, which would involve using an XSD <xs:import> element to import XSD components from the XInclude namespace.

XS3:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

       <xs:element name="z">
              <xs:complexType>
                  <xs:sequence>
                       <xs:any processContents="skip" minOccurs="2" maxOccurs="2" namespace="http://www.w3.org/2001/XInclude"/>
                 </xs:sequence>
                 <xs:assert test="count(*[local-name() = 'include']) = 2"/>
                 <xs:assert test="deep-equal((*[1] | *[2])/@*/name(), ('href','href'))"/>
             </xs:complexType>
      </xs:element>
   
</xs:schema>

This schema is also written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. But this schema enforces XSD validation even more strongly than the example "XS2" above (since this schema also requires the XInclude attribute "href" to be present on the XInclude meta-data, which the previous XSD schema doesn't enforce). This schema validates the names of XML instance elements, that are intended to be XInclude meta-data via XSD 1.1 <assert> elements (this may not be the best XSD validation approach, but such an XSD design idiom is now possible with XSD 1.1 language).

XS4:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="z">
         <xs:complexType>
               <xs:sequence>
                    <xs:element name="x">
                         <xs:complexType>
                             <xs:sequence>
                                  <xs:element name="a" type="xs:integer"/>
                                 <xs:element name="b" type="xs:integer"/>
                            </xs:sequence>
                        </xs:complexType>
                    </xs:element>
                    <xs:element name="y">
                         <xs:complexType>
                             <xs:sequence>
                                  <xs:element name="p" type="xs:integer"/>
                                  <xs:element name="q" type="xs:integer"/>
                             </xs:sequence>
                        </xs:complexType>
                   </xs:element>
              </xs:sequence>
         </xs:complexType>
     </xs:element>
   
</xs:schema>

This schema is written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data expanded. This schema specifies the strongest of XSD validation constraints as compared to the previous three approaches (strongest in a sense that, the internal structure of XML element instances "x" and 'y" are now completely specified by the XSD document).

But to make this XSD validation approach to work, the XInclude meta-data needs to be expanded and the expanded XML infoset needs to be supplied to the XSD validator for validation. This would require an XInclude processor (like Apache Xerces), that plugs within the XML parsing stage to expand the <xi:include> tags.

For the interest of readers, following are few java code snippets (the skeletal class structure and imports are omitted to keep the text shorter) that enable XInclude processing and supplying the resulting XML infoset (i.e post the XInclude meta-data expansion) to the Xerces XSD validator,

try {           
     Schema schema = schemaFactory.newSchema(getSaxSource(xsdUri, false));
     Validator validator = schema.newValidator();
     validator.setErrorHandler(new ValidationErrHandler());
     validator.validate(getSaxSource(xmlUri, true));
}
catch(SAXException se) {
     se.printStackTrace();
}
catch (IOException ioe) {
     ioe.printStackTrace();
}

private SAXSource getSaxSource(String docUri, boolean isInstanceDoc) {

     XMLReader reader = null;

     try {
          reader = XMLReaderFactory.createXMLReader();
          if (isInstanceDoc) {
              reader.setFeature("http://apache.org/xml/features/xinclude", true);
              reader.setFeature("http://apache.org/xml/features/xinclude/fixup-base-uris", false);
          }
     }
     catch (SAXException se) {
          se.printStackTrace();
     }

     return new SAXSource(reader, new InputSource(docUri));

}
     
class ValidationErrHandler implements ErrorHandler {

      public void error(SAXParseException spe) throws SAXException {
           String formattedMesg = getFormattedMesg(spe.getSystemId(), spe.getLineNumber(), spe.getColumnNumber(), spe.getMessage());
           System.err.println(formattedMesg);
      }

      public void fatalError(SAXParseException spe) throws SAXException {
             String formattedMesg = getFormattedMesg(spe.getSystemId(), spe.getLineNumber(), spe.getColumnNumber(), spe.getMessage());
             System.err.println(formattedMesg);
      }

      public void warning(SAXParseException spe) throws SAXException {
           // NO-OP           
      }
       
}

private String getFormattedMesg(String systemId, int lineNo, int colNo, String mesg) {
      return systemId + ", line "+lineNo + ", col " + colNo + " : " + mesg;   
}

Summary: I would ponder that, is devising the above various XSD design approaches beneficial for an XSD schema design that involves validating XML instance documents that contain <xi:include> meta-data directives? My thought process with regards to the above presented XSD validation options had following concerns:
1) Providing various degrees of XSD validation strenghts for <xi:include> directives (essentially the un-expanded and expanded modes).
2) Exploring some of the new XML validation idioms offered by XSD 1.1 language for the use cases presented above (essentially using "targetNamespace" attribute on xs:element elements, and using <assert> elements).
3) Exploring the java SAX and JAXP APIs to enable XInclude meta-data expansion, and providing a SAXSource object containing an XInclude expanded XML infoset which is subsequently supplied further to the XSD validation pipeline.

I hope that this post was useful.

Sunday, February 5, 2012

"castable as" vs "instance of" XPath 2.0 expressions for XSD 1.1 assertions

I'm continuing with my thoughts related to my previous blog post (ref, http://mukulgandhi.blogspot.in/2012/01/using-xsd-11-assertions-on-complextype.html). The earlier post used the XPath 2.0 "castable as" expression to do some checks on the 'untyped' data of complexType's mixed content (essentially finding if the string/untyped value in an XML instance document is a lexical representation of an xs:integer value).

This post talks about the use of XPath 2.0 "instance of" vs "castable as" expressions in context of XSD 1.1 assertions -- essentially providing guidance about when it may be necessary to use one of these expressions.

The XSD 1.1 "castable as" use case was discussed in my earlier blog post. Here I essentially talk about "instance of" expression when used with XSD 1.1 assertions.

Let's assume that there is an XML instance document like following (XML1):

<X>
   <elem>
     <a>20</a>
     <b>30</b>
   </elem>
   <elem>
     <a>10</a>
     <b>2005-10-07</b>
   </elem>
</X>

The XSD schema should express the following constraints with respect to the above XML instance document (XML1):
1. The elements "a" and "b" can be typed as an xs:integer or a xs:date (therefore we'll express this with an XSD simpleType with variety 'union').
2. If both the elements "a" and "b" are of type xs:integer (this is allowable as per the simpleType definition described in point 1 above), then numeric value of element "a" should be less than numeric value of element "b".
3. If one of the elements "a" or "b" is an xs:integer and the other one is xs:date, then we would like to express the following constraints,
   - the numeric XML instance value of an xs:integer typed element should be less than 100
   - the xs:date XML instance value should be less that the current date

The following XSD (1.1) schema document describes all of the above validation constraints for a sample XML instance document (XML1) provided above:

[XS1]

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   
     <xs:element name="X">
        <xs:complexType>
           <xs:sequence>
              <xs:element name="elem" maxOccurs="unbounded">
                 <xs:complexType>
                    <xs:sequence>
                       <xs:element name="a" type="union_of_date_and_integer"/>
                       <xs:element name="b" type="union_of_date_and_integer"/>
                    </xs:sequence>
                    <xs:assert test="if ((data(a) instance of xs:integer) and (data(b) instance of xs:integer))
                                              then (data(a) lt data(b))
                                           else if (not(deep-equal(data(a), data(b))))
                                              then (*[data(.) instance of xs:integer]/data(.) lt 100
                                                         and
                                                      *[data(.) instance of xs:date]/data(.) lt current-date())
                                              else true()"/>
                 </xs:complexType>
              </xs:element>
           </xs:sequence>
        </xs:complexType>
     </xs:element>
   
     <xs:simpleType name="union_of_date_and_integer">
        <xs:union memberTypes="xs:date xs:integer"/>
     </xs:simpleType>
   
</xs:schema>

I think it may be interesting for readers to know why I wrote an assertion like the one above. Following are few of the thoughts,
1. Since the XML elements "a" and "b" are typed as a simpleType 'union', therefore for an assertion to access the XML instance atomic values that were validated by such an simpleType we need to use the XPath 2.0 "data" function on a relevant XDM node (elements "a" and "b" in this case). Further determining that the XML document's atomic instance value is typed as xs:integer, we need to use the "instance of" expression -- "castable as" is not needed in this case, since the instance document's data is already typed.
2. The rest of the assertion implements what is mentioned in the requirements above.

If you want to have further visual and/or design elegance within what is written in an assertion above, one may feel free to break assertion rules into two or more assertions.

I would also want to write another XSD 1.1 assertions example which doesn't use an XPath 2.0 "castable as" or an "instance of" expression. This demonstrates that, if an XDM assert node is already typed it would usually be unnecessary to use the "castable as" expression (since "castable as" is essentially useful to programmatically enforce typing with string/untyped values) or an "instance of" expression may be needed for some cases.

Following is a slightly modified variant of the XML instance document specified above (XML1):

[XML2]

<X>
   <elem>
     <a>2</a>
     <b>2012-02-04</b>
   </elem>
   <elem>
     <a>10</a>
     <b>2005-10-07</b>
   </elem>
</X>

The XSD schema should express the following constraints with respect to the above XML instance document (XML2):
1. The element "a" is typed as an xs:nonNegativeInteger value, and element "b" is typed as xs:date.
2. The number of days equal to the numeric value specified in an element "a" if added to the xs:date value specified in an element "b", should result in an xs:date value which must be less than the current date.

The following XSD (1.1) schema document describes all of the above validation constraints for a sample XML instance document (XML2) provided above:

[XS2]

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   
     <xs:element name="X">
        <xs:complexType>
           <xs:sequence>
              <xs:element name="elem" maxOccurs="unbounded">
                 <xs:complexType>
                    <xs:sequence>
                       <xs:element name="a" type="xs:nonNegativeInteger"/>
                       <xs:element name="b" type="xs:date"/>
                    </xs:sequence>
                    <xs:assert test="(b + xs:dayTimeDuration(concat('P', a, 'D'))) lt current-date()"/>
                 </xs:complexType>
              </xs:element>
           </xs:sequence>
        </xs:complexType>
     </xs:element>
   
</xs:schema>

That's all I had to say today.

I hope this post was useful.

Thursday, January 26, 2012

Using XSD 1.1 assertions on complexType mixed contents

There were some interesting ;) thoughts coming to my mind lately, and not surprisingly again related to XSD. I was playing with XSD 1.1 assertions once again to try to constrain an XSD complexType{mixed} content model and I'm sharing some of my findings ... (I guess, I hadn't written about this particular topic on this blog before or on any other forum. If you find any duplicacy of information in this post with any information I might have written elsewhere, kindly ignore the earlier things I might have written). I come to the topic now.

What is XSD mixed content (you may ignore reading this, if you already know about this)?
 I believe, this isn't really an XSD only topic. It is something which is present in plain XML (there can be a good old well-formed XML document, which might have "mixed" content and needn't be validated at all -- i.e in a schema free XML environment), but XSD allows to report such an XML instance document as 'valid' (more importantly, XSD would report a "mixed" content model XML instance as 'invalid' if validated by an "element only" content model specified by an XSD complexType definition) and also to constrain XML mixed contents in certain ways (particularly with XSD 1.1 in some new ways, which I'll try to talk about further below).

Example of "element only" (content of element "X" here) XML content model [X1]:

<X>
  <Y/>
  <Z/>
</X>
Example of "mixed content" (content of element "X" here) XML content model [X2]: 

<X>
  abc
  <Y/>
  123
  <Z/>
  654
</X> 

Therefore, "mixed content" allows "non whitespaced" text nodes as siblings of element nodes.

XSD 1.0 schema definition that allows "mixed" content [XS1]:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="X">    
        <xs:complexType mixed="true">
             <xs:sequence>
                 <xs:element name="Y"/>
                 <xs:element name="Z"/>
             </xs:sequence>
        </xs:complexType>
    </xs:element>
    
</xs:schema>

This schema (XS1) would report the XML document "X2" above as 'valid' (since that instance document has "mixed" content, and this schema allows "mixed" content via a property "mixed = 'true'" on a complexType definition).

But in the schema document "XS1" above, if we remove the property specifier "mixed = 'true'" or set the value of attribute "mixed" as 'false' (which is also the default value of this attribute), then such a modified schema would report the XML instance document "X2" above as 'invalid' (but the XML document "X1" above would be reported as 'valid' -- since it doesn't has "mixed" content).

New capabilities provided by XSD 1.1 to constrain XML "mixed" content further:

Following is a list of new features supported by XSD 1.1 for XML "mixed" contents, that currently come to my mind,

a)

XSD 1.1 schema "XS2":
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="X">    
       <xs:complexType mixed="true">
          <xs:sequence>
             <xs:element name="Y"/>
             <xs:element name="Z"/>
          </xs:sequence>          
          <xs:assert test="deep-equal(text()[matches(.,'\w')]/normalize-space(.), ('abc','123','654'))"/>
       </xs:complexType>
    </xs:element>
    
</xs:schema>
The <assert> element in this schema (XS2) constrains the mixed content in XML instance document to be a list (with order of list items been significant) of only few specified values. The assertion is written only to illustrate the technical capabilities of an assertion here, but not with any application in mind.
Following are few of other things, which XSD 1.1 assertions could achieve in an XML "mixed" content model's context:

b)
<xs:assert test="((text()[matches(.,'\w')]/normalize-space(.))[2] castable as xs:integer)
                    and
                 ((text()[matches(.,'\w')]/normalize-space(.))[3] castable as xs:integer)"/>

This assertion constrains specific items of an XML "mixed" content model list to be of a specified XSD schema type -- here the 2nd and 3rd items of the list need to be typed as xs:integer, whereas the first item is "untyped".

c)
<xs:assert test="count((text()[matches(.,'\w')]/normalize-space(.))[. castable as xs:integer])
                    =
                 count(text()[matches(.,'\w')]/normalize-space(.))"/>

This assertion constrains all items of the XML "mixed" content model list to be of the same type (xs:integer in this case) -- this uses a well defined pattern "count of xs:integer items is equal to the count of all the items".

d)
<xs:assert test="every $x in text()[matches(.,'\w')][position() gt 1]
                   satisfies 
                (number(normalize-space($x)) gt number($x/preceding-sibling::text()[matches(.,'\w')][1]))"/>

This assertion constrains the list of XML "mixed" content model to be in ascending numeric order (assuming that all items in the list are numeric. Though it should be possible to specify a numeric order on a heterogeneously typed list, and specify numeric order only for numeric list items).

Summary: XSD 1.0 allowed an "untyped" XML mixed content, that was uniformly available anywhere within the scope of an XML element that was validated by an XSD complexType. No further constraints on "mixed" content were possible in an XSD 1.0 environment. XSD 1.1 allows some new ways to constrain XML "mixed" content further (some of these capabilities were illustrated in examples above). To my opinion, the likely benefits of constraining XML "mixed" content in some of the ways as illustrated above, is to allow the XML document authors to model certain semantic content in "mixed" content scope and make this knowledge available to the XML applications. All examples above were tested with Apache Xerces (I hope that these examples would also be compliant with other XSD validators, notably Saxon currently which also supports XSD 1.1).

I hope that this information was useful.



Tuesday, August 23, 2011

XPath 2.0 and XSD schemas : sharing experiences

I was just playing with XPath 2.0 and thought of sharing my observations, about a specific use case.

We start with the following XSD schema document,
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
    <xs:element name="X">
       <xs:complexType>
          <xs:sequence>
             <xs:element name="a" type="xs:integer"/>
          </xs:sequence>
          <xs:attribute name="att1" type="xs:boolean"/>
       </xs:complexType>
    </xs:element> 
</xs:schema>

This schema intends to validate an XML instance document like following,
<X att1="0">
  <a>100</a>
</X>

I wrote an XPath (2.0) expression like following [1],

/X[if (@att1) then true() else false()]/a/text()      AND ran this after enabling validation of the input document.

I though that this would not return any result (i.e an empty sequence).

But the XPath expression above ([1]) returns the result "100". At first thought, I was little amazed by this result. I thought, that since attribute "att1" was declared with type xs:boolean in the schema, the "if condition" should return 'false' in this case. But that's not the correct interpretation of the XPath expression written above ([1]). Following is a little more explanation about this.

The reference @att1 in the XPath expression above (i.e if (@att1) ..) is a node reference (an attribute node) and is not a boolean value (which I thought initially, and I was wrong -- I incorrectly thought, that atomization of the expression @att1 would take place in this case; more about this below).

The XPath 2.0 spec says, that if the first item in the sequence is a non null node, then effective boolean value of such a sequence is 'true' (this interpretation is unaffected by the fact, if the input XML document was validated or not with the XSD schema). And in the expression like above (i.e if (@att1) ..), the effective boolean value of the sequence {@att1} is used to determine IF the "if condition" returns 'true' or not (in this case, this sequence has one item [which is also the first item of this sequence] which is an attribute node whose name is "att1", which makes the effective boolean value as 'true' -- and hence the XPath predicate evaluates to 'true'). I think this explains, why the "if condition" {if (@att1)} would return true for the above XML instance document (even if it was validated by the schema given above, and the XPath 2.0 expression above [1] was run in a schema aware mode).

To write the XPath expression correctly, as I wanted (i.e the expression of the "if condition" should return 'true' if the instance document had value true/1 for the attribute, and 'false' otherwise AND an XSD validation of instance document took place prior to the evaluation of the XPath expression), the XPath expression would need to be modified to either of the following styles [2],

/X[if (data(@att1)) then true() else false()]/a/text()

OR

/X[if (@att1 = true()) then true() else false()]/a/text()

To understand why the expressions given above ([2]) work correctly, one needs to understand the XPath 2.0 "data" function (for the first correct variant above, [2] -- this returns the typed value of the argument of the "data" function) and the process of atomization (for the second correct variant above, [2] -- in this case the attribute node "att1" is atomized to return a sequence of kind {xs:boolean}) as described by the XPath 2.0 spec.

That's all about this. I hope that my experience with this may be helpful to someone (to understand this, one just has to know the XPath [2.0] spec correctly, and how it interacts with XSD schemas!).

Thanks for reading this post.

@2011-11-11: updated in place, to correct few factual errors.

Tuesday, July 26, 2011

[revisiting] Xerces-J XSModel serializer

I started playing a bit with Xerces-J XSSerializer utility (it's actually a sample within Xerces-J and was introduced in Xerces-J 2.10.0 -- the version in SVN is slightly better and will be released with a future Xerces release; and it serializes an in-memory Xerces XSModel instance into a lexical XSD syntax), and thought of writing something about it's features.

XSModel serializer has following two important (and currently the only ones) serialization features/options:
1. Selecting the XSD language version, the XSModel serializer should work with. By default this is XSD 1.0, but it can be set to XSD 1.1 via the following command line parameter, {-version 1.1}. There are very few XSD 1.1 features that the XSModel serializer currently supports. We'll try to add more XSD 1.1 features in future to the XSModel serializer. But the XSD 1.0 support with Xerces's XSModel serializer is fairly complete.
2. The XSD language prefix during serialization output can be configured with the option, {-prefix <prefix-value>}. For e.g "-prefix xsd". If this option is not specified, the prefix "xs" is generated as default during XSModel instance serialization.

I've had few interesting observations while using the Xerces XSSerializer (illustrated with small examples below),

1. I supplied the following XSD document (only the element declaration is shown, since this is the focus of this point) to the XSModel serializer,
<xs:element name="E1">
   <xs:simpleType>
      <xs:list>
         <xs:simpleType>
           <xs:restriction base="xs:string">
              <xs:minLength value="5"/>
           </xs:restriction>
         </xs:simpleType>
      </xs:list>
   </xs:simpleType>
</xs:element>

and the XSModel serializer echoed this element instance (the XSModel serializer converted the lexical schema into XSModel instance, and then serialized the XSModel again to lexical XSD syntax) to following,
<xs:element name="E1">
   <xs:simpleType>
      <xs:list>
         <xs:simpleType>
            <xs:restriction base="xs:string">
               <xs:whiteSpace value="preserve"/>
               <xs:minLength value="5"/>
            </xs:restriction>
         </xs:simpleType>
      </xs:list>
   </xs:simpleType>
</xs:element>

The interesting thing I notice in this example is, the generation of the built in facet "whiteSpace" for the XSD type xs:string.

2. Serializing the following XSD element,
<xs:element name="E1">
   <xs:simpleType>
      <xs:list>
         <xs:simpleType>
            <xs:restriction base="xs:integer">
               <xs:minInclusive value="5"/>
            </xs:restriction>
         </xs:simpleType>
      </xs:list>
   </xs:simpleType>
</xs:element>
produces the following round-trip output with the XSModel serializer,
<xs:element name="E1">
   <xs:simpleType>
      <xs:list>
         <xs:simpleType>
            <xs:restriction base="xs:integer">
               <xs:whiteSpace value="collapse"/>
               <xs:fractionDigits value="0"/>
               <xs:minInclusive value="5"/>
               <xs:pattern value="[\-+]?[0-9]+"/>
            </xs:restriction>
         </xs:simpleType>
      </xs:list>
   </xs:simpleType>
</xs:element>
this shows the built in facets for the XSD type xs:integer ("whiteSpace", "fractionDigits" and others).

I personally like this feature of XSModel serializer, that it is able to generate certain hidden properties of XML Schema components, which the schema authors normally don't specify while writing the schema documents for applications.

3. I provided the following XSD Schema fragment to XSModel serializer (a complexType referring to a model group),
<xs:element name="E1">
  <xs:complexType>
     <xs:group ref="gp1"/>
  </xs:complexType>
</xs:element>
   
<xs:group name="gp1">
   <xs:sequence>
      <xs:element name="x" type="xs:string"/>
      <xs:element name="y" type="xs:string"/>
   </xs:sequence>
</xs:group>

and the XSModel serializer generated the following round-trip serialization result,
<xs:element name="E1">
   <xs:complexType>
      <xs:sequence>
         <xs:element name="x" type="xs:string"/>
         <xs:element name="y" type="xs:string"/>
      </xs:sequence>
   </xs:complexType>
</xs:element>

<xs:group name="gp1">
   <xs:sequence>
      <xs:element name="x" type="xs:string"/>
      <xs:element name="y" type="xs:string"/>
   </xs:sequence>
</xs:group>
The global "model group" is serialized as expected. But the complexType within the element declaration was serialized with it's element declarations expanded. The lexical group reference is not present in the serialized output.

At first this may look odd (i.e the absence of the model group reference) in the serialized output. But the fact is, that Xerces XSModel instance in it's complete compiled form, doesn't know whether a group particle (in this case xs:sequence) comes from a group reference. And I had to live with this XSModel serialization characteristic. But the serialized schema output in this example is equivalent to the original schema document (which was supplied to the XSModel serializer) from validation perspective (but the global group definition in the output in this case is redundant from validation perspective, and it's just a characteristic of the XSModel serializer currently).

That's all I have to say now. Thanks for reading this post.

Saturday, June 4, 2011

Dealing with multiple roots within an XML Schema

I've been thinking on this problem for a while, and have collected some opinions, which I'm presenting here.

We'll be working with the following XML Schema documents:

a.xsd [1]
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="x" type="xs:string"/>

    <xs:element name="y" type="xs:string"/>

    <xs:element name="z">
       <xs:complexType>
          <xs:sequence>
             <xs:element ref="x"/>
             <xs:element ref="y"/>
          </xs:sequence>
       </xs:complexType>
    </xs:element>

</xs:schema>

b.xsd [2]
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:include schemaLocation="c.xsd"/>

    <xs:element name="z">
       <xs:complexType>
          <xs:sequence>
             <xs:element ref="x"/>
             <xs:element ref="y"/>
          </xs:sequence>
       </xs:complexType>
    </xs:element>

</xs:schema>

c.xsd [3]
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="x" type="xs:string"/>

    <xs:element name="y" type="xs:string"/>

</xs:schema>
The schema documents [1] and [2] are equivalent for the purpose of validating an XML instance document (it's just that the schema document b.xsd includes c.xsd).

Our application requires the following XML document to be successfully validated, by the schemas [1] or [2] above:

z.xml [4]
<z>
   <x>hello</x>
   <y>world</y>
</z>
All of this is just fine, and XML document [4] get's successfully validated by the schemas [1] or [2] above.

But the above schema design (either [1] or [2]), may present following problems sometimes:

The side effect of schema documents [1] or [2] is to also successfully validate the following XML documents,

<x>...</x>

OR

<y>...</y>
Since elements "x" and "y" are also valid roots defined in the schema (due to the global declarations of elements "x" and "y" in the schema). But the purpose of defining elements "x" and "y" in the schema, is to include them by reference else where in the schema document (as in element declaration "z" in schemas [1] or [2]).

This kind of schema design is sometimes necessary, for the reasons of modularity (for e.g using one declaration at multiple places) and re-usability (for e.g. by including a foreign schema in our own schema) -- this design can be more beneficial, if the complexity of the schema (for e.g with more schema components, and more & deep nesting of schema components) is more.

So how do we live with following design trade-off,
i.e having schema like [1] or [2] above (which gives us benefits of modularity and re-usability) and also a side effect of these schema documents validating multiple root elements (which risks an application to accept invalid XML documents -- in this example, the roots "x" and "y" are invalid for the application, while the root "z" is valid).

In this use case, if we desire that the application must reject XML documents with roots "x" or "y" but should accept documents with root "z", then to my opinion this problem cannot be solved completely with XML Schema language (there's no way currently in the XML Schema language, to forbid validating the top level XML element in instance document, with a global schema element declaration).

Solving this problem would require a little bit of non schema solution (for e.g a SAX java add-on along with schema validation).

Here's a sketch of a java SAX application which can be and-ed with the XML Schema validation (using schemas above), to achieve the desired overall XML validation effect (i.e successful validation for the root element "z" and prohibiting the XML roots "x" and "y"),

(java imports are omitted to keep the text short)
class SAXUtil extends DefaultHandler {

     String[] excludedElems = new String[] {"x", "y"};

     private boolean isRootElemOK(String docUri) {  
        boolean rootElemOk = true;
  
        try {
           SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
           saxParserFactory.setNamespaceAware(true);
           SAXParser saxParser = saxParserFactory.newSAXParser();
           saxParser.parse(docUri, this);
        }
        catch(SAXException ex) {
           if (ex instanceof RootElementSAXException) {
              RootElementSAXException expObj = (RootElementSAXException) ex;
              if ("100".equals(expObj.getErrCode())) {
                 rootElemOk = false;  
              }
           }
        }
  
        return rootElemOk;  
     }

     public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
        if (!isElementAllowed(localName)) {
           throw new RootElementSAXException("100"); 
        }
        throw new RootElementSAXException("101");
     }

     private boolean isElementAllowed(String localName) {    
        boolean elemAllowed = true;       
        for (int elemIdx = 0; elemIdx < excludedElems.length; elemIdx++) {
           if (localName.equals(excludedElems[elemIdx])) {
              elemAllowed = false;
              break;
           }
        }       
        return elemAllowed;       
     }

     class RootElementSAXException extends SAXException {
        String errorCode;
  
        public RootElementSAXException(String errorCode) {
           this.errorCode = errorCode;
        }
  
        public String getErrCode() {
           return errorCode;
        }
     }

} // class SAXUtil

Following is an algorithmic summary of the above java validation add-on,

1) A SAX parser is instantiated and parsing is invoked/triggered with the parse() method.

2) The SAX parser cannot go beyond parsing the root element -- the algorithm is intentionally designed in this way (since the SAX "startElement" callback method would always throws an exception [user defined exception, RootElementSAXException], upon encountering the top most element). The constructor parameter to the exception ("100" or "101" in this case) RootElementSAXException determines, whether the top most element was allowed or not (which is determined by an element name forbidden-list "excludedElems", defined in the above java class).

Notes:

1) To terminate the SAX parsing prior to completing parsing the whole of XML document, a SAXException can be thrown from the SAX call back methods. A custom exception class (like RootElementSAXException in the above example) is desirable, to distinguish our application designed exception from the built in SAXException events.

2) It's recommended to use SAX API for this use case, since it'll be much more efficient than for e.g using DOM APIs, which would load the whole of XML document in memory (which doesn't look a sensible approach to me, for just knowing the name of top most element of XML document).

3) The exclude element name list can be externalized from the java application, to make the above program reusable for any kind of XML documents.

4) We may use something like the java JAXP validation APIs, to help achieve the "and" of the two validation steps (i.e, schema validation and the SAX application step) described here, if we want to integrate this approach in a java application.

5) The java code snippet presented above can be made XML namespace aware (i.e if the XML elements are in namespace), by considering the namespace name parameter in SAX callback methods (for e.g the method parameter "String uri", in the startElement callback method).

I hope that this post is useful.

2011-06-26:
The explanation given by me in this blog post originally, seems to convey that multiple global element declarations in XML Schema documents are allowable by the XML Schema language, and this is inherently a bad/wrong design present within the XML Schema language. One of the solution to prohibit certain XML elements to be global in an XML instance document, was presented earlier in this blog post (using an additional restricted SAX parsing step in an application).

All this is fine. But I wanted to follow up on my thoughts written earlier in this post, arguing now, that multiple global element declarations allowable in XML Schema language is not a bad/wrong design present in XML Schema language. One of the features of XML Schema language, which requires multiple global element declarations is XML Schema "substitution groups" (i.e one element substituting for another) -- and "substitution groups" is a core and important concept within XML Schema language.

Of-course, if not working with XML Schema "substitution groups" or otherwise, one could use the SAX add-on technique I presented earlier to prohibit certain global element declarations to validate the XML instance root element, if that suits someones application design.