Monday, December 9, 2013

Schema aware XPath 2.0 and its uses

I've been thinking for a while, to write about this topic here.

(XML) schema aware XPath 2.0 is nice and finds its uses within XPath language itself, XSLT 2.0, XQuery 1.0 and XSD 1.1 as far as I know. The schema referred here, means XSD XML schemas. The XPath schema awareness feature referred within XPath 2.0 language, mentions using the XSD language. I don't think, XPath 2.0 schema awareness provides bindings with schema languages other than XSD.

What are XML Schemas like XSD used for:
These are very well known concepts I believe. Simply speaking, one of the important uses of XML Schemas is to validate XML instance documents, and these validation assessments produce a XML instance's validity result wrt to the XML schema, i.e whether XML instance is "valid" or is "not valid". This is a useful capability I believe, when using XML documents.

XPath language as is known, is used to fetch specific parts of XML instance documents via XPath expressions, like retrieving instances of XML nodes, text information and various other XML infoset items. The information retrieved via XPath expressions, can be used for various purposes by an application like using such information as it is, doing some secondary processing with it (computing something with it, storing these retrieved values into a database, transmitting it somewhere etc) etc. XPath 1.0 language has its own type system, that supports certain built in types that are very few in number. The set of types available within the XPath 1.0 language, is fixed and cannot be enhanced by the user. XPath 2.0 language changes this, and allows users to provide there own XML Schemas integrating with XPath to enhance the type system of XPath lanaguage in almost unlimited ways. XPath 2.0 language retains the essence of XPath 1.0, and adds lots of new features and provides a new facility called schema awareness. Its this XPath schema awareness feature, that I'm trying to talk about.

I would find it nice, to write few examples illustrating some of schema awareness features of XPath 2.0 language.

Lets assume we have an XML instance document like following,

<shoe color="c1" size="7"/>

(this is an XML instance document representing a human's shoe, and specifies values of this shoe XML instance's attributes)

An XSD schema for such an XML instance document may be like following,

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

       <xs:element name="shoe">
            <xs:complexType>
                 <xs:attribute name="color" type="xs:string"/>
                 <xs:attribute name="size" type="ShoeSize"/>
                 <xs:attribute name="maxFootSize" type="xs:positiveInteger" fixed="12"/>
            </xs:complexType>
       </xs:element>

       <xs:simpleType name="ShoeSize">
            <xs:restriction base="xs:positiveInteger">
               <xs:maxInclusive value="10"/>
           </xs:restriction>
      </xs:simpleType>

</xs:schema>

This example is perhaps not an important use case from the real world, therefore I wouldn't attempt to explain this domain further and the names I've chosen within this example. This example is infact very straightforward, and the XML instance document written above is valid as per the XSD schema given. What about schema awareness characteristics, wrt such XML instances and XSD schemas.

Lets consider these XPath 2.0 expressions, which are written for the above mentioned example (few explanations are also provided alongside),

1) @size instance of attribute(*, xs:positiveInteger)

returns 'true' upon evaluation, since LHS of 'instance of' operator is an XML attribute with a dynamic type xs:positiveInteger.

2) @size instance of attribute(*, ShoeSize)

returns 'true' upon evaluation, since LHS of 'instance of' operator is an XML attribute with a dynamic type ShoeSize (XSD simple type ShoeSize is also a xs:positiveInteger, but its value space is smaller than xs:positiveInteger).

3) @size le @maxFootSize

 returns 'true' upon evaluation.

4) @color lt 10

Here we're doing a test on a xs:string value. But this expression would fail, since a xs:string value is uncomparable to an integer. The XPath 2.0 language prescribes an error code XPTY0004 for such an erroneous expression evaluation.

5) @color = ('c1', 'c2', 'c3')

This test is ok. Here we're testing whether, the string value of attribute color is one from the enumeration specified.

Operations that can be done on string values, are usually different than which can be done on numerics. The rules for XPath 2.0 are quite specific for these. Different languages have different rules wrt these concerns, but various characteristics are usually common.

An important point to note with above XPath 2.0 expression examples, is that they work because their evaluation occurs within a XPath 2.0 schema aware context. That is, when these XPath expressions are evaluated, the XPath node instances have type annotations binded to them from the given XSD schema.

Thats it about these explanations. Any comments are welcome.

Sunday, May 19, 2013

Thanks to Oxygen XML folks

On behalf of Xerces-J XML Schema team, I would like to thank folks from Oxygen XML team to highlight many important bugs within Xerces-J XSD 1.1 validator. We've been able to solve many of those reported bugs, and I feel this has made implementation of Xerces-J XSD 1.1 validator quite better.

Here's the list of issues reported by Oxygen folks during the past 1-2 years I guess, which are either resolved or closed:

https://issues.apache.org/jira/issues/?jql=project%20%3D%20XERCESJ%20AND%20issuetype%20%3D%20Bug%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20reporter%20in%20%28radu_coravu%2C%20%22octavian.nadolu%22%29

In the above report, you might ignore bugs dated as old as 2006, which must have been resolved within an existing or an earlier Xerces-J version.

Other than the bugs reported by Oxygen XML folks, we also received bug reports from other members of XML community. Thanks to those persons also. 

I'm not sure when we're going to release next version of Xerces-J which should have many implementation improvements. Taking a very pessimistic view wrt this, I expect a new version of Xerces-J sometime later this year or might slip to next year.

Thursday, November 15, 2012

new thoughts about XSD 1.1 assertions

I've been thinking on these XSD topics for a while, and thought of summarizing my findings here.

Let me start this post by writing the following XML instance document (which will be the focus of all analysis in this post):

XML-1
<list attr="1 2 3 4 5 6">
    <item>a1</item>
    <item>a2</item>
    <item>a3</item>
    <item>a4</item>
    <item>a5</item>
    <item>a6</item>
</list>

We need to specify an XSD schema for the XML document above (XML-1), providing the following essential validation constraints:
1) The value of attribute "attr" is a sequence of single digit numbers. A number here can be modeled as an XSD type xs:integer, or as a restriction from xs:string (as we'll see below).
2) Each string value within an element "item" is of the form a[0-9]. i.e, this string value needs to be the character "a" followed by a single digit numeric character. We'll simply specify this with XSD type xs:string for now. We want that, each numeric character after "a" should be pair-wise same as the value at corresponding index within attribute value "attr". The above sample XML instance document (XML-1) is valid as per this requirement. Therefore, if we change any numeric value within the XML instance sample above (either within the attribute value "attr", or the numeric suffix of "a") only within the attribute "attr" or the elements "item", the XML instance document must then be reported as 'invalid' (this observation follows from the requirement that is stated in this point).

Now, let me come to the XSD solutions for these XML validation requirements.

First of all, we would need XSD 1.1 assertions to specify these validation constraints (since, this is clearly a co-occurrence data constraint issue.). Following is the first schema design, that quickly came to my mind:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   
    <xs:element name="list">
        <xs:complexType>
           <xs:sequence>
              <xs:element name="item" type="xs:string" maxOccurs="unbounded"/>
           </xs:sequence>
           <xs:attribute name="attr">
              <xs:simpleType>
                 <xs:list itemType="xs:integer"/>
              </xs:simpleType>
           </xs:attribute>
           <xs:assert test="deep-equal(item/substring-after(., 'a'), data(@attr))"/>
        </xs:complexType>
    </xs:element>
   
</xs:schema>

The above schema is almost correct, except for a little problem with the way assertion is specified. As per the XPath 2.0 spec, the "deep-equal" function when comparing the two sequences for deep equality checks, requires that atomic values at same indices in the two sequences must be equal as per the rules of equality of an XSD atomic type. Within an assertion in the above schema, the first argument of "deep-equal" has a type annotation of xs:string* and the second argument has a type annotation xs:integer* (note that, the XPath 2.0 "data" function returns the typed value of a node) and therefore the "deep-equal" function as used in this case returns a 'false' result.

Assuming that we would not change the schema specification of "item" elements and the attribute "attr", the following assertion would therefore be correct to realize the above requirement:

<xs:assert test="deep-equal(item/substring-after(., 'a'), for $att in data(@attr) return string($att))"/>

(in this case, we've converted the second argument of "deep-equal" function (highlighted with a different color) to have a type annotation xs:string* and did not modify the type annotation of the first argument)

An alternative correct modification to the assertion would be:

<xs:assert test="deep-equal(item/number(substring-after(., 'a')), data(@attr))"/>

(in this case, we convert the type annotation of the first argument of "deep-equal" function to xs:integer* and do not modify the type annotation of the second argument)

I now propose a slightly different way to specify the schema for above requirements. Following is the modified schema document:

XS-2
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  
    <xs:element name="list">
        <xs:complexType>
           <xs:sequence>
              <xs:element name="item" type="xs:string" maxOccurs="unbounded"/>
           </xs:sequence>
           <xs:attribute name="attr">
              <xs:simpleType>
                 <xs:list itemType="NumericChar"/>
              </xs:simpleType>
           </xs:attribute>
           <xs:assert test="deep-equal(item/substring-after(., 'a'), data(@attr))"/>
        </xs:complexType>
    </xs:element>
  
    <xs:simpleType name="NumericChar">
       <xs:restriction base="xs:string">
          <xs:pattern value="[0-9]"/>
       </xs:restriction>
    </xs:simpleType>
  
</xs:schema>

This schema document is right in all respects, and successfully validates the XML document specified above (i.e, XML-1). In this schema we've made following design decisions:
1) We've specified the itemType of list (the value of attribute "attr" is this list instance) as "NumericChar" (this is a user-defined simpleType, that uses the xs:pattern facet to constrain list items).
2) The "deep-equal" function as now written in the schema XS-2, has the type annotation xs:string* for both of its arguments. And therefore, it works fine.

I'll now try to summarize below the pros and cons of schema XS-2 wrt the other correct solutions specified earlier:
1) If the simpleType definition of attribute "attr" is not used in another schema context (i.e, ideally if this simpleType definition is the only use of such a type definition). Or in other words there is no need of re-usability of this type. Then the solution with schema XS-2  is acceptable.
2) If a schema author thought, that list items of attribute "attr" need to be numeric (due to semantic intent of the problem, or if the list's simpleType definition needs to be reused at more than one place and the other place needs a list of integers), then the schema solutions like shown earlier would be needed.

Here's another caution I can point wrt the schema solutions proposed above,
The above schemas would allow values within "item" elements like "pqra5" to produce a valid outcome with the "substring-after" function that is written in assertions. Therefore, the "item" element may be more correctly specified like,

<xs:element name="item" maxOccurs="unbounded">
    <xs:simpleType>
         <xs:restriction base="xs:string">
              <xs:pattern value="a[0-9]"/>
         </xs:restriction>
    </xs:simpleType>
</xs:element>

It is also evident, that XPath 2.0 "data" function allows us to do some useful things with simpleType lists, like getting the list's typed value and specifying certain checks on individual list items (possibly different checks on different list items) or accessing list items by an index (or a range of indices). For e.g, data(@attr)[2] or data(@attr)[position() gt 3]. This was not possible with XSD 1.0.

I hope that this post was useful, and hoping to come back with another post sometime soon.

Sunday, July 22, 2012

XSD 1.1 assertions with complexType extensions

I thought, it would be good to write this post here and sharing with XML Schema folks.

There was an interesting debate on xmlschema-dev list recently, where we argued that what is the benefit of specifying an XSD 1.1 assertion within a XSD complexType that is derived from another complexType via an extension operation. It was initially thought, that an assertion within such a derived complexType would produce (and always) an XML content model restriction effect (which is opposed to the actual intent of complexType extension) -- if this is the only affect of assertions in this case, then using assertions in this case is counter intutive. Therefore, would there be any benefit of specifying assertions within a derived XSD complexType when using an extension derivation (and XSD 1.1 language currently provides this facility)?

After some thought, we found a benefit of using assertions for this scenario. Following is an example, illustrating one of the benefits of assertions for this case:

XSD Schema document (XS1):
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="X">
       <xs:complexType>
          <xs:complexContent>
             <xs:extension base="T1">
                <xs:sequence>
                   <xs:element name="c" type="xs:string"/>
                </xs:sequence>
                <xs:assert test="a = c">
                   <xs:annotation>
                      <xs:documentation>
                         The value of element "a" must be equal to value of element "c".
                      </xs:documentation>
                   </xs:annotation>
                </xs:assert>
             </xs:extension>
          </xs:complexContent>
       </xs:complexType>
    </xs:element>
    
    <xs:complexType name="T1">
       <xs:sequence>
          <xs:element name="a" type="xs:string"/>
          <xs:element name="b" type="xs:string"/>
       </xs:sequence>
    </xs:complexType>

</xs:schema>

XML instance document (XML1):
<X>
  <a>same</a>
  <b/>
  <c>same</c>
</X>

We want to validate the XML instance document, XML1 above with the schema shown above (XS1). The XML content within element "X", is declared via an XSD complexType that is derived by extension from another complexType. The xs:assert element specified in the schema XS1 above, has the following semantic intent: "to specify a relational constraint between two sibling elements" (elements "a" and "c" in this case).

Summarizing the design thoughts, for the schema specified above (XS1):
1) An assertion within XSD complexType extension derivation, doesn't always produce a restriction effect. As illustrated in the example above, an assertion is specifying a orthogonal (along with the traditional xs:extension constraint) co-occurence constraint -- this is intuitive, and useful.
2) We should be careful though, to be aware that an xs:assert element within complexType extension can easily inject a content model restriction effect. If this is not wanted, an assertion shouldn't be used for such derived XSD complex types. Following is an XML Schema example, illustrating this scenario:

XSD Schema document (XS2):
(intended to validate the XML document XML1 above)
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="X">
       <xs:complexType>
          <xs:complexContent>
             <xs:extension base="T1">
                <xs:sequence>
                   <xs:element name="c" type="xs:string"/>
                </xs:sequence>
                <xs:assert test="not(b)">
                   <xs:annotation>
                      <xs:documentation>
                         The element "b" is prohibited.
                      </xs:documentation>
                   </xs:annotation>
                </xs:assert>
             </xs:extension>
          </xs:complexContent>
       </xs:complexType>
    </xs:element>
    
    <xs:complexType name="T1">
       <xs:sequence>
          <xs:element name="a" type="xs:string"/>
          <xs:element name="b" type="xs:string" minOccurs="0"/>
       </xs:sequence>
    </xs:complexType>

</xs:schema>

The schema, XS2 above illustrates following design intents:
1) An xs:assert element within complexType of element "X" prohibits element "b" from occuring within XML instance element "X". An assertion like this, is restricting the complex type "content model" of the base type. If we wouldn't like a content model restricting effect like this, then we shouldn't use an xs:assert with complexType extension.
2) The schema document, XS2 specified above can still thought to be useful to design. The complexType definition of element "X" in schema XS2 above, is quite like a mixture of extension and restriction derivation both. It is an extension derivation, because some of the element particles of the base type are made available within the derived type via an xs:extension element (element "a" for this example). It is also a restriction derivation, because the element "b" of the base type is prohibited to occur in the derived type via an xs:assert element. The complexType definition of element "X" in this case, is unlike any of the facilities of the XSD 1.0 language which allows a pure extension derivation or a pure restriction derivation but not both. Assertions can sometimes thought to be useful via a schema design like this, when we want some of the complexType extension and restriction derivation effects both.

Therefore, here's my final take of these design issues:
1) An assertion is very much intutive (and useful), to specify co-ccurence constraints between XML elements within the sibling XPath axis, and very much so also with the XSD xs:extension element (this is unlike any of XSD 1.0 facilities). Other content model co-occurence scenarios are also useful in this case, like specifying co-constraints between an  element and a attribute etc. XSD assertions are certainly recommended for this case.
2) An assertion is also very much intutive, to specify a mixture of complexType extension and restriction derivation operations (as illustrated in schema example, XS2 above). XSD assertions are certainly also recommended for this case.
3) If an XSD schema author desires to strictly use the element xs:extension for expressing pure content model extension, then using assertion within xs:extension is counter intutive (since it may inject a content model restriction effect) and is not recommended.

Therefore, if we have to do some new kinds of XML Schema modeling with XSD 1.1 assertions (for e.g, with xs:extension derivations), assertions are certainly a nice XML Schema constructs.

I hope, that this post was useful.

Saturday, April 14, 2012

XSD 1.1 is now a W3C standard

I've been looking forward (and I hope many others as well) to have the XSD 1.1 language to become a W3C recommendation (REC). XSD 1.1 did become a REC on 5th April 2012. This was a big big wait for the XSD community! But finally this has come so. The previous XSD standard (XSD 1.0 2nd edition) dates back to 2004.

XSD 1.1 implementations: There seems to be currently two XSD 1.1 implementations, which are Xerces and Saxon. Xerces is a project from Apache Software Foundation's XML activity, and Saxon is a product set from Saxonica (via Michael Kay). Both of these implementations pass near to 100% of the W3C XSD 1.1 test suite, so these tools are reliable implementations of the XSD 1.1 standard.

For the interest of readers (for those not aware), following are the feature change list (these are non-normative details, but are fairly complete. for the complete list of changes within XSD 1.1 wrt the XSD 1.0 language, you'll have to read the whole of XSD 1.1 language) that is within the XSD 1.1 language:

XSD 1.1 Structures specification: http://www.w3.org/TR/xmlschema11-1/#changes
XSD 1.1 Datatypes specification: http://www.w3.org/TR/xmlschema11-2/#changes

Wishing a happy reading, to the XSD folks (and to the wider XML community) of the new XSD 1.1 specification, and trying out the available implementations :)

Saturday, February 25, 2012

modular XML instances and modular XSD schemas

I was playing with some new ideas lately related to exploring design options, to construct modular XML instance documents vs/and modular XSD schema documents and thought to write my findings as a blog post here.

I believe, there are primarily following concepts related to constructing modular XML documents (and XSD schemas) when XSD validation is involved:
1. Modularize XML documents using the XInclude construct.
2. Modularize an XSD document via <xs:include> and <xs:import>. The <xs:include> construct maps significantly to modularlity concepts in XSD schemas, and <xs:import> is necessary (necessary in XSD 1.0, and optional in XSD 1.1) to compose (and also to modularize) XSD schemas coming from two or more distinct XML namespaces.

I don't intend to delve much in this post into concepts related to XSD constructs <xs:include> and <xs:import> since these are well known within the XSD and XML communities. In this post, I would tend to primarily focus on XML document modularization via the XInclude construct and presenting few thoughts about various design options (I don't claim to have covered every design option for these use cases, but I feel that I would cover few of the important ones) to validate such XML instance documents via XSD validation.

What is XInclude?
This is an XML standards specification, that defines about how to modularize any XML document information. The primary construct of XInclude is an <xi:include> XML element. Following is a small example of an XInclude aware XML document,

z.xml

<z xmlns:xi="http://www.w3.org/2001/XInclude">
    <xi:include href="x.xml"/>
    <xi:include href="y.xml"/>
</z>

x.xml

<x>
    <a>1</a>
    <b>2</b>
</x>

y.xml

<y>
    <p>5</p>
    <q>6</q>
</y>

We'll be using the XML document, z.xml provided above that is composed from other XML documents via an XInclude meta-data, to provide to an XSD validator for validation.

I essentially discuss here, the XSD schema design options to validate an XML instance document like z.xml above. Following are the XSD design options (that cause successful XML instance validations) that currently come to my mind for this need, along with some explanation of the corresponding design rationale:

XS1:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="z">
          <xs:complexType>
               <xs:sequence>
                    <xs:any processContents="skip" minOccurs="2" maxOccurs="2"/>
               </xs:sequence>
          </xs:complexType>
    </xs:element>
   
</xs:schema>

This schema is written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. An xs:any wild-card in this schema would weakly validate (since this wild-card declaration only requires *any particular* XML element to be present in an instance document, which is validated by this wild-card. the wild-card here doesn't specify any other constraint for it's corresponding XML instance elements) each of the included XML document element roots (i.e XML elements "x" and "y").

XS2:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

        <xs:element name="z">
                <xs:complexType>
                     <xs:complexContent>
                         <xs:restriction base="T1">
                              <xs:sequence>
                                   <xs:element name="include"  minOccurs="2" maxOccurs="2" targetNamespace="http://www.w3.org/2001/XInclude"/>
                             </xs:sequence>
                         </xs:restriction>
                    </xs:complexContent>
                </xs:complexType>
        </xs:element>
   
    <xs:complexType name="T1" abstract="true">
          <xs:sequence>
               <xs:any processContents="skip" maxOccurs="unbounded"/>
          </xs:sequence>
    </xs:complexType>
   
</xs:schema>

This schema is also written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. But this schema specifies slightly stronger XSD validation constraints as compared to the previous example (stronger in a sense that, this schema declares an XML element and specifies it's name and an namespace). This schema would need an XSD 1.1 processor, since the element declaration specifies a "targetNamespace" attribute. An XSD 1.0 version of this design approach is possible, which would involve using an XSD <xs:import> element to import XSD components from the XInclude namespace.

XS3:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

       <xs:element name="z">
              <xs:complexType>
                  <xs:sequence>
                       <xs:any processContents="skip" minOccurs="2" maxOccurs="2" namespace="http://www.w3.org/2001/XInclude"/>
                 </xs:sequence>
                 <xs:assert test="count(*[local-name() = 'include']) = 2"/>
                 <xs:assert test="deep-equal((*[1] | *[2])/@*/name(), ('href','href'))"/>
             </xs:complexType>
      </xs:element>
   
</xs:schema>

This schema is also written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. But this schema enforces XSD validation even more strongly than the example "XS2" above (since this schema also requires the XInclude attribute "href" to be present on the XInclude meta-data, which the previous XSD schema doesn't enforce). This schema validates the names of XML instance elements, that are intended to be XInclude meta-data via XSD 1.1 <assert> elements (this may not be the best XSD validation approach, but such an XSD design idiom is now possible with XSD 1.1 language).

XS4:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="z">
         <xs:complexType>
               <xs:sequence>
                    <xs:element name="x">
                         <xs:complexType>
                             <xs:sequence>
                                  <xs:element name="a" type="xs:integer"/>
                                 <xs:element name="b" type="xs:integer"/>
                            </xs:sequence>
                        </xs:complexType>
                    </xs:element>
                    <xs:element name="y">
                         <xs:complexType>
                             <xs:sequence>
                                  <xs:element name="p" type="xs:integer"/>
                                  <xs:element name="q" type="xs:integer"/>
                             </xs:sequence>
                        </xs:complexType>
                   </xs:element>
              </xs:sequence>
         </xs:complexType>
     </xs:element>
   
</xs:schema>

This schema is written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data expanded. This schema specifies the strongest of XSD validation constraints as compared to the previous three approaches (strongest in a sense that, the internal structure of XML element instances "x" and 'y" are now completely specified by the XSD document).

But to make this XSD validation approach to work, the XInclude meta-data needs to be expanded and the expanded XML infoset needs to be supplied to the XSD validator for validation. This would require an XInclude processor (like Apache Xerces), that plugs within the XML parsing stage to expand the <xi:include> tags.

For the interest of readers, following are few java code snippets (the skeletal class structure and imports are omitted to keep the text shorter) that enable XInclude processing and supplying the resulting XML infoset (i.e post the XInclude meta-data expansion) to the Xerces XSD validator,

try {           
     Schema schema = schemaFactory.newSchema(getSaxSource(xsdUri, false));
     Validator validator = schema.newValidator();
     validator.setErrorHandler(new ValidationErrHandler());
     validator.validate(getSaxSource(xmlUri, true));
}
catch(SAXException se) {
     se.printStackTrace();
}
catch (IOException ioe) {
     ioe.printStackTrace();
}

private SAXSource getSaxSource(String docUri, boolean isInstanceDoc) {

     XMLReader reader = null;

     try {
          reader = XMLReaderFactory.createXMLReader();
          if (isInstanceDoc) {
              reader.setFeature("http://apache.org/xml/features/xinclude", true);
              reader.setFeature("http://apache.org/xml/features/xinclude/fixup-base-uris", false);
          }
     }
     catch (SAXException se) {
          se.printStackTrace();
     }

     return new SAXSource(reader, new InputSource(docUri));

}
     
class ValidationErrHandler implements ErrorHandler {

      public void error(SAXParseException spe) throws SAXException {
           String formattedMesg = getFormattedMesg(spe.getSystemId(), spe.getLineNumber(), spe.getColumnNumber(), spe.getMessage());
           System.err.println(formattedMesg);
      }

      public void fatalError(SAXParseException spe) throws SAXException {
             String formattedMesg = getFormattedMesg(spe.getSystemId(), spe.getLineNumber(), spe.getColumnNumber(), spe.getMessage());
             System.err.println(formattedMesg);
      }

      public void warning(SAXParseException spe) throws SAXException {
           // NO-OP           
      }
       
}

private String getFormattedMesg(String systemId, int lineNo, int colNo, String mesg) {
      return systemId + ", line "+lineNo + ", col " + colNo + " : " + mesg;   
}

Summary: I would ponder that, is devising the above various XSD design approaches beneficial for an XSD schema design that involves validating XML instance documents that contain <xi:include> meta-data directives? My thought process with regards to the above presented XSD validation options had following concerns:
1) Providing various degrees of XSD validation strenghts for <xi:include> directives (essentially the un-expanded and expanded modes).
2) Exploring some of the new XML validation idioms offered by XSD 1.1 language for the use cases presented above (essentially using "targetNamespace" attribute on xs:element elements, and using <assert> elements).
3) Exploring the java SAX and JAXP APIs to enable XInclude meta-data expansion, and providing a SAXSource object containing an XInclude expanded XML infoset which is subsequently supplied further to the XSD validation pipeline.

I hope that this post was useful.

Sunday, February 5, 2012

"castable as" vs "instance of" XPath 2.0 expressions for XSD 1.1 assertions

I'm continuing with my thoughts related to my previous blog post (ref, http://mukulgandhi.blogspot.in/2012/01/using-xsd-11-assertions-on-complextype.html). The earlier post used the XPath 2.0 "castable as" expression to do some checks on the 'untyped' data of complexType's mixed content (essentially finding if the string/untyped value in an XML instance document is a lexical representation of an xs:integer value).

This post talks about the use of XPath 2.0 "instance of" vs "castable as" expressions in context of XSD 1.1 assertions -- essentially providing guidance about when it may be necessary to use one of these expressions.

The XSD 1.1 "castable as" use case was discussed in my earlier blog post. Here I essentially talk about "instance of" expression when used with XSD 1.1 assertions.

Let's assume that there is an XML instance document like following (XML1):

<X>
   <elem>
     <a>20</a>
     <b>30</b>
   </elem>
   <elem>
     <a>10</a>
     <b>2005-10-07</b>
   </elem>
</X>

The XSD schema should express the following constraints with respect to the above XML instance document (XML1):
1. The elements "a" and "b" can be typed as an xs:integer or a xs:date (therefore we'll express this with an XSD simpleType with variety 'union').
2. If both the elements "a" and "b" are of type xs:integer (this is allowable as per the simpleType definition described in point 1 above), then numeric value of element "a" should be less than numeric value of element "b".
3. If one of the elements "a" or "b" is an xs:integer and the other one is xs:date, then we would like to express the following constraints,
   - the numeric XML instance value of an xs:integer typed element should be less than 100
   - the xs:date XML instance value should be less that the current date

The following XSD (1.1) schema document describes all of the above validation constraints for a sample XML instance document (XML1) provided above:

[XS1]

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   
     <xs:element name="X">
        <xs:complexType>
           <xs:sequence>
              <xs:element name="elem" maxOccurs="unbounded">
                 <xs:complexType>
                    <xs:sequence>
                       <xs:element name="a" type="union_of_date_and_integer"/>
                       <xs:element name="b" type="union_of_date_and_integer"/>
                    </xs:sequence>
                    <xs:assert test="if ((data(a) instance of xs:integer) and (data(b) instance of xs:integer))
                                              then (data(a) lt data(b))
                                           else if (not(deep-equal(data(a), data(b))))
                                              then (*[data(.) instance of xs:integer]/data(.) lt 100
                                                         and
                                                      *[data(.) instance of xs:date]/data(.) lt current-date())
                                              else true()"/>
                 </xs:complexType>
              </xs:element>
           </xs:sequence>
        </xs:complexType>
     </xs:element>
   
     <xs:simpleType name="union_of_date_and_integer">
        <xs:union memberTypes="xs:date xs:integer"/>
     </xs:simpleType>
   
</xs:schema>

I think it may be interesting for readers to know why I wrote an assertion like the one above. Following are few of the thoughts,
1. Since the XML elements "a" and "b" are typed as a simpleType 'union', therefore for an assertion to access the XML instance atomic values that were validated by such an simpleType we need to use the XPath 2.0 "data" function on a relevant XDM node (elements "a" and "b" in this case). Further determining that the XML document's atomic instance value is typed as xs:integer, we need to use the "instance of" expression -- "castable as" is not needed in this case, since the instance document's data is already typed.
2. The rest of the assertion implements what is mentioned in the requirements above.

If you want to have further visual and/or design elegance within what is written in an assertion above, one may feel free to break assertion rules into two or more assertions.

I would also want to write another XSD 1.1 assertions example which doesn't use an XPath 2.0 "castable as" or an "instance of" expression. This demonstrates that, if an XDM assert node is already typed it would usually be unnecessary to use the "castable as" expression (since "castable as" is essentially useful to programmatically enforce typing with string/untyped values) or an "instance of" expression may be needed for some cases.

Following is a slightly modified variant of the XML instance document specified above (XML1):

[XML2]

<X>
   <elem>
     <a>2</a>
     <b>2012-02-04</b>
   </elem>
   <elem>
     <a>10</a>
     <b>2005-10-07</b>
   </elem>
</X>

The XSD schema should express the following constraints with respect to the above XML instance document (XML2):
1. The element "a" is typed as an xs:nonNegativeInteger value, and element "b" is typed as xs:date.
2. The number of days equal to the numeric value specified in an element "a" if added to the xs:date value specified in an element "b", should result in an xs:date value which must be less than the current date.

The following XSD (1.1) schema document describes all of the above validation constraints for a sample XML instance document (XML2) provided above:

[XS2]

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   
     <xs:element name="X">
        <xs:complexType>
           <xs:sequence>
              <xs:element name="elem" maxOccurs="unbounded">
                 <xs:complexType>
                    <xs:sequence>
                       <xs:element name="a" type="xs:nonNegativeInteger"/>
                       <xs:element name="b" type="xs:date"/>
                    </xs:sequence>
                    <xs:assert test="(b + xs:dayTimeDuration(concat('P', a, 'D'))) lt current-date()"/>
                 </xs:complexType>
              </xs:element>
           </xs:sequence>
        </xs:complexType>
     </xs:element>
   
</xs:schema>

That's all I had to say today.

I hope this post was useful.