Tuesday, August 23, 2011

XPath 2.0 and XSD schemas : sharing experiences

I was just playing with XPath 2.0 and thought of sharing my observations, about a specific use case.

We start with the following XSD schema document,
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
    <xs:element name="X">
       <xs:complexType>
          <xs:sequence>
             <xs:element name="a" type="xs:integer"/>
          </xs:sequence>
          <xs:attribute name="att1" type="xs:boolean"/>
       </xs:complexType>
    </xs:element> 
</xs:schema>

This schema intends to validate an XML instance document like following,
<X att1="0">
  <a>100</a>
</X>

I wrote an XPath (2.0) expression like following [1],

/X[if (@att1) then true() else false()]/a/text()      AND ran this after enabling validation of the input document.

I though that this would not return any result (i.e an empty sequence).

But the XPath expression above ([1]) returns the result "100". At first thought, I was little amazed by this result. I thought, that since attribute "att1" was declared with type xs:boolean in the schema, the "if condition" should return 'false' in this case. But that's not the correct interpretation of the XPath expression written above ([1]). Following is a little more explanation about this.

The reference @att1 in the XPath expression above (i.e if (@att1) ..) is a node reference (an attribute node) and is not a boolean value (which I thought initially, and I was wrong -- I incorrectly thought, that atomization of the expression @att1 would take place in this case; more about this below).

The XPath 2.0 spec says, that if the first item in the sequence is a non null node, then effective boolean value of such a sequence is 'true' (this interpretation is unaffected by the fact, if the input XML document was validated or not with the XSD schema). And in the expression like above (i.e if (@att1) ..), the effective boolean value of the sequence {@att1} is used to determine IF the "if condition" returns 'true' or not (in this case, this sequence has one item [which is also the first item of this sequence] which is an attribute node whose name is "att1", which makes the effective boolean value as 'true' -- and hence the XPath predicate evaluates to 'true'). I think this explains, why the "if condition" {if (@att1)} would return true for the above XML instance document (even if it was validated by the schema given above, and the XPath 2.0 expression above [1] was run in a schema aware mode).

To write the XPath expression correctly, as I wanted (i.e the expression of the "if condition" should return 'true' if the instance document had value true/1 for the attribute, and 'false' otherwise AND an XSD validation of instance document took place prior to the evaluation of the XPath expression), the XPath expression would need to be modified to either of the following styles [2],

/X[if (data(@att1)) then true() else false()]/a/text()

OR

/X[if (@att1 = true()) then true() else false()]/a/text()

To understand why the expressions given above ([2]) work correctly, one needs to understand the XPath 2.0 "data" function (for the first correct variant above, [2] -- this returns the typed value of the argument of the "data" function) and the process of atomization (for the second correct variant above, [2] -- in this case the attribute node "att1" is atomized to return a sequence of kind {xs:boolean}) as described by the XPath 2.0 spec.

That's all about this. I hope that my experience with this may be helpful to someone (to understand this, one just has to know the XPath [2.0] spec correctly, and how it interacts with XSD schemas!).

Thanks for reading this post.

@2011-11-11: updated in place, to correct few factual errors.

Tuesday, July 26, 2011

[revisiting] Xerces-J XSModel serializer

I started playing a bit with Xerces-J XSSerializer utility (it's actually a sample within Xerces-J and was introduced in Xerces-J 2.10.0 -- the version in SVN is slightly better and will be released with a future Xerces release; and it serializes an in-memory Xerces XSModel instance into a lexical XSD syntax), and thought of writing something about it's features.

XSModel serializer has following two important (and currently the only ones) serialization features/options:
1. Selecting the XSD language version, the XSModel serializer should work with. By default this is XSD 1.0, but it can be set to XSD 1.1 via the following command line parameter, {-version 1.1}. There are very few XSD 1.1 features that the XSModel serializer currently supports. We'll try to add more XSD 1.1 features in future to the XSModel serializer. But the XSD 1.0 support with Xerces's XSModel serializer is fairly complete.
2. The XSD language prefix during serialization output can be configured with the option, {-prefix <prefix-value>}. For e.g "-prefix xsd". If this option is not specified, the prefix "xs" is generated as default during XSModel instance serialization.

I've had few interesting observations while using the Xerces XSSerializer (illustrated with small examples below),

1. I supplied the following XSD document (only the element declaration is shown, since this is the focus of this point) to the XSModel serializer,
<xs:element name="E1">
   <xs:simpleType>
      <xs:list>
         <xs:simpleType>
           <xs:restriction base="xs:string">
              <xs:minLength value="5"/>
           </xs:restriction>
         </xs:simpleType>
      </xs:list>
   </xs:simpleType>
</xs:element>

and the XSModel serializer echoed this element instance (the XSModel serializer converted the lexical schema into XSModel instance, and then serialized the XSModel again to lexical XSD syntax) to following,
<xs:element name="E1">
   <xs:simpleType>
      <xs:list>
         <xs:simpleType>
            <xs:restriction base="xs:string">
               <xs:whiteSpace value="preserve"/>
               <xs:minLength value="5"/>
            </xs:restriction>
         </xs:simpleType>
      </xs:list>
   </xs:simpleType>
</xs:element>

The interesting thing I notice in this example is, the generation of the built in facet "whiteSpace" for the XSD type xs:string.

2. Serializing the following XSD element,
<xs:element name="E1">
   <xs:simpleType>
      <xs:list>
         <xs:simpleType>
            <xs:restriction base="xs:integer">
               <xs:minInclusive value="5"/>
            </xs:restriction>
         </xs:simpleType>
      </xs:list>
   </xs:simpleType>
</xs:element>
produces the following round-trip output with the XSModel serializer,
<xs:element name="E1">
   <xs:simpleType>
      <xs:list>
         <xs:simpleType>
            <xs:restriction base="xs:integer">
               <xs:whiteSpace value="collapse"/>
               <xs:fractionDigits value="0"/>
               <xs:minInclusive value="5"/>
               <xs:pattern value="[\-+]?[0-9]+"/>
            </xs:restriction>
         </xs:simpleType>
      </xs:list>
   </xs:simpleType>
</xs:element>
this shows the built in facets for the XSD type xs:integer ("whiteSpace", "fractionDigits" and others).

I personally like this feature of XSModel serializer, that it is able to generate certain hidden properties of XML Schema components, which the schema authors normally don't specify while writing the schema documents for applications.

3. I provided the following XSD Schema fragment to XSModel serializer (a complexType referring to a model group),
<xs:element name="E1">
  <xs:complexType>
     <xs:group ref="gp1"/>
  </xs:complexType>
</xs:element>
   
<xs:group name="gp1">
   <xs:sequence>
      <xs:element name="x" type="xs:string"/>
      <xs:element name="y" type="xs:string"/>
   </xs:sequence>
</xs:group>

and the XSModel serializer generated the following round-trip serialization result,
<xs:element name="E1">
   <xs:complexType>
      <xs:sequence>
         <xs:element name="x" type="xs:string"/>
         <xs:element name="y" type="xs:string"/>
      </xs:sequence>
   </xs:complexType>
</xs:element>

<xs:group name="gp1">
   <xs:sequence>
      <xs:element name="x" type="xs:string"/>
      <xs:element name="y" type="xs:string"/>
   </xs:sequence>
</xs:group>
The global "model group" is serialized as expected. But the complexType within the element declaration was serialized with it's element declarations expanded. The lexical group reference is not present in the serialized output.

At first this may look odd (i.e the absence of the model group reference) in the serialized output. But the fact is, that Xerces XSModel instance in it's complete compiled form, doesn't know whether a group particle (in this case xs:sequence) comes from a group reference. And I had to live with this XSModel serialization characteristic. But the serialized schema output in this example is equivalent to the original schema document (which was supplied to the XSModel serializer) from validation perspective (but the global group definition in the output in this case is redundant from validation perspective, and it's just a characteristic of the XSModel serializer currently).

That's all I have to say now. Thanks for reading this post.

Saturday, June 4, 2011

Dealing with multiple roots within an XML Schema

I've been thinking on this problem for a while, and have collected some opinions, which I'm presenting here.

We'll be working with the following XML Schema documents:

a.xsd [1]
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="x" type="xs:string"/>

    <xs:element name="y" type="xs:string"/>

    <xs:element name="z">
       <xs:complexType>
          <xs:sequence>
             <xs:element ref="x"/>
             <xs:element ref="y"/>
          </xs:sequence>
       </xs:complexType>
    </xs:element>

</xs:schema>

b.xsd [2]
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:include schemaLocation="c.xsd"/>

    <xs:element name="z">
       <xs:complexType>
          <xs:sequence>
             <xs:element ref="x"/>
             <xs:element ref="y"/>
          </xs:sequence>
       </xs:complexType>
    </xs:element>

</xs:schema>

c.xsd [3]
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="x" type="xs:string"/>

    <xs:element name="y" type="xs:string"/>

</xs:schema>
The schema documents [1] and [2] are equivalent for the purpose of validating an XML instance document (it's just that the schema document b.xsd includes c.xsd).

Our application requires the following XML document to be successfully validated, by the schemas [1] or [2] above:

z.xml [4]
<z>
   <x>hello</x>
   <y>world</y>
</z>
All of this is just fine, and XML document [4] get's successfully validated by the schemas [1] or [2] above.

But the above schema design (either [1] or [2]), may present following problems sometimes:

The side effect of schema documents [1] or [2] is to also successfully validate the following XML documents,

<x>...</x>

OR

<y>...</y>
Since elements "x" and "y" are also valid roots defined in the schema (due to the global declarations of elements "x" and "y" in the schema). But the purpose of defining elements "x" and "y" in the schema, is to include them by reference else where in the schema document (as in element declaration "z" in schemas [1] or [2]).

This kind of schema design is sometimes necessary, for the reasons of modularity (for e.g using one declaration at multiple places) and re-usability (for e.g. by including a foreign schema in our own schema) -- this design can be more beneficial, if the complexity of the schema (for e.g with more schema components, and more & deep nesting of schema components) is more.

So how do we live with following design trade-off,
i.e having schema like [1] or [2] above (which gives us benefits of modularity and re-usability) and also a side effect of these schema documents validating multiple root elements (which risks an application to accept invalid XML documents -- in this example, the roots "x" and "y" are invalid for the application, while the root "z" is valid).

In this use case, if we desire that the application must reject XML documents with roots "x" or "y" but should accept documents with root "z", then to my opinion this problem cannot be solved completely with XML Schema language (there's no way currently in the XML Schema language, to forbid validating the top level XML element in instance document, with a global schema element declaration).

Solving this problem would require a little bit of non schema solution (for e.g a SAX java add-on along with schema validation).

Here's a sketch of a java SAX application which can be and-ed with the XML Schema validation (using schemas above), to achieve the desired overall XML validation effect (i.e successful validation for the root element "z" and prohibiting the XML roots "x" and "y"),

(java imports are omitted to keep the text short)
class SAXUtil extends DefaultHandler {

     String[] excludedElems = new String[] {"x", "y"};

     private boolean isRootElemOK(String docUri) {  
        boolean rootElemOk = true;
  
        try {
           SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
           saxParserFactory.setNamespaceAware(true);
           SAXParser saxParser = saxParserFactory.newSAXParser();
           saxParser.parse(docUri, this);
        }
        catch(SAXException ex) {
           if (ex instanceof RootElementSAXException) {
              RootElementSAXException expObj = (RootElementSAXException) ex;
              if ("100".equals(expObj.getErrCode())) {
                 rootElemOk = false;  
              }
           }
        }
  
        return rootElemOk;  
     }

     public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
        if (!isElementAllowed(localName)) {
           throw new RootElementSAXException("100"); 
        }
        throw new RootElementSAXException("101");
     }

     private boolean isElementAllowed(String localName) {    
        boolean elemAllowed = true;       
        for (int elemIdx = 0; elemIdx < excludedElems.length; elemIdx++) {
           if (localName.equals(excludedElems[elemIdx])) {
              elemAllowed = false;
              break;
           }
        }       
        return elemAllowed;       
     }

     class RootElementSAXException extends SAXException {
        String errorCode;
  
        public RootElementSAXException(String errorCode) {
           this.errorCode = errorCode;
        }
  
        public String getErrCode() {
           return errorCode;
        }
     }

} // class SAXUtil

Following is an algorithmic summary of the above java validation add-on,

1) A SAX parser is instantiated and parsing is invoked/triggered with the parse() method.

2) The SAX parser cannot go beyond parsing the root element -- the algorithm is intentionally designed in this way (since the SAX "startElement" callback method would always throws an exception [user defined exception, RootElementSAXException], upon encountering the top most element). The constructor parameter to the exception ("100" or "101" in this case) RootElementSAXException determines, whether the top most element was allowed or not (which is determined by an element name forbidden-list "excludedElems", defined in the above java class).

Notes:

1) To terminate the SAX parsing prior to completing parsing the whole of XML document, a SAXException can be thrown from the SAX call back methods. A custom exception class (like RootElementSAXException in the above example) is desirable, to distinguish our application designed exception from the built in SAXException events.

2) It's recommended to use SAX API for this use case, since it'll be much more efficient than for e.g using DOM APIs, which would load the whole of XML document in memory (which doesn't look a sensible approach to me, for just knowing the name of top most element of XML document).

3) The exclude element name list can be externalized from the java application, to make the above program reusable for any kind of XML documents.

4) We may use something like the java JAXP validation APIs, to help achieve the "and" of the two validation steps (i.e, schema validation and the SAX application step) described here, if we want to integrate this approach in a java application.

5) The java code snippet presented above can be made XML namespace aware (i.e if the XML elements are in namespace), by considering the namespace name parameter in SAX callback methods (for e.g the method parameter "String uri", in the startElement callback method).

I hope that this post is useful.

2011-06-26:
The explanation given by me in this blog post originally, seems to convey that multiple global element declarations in XML Schema documents are allowable by the XML Schema language, and this is inherently a bad/wrong design present within the XML Schema language. One of the solution to prohibit certain XML elements to be global in an XML instance document, was presented earlier in this blog post (using an additional restricted SAX parsing step in an application).

All this is fine. But I wanted to follow up on my thoughts written earlier in this post, arguing now, that multiple global element declarations allowable in XML Schema language is not a bad/wrong design present in XML Schema language. One of the features of XML Schema language, which requires multiple global element declarations is XML Schema "substitution groups" (i.e one element substituting for another) -- and "substitution groups" is a core and important concept within XML Schema language.

Of-course, if not working with XML Schema "substitution groups" or otherwise, one could use the SAX add-on technique I presented earlier to prohibit certain global element declarations to validate the XML instance root element, if that suits someones application design.

Saturday, April 30, 2011

XML Schema: facets constraining the cardinality of simpleType->list

I thought I should write a little clarification of a point I mentioned in my blog post, http://mukulgandhi.blogspot.com/2010/10/xsd-11-xml-schema-design-approaches.html.

I seem to have suggested in the above cited post, that XML Schema 1.1 assertions are probably necessary to impose restrictions on cardinality of an XML Schema simpleType list instance. But this fact doesn't appear to be true, after I realized this reading the XML Schema spec lately; which allows the following constraining facets on XML Schema simpleType's with variety list:
[1]
<xs:length ../>
<xs:minLength ../>
<xs:maxLength ../>

(ref, http://www.w3.org/TR/xmlschema11-2/#defn-coss which says, "If {variety} is list, then the applicable facets are assertions, length, minLength, maxLength, pattern, enumeration, and whiteSpace")

These constraining facets [1], on simpleType with variety list were available in XML Schema 1.0 too.

These facets [1] may serve the design purpose (and should probably be even efficient than using assertions, since assertions require compiling the XPath expressions in their "test" attribute's, and to build quite a bit of context information for XPath expression evaluation) I had mentioned in the above cited post.

Also to mention, that an assertion facet for simpleType with variety list, could be found useful for other purposes (i.e they are not without purpose!), for example as follows:

<xs:assertion test="count($value) mod 2 = 0"/>

(the list instance must have even number of items)

Thanks for reading this post!

Saturday, January 1, 2011

Happy New Year 2011

I wish readers of this blog a very Happy New Year 2011.

My new year resolutions are to have more interactions with the online community, particularly with folks at XML, XML Schema, XSL and XQuery forums. And I do wish to see W3C-standards progress on XML Schema 1.1, XPath 3.0, XSLT 3.0 and XQuery 3.0 languages (these are great new XML languages which I'm following-up with recently). I'm also reading through the discussions at the newly setup HTML-XML convergence task force (I hope we'll see few nice decisions emerging there)!

And needless to mention, I'm looking to work more closely with Eclipse community.