Thursday, November 15, 2012

new thoughts about XSD 1.1 assertions

I've been thinking on these XSD topics for a while, and thought of summarizing my findings here.

Let me start this post by writing the following XML instance document (which will be the focus of all analysis in this post):

XML-1
<list attr="1 2 3 4 5 6">
    <item>a1</item>
    <item>a2</item>
    <item>a3</item>
    <item>a4</item>
    <item>a5</item>
    <item>a6</item>
</list>

We need to specify an XSD schema for the XML document above (XML-1), providing the following essential validation constraints:
1) The value of attribute "attr" is a sequence of single digit numbers. A number here can be modeled as an XSD type xs:integer, or as a restriction from xs:string (as we'll see below).
2) Each string value within an element "item" is of the form a[0-9]. i.e, this string value needs to be the character "a" followed by a single digit numeric character. We'll simply specify this with XSD type xs:string for now. We want that, each numeric character after "a" should be pair-wise same as the value at corresponding index within attribute value "attr". The above sample XML instance document (XML-1) is valid as per this requirement. Therefore, if we change any numeric value within the XML instance sample above (either within the attribute value "attr", or the numeric suffix of "a") only within the attribute "attr" or the elements "item", the XML instance document must then be reported as 'invalid' (this observation follows from the requirement that is stated in this point).

Now, let me come to the XSD solutions for these XML validation requirements.

First of all, we would need XSD 1.1 assertions to specify these validation constraints (since, this is clearly a co-occurrence data constraint issue.). Following is the first schema design, that quickly came to my mind:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   
    <xs:element name="list">
        <xs:complexType>
           <xs:sequence>
              <xs:element name="item" type="xs:string" maxOccurs="unbounded"/>
           </xs:sequence>
           <xs:attribute name="attr">
              <xs:simpleType>
                 <xs:list itemType="xs:integer"/>
              </xs:simpleType>
           </xs:attribute>
           <xs:assert test="deep-equal(item/substring-after(., 'a'), data(@attr))"/>
        </xs:complexType>
    </xs:element>
   
</xs:schema>

The above schema is almost correct, except for a little problem with the way assertion is specified. As per the XPath 2.0 spec, the "deep-equal" function when comparing the two sequences for deep equality checks, requires that atomic values at same indices in the two sequences must be equal as per the rules of equality of an XSD atomic type. Within an assertion in the above schema, the first argument of "deep-equal" has a type annotation of xs:string* and the second argument has a type annotation xs:integer* (note that, the XPath 2.0 "data" function returns the typed value of a node) and therefore the "deep-equal" function as used in this case returns a 'false' result.

Assuming that we would not change the schema specification of "item" elements and the attribute "attr", the following assertion would therefore be correct to realize the above requirement:

<xs:assert test="deep-equal(item/substring-after(., 'a'), for $att in data(@attr) return string($att))"/>

(in this case, we've converted the second argument of "deep-equal" function (highlighted with a different color) to have a type annotation xs:string* and did not modify the type annotation of the first argument)

An alternative correct modification to the assertion would be:

<xs:assert test="deep-equal(item/number(substring-after(., 'a')), data(@attr))"/>

(in this case, we convert the type annotation of the first argument of "deep-equal" function to xs:integer* and do not modify the type annotation of the second argument)

I now propose a slightly different way to specify the schema for above requirements. Following is the modified schema document:

XS-2
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  
    <xs:element name="list">
        <xs:complexType>
           <xs:sequence>
              <xs:element name="item" type="xs:string" maxOccurs="unbounded"/>
           </xs:sequence>
           <xs:attribute name="attr">
              <xs:simpleType>
                 <xs:list itemType="NumericChar"/>
              </xs:simpleType>
           </xs:attribute>
           <xs:assert test="deep-equal(item/substring-after(., 'a'), data(@attr))"/>
        </xs:complexType>
    </xs:element>
  
    <xs:simpleType name="NumericChar">
       <xs:restriction base="xs:string">
          <xs:pattern value="[0-9]"/>
       </xs:restriction>
    </xs:simpleType>
  
</xs:schema>

This schema document is right in all respects, and successfully validates the XML document specified above (i.e, XML-1). In this schema we've made following design decisions:
1) We've specified the itemType of list (the value of attribute "attr" is this list instance) as "NumericChar" (this is a user-defined simpleType, that uses the xs:pattern facet to constrain list items).
2) The "deep-equal" function as now written in the schema XS-2, has the type annotation xs:string* for both of its arguments. And therefore, it works fine.

I'll now try to summarize below the pros and cons of schema XS-2 wrt the other correct solutions specified earlier:
1) If the simpleType definition of attribute "attr" is not used in another schema context (i.e, ideally if this simpleType definition is the only use of such a type definition). Or in other words there is no need of re-usability of this type. Then the solution with schema XS-2  is acceptable.
2) If a schema author thought, that list items of attribute "attr" need to be numeric (due to semantic intent of the problem, or if the list's simpleType definition needs to be reused at more than one place and the other place needs a list of integers), then the schema solutions like shown earlier would be needed.

Here's another caution I can point wrt the schema solutions proposed above,
The above schemas would allow values within "item" elements like "pqra5" to produce a valid outcome with the "substring-after" function that is written in assertions. Therefore, the "item" element may be more correctly specified like,

<xs:element name="item" maxOccurs="unbounded">
    <xs:simpleType>
         <xs:restriction base="xs:string">
              <xs:pattern value="a[0-9]"/>
         </xs:restriction>
    </xs:simpleType>
</xs:element>

It is also evident, that XPath 2.0 "data" function allows us to do some useful things with simpleType lists, like getting the list's typed value and specifying certain checks on individual list items (possibly different checks on different list items) or accessing list items by an index (or a range of indices). For e.g, data(@attr)[2] or data(@attr)[position() gt 3]. This was not possible with XSD 1.0.

I hope that this post was useful, and hoping to come back with another post sometime soon.

No comments: