Sunday, April 25, 2010

XSD 1.1: negative "pattern" facets and assertions

While exploring more of XSD 1.1 assertions, I've been pretty convinced that much of the limitations of XSD "pattern" facet can be overcome with assertions (and of-course one of real benefits of XSD 1.1 assertions is the ability to specify co-occurrence constraints, in XML Schema documents -- here's a nice article explaining XML Schema 1.1 co-occurrence constraints).

I think, one of the things which might get quite difficult to express in XML Schema 1.0, is specifying a negative word list.

For example, if we have this simple XML document:
  <fruit>apple</fruit>

And we want that, the XML element "fruit" must not contain say the words "cherry" or "guava". Although, this looks a pretty straight-forward regex use-case, but unfortunately it might get quite cumbersome to express this seemingly straightforward regex pattern, with the available XSD 1.0 regular-expression syntax.

My quick try to express this with XSD 1.0, was something like following:
<xs:pattern value="^(cherry|guava)" />

But unfortunately, the above pattern facet and quite a few similar regexes, can't accomplish this seemingly common use-case easily (I think, this is doable with XSD 1.0 regex's but certainly, it would be quite tedious to come to the right regex pattern -- of-course regex experts/gurus could do this easily, but not me at this moment!).

And now, I try to express these validation constraints with XSD 1.1 assertions. Here's a sample XSD 1.1 schema [1], using assertions to solve this, and few of similar use-cases:

  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

     <xs:element name="Example" type="Fruits1" />
  
     <xs:complexType name="Fruits1">
       <xs:sequence>
         <xs:element name="fruit" type="xs:string" />
         <xs:element name="exclude" type="xs:string" />
       </xs:sequence>
       <xs:assert test="not(fruit = tokenize(exclude,','))" />
     </xs:complexType>
  
     <xs:complexType name="Fruits2">
       <xs:sequence>
         <xs:element name="fruit" type="xs:string" />
         <xs:element name="exclude" type="xs:string" />
       </xs:sequence>
       <xs:assert test="not(fruit = (for $x in tokenize(exclude,',') return 
                                              normalize-space($x)))" />
     </xs:complexType>
  
     <xs:complexType name="Fruits3">
       <xs:sequence>
         <xs:element name="fruit" type="xs:string" />
         <xs:element name="exclude" type="xs:string" />
       </xs:sequence>
       <xs:assert test="not(fruit = (for $x in tokenize(exclude,',') return 
                                (string-join(tokenize($x,' '),''))))" />
     </xs:complexType>
  
   </xs:schema>

A sample XML instance document [2], that we'll validate with the above schema, is following:
  <Example>
    <fruit>apple</fruit>
    <exclude>cherry,guava</exclude>
  </Example>

As stated in the original requirements above, we want that the word in element "fruit" must not contain any of words, from the comma-separated list in the "exclude" element.

In the above XSD schema [1], the complex type "Fruits1" can successfully validate the above XML instance document [2].

The complex type "Fruits2" can validate an exclude list, where there could be white-spaces before and after the 'comma separator'. For example, the list "cherry, guava" (please note, an extra white-space after the 'comma') would be considered an appropriate exclusion list for this example. Whereas, this list variant cannot be validated by the schema type, "Fruits1".

And the complex type "Fruits3" can validate an exclude list of kind, "cherry, g u a v a" (i.e, there could be white-space characters, within a word) -- this is a figment of my imagination :). But certainly there could possibly be such lexical constraints in instance documents.

PS: All the examples in this post were tested with, Xerces-J.

I hope, that this post is useful.

Saturday, April 17, 2010

XSD 1.1: xs:precisionDecimal, assertions and Xerces-J updates

Section 1
Recently, I went through in sufficient detail about the XSD primitive data-type, xs:precisionDecimal (newly introduced in, XSD 1.1), and was trying to use XSD 1.1 assertions to simulate xs:precisionDecimal (just to vent my curiosity and exploring more of, XSD assertions) as a user-defined (as a restriction of xs:decimal data-type) XSD Simple Type (though I believe, a native implementation of xs:precisionDecimal should also exist in an XSD 1.1 implementation, or in language systems which may use the XSD type system -- for example, a stand-alone XPath (2.x) implementation which uses an XSD type system).

Here's an XSD 1.1 schema example, illustrating these concepts:
[1]
   <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

      <xs:element name="example" type="myPrecisionDecimal" />
  
      <xs:simpleType name="myPrecisionDecimal">
        <xs:restriction base="xs:decimal" xmlns:xerces="http://xerces.apache.org">
           <xs:totalDigits value="6" />
           <xs:fractionDigits value="4" />
           <xs:assertion test="string-length(substring-after(string($value), '.')) ge 2" 
                  xerces:message="minScale of this decimal number should be 2" />
        </xs:restriction>
      </xs:simpleType>
  
   </xs:schema>

The XSD type, "myPrecisionDecimal" defined above has following correspondences with the type, xs:decimal:
a) The facet specification, xs:totalDigits in "myPrecisionDecimal" is equivalent to the facet xs:totalDigits in xs:decimal.
b) The facet specification, xs:fractionDigits in "myPrecisionDecimal" is equivalent to the facet "maxScale" for, xs:decimal.
c) The assertion facet in, "myPrecisionDecimal" is equivalent (an user-defined attempt to equalize!) to the facet "minScale" for, xs:decimal.

When the above schema document [1], is used to validate the following XML instance:
<example>44.4</example>
The following error message is produced:
[Error] test.xml:1:24: cvc-assertion.failure: Assertion failure. minScale of this decimal number should be 2.

It's also worth noting that, the above user-defined type "myPrecisionDecimal" cannot be considered a true equivalent of XSD type, xs:precisionDecimal as defined in XSD 1.1 spec, because xs:precisionDecimal also includes values for positive and negative infinity and for "not a number", and it differentiates between "positive zero" and "negative zero" (these aspects, are not defined for xs:decimal). The above example, for "myPrecisionDecimal" only demostrates, simulating the "minScale" facet (which is not available in the type, xs:decimal) of xs:precisionDecimal.

Section 2
(Xerces-J, assertions implementation update)

Xerces-J recently implemented, an extension attribute "message" (specified in a namespace, http://xerces.apache.org, for Xerces-J XSD 1.1 implementation) on XSD 1.1 assertion instructions. The value of this attribute, needs to be an error message that will be reported by an XSD 1.1 engine upon assertions failure.

An example of this is illustrated, in the schema document above [1].

In the absence of the "message" attribute on assertions (or if it's present, but it doesn't contain any significant non-whitespace characters), the following default error message is produced by Xerces:
[Error] test.xml:1:24: cvc-assertion.3.13.4.1: Assertion evaluation ('string-l
ength(substring-after(string($value), '.')) ge 2') for element 'example' with type 'myPrecisionDecimal' did not succeed.


We could see the benefit of, the "message" attribute on assertions, which to my opinion are following:
a) For complex (& particularly, lengthy) XPath expressions in assertions, the default error messages produced by Xerces, could be quite verbose which the user's may not find convenient to view & debug. The user experience, with default assertions error messages, may be further trouble-some if there are numerous assertion evaluations for XML documents -- we could imagine the user-experience, for say maxOccurs="unbounded" specification on XML elements on which assertions apply OR let's say, there may be of the order of "> 10" different assertions.
b) We could specify, domain specific error messages with the assertions "message" attribute.

Though, the advantage of the default assertion error messages produced by Xerces is that, it prints to the user, the name of XSD type and the element/attribute involved in a particular assertions validation episode.

PS: There's been a recent issue raised with the XSD WG, which proposes addition of a "message" attribute on assertions in the XSD 1.1 language itself. The Xerces implementation of assertions "message" attribute may change in future, depending on a recommendation related to this, from the XSD WG.

I hope, that this post is useful.