Thursday, January 26, 2012

Using XSD 1.1 assertions on complexType mixed contents

There were some interesting ;) thoughts coming to my mind lately, and not surprisingly again related to XSD. I was playing with XSD 1.1 assertions once again to try to constrain an XSD complexType{mixed} content model and I'm sharing some of my findings ... (I guess, I hadn't written about this particular topic on this blog before or on any other forum. If you find any duplicacy of information in this post with any information I might have written elsewhere, kindly ignore the earlier things I might have written). I come to the topic now.

What is XSD mixed content (you may ignore reading this, if you already know about this)?
 I believe, this isn't really an XSD only topic. It is something which is present in plain XML (there can be a good old well-formed XML document, which might have "mixed" content and needn't be validated at all -- i.e in a schema free XML environment), but XSD allows to report such an XML instance document as 'valid' (more importantly, XSD would report a "mixed" content model XML instance as 'invalid' if validated by an "element only" content model specified by an XSD complexType definition) and also to constrain XML mixed contents in certain ways (particularly with XSD 1.1 in some new ways, which I'll try to talk about further below).

Example of "element only" (content of element "X" here) XML content model [X1]:

<X>
  <Y/>
  <Z/>
</X>
Example of "mixed content" (content of element "X" here) XML content model [X2]: 

<X>
  abc
  <Y/>
  123
  <Z/>
  654
</X> 

Therefore, "mixed content" allows "non whitespaced" text nodes as siblings of element nodes.

XSD 1.0 schema definition that allows "mixed" content [XS1]:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="X">    
        <xs:complexType mixed="true">
             <xs:sequence>
                 <xs:element name="Y"/>
                 <xs:element name="Z"/>
             </xs:sequence>
        </xs:complexType>
    </xs:element>
    
</xs:schema>

This schema (XS1) would report the XML document "X2" above as 'valid' (since that instance document has "mixed" content, and this schema allows "mixed" content via a property "mixed = 'true'" on a complexType definition).

But in the schema document "XS1" above, if we remove the property specifier "mixed = 'true'" or set the value of attribute "mixed" as 'false' (which is also the default value of this attribute), then such a modified schema would report the XML instance document "X2" above as 'invalid' (but the XML document "X1" above would be reported as 'valid' -- since it doesn't has "mixed" content).

New capabilities provided by XSD 1.1 to constrain XML "mixed" content further:

Following is a list of new features supported by XSD 1.1 for XML "mixed" contents, that currently come to my mind,

a)

XSD 1.1 schema "XS2":
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="X">    
       <xs:complexType mixed="true">
          <xs:sequence>
             <xs:element name="Y"/>
             <xs:element name="Z"/>
          </xs:sequence>          
          <xs:assert test="deep-equal(text()[matches(.,'\w')]/normalize-space(.), ('abc','123','654'))"/>
       </xs:complexType>
    </xs:element>
    
</xs:schema>
The <assert> element in this schema (XS2) constrains the mixed content in XML instance document to be a list (with order of list items been significant) of only few specified values. The assertion is written only to illustrate the technical capabilities of an assertion here, but not with any application in mind.
Following are few of other things, which XSD 1.1 assertions could achieve in an XML "mixed" content model's context:

b)
<xs:assert test="((text()[matches(.,'\w')]/normalize-space(.))[2] castable as xs:integer)
                    and
                 ((text()[matches(.,'\w')]/normalize-space(.))[3] castable as xs:integer)"/>

This assertion constrains specific items of an XML "mixed" content model list to be of a specified XSD schema type -- here the 2nd and 3rd items of the list need to be typed as xs:integer, whereas the first item is "untyped".

c)
<xs:assert test="count((text()[matches(.,'\w')]/normalize-space(.))[. castable as xs:integer])
                    =
                 count(text()[matches(.,'\w')]/normalize-space(.))"/>

This assertion constrains all items of the XML "mixed" content model list to be of the same type (xs:integer in this case) -- this uses a well defined pattern "count of xs:integer items is equal to the count of all the items".

d)
<xs:assert test="every $x in text()[matches(.,'\w')][position() gt 1]
                   satisfies 
                (number(normalize-space($x)) gt number($x/preceding-sibling::text()[matches(.,'\w')][1]))"/>

This assertion constrains the list of XML "mixed" content model to be in ascending numeric order (assuming that all items in the list are numeric. Though it should be possible to specify a numeric order on a heterogeneously typed list, and specify numeric order only for numeric list items).

Summary: XSD 1.0 allowed an "untyped" XML mixed content, that was uniformly available anywhere within the scope of an XML element that was validated by an XSD complexType. No further constraints on "mixed" content were possible in an XSD 1.0 environment. XSD 1.1 allows some new ways to constrain XML "mixed" content further (some of these capabilities were illustrated in examples above). To my opinion, the likely benefits of constraining XML "mixed" content in some of the ways as illustrated above, is to allow the XML document authors to model certain semantic content in "mixed" content scope and make this knowledge available to the XML applications. All examples above were tested with Apache Xerces (I hope that these examples would also be compliant with other XSD validators, notably Saxon currently which also supports XSD 1.1).

I hope that this information was useful.