Saturday, July 17, 2010

XSD 1.1: XML schema design approaches cotd... PART 3

I'm continuing with the XML Schema design thoughts series, with the third part here. The first two parts are available here:
1) PART 1
2) PART 2

All the examples here have been tested with Xerces-J 2.10.0.

(I'm disclaiming in the beginning, that examples presented in this blog post are somewhat fictitious and may not serve a real life use-case. These examples are kind of cooked-up to only illustrate XML Schema 1.1 constructs, and some of design thinking behind them. I also refer at lot of places a phrase "element particles". This simply means XML elements, but "particles" is a formal term defined by the XML Schema spec, designating XML schema components having minOccurs and maxOccurs attributes -- if minOccurs/maxOccurs attributes are absent, then these have default values for the relevant schema components)

I'm presenting a sample 1.1 XML schema with corresponding XML document first, and then attempting trying to reflect on the inherent design from my point of view in these examples:

XML Schema 1.1 specific constructs are emphasized with a different color.

[XML1]
  <Book>
     <name>XML in a Nutshell</name>
     <ISBN>AB-1001</ISBN>
     <author>Jimmy</author>
     <NoPages>100</NoPages>
  </Book>

[XML Schema 1]
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
     <xs:element name="Book">
        <xs:complexType>
          <xs:complexContent>
            <xs:extension base="BOOK_FRAGMENT">
               <xs:openContent>
                 <xs:any processContents="lax" />
               </xs:openContent>
               <xs:assert test="not(* except (name, author, ISBN, NoPages)) and 
                                 (if (ISBN)
                                    then not(ISBN/*) 
                                    else true()) and 
                                 (if (NoPages) 
                                     then (not(NoPages/*) and (NoPages/text() castable as xs:positiveInteger))
                                     else true())" />    
            </xs:extension>
          </xs:complexContent>          
        </xs:complexType>
     </xs:element>
   
     <xs:complexType name="BOOK_FRAGMENT">
        <xs:sequence>
          <xs:element name="name" type="xs:string" />
          <xs:element name="author" type="xs:string" />
        </xs:sequence>
     </xs:complexType>

  </xs:schema>

The following use-case requirements motivated me to write this sample (I'm also trying to reflect on the schema design choices I've made, about which I surely invite comments from the readers -- if you've patience to read this post and respond!):
1. XML Schema 1.0 has a limitation that, when a complex type (having sequence or choice particles) is derived by extension then a derived complex type can only add element particles at the end of an element list (within the base type). Supposing that we want to re-use a complex type (having a sequence of element particles) by deriving it with extension, and need to add additional element particles say any-where in between the elements of the base type. This is what the above XML schema (XML Schema 1) example intends to do; and the above schema does indeed validates successfully the corresponding XML document presented above (XML1).

2. A key design decision in the above schema (XML Schema 1) is to use the XML Schema 1.1 "openContent" instruction (newly introduced in 1.1 version). The use of XSD 1.1 assertions here is optional, but is very practical to do so (which I'll try to explain!). An XML schema "openContent" instruction is essentially a wrapper around xs:any wild-card, producing the same effect as xs:any wild-card but has an interleave or a suffix appending behavior (please feel free to read the XML Schema 1.1 spec to learn more about XSD 1.1 open contents. Or perhaps if you want a lighter [but brilliant] explanation, you may read Roger L. Costello's XML Schema 1.1 write-up available here).
The XML Schema 1.1 spec defines an "openContent" instruction as following:
  <openContent
     id = ID
     mode = (none | interleave | suffix) : interleave
     {any attributes with non-schema namespace . . .}>
     Content: (annotation?, any?)
  </openContent>
It is an openContent instruction with "interleave" mode (which is the default openContent mode), which enables adding additional element particles interspersed between base type's element particles.

3. In the above example, the XML elements "ISBN" and "NoPages" are added to the base type's element particles which are not appended at the end of base type's elements, but can be added anywhere within the resulting XML content model. For this particular example, the placement of XML elements coming from the derived complex type are arbitrary, and is done to only illustrate the workings of "openContent" instruction in "interleave" mode.

4. It's interesting to see the benefit of XSD 1.1 assertions here. The assertions here are able to impose certain constraints on the resultant content model (otherwise the content model is kind-of wide open with no restrictions). The assertions in the above schema document (XML Schema 1) mean:
  a) The resulting content model can only have XML elements -> "name", "author", "ISBN" and "NoPages".
  b) The element "ISBN" needs to be an atomic string value, and the element "NoPages" needs to be an xs:positiveInteger value.

I'm presenting below another XML schema variant (than the example above -- XML Schema 1), which solves the same problem as described above, but in a slightly different way (with advantages and disadvantages described after the example):

[XML Schema 2]
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
      <xs:element name="Book">
         <xs:complexType>
            <xs:complexContent>
               <xs:extension base="BOOK_FRAGMENT">
                  <xs:openContent>
                     <xs:any processContents="strict"/>
                  </xs:openContent>
                  <xs:assert test="count(distinct-values(for $elem in (* except (name, author)) return $elem/name())) = count(for $elem in (* except (name, author)) return $elem/name())"/>       
               </xs:extension>
            </xs:complexContent>          
         </xs:complexType>
      </xs:element>
   
      <xs:complexType name="BOOK_FRAGMENT">
         <xs:sequence>
           <xs:element name="name" type="xs:string"/>
           <xs:element name="author" type="xs:string"/>
         </xs:sequence>
      </xs:complexType>
   
      <xs:element name="ISBN" type="xs:string" />
   
      <xs:element name="NoPages" type="xs:positiveInteger" />

   </xs:schema>

The example XML document for this schema (XML Schema 2) remains same (XML1). Here are the advantages (and unfortunately a little disadvantage as well, with a suggested workaround for the drawback...) of the sample, XML Schema 2:
1. Here we are using xs:any wild-card with processContents="strict" mode (the earlier example used the wild-card with "lax" mode) and providing the corresponding element declarations in the schema (the last two element declarations). This approach has advantage that, the content model of elements "ISBN" and "NoPages" are enforced natively by the XML schema engine, and the schema author doesn't have to implement the content model constraints herself/himself (for example, that an element is empty and has an atomic value) -- say via assertions. This approach is more robust, than trying to achieve the similar effect with assertions.

2. The assertion in schema document, [XML Schema 2] enforces that elements in the sequence could occur only once. This is accomplished by this simple algorithm:
count(distinct-values(names...)) = count(names...)

3. The only drawback I foresee with XML Schema 2, is that elements "ISBN" and "NoPages" are now global elements (which is necessary to have xs:any wild-card to work with processContents="strict" mode). This has implication that following XML documents would be reported valid as well, by the schema document XML Schema 2:
  <ISBN>AB-1001</ISBN>
AND
  <NoPages>100</NoPages>

This is a side-effect of schema document XML Schema 2, which I myself personally don't seem to like :(

To solve this limitation, I can imagine there could be a workaround as following:
We could perform two validations in sequence. One with the schema document, [XML Schema 2] (let's call this validation V1) and the second one with the following schema document (let's call this validation result V2):

[XML Schema 3]
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
      
      <xs:element name="ISBN" type="xs:string" />
   
      <xs:element name="NoPages" type="xs:positiveInteger" />

  </xs:schema>

This is kind of a little validation pipeline. The complete/end-to-end (which usually means, that this has domain meaning) schema validation succeeds in entirety, if validation V1 succeeds but V2 doesn't (I imagine, that this kind-of pipeline operation could be enforced by a host language, like Java using the XML Schema JAXP APIs).

Thanks for reading!

I hope that this post is useful.

No comments: