Saturday, July 17, 2010

XSD 1.1: XML schema design approaches cotd... PART 3

I'm continuing with the XML Schema design thoughts series, with the third part here. The first two parts are available here:
1) PART 1
2) PART 2

All the examples here have been tested with Xerces-J 2.10.0.

(I'm disclaiming in the beginning, that examples presented in this blog post are somewhat fictitious and may not serve a real life use-case. These examples are kind of cooked-up to only illustrate XML Schema 1.1 constructs, and some of design thinking behind them. I also refer at lot of places a phrase "element particles". This simply means XML elements, but "particles" is a formal term defined by the XML Schema spec, designating XML schema components having minOccurs and maxOccurs attributes -- if minOccurs/maxOccurs attributes are absent, then these have default values for the relevant schema components)

I'm presenting a sample 1.1 XML schema with corresponding XML document first, and then attempting trying to reflect on the inherent design from my point of view in these examples:

XML Schema 1.1 specific constructs are emphasized with a different color.

[XML1]
  <Book>
     <name>XML in a Nutshell</name>
     <ISBN>AB-1001</ISBN>
     <author>Jimmy</author>
     <NoPages>100</NoPages>
  </Book>

[XML Schema 1]
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
     <xs:element name="Book">
        <xs:complexType>
          <xs:complexContent>
            <xs:extension base="BOOK_FRAGMENT">
               <xs:openContent>
                 <xs:any processContents="lax" />
               </xs:openContent>
               <xs:assert test="not(* except (name, author, ISBN, NoPages)) and 
                                 (if (ISBN)
                                    then not(ISBN/*) 
                                    else true()) and 
                                 (if (NoPages) 
                                     then (not(NoPages/*) and (NoPages/text() castable as xs:positiveInteger))
                                     else true())" />    
            </xs:extension>
          </xs:complexContent>          
        </xs:complexType>
     </xs:element>
   
     <xs:complexType name="BOOK_FRAGMENT">
        <xs:sequence>
          <xs:element name="name" type="xs:string" />
          <xs:element name="author" type="xs:string" />
        </xs:sequence>
     </xs:complexType>

  </xs:schema>

The following use-case requirements motivated me to write this sample (I'm also trying to reflect on the schema design choices I've made, about which I surely invite comments from the readers -- if you've patience to read this post and respond!):
1. XML Schema 1.0 has a limitation that, when a complex type (having sequence or choice particles) is derived by extension then a derived complex type can only add element particles at the end of an element list (within the base type). Supposing that we want to re-use a complex type (having a sequence of element particles) by deriving it with extension, and need to add additional element particles say any-where in between the elements of the base type. This is what the above XML schema (XML Schema 1) example intends to do; and the above schema does indeed validates successfully the corresponding XML document presented above (XML1).

2. A key design decision in the above schema (XML Schema 1) is to use the XML Schema 1.1 "openContent" instruction (newly introduced in 1.1 version). The use of XSD 1.1 assertions here is optional, but is very practical to do so (which I'll try to explain!). An XML schema "openContent" instruction is essentially a wrapper around xs:any wild-card, producing the same effect as xs:any wild-card but has an interleave or a suffix appending behavior (please feel free to read the XML Schema 1.1 spec to learn more about XSD 1.1 open contents. Or perhaps if you want a lighter [but brilliant] explanation, you may read Roger L. Costello's XML Schema 1.1 write-up available here).
The XML Schema 1.1 spec defines an "openContent" instruction as following:
  <openContent
     id = ID
     mode = (none | interleave | suffix) : interleave
     {any attributes with non-schema namespace . . .}>
     Content: (annotation?, any?)
  </openContent>
It is an openContent instruction with "interleave" mode (which is the default openContent mode), which enables adding additional element particles interspersed between base type's element particles.

3. In the above example, the XML elements "ISBN" and "NoPages" are added to the base type's element particles which are not appended at the end of base type's elements, but can be added anywhere within the resulting XML content model. For this particular example, the placement of XML elements coming from the derived complex type are arbitrary, and is done to only illustrate the workings of "openContent" instruction in "interleave" mode.

4. It's interesting to see the benefit of XSD 1.1 assertions here. The assertions here are able to impose certain constraints on the resultant content model (otherwise the content model is kind-of wide open with no restrictions). The assertions in the above schema document (XML Schema 1) mean:
  a) The resulting content model can only have XML elements -> "name", "author", "ISBN" and "NoPages".
  b) The element "ISBN" needs to be an atomic string value, and the element "NoPages" needs to be an xs:positiveInteger value.

I'm presenting below another XML schema variant (than the example above -- XML Schema 1), which solves the same problem as described above, but in a slightly different way (with advantages and disadvantages described after the example):

[XML Schema 2]
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
      <xs:element name="Book">
         <xs:complexType>
            <xs:complexContent>
               <xs:extension base="BOOK_FRAGMENT">
                  <xs:openContent>
                     <xs:any processContents="strict"/>
                  </xs:openContent>
                  <xs:assert test="count(distinct-values(for $elem in (* except (name, author)) return $elem/name())) = count(for $elem in (* except (name, author)) return $elem/name())"/>       
               </xs:extension>
            </xs:complexContent>          
         </xs:complexType>
      </xs:element>
   
      <xs:complexType name="BOOK_FRAGMENT">
         <xs:sequence>
           <xs:element name="name" type="xs:string"/>
           <xs:element name="author" type="xs:string"/>
         </xs:sequence>
      </xs:complexType>
   
      <xs:element name="ISBN" type="xs:string" />
   
      <xs:element name="NoPages" type="xs:positiveInteger" />

   </xs:schema>

The example XML document for this schema (XML Schema 2) remains same (XML1). Here are the advantages (and unfortunately a little disadvantage as well, with a suggested workaround for the drawback...) of the sample, XML Schema 2:
1. Here we are using xs:any wild-card with processContents="strict" mode (the earlier example used the wild-card with "lax" mode) and providing the corresponding element declarations in the schema (the last two element declarations). This approach has advantage that, the content model of elements "ISBN" and "NoPages" are enforced natively by the XML schema engine, and the schema author doesn't have to implement the content model constraints herself/himself (for example, that an element is empty and has an atomic value) -- say via assertions. This approach is more robust, than trying to achieve the similar effect with assertions.

2. The assertion in schema document, [XML Schema 2] enforces that elements in the sequence could occur only once. This is accomplished by this simple algorithm:
count(distinct-values(names...)) = count(names...)

3. The only drawback I foresee with XML Schema 2, is that elements "ISBN" and "NoPages" are now global elements (which is necessary to have xs:any wild-card to work with processContents="strict" mode). This has implication that following XML documents would be reported valid as well, by the schema document XML Schema 2:
  <ISBN>AB-1001</ISBN>
AND
  <NoPages>100</NoPages>

This is a side-effect of schema document XML Schema 2, which I myself personally don't seem to like :(

To solve this limitation, I can imagine there could be a workaround as following:
We could perform two validations in sequence. One with the schema document, [XML Schema 2] (let's call this validation V1) and the second one with the following schema document (let's call this validation result V2):

[XML Schema 3]
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
      
      <xs:element name="ISBN" type="xs:string" />
   
      <xs:element name="NoPages" type="xs:positiveInteger" />

  </xs:schema>

This is kind of a little validation pipeline. The complete/end-to-end (which usually means, that this has domain meaning) schema validation succeeds in entirety, if validation V1 succeeds but V2 doesn't (I imagine, that this kind-of pipeline operation could be enforced by a host language, like Java using the XML Schema JAXP APIs).

Thanks for reading!

I hope that this post is useful.

Sunday, July 11, 2010

XSD 1.1: XML schema design approaches cotd... PART 2

I'm continuing with the XML Schema design approaches series, I started in the previous blog post. Here's the second post in this series.

Here's a description of the use-case I'll be illustrating in this post, with both XML Schema 1.0 and 1.1 examples:

We need to write an XML Schema for the following XML content model:
  colors
    -> (violet | indigo | blue | green | yellow | orange | red)+

Here the words "colors", "violet" etc represent XML elements, and they have no attributes and are empty. The above content model means, that children of element "colors" can repeat and are unordered, and at-least one of them is required.

Therefore following XML document is a valid instance according to this content model:

[XML1]
  <colors>
     <violet/>
     <indigo/>
     <blue/>
     <green/>
     <yellow/>
     <orange/>
     <red/>
  </colors>

AND for example, the following XML document is valid as well, as per the content model described above (here the element "colors" have less children than the previous example, and some of children of "colors" occur more than once):

[XML2]
  <colors>
     <violet/>
     <indigo/>
     <blue/>
     <green/>
     <green/>
  </colors>

Here are two XML schema examples that express the above XML content model constraints:

[XML Schema 1] (written in XML Schema 1.0)
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
     <xs:element name="colors">
        <xs:complexType>
           <xs:choice maxOccurs="unbounded">
              <xs:element name="violet" type="EMPTY" />
              <xs:element name="indigo" type="EMPTY" />
              <xs:element name="blue" type="EMPTY" />
              <xs:element name="green" type="EMPTY" />
              <xs:element name="yellow" type="EMPTY" />
              <xs:element name="orange" type="EMPTY" />
              <xs:element name="red" type="EMPTY" />     
           </xs:choice>
        </xs:complexType>
     </xs:element>
   
     <xs:complexType name="EMPTY"> 
        <xs:complexContent> 
          <xs:restriction base="xs:anyType" /> 
        </xs:complexContent> 
     </xs:complexType>

  </xs:schema>

[XML Schema 2] (written in XML Schema 1.1 -- the 1.1 specific constructs are displayed with a different color)
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
     <xs:element name="colors">
        <xs:complexType>
           <xs:sequence>
             <xs:any maxOccurs="unbounded" processContents="lax" />
           </xs:sequence>
           <xs:assert test="every $x in */name() satisfies ($x = 
                              ('violet','indigo','blue','green','yellow','orange','red'))" />
           <xs:assert test="every $x in * satisfies not($x/node())" />
        </xs:complexType>
     </xs:element>

  </xs:schema>

Here's some quick analysis from my point of view, about the differences between the above schema approaches, and if any of the above approaches is better than the other one:
1) "XML Schema 1" is written in a familiar 1.0 style, so people who want to stick with 1.0 can still adopt this technique. We can observe, that the first schema is a little more verbose than the second one, which I see as one of the advantage of the second one.

2) If you are comfortable writing the XPath 2.0 expressions, then there are virtually too many possibilities to express schema validation constraints with XSD 1.1 assertions, which is really lots of power in the hands of the schema author!

3) Personally speaking, I find the second way of writing the XML schema ("XML Schema 2") a really cool NEW way to express these validation constrains. I'm not suggesting that the 1st way is not really good! That technique has great value, in it's own sense and has stood the tests of time. I find the second technique a more natural description from the schema author, to express the logic of the use-case in question.

4) One the possibilities I now foresee with XML Schema 1.1, is that schema author could impose quite a bit of constraints on xs:any wild-card instruction via assertions (which is particularly useful with processContents="lax" mode of the xs:any wild-card). A point worth observing is that with processContents="strict" mode of the xs:any wild-card, assertions are not really useful because, the schema validator would strictly validate the XML element with an element declaration, which must be provided by the schema author to satisfy the processContents="strict" mode of the wild-card (and assertions here would actually interfere with the available element declarations, which to my opinion is not a good design). With processContents="skip" mode of the xs:any wild-card, assertions would always fail (and the XML instance would become invalid), because the concerned XML elements would be discarded by the XML schema validator, and consequently these elements would not be part of the XPath data-model tree, on which assertions operate.

And needless to mention, Xerces-J handles all the above examples fine!

I hope that this post is useful.

Saturday, July 3, 2010

XSD 1.1: XML schema design approaches in XSD 1.1 world... PART 1

I'm thinking to write a series of posts (since writing too many ideas in one post could be boring to read, and could be quite voluminous for one post. I'll try to make sure, that these blog posts starting from this one have cross-references between them for related issues AND I'll convey in some future blog post, when I'm stopping writing this series!) only on XML schema design, given the XML Schema 1.1 constructs. I'll try to reflect why XML Schema 1.1 is essential for certain use-cases, and where XML Schema 1.0 falls short.

It is possible, that there may be a blog post unrelated to this series between these posts. When this series completes, I'll try to summarize the ideas at the end, to make the whole series available as an unit.

I'm disclaiming in the beginning, that any advice offered here may not necessarily be best. Improvements are generally always possible! Any feedback would be great (about the correctness of anything described here, alternative ideas OR anything else).

To start with the 1st post in this series, here's a little background about the use-case I'm describing in the subsequent paragraphs:
I've been reading the book "DB2 pureXML Cookbook: Master the Power of the IBM Hybrid Data Server" [1] recently by Matthias Nicola (a member of DB2 pureXML team). This book describes an example (in chapter 2) as follows.
A physical object could be described by two possible XML content models as follows:

a) Metadata as values, aka Name/Value Pairs (often bad):
  <object type="car">
    <field name="brand" value="Honda" />
    <field name="price" value="5000" />
    <field name="currency" value="USD" />
    <field name="year" value="2002" />
  </object>

b) Metadata as element names (good):
  <car>
    <brand>Honda</brand>
    <price currency="USD">5000</price>
    <year>1996</year>
  </car>

I wouldn't describe here why one of the above XML design approaches is good or bad. This is described well in the book cited above [1], which I would encourage folks to read (the books has some nice explanation about DB2 pureXML as well).

Let's say we want to build an XSD schema, for the XML document (a) above. To start with, one of the design decisions I took is, to define a set of XML Schema types with a hierarchy as following:
  OBJECT
     -> OBJECT_ON_SALE
            -> CAR
            -> BOOK

The other few design decisions I've taken are as following:
1) We'll use XSD 1.1 type-alternatives to select between different schema types, depending on value of the attribute object/@type.
2) We'll use an hierarchy of XSD 1.1 assertions definitions, to enforce certain validation constraints.

The meaning of these design decisions would likely become clear to us, by looking at the XML instance and schema document I'm describing below:

The sample XML instance I propose is as following:
[XML1] (this is same as XML instance "a" above, and is repeated here for convenience)
  <object type="car">
     <field name="brand" value="Honda" />
     <field name="price" value="5000" />
     <field name="currency" value="USD" />
     <field name="year" value="2002" />
  </object>
OR

[XML2]
  <object type="book">
     <field name="title" value="XML in a Nutshell" />
     <field name="author" value="Jimmy" />
     <field name="author" value="Nick" />
     <field name="publisher" value="Prentice Hall" />
     <field name="price" value="15" />
     <field name="currency" value="USD" />
     <field name="year" value="2008" />
  </object>

I propose the following XSD 1.1 schema (the 1.1 specific constructs are highlighted with a different color), that is designed to validate both of above XML instance documents (after which I'll try to analyze few design elements of this schema):

[SCHEMA1]
  <?xml version="1.0" encoding="UTF-8"?>
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
             xmlns:xerces="http://xerces.apache.org">
 
     <xs:element name="object" type="OBJECT">
       <xs:alternative test="@type='book'" type="BOOK" />
       <xs:alternative test="@type='car'" type="CAR" />
     </xs:element>
 
     <xs:complexType name="BOOK">
        <xs:complexContent>
           <xs:extension base="OBJECT_ON_SALE">
              <xs:assert test="field/@name = 'title' and
                               field/@name = 'author' and
                               field/@name = 'publisher' and
                               field/@name = 'year'"
                         xerces:message="For a book the fields title/author/publisher/year are mandatory" />
              <xs:assert test="xs:int(field[@name = 'year']/@value) gt 1900 and 
                               xs:int(field[@name = 'year']/@value) lt 2011"
                         xerces:message="A book's publication year must be between 1900 and 2011" />
              <xs:assert test="count(field[not(@name = 'author')]) = 
                               count(distinct-values(field[not(@name = 'author')]/string(@name)))" 
                         xerces:message="A book can have multiple authors, but none of other fields of a book can occur twice" />
           </xs:extension>
        </xs:complexContent>
     </xs:complexType>
 
     <xs:complexType name="CAR">
        <xs:complexContent>
           <xs:extension base="OBJECT_ON_SALE">
              <xs:assert test="field/@name = 'brand' and
                               field/@name = 'year'" 
                         xerces:message="For a car the fields brand/year are required" />
              <xs:assert test="xs:int(field[@name = 'year']/@value) gt 2000 and 
                               xs:int(field[@name = 'year']/@value) lt 2011" 
                         xerces:message="A car's manufacture year must be between 2000 and 2011" />
              <xs:assert test="count(field) = count(distinct-values(field/string(@name)))" 
                         xerces:message="None of the fields of an object 'car' can occur twice" />
           </xs:extension>
        </xs:complexContent>
     </xs:complexType> 
 
     <xs:complexType name="OBJECT_ON_SALE">
        <xs:complexContent>
          <xs:extension base="OBJECT">
             <xs:assert test="field/@name = ('price','currency')" 
                        xerces:message="An object that can be sold, must have the fields price/currency" />
             <xs:assert test="if (field/@name = 'price') then 
                              (field[@name = 'price']/xs:decimal(@value) gt 0 and
                              field/@name = 'currency')
                              else true()" 
                         xerces:message="If a field price is present, the currency field should exist. The value of price must be greater than 0." />
          </xs:extension>
        </xs:complexContent>
     </xs:complexType>
 
     <xs:complexType name="OBJECT">
       <xs:sequence>
         <xs:element name="field" minOccurs="0" maxOccurs="unbounded">
            <xs:complexType>
               <xs:attribute name="name" type="xs:string" />
               <xs:attribute name="value" type="xs:string" />
            </xs:complexType>
         </xs:element>
       </xs:sequence>
       <xs:attribute name="type" type="xs:string" />    
     </xs:complexType>

  </xs:schema>

The key design elements in the above schema document are the inheritance hierarchy and assertions/type-alternatives. The element "object" is validated by the schema type "CAR" or "BOOK". The set of assertions applicable on the type CAR/BOOK are the assertions on this type, as well as assertions inherited from the base types. The schema type applicable on the element "object" is controlled by the type-alternative switch (which works upon the value of attribute "type").

When the XML document, [XML1] is validated (with Xerces-J 2.10.0 -- actually with the latest code-base on Xerces SVN as of today, because there was a minor bug [which affects this particular example] that got fixed few days after Xerces-J 2.10.0 got released) by the schema document [SCHEMA1] the validation succeeds (as there are no validation errors).

Let's try to introduce some data errors in the XML instance document, and see what happens upon validation with the same schema document.

Here's a modified XML instance document:
[XML3]
  <object type="car">
    <field name="price" value="-100" />
    <field name="currency" value="USD" />
    <field name="year" value="1999" />
  </object>

If the above XML document ([XML3] -- named "test.xml") is validated by the schema document, [SCHEMA1] we get following validation errors with Xerces-J:
[Error] test.xml:5:10: cvc-assertion.failure: Assertion failure. If a field price is present, the currency field should exist. The value of price must be greater than 0.
[Error] test.xml:5:10: cvc-assertion.failure: Assertion failure. For a car the fields brand/year are required.
[Error] test.xml:5:10: cvc-assertion.failure: Assertion failure. A car's manufacture year must be between 2000 and 2011.


If we remove all of xerces:message attributes from assertions above, the following error messages are printed by Xerces for the above scenario:
[Error] test.xml:5:10: cvc-assertion.3.13.4.1: Assertion evaluation ('if (field/@name = 'price') then (field[@name = 'price']/xs:decimal(@value) gt 0 and field/@name = 'currency') else true()') for element 'object' with type 'OBJECT_ON_SALE' did not succeed.
[Error] test.xml:5:10: cvc-assertion.3.13.4.1: Assertion evaluation ('field/@name = 'brand' and field/@name = 'year'') for element 'object' with type 'CAR' did not succeed.
[Error] test.xml:5:10: cvc-assertion.3.13.4.1: Assertion evaluation ('xs:int(field[@name = 'year']/@value) gt 2000 and xs:int(field[@name = 'year']/@value) lt 2011') for element 'object' with type 'CAR' did not succeed


It's up-to the user's, that which error format is appropriate for them (the error format without xerces:message prints more error-context information. While the format with xerces:message could be useful to print user-friendly error messages, upon assertions failure).

I won't describe in much detail now, the domain meaning of error messages above and the problem scenario itself. I believe, this fictitious problem domain is simple enough to understand these examples.

I would end this post now. I'll take up the case of schema type "BOOK" in a future blog post (I imagine, the concepts I'm trying to illustrate here for the domain object CAR would be similar for the object BOOK).

I'll try to write few more XSD 1.1 examples in subsequent posts!

I hope that this post is useful.