Saturday, July 3, 2010

XSD 1.1: XML schema design approaches in XSD 1.1 world... PART 1

I'm thinking to write a series of posts (since writing too many ideas in one post could be boring to read, and could be quite voluminous for one post. I'll try to make sure, that these blog posts starting from this one have cross-references between them for related issues AND I'll convey in some future blog post, when I'm stopping writing this series!) only on XML schema design, given the XML Schema 1.1 constructs. I'll try to reflect why XML Schema 1.1 is essential for certain use-cases, and where XML Schema 1.0 falls short.

It is possible, that there may be a blog post unrelated to this series between these posts. When this series completes, I'll try to summarize the ideas at the end, to make the whole series available as an unit.

I'm disclaiming in the beginning, that any advice offered here may not necessarily be best. Improvements are generally always possible! Any feedback would be great (about the correctness of anything described here, alternative ideas OR anything else).

To start with the 1st post in this series, here's a little background about the use-case I'm describing in the subsequent paragraphs:
I've been reading the book "DB2 pureXML Cookbook: Master the Power of the IBM Hybrid Data Server" [1] recently by Matthias Nicola (a member of DB2 pureXML team). This book describes an example (in chapter 2) as follows.
A physical object could be described by two possible XML content models as follows:

a) Metadata as values, aka Name/Value Pairs (often bad):
  <object type="car">
    <field name="brand" value="Honda" />
    <field name="price" value="5000" />
    <field name="currency" value="USD" />
    <field name="year" value="2002" />

b) Metadata as element names (good):
    <price currency="USD">5000</price>

I wouldn't describe here why one of the above XML design approaches is good or bad. This is described well in the book cited above [1], which I would encourage folks to read (the books has some nice explanation about DB2 pureXML as well).

Let's say we want to build an XSD schema, for the XML document (a) above. To start with, one of the design decisions I took is, to define a set of XML Schema types with a hierarchy as following:
            -> CAR
            -> BOOK

The other few design decisions I've taken are as following:
1) We'll use XSD 1.1 type-alternatives to select between different schema types, depending on value of the attribute object/@type.
2) We'll use an hierarchy of XSD 1.1 assertions definitions, to enforce certain validation constraints.

The meaning of these design decisions would likely become clear to us, by looking at the XML instance and schema document I'm describing below:

The sample XML instance I propose is as following:
[XML1] (this is same as XML instance "a" above, and is repeated here for convenience)
  <object type="car">
     <field name="brand" value="Honda" />
     <field name="price" value="5000" />
     <field name="currency" value="USD" />
     <field name="year" value="2002" />

  <object type="book">
     <field name="title" value="XML in a Nutshell" />
     <field name="author" value="Jimmy" />
     <field name="author" value="Nick" />
     <field name="publisher" value="Prentice Hall" />
     <field name="price" value="15" />
     <field name="currency" value="USD" />
     <field name="year" value="2008" />

I propose the following XSD 1.1 schema (the 1.1 specific constructs are highlighted with a different color), that is designed to validate both of above XML instance documents (after which I'll try to analyze few design elements of this schema):

  <?xml version="1.0" encoding="UTF-8"?>
  <xs:schema xmlns:xs=""
     <xs:element name="object" type="OBJECT">
       <xs:alternative test="@type='book'" type="BOOK" />
       <xs:alternative test="@type='car'" type="CAR" />
     <xs:complexType name="BOOK">
           <xs:extension base="OBJECT_ON_SALE">
              <xs:assert test="field/@name = 'title' and
                               field/@name = 'author' and
                               field/@name = 'publisher' and
                               field/@name = 'year'"
                         xerces:message="For a book the fields title/author/publisher/year are mandatory" />
              <xs:assert test="xs:int(field[@name = 'year']/@value) gt 1900 and 
                               xs:int(field[@name = 'year']/@value) lt 2011"
                         xerces:message="A book's publication year must be between 1900 and 2011" />
              <xs:assert test="count(field[not(@name = 'author')]) = 
                               count(distinct-values(field[not(@name = 'author')]/string(@name)))" 
                         xerces:message="A book can have multiple authors, but none of other fields of a book can occur twice" />
     <xs:complexType name="CAR">
           <xs:extension base="OBJECT_ON_SALE">
              <xs:assert test="field/@name = 'brand' and
                               field/@name = 'year'" 
                         xerces:message="For a car the fields brand/year are required" />
              <xs:assert test="xs:int(field[@name = 'year']/@value) gt 2000 and 
                               xs:int(field[@name = 'year']/@value) lt 2011" 
                         xerces:message="A car's manufacture year must be between 2000 and 2011" />
              <xs:assert test="count(field) = count(distinct-values(field/string(@name)))" 
                         xerces:message="None of the fields of an object 'car' can occur twice" />
     <xs:complexType name="OBJECT_ON_SALE">
          <xs:extension base="OBJECT">
             <xs:assert test="field/@name = ('price','currency')" 
                        xerces:message="An object that can be sold, must have the fields price/currency" />
             <xs:assert test="if (field/@name = 'price') then 
                              (field[@name = 'price']/xs:decimal(@value) gt 0 and
                              field/@name = 'currency')
                              else true()" 
                         xerces:message="If a field price is present, the currency field should exist. The value of price must be greater than 0." />
     <xs:complexType name="OBJECT">
         <xs:element name="field" minOccurs="0" maxOccurs="unbounded">
               <xs:attribute name="name" type="xs:string" />
               <xs:attribute name="value" type="xs:string" />
       <xs:attribute name="type" type="xs:string" />    


The key design elements in the above schema document are the inheritance hierarchy and assertions/type-alternatives. The element "object" is validated by the schema type "CAR" or "BOOK". The set of assertions applicable on the type CAR/BOOK are the assertions on this type, as well as assertions inherited from the base types. The schema type applicable on the element "object" is controlled by the type-alternative switch (which works upon the value of attribute "type").

When the XML document, [XML1] is validated (with Xerces-J 2.10.0 -- actually with the latest code-base on Xerces SVN as of today, because there was a minor bug [which affects this particular example] that got fixed few days after Xerces-J 2.10.0 got released) by the schema document [SCHEMA1] the validation succeeds (as there are no validation errors).

Let's try to introduce some data errors in the XML instance document, and see what happens upon validation with the same schema document.

Here's a modified XML instance document:
  <object type="car">
    <field name="price" value="-100" />
    <field name="currency" value="USD" />
    <field name="year" value="1999" />

If the above XML document ([XML3] -- named "test.xml") is validated by the schema document, [SCHEMA1] we get following validation errors with Xerces-J:
[Error] test.xml:5:10: cvc-assertion.failure: Assertion failure. If a field price is present, the currency field should exist. The value of price must be greater than 0.
[Error] test.xml:5:10: cvc-assertion.failure: Assertion failure. For a car the fields brand/year are required.
[Error] test.xml:5:10: cvc-assertion.failure: Assertion failure. A car's manufacture year must be between 2000 and 2011.

If we remove all of xerces:message attributes from assertions above, the following error messages are printed by Xerces for the above scenario:
[Error] test.xml:5:10: cvc-assertion. Assertion evaluation ('if (field/@name = 'price') then (field[@name = 'price']/xs:decimal(@value) gt 0 and field/@name = 'currency') else true()') for element 'object' with type 'OBJECT_ON_SALE' did not succeed.
[Error] test.xml:5:10: cvc-assertion. Assertion evaluation ('field/@name = 'brand' and field/@name = 'year'') for element 'object' with type 'CAR' did not succeed.
[Error] test.xml:5:10: cvc-assertion. Assertion evaluation ('xs:int(field[@name = 'year']/@value) gt 2000 and xs:int(field[@name = 'year']/@value) lt 2011') for element 'object' with type 'CAR' did not succeed

It's up-to the user's, that which error format is appropriate for them (the error format without xerces:message prints more error-context information. While the format with xerces:message could be useful to print user-friendly error messages, upon assertions failure).

I won't describe in much detail now, the domain meaning of error messages above and the problem scenario itself. I believe, this fictitious problem domain is simple enough to understand these examples.

I would end this post now. I'll take up the case of schema type "BOOK" in a future blog post (I imagine, the concepts I'm trying to illustrate here for the domain object CAR would be similar for the object BOOK).

I'll try to write few more XSD 1.1 examples in subsequent posts!

I hope that this post is useful.


Jim said...

Mukul, thank you very much for this xsd post. I am currently struggling with an XSD issue of my own, and was wondering if you might be able to shed some light. I want to enforce a couple of identity constraints in my XSD. one will be that the value of an particular attribute is equal to the count of all the root child nodes. and the other would be another attribute is equal to the sum ( or substraction ) of the value of several values within the doc. any help would be greatly appreciated. cheers. Jim

Mukul Gandhi said...

I think we can use XSD 1.1 assertions to accomplish this. Here's an example.

XML doc:
   <a attr="3" />
   <b attr="4" />
   <c attr="1" />

Within the complex type of element "x" we can have assertions as following to accomplish this:
<xs:assert test="a/@attr = count(*)" />
<xs:assert test="b/@attr = sum((a/@attr,c/@attr))" />

You may also explore the usual XSD identity constraints (ref, But I think, XPath expressions there are a bit restricted.

Disha Devaiya said...
This comment has been removed by the author.
Disha Devaiya said...

Hi Mukul,

I am getting issue in xsd validation.
I am using xs:alternative tag but it is giving error like s4s-elt-must-match.1: The content of 'InventoryId' must match (annotation?, (simpleType | complexType)?,
(unique | key | keyref)*)). A problem was found starting at: alternative.

If you can provide any inputs, It would be really appreciated. Thanks.

Mukul Gandhi said...

@ Disha : it surprises me to see a response to my significantly earlier posts. but it's ok if it helps.

It seems to me that you're not using an XSD 1.1 processor, and "alternative" is a XSD 1.1 construct. You may be using an XSD 1.0 processor or an XSD 1.1 processor in XSD 1.0 mode. Using a compliant XSD 1.1 processor would solve this problem.

I would also suggest to post questions to relevant forums, or search the web for answer. That would yield faster help.

Disha Devaiya said...

Thanks Mukul. I tried to search on web but very limited help is available. I have validate my xsd through oxygen XML editor with xsd 1.1 processor and it is valid successfully. I am using ESB mule in this i dont know how to change xsd processor.

vijut said...

How do we nest a function lets say in your example we want to select the first 2 digits of the year be greater than 19.
I tried below but the compiler is throwing an error, could you please correct this?
xs:int(field[@name = 'year']/fn:substring(@value,2) eq 'some' ) > 19