Saturday, January 9, 2010

Xerces-J: more XSD 1.1 tests; negative wild-cards

I'm pretty satisfied with the XSD 1.1 assertions and CTA (type alternatives) implementation (as I've been writing few posts about them, earlier on this blog), in current Xerces-J SVN code base (though, a user feedback would be great, for the Xerces project. Instructions to report bugs in Xerces-J can be found at, http://xerces.apache.org/xerces2-j/jira.html).

I'm now beginning to test some of other XSD 1.1 features. To start with these new set of posts, following are few use cases for "Negative wildcards" (ref, http://www.ibm.com/developerworks/xml/library/x-xml11pt3/index.html#N101C9), which I've found to be working fine with Xerces-J.

XSD 1.0 had following XML representation of, xs:any wild-card Schema component:
  <any
    id = ID
    maxOccurs = (nonNegativeInteger | unbounded)  : 1
    minOccurs = nonNegativeInteger : 1
    namespace = ((##any | ##other) | List of (anyURI | (##targetNamespace | ##local)) )  : ##any
    processContents = (lax | skip | strict) : strict
    {any attributes with non-schema namespace . . .}>
      Content: (annotation?)
  </any>

("anyAttribute" is another wild-card Schema component)

XSD 1.1 enhances the xs:any wild-card definition to following:
  <any
    id = ID
    maxOccurs = (nonNegativeInteger | unbounded)  : 1
    minOccurs = nonNegativeInteger : 1
    namespace = ((##any | ##other) | List of (anyURI | (##targetNamespace | ##local)) ) 
    notNamespace = List of (anyURI | (##targetNamespace | ##local)) 
    notQName = List of (QName | (##defined | ##definedSibling)) 
    processContents = (lax | skip | strict) : strict
    {any attributes with non-schema namespace . . .}>
      Content: (annotation?)
  </any>

As we could notice, xs:any now allows (in XSD 1.1, which was not available in XSD 1.0) two additional specifiers in it's definition, namely "notNamespace" and "notQName".

Here's a fictitious example of usage of "notNamespace" specifier:

XML document, [1]:
  <Example xmlns="http://www.example.com/mySample">
    <a>hi there</a>
    <b>hi there ..</b>
    <c>hi there ...</c>
    <d xmlns="http://www.notallowed.com/sorry">hi there ....</d>
  </Example>

XSD 1.1 Schema, [2]:
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
             targetNamespace="http://www.example.com/mySample"
             elementFormDefault="qualified">

    <xs:element name="Example">
      <xs:complexType>
        <xs:sequence>
          <xs:element name="a" type="xs:string" />
          <xs:element name="b" type="xs:string" />
          <xs:element name="c" type="xs:string" />
          <xs:any notNamespace="http://www.notallowed.com/sorry"
                  processContents="lax"/>
        </xs:sequence>
      </xs:complexType>
    </xs:element>
  
  </xs:schema>

The XSD schema, [2] defines an xs:any wild-card definition, which doesn't allow an XML instance document to have an element instance (allowed by the wild-card) to be in the namespace, "http://www.notallowed.com/sorry".

Therefore when the XML instance, [1] is validated by XSD document, [2] we get following error while performing validation with Xerces-J XSD 1.1 schema engine:
test.xml:5:46:cvc-complex-type.2.4.a: Invalid content was found starting with element 'd'. One of '{WC[##other:"http://www.notallowed.com/sorry"]}' is expected.

If we change the instance document, to specify element "d" to following:
<d>hi there ....</d>
or say,
<d xmlns="http://www.allowed.com">hi there ....</d>

the XSD validation, succeeds (as element "d" is now not in the namespace, "http://www.notallowed.com/sorry").

Here's an example for usage of "notQName" specifier:

XML document, [3]:
  <Example xmlns="http://www.example.com/mySample">
    <a>hi there</a>
    <b>hi there ..</b>
    <c>hi there ...</c>
    <XX>hi there ....</XX>
  </Example>

XSD 1.1 Schema, [4]:
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
             xmlns:tns="http://www.example.com/mySample"
             targetNamespace="http://www.example.com/mySample"
             elementFormDefault="qualified">

    <xs:element name="Example">
      <xs:complexType>
        <xs:sequence>
          <xs:element name="a" type="xs:string" />
          <xs:element name="b" type="xs:string" />
          <xs:element name="c" type="xs:string" />
          <xs:any notQName="tns:XX"
                  processContents="lax"/>
        </xs:sequence>
      </xs:complexType>
    </xs:element>
  
  </xs:schema>

The XSD schema, [4] doesn't allow an instance document to have, an element "XX" (in namespace, "http://www.example.com/mySample"), where the xs:any wild-card allows an element content.

Therefore, when XML document [3] is validated by XSD schema, [4] we get following error message, with Xerces-J:
test.xml:5:7:cvc-complex-type.2.4.a: Invalid content was found starting with element 'XX'. One of '{WC[##any, notQName(tns:XX)]}' is expected.

So if we, replace the offending element, with:
<abc>hi there ....</abc>
or say,
<XX xmlns="http://www.allowed.com">hi there ....</XX>
(here, the local-name in XML instance document, is same as that specified in the "notQName" specifier, while the namespace of element instance is different, than specified by "notQName", which makes this element instance, valid)

the XML validation passes.

All these XSD language enhancements, in 1.1 version look cool (and, useful :)) to me, and they give some more XML validation capabilities to XSD schema, authors.

Out of my curiosity, I was thinking if we could write few of new XSD 1.1 wild-card capabilities, with assertions.

For e.g, some of the features of "notNamespace" attribute can be written with an assertion like following:
  <xs:assert test="not(namespace-uri(*[last()]) = (
                        'http://www.notallowed.com/sorry1',
                        'http://www.notallowed.com/sorry2',
                        'http://www.notallowed.com/sorry3')
                       )" />

But using assertions for this need, might have following disadvantages, or limitations [5]:
1. The XSD 1.1 engine, has to build a XPath tree to evaluate an assertion, which is a memory overhead.
2. It looks like, that by using assertions, we cannot implement following facilities of "notNamespace" attribute: we cannot specify namespace URIs with keywords, ##targetNamespace & ##local.
3. Using the, "notNamespace" attribute on xs:any wildcard, gives us optimization benefits of xs:any implementation (like, this doesn't have to build a XPath tree, which an assertion approach requires). Moreover, it's better to use a native facility of a construct (like, xs:any), which keeps the XSD schema's design more natural, and easy to understand.

And also, some of the features of "notQName" attribute can be written with an assertion like following:
  <xs:assert test="not(local-name(*[last()]) eq 'XX'
                       and
                      namespace-uri(*[last()]) eq 'http://www.example.com/mySample')" />

This approach would have similar issues, as specified above [5].

I hope, that this post was useful.

No comments: