Saturday, November 28, 2009

Xerces-J: XSD 1.1 assertions on simple types

I'm trying to put up a post here, with few examples for assertions on XSD simple types, and also for complex types with simple contents, and testing them with Xerces-J XSD 1.1 implementation. The previous couple of posts on this blog, described assertions on XSD complex types having complex content (i.e, elements having "element only" or mixed content, and/or attributes).

1) Here's an example, taken from Roger L. Costello's collections of XSD 1.1 examples, which he's published on his web site:

XML document [1]:
  <Example>
    <even-integer>100</even-integer>        
  </Example>

XSD 1.1 document [2]:
  <schema xmlns="http://www.w3.org/2001/XMLSchema"
          elementFormDefault="qualified">

    <element name="Example">
       <complexType>
          <sequence>
             <element name="even-integer">
                <simpleType>
                  <restriction base="integer">
                     <assertion test="$value mod 2 = 0" />
                  </restriction>
                </simpleType>
             </element>
          </sequence>
       </complexType>
    </element>

  </schema>

The above XSD 1.1 schema [2] constrains the XSD integer values, to only even ones (this works fine with Xerces!). XSD 1.1 defines a new facet named, assertion on XSD built in simple types, which the above example describes.

Please note that, "assertion" facet (applicable both to XSD simple types, and complex types with simple contents) is conceptually different than "assert" constraint on complex types (some of the explanation, about this is also given below as well).

The XSD 1.1 spec mentions, that the assertions XPath 2 "dynamic context" get's augmented with a variable, $value. The XSD type of variable, $value is that of the base simple type (in this example, the type of $value is xs:integer). The detailed rules, for using variable $value in XSD 1.1 schemas are described, here.

It looks to me, that the ability to have an assertion facet on simple types, significantly enhances the XSD author's capability to provide many new constraints on simple type values, which were not possible in XSD 1.0 (for e.g, an ability to constrain integer values to be even, was not possible in XSD 1.0).

For the above example, we could specify assertions to something like below, as well:
<assertion test="$value mod 2 = 0" />
<assertion test="$value lt 500" />
(i.e, a set of two assertion facet instances)

Or perhaps, specifying only one assertion facet instance as following, <assertion test="($value mod 2 = 0) and ($value lt 500)" /> if user wishes, which realizes the same objective.

This enforces that the simple type value should be even, and also should be less than 500. Also, there are no limits to the number of assertion facet instances that can be specified. To my opinion, an ability to specify unlimited number of assertion facets (and also the assert constraints on complex types), makes assertions a tremendously useful XSD validation constructs.

Notes: Interestingly, the following facet definition achieves the same results as met by the 2nd assertion facet instance, that's described above:
<maxExclusive value="500" />
(this was available in, XSD 1.0 as well)

2) Complex types with simple contents, using assertions:
XML document [3]:
  <root>
    <x label="a">2</x>
    <x label="b">4</x>
  </root>

Here, the element "x" should have an attribute "label" with type xs:string. But the content of element "x" is simple (of type, xs:int for this example).
Additional we also want, that the simple content value of "x", should be an even number.

The XSD document for these validation constraints, is as follows [4]:
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
 
   <xs:element name="root">
     <xs:complexType>
       <xs:sequence>
         <xs:element name="x" maxOccurs="unbounded" type="X_Type" />
       </xs:sequence>
     </xs:complexType>
   </xs:element>
   
   <xs:complexType name="X_Type">
     <xs:simpleContent>
        <xs:extension base="xs:int">    
          <xs:attribute name="label" type="xs:string" />
          <xs:assert test="$value mod 2 = 0" />
        </xs:extension>
     </xs:simpleContent>
   </xs:complexType>
  
  </xs:schema>

The use of xs:assert instruction is stressed in this example.

It's interesting to see, that if we change value of one of "x" elements as follows:
<x label="a">21</x>
(I changed the first "x")

Xerces fails the validation of XML instance, and returns following error message to the user:
test.xml:2:22:cvc-assertion.3.13.4.1: Assertion evaluation ('$value mod 2 = 0') for element 'x' with type 'X_Type' did not succeed.

Here, the XML validation did not succeed, because the value 21 is not an even number.

3) The last example of this post is following:
This describes the scenario of Complex types with simple contents. But here, the simple content get's its value by "restriction of a complex type". The previous example described Complex types with simple contents, using derivation by extension.

The XML file remains same [3], while the new XSD document is following [5]:
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
 
   <xs:element name="root">
     <xs:complexType>
       <xs:sequence>
         <xs:element name="x" maxOccurs="unbounded" type="X_Type" />
       </xs:sequence>
     </xs:complexType>
   </xs:element>
   
   <xs:complexType name="X_Type">
     <xs:simpleContent>
        <xs:restriction base="x_base">      
           <xs:assertion test="$value mod 2 = 0" />
           <xs:assert test="@label = ('a','b')" />
        </xs:restriction>
     </xs:simpleContent>
   </xs:complexType>
   
   <xs:complexType name="x_base">
     <xs:simpleContent>
        <xs:extension base="xs:int">    
          <xs:attribute name="label" type="xs:string" />
        </xs:extension>
     </xs:simpleContent>
   </xs:complexType>
  
 </xs:schema>

Please notice, how assertions are specified on the complex type, "X_Type" (shown with bold emphasis). Here, we have two assertion instructions (xs:assertion and xs:assert). In this example, xs:assertion is a facet for the atomic value, of the complex type (the value of complex type is simple in this case!). While xs:assert is the assertions instruction on the complex type (which has access to the element tree).

The complexType -> simpleContent -> restriction, type definition can specify assertions with following grammar:
... assertion*, ..., assert* (i.e, 0-n xs:assertion components can be followed by 0-n xs:assert components (this ordering is significant, otherwise the XSD 1.1 processor will flag an error).
There could be other constructs as well, before xs:assertion here (and some after it. But anything after xs:assertion*, needs to be before the trailing xs:assert's). This is described in the relevant XSD 1.1 grammar at, http://www.w3.org/TR/2009/CR-xmlschema11-1-20090430/#dcl.ctd.ctsc.

Notes: The XML Schema WG decided to have two different names for assertion instructions (xs:assertion and xs:assert), for this particular scenario, so that the XSD Schema authors could decide, whether they are writing assertions as a facet for simple values, or assertions for complex types (which have access to the element tree). If this naming distinction was not made in XSD 1.1 assertions, then specification of asserts in XSD documents, in this case would have caused ambiguity (i.e, the XSD 1.1 processor could not tell, which assertion is a facet, and which is an assertion for the complex type).

Acknowledgements:
I must mention that XSD 1.1 examples shared by Roger L. Costello, helped us fix quite a bit of bugs in Xerces assertions implementation. Our sincere thanks are due, to Roger.

References:
1. Reader's could also find this article useful, http://www.ibm.com/developerworks/library/x-xml11pt2/ about XSD 1.1 co-occurence constraints, which describes XSD 1.1 assertions facility in detail.

I hope that this post was useful.

Friday, November 27, 2009

XSD 1.1: another assertions example with Xerces-J !

Here's another XSD 1.1 assertions example, which I came up with today :)

An XML document is something like below:
  <person_db>
    <person id="1">
      <fname>john</fname>
      <lname>backus</lname>
      <dob>1995-12-10</dob>
    </person>
    <person id="2">
      <fname>rick</fname>
      <lname>palmer</lname>
      <dob>2001-11-09</dob>
    </person>
    <person id="3">
      <fname>neil</fname>
      <lname>cooks</lname>
      <dob>1998-11-10</dob>
    </person>
  </person_db>

Other than constraining the XML document to a structure like above, the XSD schema should specify following additional validation constraints, as well:
1) Each person's dob field should specify a date, which must be later than or equal to the date, 1900-01-01.
2) Each "person" element, should be sorted numerically according to "id" attribute, in an ascending fashion.

I wanted to achieve these validation objectives, completely with XSD 1.1 assertions. Here's the XSD 1.1 document, which I find that works fine, with Xerces-J:
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
 
     <xs:element name="person_db">
       <xs:complexType>
          <xs:sequence>
            <xs:element name="person" maxOccurs="unbounded" type="Person" />
          </xs:sequence>
          <xs:assert test="every $p in person[position() lt last()] satisfies
                            ($p/@id lt $p/following-sibling::person[1]/@id)" />
       </xs:complexType>
     </xs:element>
   
     <xs:complexType name="Person">
        <xs:sequence>
          <xs:element name="fname" type="xs:string" />
          <xs:element name="lname" type="xs:string" />
          <xs:element name="dob" type="xs:date" />
        </xs:sequence>
        <xs:attribute name="id" type="xs:int" use="required" />
        <xs:assert test="dob ge xs:date('1900-01-01')" />
     </xs:complexType>
  
   </xs:schema>

Notes: It also seems, that above XSD validation requirements could be met, with following changes as well:
1. Remove assertion from the complex type, "Person".
2. Have an additional assertion on the element, "person_db" which will now look something like following:
<xs:assert test="every $p in person[position() lt last()] satisfies
($p/@id lt $p/following-sibling::person[1]/@id)" />
<xs:assert test="every $p in person satisfies ($p/dob ge xs:date('1900-01-01'))" />

i.e, we'll now have two assertions on the element, "person_db" (which are actually specified on the element's schema type).

Though, I seem to like the first solution as it seems elegant to me, and more logically in place.

I am happy, that this particular example worked fine as I expected, with Xerces.

I hope that this post was useful.

Friday, November 20, 2009

XSD 1.1: some CTA samples with Xerces-J

I've been trying to write few XSD 1.1 Conditional Type Assignment (CTA) samples, and trying them to run with the current Xerces-J schema development SVN code.

To start with, here's the first example (a very simple one indeed) that I find, which runs fine with Xerces-J:

XML document [1]:
  <root>
    <x>hello</x>
    <x kind="int">10</x>
  </root>

XSD 1.1 document [2]:
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

     <xs:element name="root">
       <xs:complexType>
         <xs:sequence>
           <xs:element name="x" type="xs:anyType" maxOccurs="unbounded">
             <xs:alternative test="@kind='int'" type="xInt_Type" />
             <xs:alternative type="xString_Type" />
           </xs:element>
         </xs:sequence>
       </xs:complexType>
     </xs:element>

     <xs:complexType name="xInt_Type">
       <xs:simpleContent>
         <xs:extension base="xs:int">
           <xs:attribute name="kind" type="xs:string" />
         </xs:extension>
       </xs:simpleContent>
     </xs:complexType>

     <xs:complexType name="xString_Type">
       <xs:simpleContent>
         <xs:extension base="xs:string">
           <xs:attribute name="kind" type="xs:string" />
         </xs:extension>
       </xs:simpleContent>
     </xs:complexType>

  </xs:schema>

Please note the presence of XSD 1.1 instruction, xs:alternative (which is newly introduced in XSD 1.1, and makes this XSD Schema, a type alternative scenario), within the declaration for element, "x" in above schema [2]. If the value of "kind" attribute on element "x" is 'int', then a schema type "xInt_Type" will be assigned to element "x". If the attribute "kind" is not present on element, "x" or if it's present, and it's value if not 'int', the schema type xString_Type get's assigned to element, "x".

Xerces-J successfully validates the above XML document [1] with the given XSD 1.1 Schema [2].

If we introduce the following change to the XML document:
<x kind="int">not an int</x>

Xerces-J would display following error messages:
cvc-datatype-valid.1.2.1: 'not an int' is not a valid value for 'integer'.

The above error message is correct, because the value 'not an int' in the XML document is not of type, xs:int.

Notes:
The schema types specified on xs:alternative instructions, need to validly derive (also referred to as, "type substitutable" in XSD 1.1 spec) from the default type specified on the element (which is, xs:anyType in this example), or the type on xs:alternative could be xs:error (this is a new schema type defined in XSD 1.1 spec, and is particularly useful with XSD type alternatives. The schema type xs:error has an empty lexical and value space, and any XML element or attribute which has this type, will always be invalid).

So for example, if we write an element declaration like following (demonstrating type substitutability/derivation of XSD types, specified on xs:alternative instructions):
  <xs:element name="x" type="xs:string" maxOccurs="unbounded">
    <xs:alternative test="@kind='int'" type="xInt_Type" />
  ...

Xerces-J would return following error message:
e-props-correct.7: Type alternative 'xInt_Type' is not xs:error or is not validly derived from the type definition, 'string', of element 'x'.

Making use of type xs:error, in CTAs:
Let's assume, that XML document remains same as document [1], and declaration of element "x" is now written like following:
  <xs:element name="x" type="xs:anyType" maxOccurs="unbounded">
    <xs:alternative test="@kind='int'" type="xInt_Type" />
    <xs:alternative type="xs:error" />
  </xs:element>

Now Xerces returns an error message like following:
cvc-datatype-valid.1.2.1: 'hello' is not a valid value for 'error'.

For this particular example, this error would occur if attribute "kind" is not present, or if the attribute "kind" is present, and it's value is not 'int'.

Xerces-J CTA implementation, using PsychoPath XPath 2 engine:
The XSD 1.1 spec, defines a small XPath 2 language subset, to be used by XSD 1.1 CTA instructions. Xerces-J has a native implementation of this XPath 2 subset (implemented by Hiranya Jayathilaka, a fellow Xerces-J committer), which get's selected by Xerces as a default XPath 2 processor, if CTA XPath 2 expressions conform to this XPath 2 subset (this was designed into Xerces, to make efficient XPath 2 evaluations using the CTA XPath 2 subset, since evaluating every XPath 2 expression with PsychoPath engine could have been computationally expensive).

But if, the XSD CTA XPath 2 expressions cannot be compiled by the native Xerces-J CTA XPath 2 subset, Xerces will attempt to use the PsychoPath XPath engine to evaluate CTA XPath expressions, as a fall back option (and also to enable users to use the full XPath 2 language with Xerces CTA implementation, if they want to).

To test, that PsychoPath engine does work with Xerces CTA implementation, I modified the type alternative instruction for the XSD example [2] above, to following:
<xs:alternative test="@kind='int' and (tokenize('xxx xx', '\s+')[1] eq 'xxx')" type="xInt_Type" />
I added a dummy XPath "and" clause, which can only succeed with Xerces, if PsychoPath engine would evaluate this XPath expression. This additional "and" clause doesn't make any difference to the validity of the XML document [1], as in this example it would always evaluate to a boolean "true". If we try to introduce any error into the above XPath expression like say, to following:
tokenize('xxx xx', '\s+')[1] eq 'xx' (please note the change from eq 'xxx' to eq 'xx', which will cause this XPath expression to evaluate to a boolean "false"), Xerces would report a XML validity error, which is really expected of the Xerces CTA implementation.

I hope that this post was useful.

Wednesday, November 18, 2009

XSD 1.1: some XSD 1.1 samples running with Xerces-J

I was thinking lately to functionally stress test, the upcoming Xerces-J XSD 1.1 preview release (using the SVN code we have now, and later using the public binaries which will be provided by the Xerces project). I'm just curious to know, if there are any non-compliant parts in Xerces-J XSD 1.1 implementation, that I can find, which could probably serve as inputs to improving Xerces-J XSD 1.1 code base. To start with, I'll try to write few XSD 1.1 schemas, using the XSD 1.1 assertions and "Conditional Type Assignment (CTA)/type alternative" instructions.

Assertions examples

Example 1
Sample XML [1]
  <x a="xyz">
    <foo>5</foo>
    <bar>10</bar>
  </x>

XSD 1.1 Schema [2]
(Use Case: "the value of the foo element must be less than or equal to the value of the bar element")
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
 
    <xs:element name="x">
      <xs:complexType>
         <xs:sequence>
           <xs:element name="foo" type="xs:int" />
           <xs:element name="bar" type="xs:int" />
         </xs:sequence>
         <xs:attribute name="a" type="xs:string" use="required" />
         <xs:assert test="foo le bar" />
      </xs:complexType>
    </xs:element>
  
  </xs:schema>

Using Xerces-J XSD 1.1 validator, the XML document [1] above validates fine with the given XSD document [2].

If the assertion is written as follows (which is a false assertions. this is just to check for false assertions, and the error messages):
<xs:assert test="(foo + 10) le bar" />

Then that would make the XML instance document ([1] above) invalid, and following error message is returned by Xerces:
test.xml:4:5:cvc-assertion.3.13.4.1: Assertion evaluation ('(foo + 10) le bar') for element 'x' with type '#anonymous' did not succeed.

Use Case: "if the value of the attribute "a" is xyz, then the bar and baz elements are required, but otherwise they are optional".

This would require following assertion definition:
<xs:assert test="if (@a eq 'xyz') then (foo and bar) else true()" />

This works fine with Xerces-J.

Acknowledgements: Thanks to Douglass A Glidden for contributing these use cases, on xml-dev list.

Example 2
Sample XML [3]
  <Example>
    <x>hi</x>
    <y>there</y>
    <ASomeNameSuffix/>
  </Example>

XSD 1.1 Schema [4]
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
    <xs:element name="Example" type="myType" />
 
    <xs:complexType name="myType">
      <xs:sequence>
        <xs:element name="x" type="xs:string" />
        <xs:element name="y" type="xs:string" />
        <xs:any processContents="lax" />
      </xs:sequence>
      <xs:assert test="starts-with(local-name(*[3]), 'A')" />
    </xs:complexType>

  </xs:schema>

In this particular example (Example 2), the immediate sibling element, of element "y" is defined via the XSD wild-card instruction, <xs:any/>. The assertion in XSD Schema [4] enforces, that name of the sibling element, that appears after element "y" must start with letter "A". I think, this could not have been accomplished (i.e, defining a constraint on an element name, in xs:any wild-card instruction) with XSD 1.0.

Example 3
Sample XML [5]
  <record>
    <wins>20</wins>
    <losses>15</losses>
    <ties>8</ties>
    <!--
      0 to n no's of well-formed elements, allowed here
      by XSD wild-card instruction, <xs:any />
    -->
  </record>

XSD 1.1 Schema [6]
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:complexType name="Record">
      <xs:sequence>
        <xs:element name="wins" type="xs:nonNegativeInteger"/>
        <xs:element name="losses" type="xs:nonNegativeInteger"/>
        <xs:element name="ties" type="xs:nonNegativeInteger" minOccurs="0"/>
        <xs:any minOccurs="0" maxOccurs="unbounded" namespace="##any" processContents="lax"/>   
      </xs:sequence>
      <xs:assert test="every $x in ties/following-sibling::* satisfies
                     not(empty(index-of(('x','y','z'), local-name($x))))" />
    </xs:complexType>

    <xs:element name="record" type="Record"/>

  </xs:schema>

The XSD schema, [6] validates the XML document [5]. The <xs:any ../> instruction in this schema ([6]) allows, 0-n number of well-formed XML elements after element, "ties". This facility was available in XSD 1.0 as well (for the interest of readers, XSD 1.1 has a weakened wild-card support, which makes the above XSD schema [6] valid -- in XSD 1.0 this schema was invalid, due to enforcement of UPA (unique particle attribution) constraint. An example of this is given in an article here, http://www.ibm.com/developerworks/xml/library/x-xml11pt3/index.html#N10122.).

The assertion in this schema ([6]) enforces that, any element after element, "ties" which is allowed by the xs:any wild-card, should have a name (i.e, a name without namespace prefix -- a XML local-name) among this list, ('x', 'y', 'z'). Something like this, was not possible with XSD 1.0, and to my opinion this is nice :)

PS: more examples to follow, in the next few posts :)

References:
XSD 1.1 Part 1: Structures
XSD 1.1 Part 2: Datatypes

I must acknowledge (a long enough acknowledgement. but I must do it anyway :)), that Xerces assertions is really powered by the PsychoPath XPath 2 engine, and the credit for bringing PsychoPath engine to almost 100% compliance to W3C XPath 2.0 test suite (as of now, PsychoPath is 99% + compliant to the W3C XPath 2.0 test suite) should largely go to Dave Carver and Jesper Steen Møller. I was fortunate enough to contribute somewhat to PsychoPath XPath implementation (the freedom given to me as a Eclipse Source Editing project committer -- thanks to Dave Carver for this, helped me to drive Xerces assertions development quickly). Needless to mention the original PsychoPath code contribution by Andrea Bittau and his team, to Eclipse Foundation. I must also mention the numerous reviews, and improvements suggested by Khaled Noaman and general design advice by Michael Glavassevich (both are Xerces committers) helped tremendously while developing Xerces assertions. I must also mention Ken Cai's contribution, who wrote the original Xerces-PsychoPath interface, and also an initial implementation of that interface.

Saturday, November 14, 2009

Xerces-J XSD 1.1 update: bug fixes and enhancements

The Xerces-J team did few enhancements to the XSD 1.1 implementation, which solves few important XSD namespace URI issues, which affected Xerces assertions and Conditional Type Alternatives (CTA) implementation. These changes went into the Xerces-J SVN repository today.

Here are the summary of these improvements:
1) There is now an ability with Xerces-J XSD 1.1 implementation, to pass on the XSD language namespace prefix (which is declared on the XSD <schema> element), along with the XSD language URI as a prefix-URI binding pair to PsychoPath XPath 2.0 engine. This enhancement allows, the XSD language prefix declared on the "XSD 1.1 Schema instance" 's <schema> element to be used in the assertions and CTA XPath 2.0 expressions, for example as following:
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" ...>
    ...
    <xs:assert test="xs:string(test) eq 'xxx'" />
    ...
  </xs:schema>

OR say,

  <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" ...>
    ...
    <xsd:assert test="xsd:string(test) eq 'xxx'" />
    ...
  </xsd:schema>

The earlier code in Xerces SVN (before the today's commit), hardcoded the XML Schema prefix to string, "xs" while communicating to the PsychoPath XPath 2 engine interface. That didn't allow the XPath 2 expressions in assertions and CTA to evaluate correctly (the Xerces code before this fix, always returned false for assertions, due to the presence of this bug), which used any other XSD prefix, like say "xsd" (even if the prefix "xsd" was bound to the XSD namespace, on the XSD root element, <schema>).

This was a significant Xerces assertions and CTA bug, which got solved today, and the fix for this is now available on the Xerces-J XSD 1.1 development SVN repository.

2) Another enhancement which went into Xerces-J SVN repository today, is the ability to specify the XPath 2.0 F&O namespace declaration on the XSD document root element, <schema>.

This enhancement makes possible something like, the following XSD 1.1 Schema to become valid:
  <xs:schema xmlns:xs="" xmlns:fn="http://www.w3.org/2005/xpath-functions" ...>
    ...
     <xs:assert test="xs:string(test) eq fn:string('xxx')" />
    ...
  </xs:schema>

Here the XML Schema author can, qualify the XPath 2 function calls in assertions XPath expressions, with the XPath 2 F&O namespace prefix, like fn:string('xxx') above. The F&O namespace prefix must be bound to the F&O namespace URI, "http://www.w3.org/2005/xpath-functions" for such a XSD Schema to be valid.

Even the following XSD 1.1 Schema is also valid (this happened to work correctly, earlier also before this Xerces SVN commit):
  <xs:schema xmlns:xs="" ...>
    ...
     <xs:assert test="xs:string(test) eq string('xxx')" />
    ...
  </xs:schema>

Here the XML Schema author, can use XPath 2 functions in Xerces assertions without specifying any prefix, for example like string('xxx') in the above example. The XPath 2 function calls without specifying the XPath 2 F&O prefix, would work correctly for all the XPath 2.0 built in functions, in Xerces assertions XPath 2 expressions.

World community grid

There seems to be a nice initiative, "world community grid". I think, IBM sponsors this community computing grid. I have been participating on this grid, since quite a few days now, and it really works! and I believe, it does make a difference to community good.

This grid is composed by, computers which could be normal public personal computers at home, or office or any kind of computers that all can connect to the web. When a grid client is connected to the web, enabled by user authentication, the client computer participates in numerous public computing projects. Joining the grid, helps us to donate our computer's processing power to computations needed by these public projects, normally those that require massive computing simulations in short time.

Joining the grid, doesn't disrupt the normal user activity on client computers, and the grid client intelligently utilizes memory (a very less amount of memory is needed by the grid client, while it works, which is normally as less as 5-10 MB) and the CPU, without disrupting anything for user's personal activities. It is also possible to configure the user's grid activity, about how to use one's CPU. Somebody may want to work in the default mode, or can give more CPU usage to the grid project tasks. The default mode works, well for me.

All these details, and much more are available on the "world community grid", web page.

Friday, November 13, 2009

XML spec and XSD

A few days ago, I started of a pretty length discussion on xml-dev mailing list about the following topic,

"Should the W3C XML specification specify XML Schema (a.k.a XSD) also as a XML validation language, as it specifies DTD (Document Type Definition)."

The XML spec seems to convey, that an XML document is valid, *only* if it's valid according to a DTD. I had a contention to this point, and started of a debate on xml-dev list related to this question. I argued, that since there are now newer XML validation languages like XSD, RelaxNG, Schematron etc, the XML spec now can modify the XML validation definition to refer to other XML Schema languages as well, rather than saying, that XML document is valid *only* if DTD is associated with the XML document.

Unfortunately, may people who spoke on xml-dev, who have been working with XML for long, did not agree to this idea. But alas, I still feel I had/have a valid point about this :(

I am referring to this threaded discussion again here, for records of this blog. Please follow this link, if anybody wants to read this whole discussion.

Sunday, November 1, 2009

XSLT 1.0: Regular expression string tokenization, and Xalan-J

Some time ago, XSLT folks were debating on xsl-list (ref, http://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/200910/msg00365.html) about how to implement string tokenizer functionality in XSLT. XPath 2.0 (and therefore, XSLT 2.0) has a built in function for this need (ref, fn:tokenize). XPath 2.0 string tokenizer method, 'fn:tokenize' takes a string and a tokenizing regular expression pattern as arguments. This is something, which cannot be done natively in XSLT 1.0. To do this, with XSLT 1.0 we need to write a recursive tokenizing "named XSLT template". But a "named XSLT template" using XSLT 1.0, for string tokenization has limitation, that it cannot accept natively an arbitrary regular expression, as a tokenizing delimiter.

I got motivated enough, to write a Java extension mechanism for regular expression based, string tokenization facility for XSLT 1.0 stylesheets, using the Xalan-J XSLT 1.0 engine.

Here's Java code and a sample XSLT stylesheet for this particular, functionality:

String tokenizer Xalan-J Java extension:
package org.apache.xalan.xslt.ext;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.apache.xpath.NodeSet;
import org.w3c.dom.Document;

public class XalanUtil {
    public static NodeSet tokenize(String str, String regExp) throws ParserConfigurationException {
      String[] tokens = str.split(regExp);
      NodeSet nodeSet = new NodeSet();
       
      DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
      DocumentBuilder docBuilder = dbf.newDocumentBuilder();
      Document document = docBuilder.newDocument();
       
      for (int nodeCount = 0; nodeCount < tokens.length; nodeCount++) {
        nodeSet.addElement(document.createTextNode(tokens[nodeCount]));   
      }
       
      return nodeSet;
    }
}
Sample XSLT stylesheet, using the above Java extension (named, test.xsl):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0"                                                    
                xmlns:java="http://xml.apache.org/xalan/java"
                exclude-result-prefixes="java">
                 
   <xsl:output method="xml" indent="yes" />
   
   <xsl:param name="str" />
   
   <xsl:template match="/">
     <words>
       <xsl:for-each select="java:org.apache.xalan.xslt.ext.XalanUtil.tokenize($str, '\s+')">
         <word>
           <xsl:value-of select="." />
         </word>
       </xsl:for-each>
     </words>
   </xsl:template>
   
 </xsl:stylesheet>
Now for e.g, when the above stylesheet is run with Xalan as follows: java -classpath <path to the extension java class> org.apache.xalan.xslt.Process -in test.xsl -xsl test.xsl -PARAM str "hello world", following output is produced:
<?xml version="1.0" encoding="UTF-8"?>
<words>
 <word>hello</word>
 <word>world</word>
</words>

This illustrates, that regular expression based string tokenization was applied as designed above, for XSLT 1.0 environment.

The above Java extension, should be running fine with a min JRE level of, 1.4 as it relies on the JDK method, java.lang.String.split(String regex) which is available since JDK 1.4.

PS: For easy reading and verboseness, the package name in the above Java extension class may be omitted, which will cause the corresponding XSLT instruction to be written like following:
xsl:for-each select="java:XalanUtil.tokenize(... I would personally prefer this coding style, for production Java XSLT extensions. Though, this should not matter and to my opinion, decision to handle this can be left to individual XSLT developers.

I hope, that this was useful.