Sunday, November 1, 2009

XSLT 1.0: Regular expression string tokenization, and Xalan-J

Some time ago, XSLT folks were debating on xsl-list (ref, http://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/200910/msg00365.html) about how to implement string tokenizer functionality in XSLT. XPath 2.0 (and therefore, XSLT 2.0) has a built in function for this need (ref, fn:tokenize). XPath 2.0 string tokenizer method, 'fn:tokenize' takes a string and a tokenizing regular expression pattern as arguments. This is something, which cannot be done natively in XSLT 1.0. To do this, with XSLT 1.0 we need to write a recursive tokenizing "named XSLT template". But a "named XSLT template" using XSLT 1.0, for string tokenization has limitation, that it cannot accept natively an arbitrary regular expression, as a tokenizing delimiter.

I got motivated enough, to write a Java extension mechanism for regular expression based, string tokenization facility for XSLT 1.0 stylesheets, using the Xalan-J XSLT 1.0 engine.

Here's Java code and a sample XSLT stylesheet for this particular, functionality:

String tokenizer Xalan-J Java extension:
package org.apache.xalan.xslt.ext;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.apache.xpath.NodeSet;
import org.w3c.dom.Document;

public class XalanUtil {
    public static NodeSet tokenize(String str, String regExp) throws ParserConfigurationException {
      String[] tokens = str.split(regExp);
      NodeSet nodeSet = new NodeSet();
       
      DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
      DocumentBuilder docBuilder = dbf.newDocumentBuilder();
      Document document = docBuilder.newDocument();
       
      for (int nodeCount = 0; nodeCount < tokens.length; nodeCount++) {
        nodeSet.addElement(document.createTextNode(tokens[nodeCount]));   
      }
       
      return nodeSet;
    }
}
Sample XSLT stylesheet, using the above Java extension (named, test.xsl):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0"                                                    
                xmlns:java="http://xml.apache.org/xalan/java"
                exclude-result-prefixes="java">
                 
   <xsl:output method="xml" indent="yes" />
   
   <xsl:param name="str" />
   
   <xsl:template match="/">
     <words>
       <xsl:for-each select="java:org.apache.xalan.xslt.ext.XalanUtil.tokenize($str, '\s+')">
         <word>
           <xsl:value-of select="." />
         </word>
       </xsl:for-each>
     </words>
   </xsl:template>
   
 </xsl:stylesheet>
Now for e.g, when the above stylesheet is run with Xalan as follows: java -classpath <path to the extension java class> org.apache.xalan.xslt.Process -in test.xsl -xsl test.xsl -PARAM str "hello world", following output is produced:
<?xml version="1.0" encoding="UTF-8"?>
<words>
 <word>hello</word>
 <word>world</word>
</words>

This illustrates, that regular expression based string tokenization was applied as designed above, for XSLT 1.0 environment.

The above Java extension, should be running fine with a min JRE level of, 1.4 as it relies on the JDK method, java.lang.String.split(String regex) which is available since JDK 1.4.

PS: For easy reading and verboseness, the package name in the above Java extension class may be omitted, which will cause the corresponding XSLT instruction to be written like following:
xsl:for-each select="java:XalanUtil.tokenize(... I would personally prefer this coding style, for production Java XSLT extensions. Though, this should not matter and to my opinion, decision to handle this can be left to individual XSLT developers.

I hope, that this was useful.

Sunday, October 25, 2009

Mozilla firefox and XPath namespace axis

C. M. Sperberg-McQueen shared with us, that Mozilla Firefox browser doesn't implement the XPath namespace axis (ref, http://cmsmcq.com/mib/?p=757). CMSMcQ has encouraged us to cast a vote on Mozilla forum, to push Firefox team, to implement XPath namespace axis. I agree with CMSMcQ, and also find that XPath namespace axis is quite a critical functionality for XPath data model. This is certainly true in XPath 1.0, where namespace axis is very critical (and Mozilla, implements XPath 1.0). In XPath 2.0, namespace axis is deprecated but namespace nodes still is a core part of, XPath 2.0 data model as well.

I have already casted my vote for this with my support at, https://bugzilla.mozilla.org/show_bug.cgi?id=94270.

Other's might follow, please.

Java downloads, archive

Anybody looking to download older versions of Java compilers and runtimes, with documentations can access Java SDK & JRE from, http://java.sun.com/products/archive/.

I can see, Java compilers from ver Java 1.1 to 1.6 to be available here (that, looks cool to me).

I have already downloaded couple of older Java versions, needed by me primarily for investigating porting of PsychoPath XPath 2.0 engine to an older Java release. At present, PsychoPath engine runs on min Java release 1.5. We might try to port PsychoPath XPath engine to Java 1.4. But if that happens, it is likely to take some time to implement that :(

Saturday, October 24, 2009

Martin Fowler: UML Distilled, 3rd Edition

I have been reading Martin Fowler's book, "UML Distilled, 3rd Edition" since last few months (my book reading has been very slow, keeping in mind the time I spend on web these days, to do most of my learnings).

This is a great UML book (and has only, 175 pages but very good), and I recommend it to anybody wanting to know about UML (Unified Modeling Language).

How XPath compare values in prediates

A user asked question similar to following, on IBM developerWorks XQuery and XPath forum:

What does A = B and A != B mean in XPath expressions?

Michael Kay provided a very nice explanation to this:
The operators "=" and "!=" in XPath use "implicit existential quantification". So A=B is shorthand for "some $a in A, $b in B satisfies $a eq $b" (the longhand form is legal in XPath 2.0), while A!=B is shorthand for "some $a in A, $b in B satisfies $a ne $b".

So, not(A=B) is true if there is no pair of items from A and B that are equal, while (A!=B) is true if there is a pair of values that are not equal. In practice, you nearly always want not(A=B).

Wednesday, October 21, 2009

W3C: The web site has new look

The W3C web site now has a new look (ref, http://www.w3.org/). Though, I am trying to come to terms with the new W3C site design. It takes some time for a new user like me, to reach to old W3C recommendations.

I guess, this site redesign for www.w3.org has occurred after 8-10 years. Much of the parts of new W3C site, are empty at present, and needs to be filled with references and user material.

Sunday, September 27, 2009

OO multiple inheritance, and Java

I have been thinking again, about multiple inheritance and why Java doesn't support it. I wrote a bit about this topic, some time ago.

There are so, so many resources on web about this, and it's actually very easy to find the answer to this, via a simple web search. Here is an article from where I started to know an answer to this, http://www.javaworld.com/javaqa/2002-07/02-qa-0719-multinheritance.html, which pointed me to this white paper by James Gosling and Henry McGilton. Really, I did not read this white paper by Java creators earlier (it never came across my eyes :)), in spite being familiar and working with Java since long time. Sometimes, we find gems on web in an unexpected ways (I mean, this paper is a gem for me :)). I'll try to read this paper (hopefully fully, and being able to understand it) over the next few days.

And here is a link in this white paper, which explains why Java doesn't support multiple inheritance. The following white paper link is also interesting, which gives a complete overview of C and C++ features, that were omitted in Java language (Java has been influenced from C and C++).

My personal opinion is, that if we must need to use multiple inheritance, we should just try to write programs in C++. On the contrary, my experience in using Java for about a decade, convinces me, that Java is suitable to solve almost any business application problem, and absence of multiple inheritance in Java, is not an hindrance to design good programming abstractions for problem domain. The advantages like Java's byte code portability and web friendliness far outweigh, any disadvantages caused by absence of multiple inheritance. On numerous occasions, I have created Java byte code on Windows, and used it without modification on Unix based systems (and vice versa). This is something which is built into the Java language, and it is cool!