Sunday, November 1, 2009

XSLT 1.0: Regular expression string tokenization, and Xalan-J

Some time ago, XSLT folks were debating on xsl-list (ref, about how to implement string tokenizer functionality in XSLT. XPath 2.0 (and therefore, XSLT 2.0) has a built in function for this need (ref, fn:tokenize). XPath 2.0 string tokenizer method, 'fn:tokenize' takes a string and a tokenizing regular expression pattern as arguments. This is something, which cannot be done natively in XSLT 1.0. To do this, with XSLT 1.0 we need to write a recursive tokenizing "named XSLT template". But a "named XSLT template" using XSLT 1.0, for string tokenization has limitation, that it cannot accept natively an arbitrary regular expression, as a tokenizing delimiter.

I got motivated enough, to write a Java extension mechanism for regular expression based, string tokenization facility for XSLT 1.0 stylesheets, using the Xalan-J XSLT 1.0 engine.

Here's Java code and a sample XSLT stylesheet for this particular, functionality:

String tokenizer Xalan-J Java extension:
package org.apache.xalan.xslt.ext;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.apache.xpath.NodeSet;
import org.w3c.dom.Document;

public class XalanUtil {
    public static NodeSet tokenize(String str, String regExp) throws ParserConfigurationException {
      String[] tokens = str.split(regExp);
      NodeSet nodeSet = new NodeSet();
      DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
      DocumentBuilder docBuilder = dbf.newDocumentBuilder();
      Document document = docBuilder.newDocument();
      for (int nodeCount = 0; nodeCount < tokens.length; nodeCount++) {
      return nodeSet;
Sample XSLT stylesheet, using the above Java extension (named, test.xsl):
<xsl:stylesheet xmlns:xsl=""
   <xsl:output method="xml" indent="yes" />
   <xsl:param name="str" />
   <xsl:template match="/">
       <xsl:for-each select="java:org.apache.xalan.xslt.ext.XalanUtil.tokenize($str, '\s+')">
           <xsl:value-of select="." />
Now for e.g, when the above stylesheet is run with Xalan as follows: java -classpath <path to the extension java class> org.apache.xalan.xslt.Process -in test.xsl -xsl test.xsl -PARAM str "hello world", following output is produced:
<?xml version="1.0" encoding="UTF-8"?>

This illustrates, that regular expression based string tokenization was applied as designed above, for XSLT 1.0 environment.

The above Java extension, should be running fine with a min JRE level of, 1.4 as it relies on the JDK method, java.lang.String.split(String regex) which is available since JDK 1.4.

PS: For easy reading and verboseness, the package name in the above Java extension class may be omitted, which will cause the corresponding XSLT instruction to be written like following:
xsl:for-each select="java:XalanUtil.tokenize(... I would personally prefer this coding style, for production Java XSLT extensions. Though, this should not matter and to my opinion, decision to handle this can be left to individual XSLT developers.

I hope, that this was useful.


digger said...

This is a fascinating problem. I was hoping this would work, but I got:

ERROR: 'Error checking type of the expression ''.'
FATAL ERROR: 'Could not compile stylesheet'

(Location of error unknown)XSLT Error (javax.xml.transform.TransformerConfigurationException): Could not compile stylesheet

Mukul Gandhi said...

it seems, you are using Xalan that's bundled with Sun JDK. To get this to work, I recommend that you should use the latest version, of Apache Xalan-J (ref,