Mukul Gandhi: 2008

Sunday, December 28, 2008

A static code quality tool, for XSLT code

I wrote a little tool to measure and enhance the code quality, of XSLT programs/scripts. It's available at, http://gandhimukul.tripod.com/xslt/xslquality.html.

I hope that the XSLT community might find this useful.

Any comments/suggestions about this tool would be most welcome.

Saturday, December 13, 2008

Michael Kay on application processing

Roger L. Costello summarized an interesting discussion we had on xml-dev list, citing Michael Kay's thoughts (just thought of sharing with the readers here):

Question:

What language(s) do you recommend using for application processing?

Michael Kay:

... do everything using XML-based processing languages (XSLT and XQuery).

The idea I'm pushing is that you write the application end-to-end using XML-based languages. That means that you never convert the data into C++ or Java data types. If you only go half way, by calling XQuery from Java and then converting the results to Java values, you lose half the benefits.

Question:

By "write the application end-to-end using XML-based languages" do you mean XML end-to-end or PSVI end-to-end?

Michael Kay:

I'm not too fussed how the XML is represented, the important thing is to avoid converting it unnecessarily into non-XML models such as Java objects or SQL tables. Where "unnecessarily" means "except as required by the need to interface with other applications".

Question:

Isn't XSLT a special-purpose programming language?

Michael Kay:

It's true that XSLT is a specialized language rather than a general-purpose language. But it's capable of a lot more than some people imagine. Some people never really get below the surface to discover its depth.

Question:

What's wrong with using a mix of XML-based processing languages and imperative languages (Java, C++, etc)?

Michael Kay:

[If you use a mix of languages then you end up] spending 75% of your programming effort converting your data between one type system and another.

Writing your application in a single language, if you can do it, will always give you a substantial reduction in complexity versus using multiple languages.

And writing in a language that is well adapted to the data it is required to process will save you a lot of effort (human and machine effort) in doing data conversion.

Question:

Aren't there some applications for which an imperative language is needed? Aren't some problems so complex that they need an imperative language?

Michael Kay:

There are no applications that NEED imperative programming (functional programming has provably the same computational power). Whether things are more EASILY done that way is in large measure a matter of your skills and experience. (I remember working with programmers from an older generation who claimed coding was easier if they used GOTO statements.) There are one or two things I still find easier in imperative languages - notably some graph-walking applications - but they are few and far between.

Many problems that appear to be so complex that you need an imperative language turn out, on examination, to be complex only BECAUSE you are using an imperative language.

Question:

I agree that functional languages like XSLT are quite useful. But many of the underlying hardware architectures are von Neumann based and von Neumann machines work best with imperative programs (languages).

Michael Kay:

If we believed that we would all still be writing in Assembler, or at any rate using GOTO statements.

Question:

Isn't application processing faster using an imperative language? Aren't imperative languages easier for programmers to use?

Michael Kay:

... on both ease-of-use and performance I would go for XSLT or XQuery in preference to lower-level languages every time.

If you're receiving lexical XML from a web service, the time taken to process it in XSLT or XQuery is usually less than the time taken to parse and validate it. I would take a lot of convincing that a data binding approach is likely to be faster, given the cost of marshalling and unmarshalling the data.

And on ease of use, I've seen programmers struggling with regenerating all their Java classes when the schema changes and it's horrendous. (Worse, I've seen people refuse to change the schema because it has become too expensive to contemplate!) Having two different models of the same data, understanding how they relate, and organizing yourself to keep them in sync is simply complexity that you don't need.

Question:

XQuery can be usable when you need to access a small subset of an XML document. However, when one needs to access most of the data, or, worse, access the same data many times, data binding will have speed/memory advantage.

Michael Kay:

Evidence please! I don't see any reason why it should.

(There are still people who maintain that coding in assembler is faster than in C, or that coding in C is faster than in Java. All the evidence is that there are very few people with the skills in the lower-level language to beat the optimizers for the higher-level language. It can be done if you try hard enough, but not by the average programmer.)

Question:

For most web service applications the Java (or other programming language) representation is actually primary, and XML is just being used for interchange. So surely application processing should be done in an imperative language, right?

Michael Kay:

I don't think it matters which is primary, the complexity comes from having two representations and keeping them aligned. But I would have thought the format used for data interchange is primary in the sense that it needs to be agreed with other parties, whereas the Java representation is under local control. (Unless of course you're running the same code at both ends, in which case I'm not sure why you're using XML at all.)

Question:

Suppose I have a Web page that contains a form that users fill in. When the submit button is pressed the form data is sent to a server. Don't we need something like Java to receive the data? Doesn't this break the "all-XML" approach you advocate?

Michael Kay:

It doesn't really matter too much if you have a bit of Java glue to accept the input and fire off the [XSLT] transformation, so long as it doesn't try to manipulate the data. But why use Java for this? There are plenty of higher-level frameworks that will do the job - since you're using forms, Orbeon does the job nicely.

Question:

Why is there such variance in data binding tools with respect to mapping XML Schema structures into data structures in an imperative language?

Michael Kay:

XSD is used for other things [in addition to validation], notably data binding. To a large extent data binding is outside the scope of the XSD specification itself (XSD doesn't tell you how its types map to Java or C++) and this probably explains why there is wide variation between products in this area.

Question:

I would be interested to hear about any non-trivial and practical application other than "XML in, XML out", format changing kind, that are written exclusively in XSLT/XQuery.

Michael Kay:

How about an application for managing the creation, processing, and review of capital spending proposals in a large international corporation. The proposals are entered by form-filling using XForms, and each proposal is an XML document in an XML database. The rules defining the approval process (based on the nature of the capital project) are defined in a business rules document, also XML, (for example "anything over $1m requires CEO approval") and the abstract roles defined in that document (such as "CEO", or "finance controller, Taiwan") are mapped to real people in another XML document generated by a transformation of an XML dump of the LDAP directory. There's
also an XML document that defines the corporate reporting structure, or actually two structures one functional and one geographic. These rules are used to construct an approval schedule for the proposal, which is also stored in the XML database, and emails are sent to the relevant people at the relevant times, with clickable URLs that they can use to access the application and approve or deny requests, or ask for more information. There's a full query/reporting system built using the same technology: you define a query by form-filling using XForms, and this generates an XQuery to get the data from the database and an XSLT stylesheet to format the query results.

I believe the only parts of this application which are not written in XSLT, XQuery, XForms, or the Orbeon XML pipeline language (a precursor to XProc) are one or two simple extension functions to do things like sending an email or translating Base64 data from the LDAP directory.

Saturday, December 6, 2008

Feasibility of "do all application coding in the XML languages"?

Roger L. Costello initiated an interesting discussion (on, 01-Dec-08) on xml-dev mailing list, about following concept:

I am exploring the idea of "do all application coding in the XML languages."

Here is a response from a colleague:

"... in general XSLT is cool but limited. If your transform requires any "higher math" or advanced functionality or external code libraries (such as geometry coordinate system libraries), you almost always have to go back to a higher level language (such as Java) at some point."

Does my colleague make a true or false statement?

XML professionals on the list expressed interesting opinions about this. Following is the link to this threaded discussion, http://lists.xml.org/archives/xml-dev/200812/msg00007.html.

Just thought of sharing this with readers of this blog (if you got a chance to visit here).

Saturday, November 22, 2008

Are multiple XPath predicates same as boolean "and" operator

I had a doubt about this concept, and asked following question on xsl-list.

Supposing I write the following XPath expressions,

1) X[c1][c2] or specifying generically, X[c1][c2][]...[cn]

2) X[c1 and c2] or specifying generically, X[c1 and c2 ... and cn]

where c1, c2 etc. are boolean expressions.

are the two forms (1 & 2) above exactly equivalent (i.e., will they return the same nodeset/sequence)? I think yes ... but just wanted to confirm with the list.

if 1 & 2 are exactly equivalent, then what could be the rule of thumb for using which form in certain scenarios?

There was a good discussion on the list about this, and list members shared some useful thoughts.

Below is a summary of the points we discussed on the list.

1. David Carlisle

> are the two forms (1 & 2) above exactly equivalent

No

compare

X[position()=2][position()=2]

and

[position()=2 and position()=2]

the first one is

()

the second is

X[2]

David further wrote,

context position (position()) and size (last()) do change. so basically repeated filters are equivalent to and unless any of them depend on position() or last(), including the special case of [integer] being equivalent to [position()=integer]
this last case is what makes it tricky to do a static rewrite of this.

If you have

X[... foo ..][... bar ...]

you can only rewrite that to

X[(... foo ..) and (... bar ...)]

if you know that neither expression will evaluate to a number at run time.

2. Vasu Chakkera

If that were true, then the condition

myelement[@myattribute][1] should be same as
myelement[1][@myattribute], which is not true...

The predicate order is important

in a typical "and"

[a and b] = [b and a]

3. Andrew Welch

Only / will change the context node, so I would've thought one predicate after the other is pretty much equivalent apart from cases that rely on size of the selection (which is the only thing that changes after each predicate).

Mukul: I asked a related question in continuation to this.

for real world XSLT/XPath programs, upto how many predicates can we typically see?

I haven't seen programs using 3, 4 or more predicates.

X[..][..][..][..]

I have used only one or two predicates upto now.

are excessively large number or predicates really useful? (though, the syntax allows that)

I think perhaps, for complex 'and' conditions, using multiple predicates are useful.

David shared an interesting observation about this:

He has been using some stylesheets having upto 11 predicates.

He wrote:

> are excessively large number or predicates really useful? (though, the syntax allows that)

isn't that like asking if complicated expressions are useful? they are useful if you need them, otherwise they are not.

3 or 4 predicates is totally routine but the most common reason for having larger numbers is to filter attributes

[not(@purpose='iemode')]
[not(@purpose='artifact')]
[not(@purpose='w-dimension')]

is equivalent to

[not(@purpose='iemode') and
[not(@purpose='artifact') and
[not(@purpose='w-dimension')]

but I'd almost always use the first form in XSLT 1.0 because it's easier to indent and easier to refactor, but if starting from the beginning in XSLT 2.0 I'd write it as

[not(@purpose=('iemode','artifact','w-dimension'))]

Mukul: This was a nice discussion I believe, and I have learnt few useful concepts.

Tuesday, November 4, 2008

fn:contains -> multiple strings to compare with

An XSLT user asked following question on xsl-list:

I want to have something that does this: contains('$d/ris:organ/text()', 'Hamburg' or 'Koblenz' or 'xxx'...) ===> Compare 1 String with multpile strings.

instead of: contains('$d/ris:organ/text()','Hamburg') or contains('$d/ris:organ/text()','Koblenz')...

Andrew Welch suggested following answer:

some $x in ('Hamburg', 'Koblenz', 'xxx') satisfies
contains($d/ris:organ/text(), $x)

This uses the XPath 2.0 quantified expression, "some".

This is cool.

I was prompted to share Andrew's answer here, because I thought of a lengthy and perhaps inefficient solution for this (I feel a bit stupid, actually :) ):

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                xmlns:my="http://my-functions"
                version="2.0">

 <xsl:output indent="yes" omit-xml-declaration="yes" />

 <xsl:template match="/">
  <xsl:variable name="str" select="'hello xxx dd'" />
  <xsl:variable name="list" select="('Hamburg','Koblenz','xxx')" />

  <xsl:if test="my:contains($str, $list)">
     matches
  </xsl:if>

 </xsl:template>

 <!-- a custom 'contains' implementation -->
 <xsl:function name="my:contains" as="xs:boolean">
  <xsl:param name="str" as="xs:string" />
  <xsl:param name="list" as="xs:string+" />

  <xsl:variable name="temp" as="xs:boolean*">
    <xsl:for-each select="$list">
      <xsl:if test="contains($str, .)">
        <xsl:sequence select="xs:boolean('true')" />
      </xsl:if>
    </xsl:for-each>
  </xsl:variable>

  <xsl:sequence select="if ($temp[1] = xs:boolean('true')) then
xs:boolean('true') else xs:boolean('false')" />

 </xsl:function>

</xsl:stylesheet>

Saturday, September 27, 2008

boolean value of a RTF, in XSLT 1.0

A recent discussion on xsl-list taught me something about a feature of XSLT 1.0 language.

Let's say there is a variable reference like following, in a XSLT 1.0 code:

<xsl:variable name="temp">
  <!-- anything could be here -->
</xsl:variable>

According to the XSLT 1.0 specification, the contents of the variable temp is called a RTF (Result Tree Fragment). Now what will be the result of the following function call, boolean($temp)

This will always return boolean value true.

Because, as defined in XSLT 1.0 specification,
The boolean value of a node set, with at least one node, is true. The RTF always has a root node.

It seems to me, that the function call, boolean($rtfVariable) is not very useful to applications, as it's always true.

This post applies to XSLT 1.0. I haven't cross checked what are the rules for this in XSLT 2.0.

Friday, September 12, 2008

Should we include <?xml version="1.0" ... in XPath data model, or DOM?

We all know that the following declaration statement appears at the beginning of most of the XML documents.

[1]
<?xml version="1.0" encoding="UTF-8"?>

But the details of the above statement is not included in a DOM tree, and neither it is part of the XPath (2.0) data model. To my opinion, this statement is used by the XML parser to initialize certain behaviors (or to enable certain properties).

I just had a weird thought, that should we not include this information as part of XPath data model, perhaps as properties of the document node; and in case of DOM, part of the DOM tree? If this declaration is not present in the XML document, then the relevant properties can be empty sequences.

The XML declaration is available in the XML infoset [2], but it's not included in a DOM, or the XPath data model.

I think, perhaps the XML declaration information is not useful to end user applications, and is only useful to the XML parser.

[2] http://www.w3.org/TR/xml-infoset/

Michael Glavassevich on xml-dev list corrected me, about DOM:

Information from the XML declaration is already stored in the DOM [3][4][5] (since DOM Level 3). Within the API the values have an effect on serialization, in-memory well-formedness checking and in-memory validation.

[3] http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Document3-version
[4] http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Document3-encoding
[5] http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Document3-standalone

Saturday, August 30, 2008

Schema aware processing with XSLT 1.0

I wrote following article, about implementing some of the Schema aware stylesheet ideas (as defined in the XSLT 2.0 spec.) using a XSLT 1.0 processor and Java extensions, and a suitable validating XML parser. A requirement from a XSLT user on this blog motivated me to work on this idea.

The article on my web site showcases just how few of the XSLT 2.0 Schema aware facilities like, validating the output tree (or even a result tree fragment) prior to serialization can be done. This covers one of the very important aspects of Schema aware XSLT 2.0 stylesheet design.

I think it's not possible to implement following Schema aware XSLT 2.0 facilities, using XSLT 1.0 and extensions alone:

1) Pass the validated XML instance tree to the XSLT processor, and the stylesheet is able to access Schema annotated XPath 2.0 data model tree. The XSLT 2.0 and XPath 2.0 languages define various Schema related instructions and expressions (for e.g., element(*, typeName) etc.), which cannot be simulated with XSLT 1.0 and extensions.

2) Since we cannot access Schema annotated XPath 2.0 data model tree in a XSLT 1.0 stylesheet, we cannot access XML Schema type names in a stylesheet, which prohibits the enhanced static typing features in XSLT stylesheets.

Using a complete Schema aware XSLT 2.0 system allows very rich static typing in XSLT stylesheets out of the box.

Friday, August 22, 2008

Nice use case for xsl:analyze-string instruction

I thought that this was interesting to share.

Recently an XSLT user discussed a problem on xsl-list, which was solved by Jeni Tennison using XSLT 1.0 long time ago.

I presented a XSLT 2.0 solution for the same problem. The 2.0 solution is lot shorter as compared to the 1.0 solution, and utilizes the XSLT 2.0 instruction, xsl:analyze-string.

The link to this thread is at, http://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/200808/msg00383.html.

This problem could be a nice use case for xsl:analyze-string instruction.

Thursday, August 14, 2008

Transforming tree structure from one format into another

An interesting question was asked on xsl-list,

The input XML file is something as:


<Objs>
   <obj name="a" child="b"/>
   <obj name="b" child="c"/>
   <obj name="b" child="d"/>
   <obj name="c" child="e"/>
</Objs>

Let's say that XML file has only one root node.

The output XML file would be:


<Obj name="a">
  <Obj name="b">
      <Obj name="c">
         <Obj name="e"/>
      </Obj>
      <Obj name="d"/>
  </Obj>
</Obj>

A tree structure is defined in input XML, by the 'name' and 'child' attributes. The output represents a true logical tree. We should be able to cater to unlimited number of tree nodes.

We need to write a XSLT stylesheet for this.

At first thought, I imagined that this could be a tough problem. But a little bit of patience helped me to write the stylesheet for this. The solution is presented below.


<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

<xsl:output method="xml" indent="yes" />

<xsl:template match="Objs">
  <xsl:variable name="start" select="obj[not(@name = ../obj/@child)]" />
  <xsl:variable name="startName" select="$start[1]/@name" />
  <Obj name="{$startName}">
    <xsl:for-each select="obj[(@name = $startName) and not(../obj/@name = @child)]">
      <Obj name="{@child}" />
    </xsl:for-each>
    <xsl:call-template name="makeTree">
      <xsl:with-param name="list" select="obj[@name = $start/@child]" />
    </xsl:call-template>
  </Obj>
</xsl:template>

<xsl:template name="makeTree">
 <xsl:param name="list" />

 <Obj name="{$list[1]/@name}">
   <xsl:for-each select="$list">
     <xsl:variable name="child" select="@child" />
     <xsl:choose>
       <xsl:when test="not(../obj[@name = $child])">
         <Obj name="{$child}" />
       </xsl:when>
       <xsl:otherwise>
         <xsl:call-template name="makeTree">
           <xsl:with-param name="list" select="../obj[@name = $child]" />
         </xsl:call-template>
       </xsl:otherwise>
     </xsl:choose>
   </xsl:for-each>
 </Obj>
</xsl:template>

</xsl:stylesheet>

At first thought, I felt that XSLT 2.0 constructs will be required to solve this problem. But the problem can be solved completely with a XSLT 1.0 stylesheet.

My belief that XSLT is a wonderful language for processing XML data, became stronger after solving this problem.

Sunday, August 3, 2008

Multiple values for XSLT keys

An XSLT user, asked following question, on xsl-list.

How do I use multiple key values?

Declaration:

<xsl:key name="keyname" match="subroot" use="ccc"/>

During the usage, I want to specify multiple values:

<xsl:variable name="keyname" select="key('keyname', '11' or '22')"/> ==> Here I want to use multiple values 11 and 22.

xsl-list members suggested useful options,

1. David Carlisle

<xsl:variable name="keyname" select="key('keyname', '22')|key('keyname', '11')"/>

2. Michael Kay

In XSLT 2.0, you can supply a sequence:

key('keyname', ('111', '222'))

In 1.0, you can supply a node-set with one value per node - but of course it's hard to set that up, you need the xx:node-set() function.

I worked upon Mike's idea for a XSLT 1.0 solution,

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exslt="http://exslt.org/common"
exclude-result-prefixes="exslt"
version="1.0">

<xsl:output method="xml" indent="yes" />

<xsl:key name="x" match="subroot" use="ccc"/>

<xsl:variable name="x-values">
<v>11</v>
<v>22</v>
</xsl:variable>

<xsl:template match="/root">
<result>
<xsl:for-each select="key('x', exslt:node-set($x-values)/v)">
<value>
<xsl:value-of select="eee" />
</value>
</xsl:for-each>
</result>
</xsl:template>

</xsl:stylesheet>

Mukul: I think, this could be better than David's suggestion, because if we want to have quite large number of different values to search by the key, we just have to change following code fragment,

<xsl:variable name="x-values">
<v>11</v>
<v>22</v>

</xsl:variable>

G. Ken Holman responded to my post, and provided a brilliant idea,

The node-set extension can be avoided to achieve what you want.

<xsl:for-each select="key('x', exslt:node-set($x-values)/v)">

The above can be replaced with standard XSLT 1.0 to read the stylesheet file as a source node tree.

<xsl:for-each
select="key('x',document('')/*/xsl:variable[@name='x-values']/v)">

Ken further wrote,

I grant, though, that if your stylesheet is large then putting this into a small included or imported fragment would keep any overhead of building the tree small.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:output method="xml" indent="yes" />

<xsl:key name="x" match="subroot" use="ccc"/>

<xsl:include href="ken2values.xsl"/>

<xsl:template match="/root">
<result>
<xsl:for-each select="key('x',$x-values)">
<value>
<xsl:value-of select="eee" />
</value>
</xsl:for-each>
</result>
</xsl:template>

</xsl:stylesheet>

ken2values.xsl
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:variable name="x-values-data">
<v>11</v>
<v>22</v>
</xsl:variable>

<xsl:variable name="x-values"
select="document('')/*/xsl:variable[@name='x-values-data']/v"/>

</xsl:stylesheet>

I think, Ken's idea of having an included stylesheet (ken2values.xsl, above) is brilliant, as it is memory efficient, and we are able to avoid the node-set extension (as mentioned earlier).

Sunday, July 27, 2008

XML Schema 1.1 assertions implementation in Xerces-J

One of the interesting enhancements, that are happening recently to Xerces code base, is XML Schema 1.1 implementation.

The Xerces-J team motivated me to implement some of the XML Schema 1.1 features into Xerces. I've started with the implementation of XML Schema 1.1 facility, "assertions".

I've completed quite a bit of work regarding this, and am hoping that the assertions support I'm writing would be available in Xerces-J in near future.

2008-11-04: Xerces team approved my work so far for assertions implementation, and have committed my patch to the Xerces code base. In the coming weeks, I would be working on integrating XPath 2.0 processing for assertions.

Saturday, July 26, 2008

An elegant XSLT solution

We usually see some nice posts, on the xsl-list.

David Carlisle recently posted a very elegant XSLT solution to a question asked on xsl-list. It's archived here, http://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/200807/msg00574.html.

I really liked the following expression, in David's solution:

<xsl:attribute name="level">
<xsl:value-of select="sum(.|preceding::sect[1]/@depth)"/>
</xsl:attribute>

I might have solved this problem differently, but not as elegantly like this. Nice thought, David!

Friday, July 18, 2008

XSLT 2.0 shines over 1.0

I was pondering over XSLT 2.0's advantages over XSLT 1.0, and came up with a simple example that illustrates XSLT 2.0's benefits.

Below are a 1.0 and 2.0 stylesheets, for finding the 1st n fibonacci numbers (and, analysis later on):

XSLT 2.0

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:x="http://localhost"
version="2.0">

<xsl:output method="text" />

<xsl:param name="n" />

<xsl:template match="/">
<xsl:for-each select="1 to $n">
<xsl:value-of select="x:fibonacci(position())" /><xsl:text> </xsl:text>
</xsl:for-each>
</xsl:template>

<xsl:function name="x:fibonacci" as="xs:integer">
<xsl:param name="n" as="xs:integer" />

<xsl:sequence select="if (($n = 1) or ($n = 2)) then 1 else x:fibonacci($n - 1) + x:fibonacci($n - 2)" />
</xsl:function>

</xsl:stylesheet>

XSLT 1.0

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:output method="text" />

<xsl:param name="n" />

<xsl:template match="/">
<xsl:call-template name="iterateAndFib">
<xsl:with-param name="x" select="1" />
</xsl:call-template>
</xsl:template>

<xsl:template name="iterateAndFib">
<xsl:param name="x" />

<xsl:if test="$x <= $n">
<xsl:call-template name="fibonacci">
<xsl:with-param name="n" select="$x" />
</xsl:call-template>
<xsl:text> </xsl:text>
<xsl:call-template name="iterateAndFib">
<xsl:with-param name="x" select="$x + 1" />
</xsl:call-template>
</xsl:if>
</xsl:template>

<xsl:template name="fibonacci">
<xsl:param name="n" />

<xsl:choose>
<xsl:when test="($n = 1) or ($n = 2)">
<xsl:value-of select="1" />
</xsl:when>
<xsl:otherwise>
<xsl:variable name="x">
<xsl:call-template name="fibonacci">
<xsl:with-param name="n" select="$n - 1" />
</xsl:call-template>
</xsl:variable>
<xsl:variable name="y">
<xsl:call-template name="fibonacci">
<xsl:with-param name="n" select="$n - 2" />
</xsl:call-template>
</xsl:variable>
<xsl:value-of select="$x + $y" />
</xsl:otherwise>
</xsl:choose>
</xsl:template>

</xsl:stylesheet>

Why therefore I think, XSLT 2.0 is better over 1.0,

1. A 2.0 stylesheet can be written with very few lines of code as compared to 1.0 stylesheet. In this case, the 2.0 stylesheet is of 22 lines, and the 1.0 stylesheet is of 51 lines (considering normal indentation markers in code).

2. In a 2.0 stylesheet, there is no need of recursion to iterate. The for-each loop natively supports iteration in a numerical range.

3. In the 2.0 stylesheet, we can utilize the xsl:function construct to write shorter code, which is better logically understood. The recursive calls in xsl:function in this example are easy to understand.

In a 1.0 stylesheet, we need to write named template to achieve recursive calls, which can get cumbersome if logic is complex.

4. In XSLT 2.0, the data model type system has lot more data types, than 1.0 (All built-in XML Schema types, as well user defined types can be used in XSLT 2.0 stylesheets).

I have no doubt, XSLT 2.0 shines over XSLT 1.0.

I read Norman Walsh expressing following thoughts on his blog post, "Every experience that I have with XSLT 2.0 increases my enthusiasm for it.". I totally agree with Norm.

Saturday, July 12, 2008

Constructing StreamSource and StreamResult for JAXP transformation

Recently, I had some tough time converting file paths having spaces into correct StreamSource and StreamResult objects, to be used by the JAXP transformer.

I figured out that below suggestion is probably the best way to solve this issue.

String sourceSystemId = "file:///C:/... .xml"; (use like this, if you are sure that URI string doesn't contain any illegal characters, like spaces etc.)

OR

String sourceSystemId = (new File(pathname)).toURI().toString(); (this will correctly escape characters that are illegal in URIs)

Then, construct StreamSource or StreamResult like following

StreamSource source = new StreamSource(sourceSystemId);

StreamResult result = new StreamResult(outputSystemId);

A useful function exists in XPath 2.0, which applies the %HH escaping convention to a URI, escaping both disallowed characters and reserved characters such as "/" and ":" (encode-for-uri(string $uri-part) → string).

Saturday, June 21, 2008

XML tools for Eclipse

I was searching for some freeware tools for authoring XML and XSD documents. But my bad search inputs didn't return any good results.

So I planned out to create this tooling myself. I decided to use Eclipse as the platform for this, as it is quite robust and very popular.

After few days of effort, I ended up with a very basic prototype. The details are available at, http://gandhimukul.tripod.com/xml/xmleclipse.html. But still lot of work needs to be done on this code base to make it really useful for XML developers, particularly the content assist functionality which is presently missing.

I posted my thoughts about this on xml-dev list.

Dave Carver on xml-dev list suggested to me that there already exists such a tooling. It's the "Web Tools Platform (WTP)" for Eclipse.

I gave WTP a try. Personally speaking, I found it very good. It already has lot's of things that I need in XML GUI tools. Moreover, it's free and open source.

I have now kept WTP in my toolset to author XML, XSD and DTD documents. If required, I might try to tweak it's source code to do something new.

Tuesday, June 10, 2008

self axis vs .

Andrew Welch started a nice discussion about comparison of XPath's self axis vs ., on xsl-list.

Andrew wrote that he has seen people writing self::elem//whatever instead of .//whatever.

I looked at the XPath 2.0 spec to find the relevant definitions. Below are the definitions from the XPath 2.0 spec.

<quote>
the self axis contains just the context node itself

. is known as, "context item expression".
A context item expression evaluates to the context item, which may be either a node (as in the expression fn:doc("bib.xml")/books/book[fn:count(./author)>1]) or an atomic value (as in the expression (1 to 100)[. mod 5 eq 0]).

The context item is the item currently being processed. An item is either an atomic value or a node. When the context item is a node, it can also be referred to as the context node.
</quote>

Wendell Piez wrote,

It also works to test whether the current node is actually an 'elem' when you process it or traverse from it.

Mukul: I personally prefer the style, .//whatever as it is shorter. But I think, there are important use cases for using the self:: axis.

Like for e.g.,


<xsl:template match="*[not(self::elem)]">
  <!-- do something -->
</xsl:template>


<xsl:template match="*">
  <xsl:if test="not(self::elem)">
    <!-- do something -->
  </xsl:if>
</xsl:template>

Vyacheslav Sedov wrote,

In XSLT 2.0 I prefer to use "* except elem".

Sunday, May 25, 2008

One-based indexes in XPath

Justin Johansson asked an interesting question on the xsl-list.

Would someone please give me advice as to why "1-based" indexes are used in XPath, such as para[1] instead of para[0] for the first para item/element?

Why does the spec for XPath (and its/XQuery operator/function library) go against the norm for modern programming languages in which zero is the base for array-like collections?

An interesting discussion happened on xsl-list after this query.

Colin Adams wrote: Zero is not the norm for modern programming languages. It might well be for ancient ones. It is a very poor choice, justifiable only when trying to squeeze the last ounce of speed in a highly numerically-intensive application.

And even there it is not justified - you simply use data structures that have an unused first element, and so avoid the subtract one operation in that way.

I reasoned:
Let's say, we have to select a node as following:

following-sibling::xx[1]

To me traversal on following (or say preceding) axis will make sense if indexes start from 1.

I also think 0 based indexes in low level languages (I consider Java or C to be low level than XPath. I am talking about assembly languages too.) have relation to hardware addressing.

For e.g., a memory might have addresses ranging from 0000 to 1111 (this is just a small amount of memory). This probably has got to do with logic of bits, where 0 has a very important meaning.

Lot of programming languages (also mentioned by you) have 0 based indexes (in arrays, strings etc.), so compilers can easily map them to hardware locations.

Indexes in XPath start from 1 because it's more convenient for the users.

Michael Kay commented on my reasoning as follows, "I don't think hardware addressing is the only benefit of 0-based addressing. It also makes computations easier. If you number the rows and columns on a chessboard from 0-7, and the squares from 0-63, then the square number is row*8+column, whereas with 1-based addressing it is (row-1)*8+column.

And we do sometimes use 0-based logic in real life too. In many countries the "first floor" is the one above where you enter the building; and in many societies a child is "1 year old" between 12 months and 24 months after their birth.

But on balance I do think 1-based logic was the right choice for XPath and XSLT."

Michael further said, "Because the language was designed for users, not for programmers, and users still have this old-fashioned habit of referring to the first chapter in a book as Chapter One. (Though I did once hear Dijkstra refer to the fourth slide in someone's presentation as the third.)

(I fully agree that when handling tables, or subscripting into strings, zero-based addressing would often be much more convenient. There are arguments both ways, and as always, I can't tell you what the actual history of the decision was; I can only post-rationalize it.)".

Owen Rees shared an interesting point: Dijkstra wrote a note "Why numbering should start at zero": http://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html.

The book "Informal introduction to ALGOL 68" numbers its chapters from zero. I remember that as being unusual, amusing, and also appropriate given the audience.

Since XSLT does not have multi-dimensional sequences, the issues that arise when multi-dimensional arrays can be treated as one-dimensional arrays do not arise. Zero-based indexing tends to make the index formulae simpler when accessing a multi-dimensional array with a one-dimensional array access expression but on the whole I think it is best to just not do that sort of thing at all.

I don't think that the age of language or design for programmer or user arguments stand up very well.

FORTRAN is a counterexample to the "old languages use zero" argument, it uses 1. Modern versions of FORTRAN allow the lower bound to be specified (as Algol did) but the default is still 1.

Dartmouth BASIC used zero-based indexing, I have no information about why, but I doubt if it has anything to do with being close to assembly code or concern for the last ounce of speed.

I think that both of these languages were originally aimed at people who did not think of themselves as being or intending to become programmers.

One related point here is that those thinking in terms of other programming languages may be misled by the syntactic similarity of the XPath expression para[1] to an array indexing expression in various other languages. Understanding that para[1] is just a shorthand for para[position()=1] moves the issue to the question of why position() is defined the way it is.

Defining last() to be the context size means that context position has to count from one unless you are going to cause either 'last' or 'size' to have a very counterintuitive meaning.

Wednesday, May 14, 2008

Namespace-based validation dispatching language

A recent discussion on xml-dev list motivated me to know more about NVDL.

NVDL stands for, Namespace-based Validation Dispatching Language.

NVDL is Part 4 of ISO/IEC 19757 DSDL (Document Schema Definition Languages).

If an XML document is composed of sections belonging to multiple namespaces, then each section (differentiated by the namespace) can be dispatched for validation to a different schema processor (for e.g., RELAX NG, W3C XML Schema, DTD, Schematron etc.).

This is an innovative new idea to allow validation of different parts of XML document with different schema languages.

This is what Roger Costello wrote about NVDL on xml-dev list:

Here are the evolutionary changes I envision NVDL bringing about in the marketplace:

1. Opens the marketplace to utilizing a variety of schema languages.

Previously, you and all your trading partners were locked into using one schema language (typically W3C XML Schema) if you wanted interoperability. With NVDL that limitation is lifted and you can achieve interoperability while using a variety of schema languages.

2. Promotes using the right schema language for the right job.

XML Schema and Relax NG are two schema languages for expressing grammar-based rules. They are both standards, the former a W3C standard, the later an ISO standard. Although their capabilities are largely overlapping, there are important differences. "Use the right tool for the right job" is an adage that applies to choosing a schema language. Knowing the differences in capabilities is important to making a good decision in choosing a schema language.

3. Encourages the creation of small, simple, independent schemas, written in any schema language.

4. Moves the application developer's focus from:

"using a schema"

to:

"using XML vocabularies"

Sunday, May 11, 2008

Some differences between XQuery 1.0 and XSLT 2.0

I can think, of following differences between XQuery 1.0 and XSLT 2.0:

1. In XQuery 1.0, functions need to be declared before use. While in XSLT 2.0, functions may be defined anywhere in the stylesheet (provided, that the function body is a child of the element, xsl:stylesheet).

2. In XQuery 1.0, the XML Schema namespace, http://www.w3.org/2001/XMLSchema is not required to be declared for using the prefix, xs:. While in XSLT 2.0 XML Schema namespace needs to be declared, if any reference to the prefix xs: is made in the XSLT stylesheet.

3. XQuery 1.0 has (seems to me) stronger static typing than XSLT 2.0. For e.g., to return a xs:string value from a function in XQuery 1.0, we cannot simply write, $people/person[fname = $fName]/lname/text(). But instead we have to do for example, xs:string($people/person[fname = $fName]/lname/text()).
While in XSLT 2.0, an expression like $people/person[fname = $fName]/lname is able to return a xs:string value (if element 'lname' contains a text only data).

4. Moreover, an XQuery program is not a template based program description (as is a XSLT, stylesheet). The XQuery syntax looks to me, a mix of procedural and declarative syntax.

It's also true that, XSLT and XQuery are both functional in nature. But that's a similarity, between these two languages!

Following are an XQuery 1.0 and XSLT 2.0 examples which illustrate the above points:

Input XML:

<?xml version="1.0" encoding="UTF-8"?>
  <people>
    <person>
      <fname>Mukul</fname>
      <lname>Gandhi</lname>
    </person>
    <person>
      <fname>Rohit</fname>
      <lname>Rawat</lname>
    </person>
  </people>

XQuery program:

declare namespace my = "http://localhost/functions";

declare function my:getLastName($people as element(), $fName as xs:string)
as xs:string
{
   xs:string($people/person[fname = $fName]/lname/text())  
};

<person>
  <fname>Mukul</fname>
  <lname>{my:getLastName(doc("../Data/test.xml")/people, "Mukul")}</lname>
</person>

XSLT 2.0 stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                xmlns:my="http://localhost/functions"
                exclude-result-prefixes="xs my">

<xsl:output method="xml" indent="yes" />

<xsl:template match="/">
  <person>
    <fname>Mukul</fname>
    <lname><xsl:value-of select="my:getLastName(people,'Mukul')" /></lname>
  </person>
</xsl:template>

<xsl:function name="my:getLastName" as="xs:string">
  <xsl:param name="people" as="element()" />
  <xsl:param name="fName" as="xs:string" />

  <xsl:sequence select="$people/person[fname = $fName]/lname" />

</xsl:function>

</xsl:stylesheet>

Michael Kay shared following observation on xsl-list:
XQuery tends to work better when you want to extract a small amount of information from a large document and ignore the rest. XSLT tends to work better if you want to keep most things the same and make a few small changes. Of course there's a range of tasks between those extremes.

Tuesday, May 6, 2008

Namespace nodes for literal result elements

A recent discussion on xsl-list taught me something new about the XSLT 2.0 language. Following are my thoughts about it.

Suppose, we have this simple stylesheet:


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:a="http://mydomain" version="2.0">
       
  <xsl:output method="xml" indent="yes" />
  
  <xsl:template match="/">
    <result>
      <x/>
    </result>
  </xsl:template>
  
</xsl:stylesheet>

This stylesheet when run, produces the following output:


<?xml version="1.0" encoding="UTF-8"?>
<result xmlns:a="http://mydomain">
  <x/>
</result>

Please note the xmlns:a namespace declaration in the output.

To get rid of this namespace declaration from the output, we have to do:

exclude-result-prefixes="a" on the xsl:stylesheet element, or

have literal result element be declared in the stylesheet as follows:


<result xsl:exclude-result-prefixes="a">
  <x/>
</result>

The question is: Why is the namespace declaration copied to the output?

The answer can be found in the XSLT 2.0 specification, at http://www.w3.org/TR/xslt20/#lre-namespaces. As per the XSLT 2.0 specification, XSLT namespace - http://www.w3.org/1999/XSL/Transform is not copied to the output, while any other namespace nodes are copied to the output, except for few additional rules, as specified in the spec.

Saturday, May 3, 2008

Output validation with XSLT 2.0

An interesting example occurred to me, about Schema-aware XSLT stylesheet design. Below is the code for it.


<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
     version="2.0">
       
  <xsl:output method="xml" indent="yes" />
  
  <xsl:import-schema>
    <xs:schema>
      <xs:element name="x">
        <xs:complexType>      
          <xs:sequence>
            <xs:element name="y" />
          </xs:sequence>      
        </xs:complexType>
      </xs:element>
    </xs:schema>
  </xsl:import-schema>
  
  <xsl:import-schema>
    <xs:schema>
      <xs:element name="p">
        <xs:complexType>      
          <xs:sequence>
            <xs:element name="q" />
          </xs:sequence>      
        </xs:complexType>
      </xs:element>
    </xs:schema>
  </xsl:import-schema>
  
  <xsl:template match="/">
    <xsl:variable name="temp1">
      <x xsl:validation="strict">
        <y/>
      </x>
    </xsl:variable>
    <xsl:variable name="temp2">
      <p xsl:validation="strict">
        <q/>
      </p>
    </xsl:variable>
    <result>
      <xsl:copy-of select="$temp1" />
      <xsl:copy-of select="$temp2" />
    </result>
  </xsl:template>

</xsl:stylesheet>

This stylesheet imports/declares two inline XSD schemas. In the body of the root template, two variables (temp1 and temp2) request strict validation of the element markup.

If we run this example with a Schema-aware XSLT 2.0 processor, we can find that invalid content cannot be generated from the stylehseet.

An alternate writing style for the above example could be:


 <xsl:template match="/">
   <xsl:variable name="temp1">
     <x>
       <y/>
     </x>
   </xsl:variable>
   <xsl:variable name="temp2">
     <p>
       <q/>
     </p>
   </xsl:variable>
   <result>
     <xsl:copy-of select="$temp1" validation="strict" />
     <xsl:copy-of select="$temp2" validation="strict" />
   </result>
 </xsl:template>

Now we specify validation="strict" option on xsl:copy-of instruction.

The intended meaning is same in both the above cases.

This to me is quite useful XSLT facility. XSLT 2.0 is very flexible, where we want the validation in output tree to occur.

Monday, April 14, 2008

Efficient XML creation with well-formedness support

After analyzing lot of points (with a discussion in the thread, "Best way to create an XML document" on xml-dev list) for creating XML document from scratch, I concluded that the following technique is perhaps the most efficient way to do this (with well-formedness support):


public static void main(String[] args) {
  try {
    TransformerFactoryImpl tfi = new TransformerFactoryImpl();
    TransformerHandler tHandler = tfi.newTransformerHandler();

    tHandler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
    String output = "output.xml";
    tHandler.setResult(new StreamResult(new File(output)));

    tHandler.startDocument();
    tHandler.startElement("", "x", "x", null);
    AttributesImpl attrs = new AttributesImpl();
    attrs.addAttribute("", "attr1","attr1", "", "123");
    attrs.addAttribute("", "attr2","attr2", "", "456");
    tHandler.startElement("", "y", "y", attrs);
    tHandler.startElement("", "z", "z", null);
    tHandler.endElement("", "z", "z");
    tHandler.endElement("", "y", "y");
    tHandler.endElement("", "x", "x");
    tHandler.endDocument();

    if (isWellFormed(output)) {      
      /*
        do something with the generated file
      */
    }
    else {
      System.out.println("Generated XML document is not well-formed.");
    }
  }
  catch(Exception ex) {
    ex.printStackTrace();
  }
}

private static boolean isWellFormed(String output) {
 try {
   XMLReaderAdapter xra = new XMLReaderAdapter();
   InputSource is = new InputSource(new FileInputStream(output));
   xra.parse(is);
   return true;
 }
 catch(Exception ex) {
   return false;
 }
}

I also agree to this opinion:
Since version 1.6, Java has supported the javax.xml.stream package, so the most straightforward way would be to use an XMLStreamWriter.

Please note: javax.xml.stream API is available to previous versions of Java through JSR 173 (https://sjsxp.dev.java.net/)

Acknowledgements
Martin Gallagher
Alain Couthures
Michael Kay
Robert Koberg
Andrew Welch
Michael Glavassevich
Chris Burdess

Thursday, April 10, 2008

Best way to create an XML document

I posted following question on the xml-dev mailing list, regarding this topic:

Let's suppose I need to create an XML document from scratch in a Java program.

What's the best way to do this?

I have seen that a quick way to do this is preparing a XML string by hand.

For e.g.,

String xml_str = "<x><y><z/></y></x>";

I want to understand the pros and cons of this approach.

I most of the time prefer using an API like DOM to create an in-memory representation, and then serializing the tree to String.

Following are my arguments in favor of using the DOM approach:

1) Creating a XML string by hand can become cumbersome, if XML is huge. Maintaining the correct parent child relationship for a huge document can be difficult, if done by hand (imagine a document of size 50 MB). This would lead to difficult debugging. Using a DOM API can do this inherently in memory.

2) It's difficult to remember correct XML name conventions if done by hand.

for e.g., <9abc> is an invalid XML name (because it starts with a number).

There are more rules for XML names.

Using DOM API does this automatically.

3) Using DOM API can check well-formedness of entities (like, &abc; etc). Doing this by hand in a string can become difficult.

Following people added useful comments to this thread.

Martin Gallagher: Creating via DOM can be advantageous if further manipulation via the DOM is required. This will remove the overhead of creating a string and the initial parse to a DOM object.

Alain Couthures: I do appreciate not having too much program lines so I really prefer the string approach. You can always copy/paste the string in a text editor (I use NotePad++ myself...) which will check it is well-formed !

Michael Kay: My preference is to use a SAX-like serializer driven by calls such as

startElement("x")
attribute("a", "3")
text("content")
endElement("x")

This avoids the overhead of creating a tree in memory, while still giving the benefits of having the system take care of matters such as escaping special characters.

Robert Koberg: Rob referred to the following link to implement Mike's ideas in Java,

http://www.megginson.com/downloads/

xml-writer-0.2.zip

(Mukul: I tried this, but found that this XML writer library doesn't check for well-formedness of XML documents, which is a bit of a drawback of this library).

Andrew Welch: Create an empty SAXTransformerFactory get the TransformerHandler and then just call the event methods on that... (using Xalan or Saxon rather than Xerces)

Alternatively use the streaming api in Java 6 - see XMLStreamWriter.

Here is the working code as per Andrew's suggestions:

TransformerFactoryImpl tfi = new TransformerFactoryImpl();
TransformerHandler tHandler = tfi.newTransformerHandler();
tHandler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
tHandler.setResult(new StreamResult(System.out));
tHandler.startDocument();
tHandler.startElement("", "x", "x", null);
AttributesImpl attrs = new AttributesImpl();
attrs.addAttribute("", "attr1","attr1", "", "123");
attrs.addAttribute("", "attr2","attr2", "", "456");
tHandler.startElement("", "y", "y", attrs);
tHandler.startElement("", "z", "z", null);
tHandler.endElement("", "z", "z");
tHandler.endElement("", "y", "y");
tHandler.endElement("", "x", "x");
tHandler.endDocument();

But unfortunately, with this technique also, we cannot ensure well-formedness of the XML output.

I questioned ...
Here I am using the transformer functionality for creating XML, which looks more like a XSLT feature (i.e., transformation task). Should we not have this capability in the XML parser (for e.g., in Xerces)? Should we have something like xml-writer (which Rob pointed) built into Xerces (possibly as an enhancement)?

Michael Kay provided following explanation for this scenario:
XSLT processors include a serializer because it's defined in the XSLT specification, and since they have one, it makes sense to expose it even if you aren't doing a transformation.

Michael provided following explanation for the inability of the XSLT serializer to check for well-formedness of XML output.
"Because the spec doesn't say it has to. And because XSLT serializers were primarily written to get their input from XSLT transformers, which they trust; so why incur the extra expense?"

Using the XMLStreamWriter approach, as suggested by Andrew, following is the working code:

XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(System.out);
writer.writeStartDocument();
writer.writeStartElement("x");
writer.writeStartElement("y");
writer.writeAttribute("attr1", "123");
writer.writeEndDocument();
writer.flush();
writer.close();

This looks better than the SAX writer approach, for the well-formedness requirement.

Michael Glavassevich further commented on this approach:

It keeps the tags nested properly but you can still write all sorts of garbage.

Extending your example:

writer.writeStartDocument();
writer.writeDTD("");
writer.writeComment("bad -- comment");
writer.writeStartElement("x");
writer.writeCharacters("\u0000");
writer.writeProcessingInstruction("xml", "version=\"1.0\"");
writer.writeStartElement("y");
writer.writeAttribute("attr1", "123");
writer.writeAttribute("attr1", "456");
writer.writeEndElement();
writer.writeEndElement();
writer.writeEmptyElement("3");
writer.writeEndDocument();
writer.flush();
writer.close();

produces:

<?xml version="1.0" ?><!GARBAGE><x><?xml version="1.0"?><y attr1="123" attr1="456"></y></x><3/>

with a NUL char after <x>.

Xalan-J serializer

I thought there was some problem with Xalan-J serializer. I posted the following question on xalan-dev mailing list:

I think, there is scope of improvement to the Xalan-J 2.7.1 serializer.

I tried this sample XSLT stylesheet with Xalan-J 2.7.1.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:output method="xml" indent="yes" />

<xsl:template match="/">
  <x>
    <y/>
  </x>
</xsl:template>

</xsl:stylesheet>

The output produced by Xalan is:
<?xml version="1.0" encoding="UTF-8"?><x>
<y/>
</x>

Please note that top most element tag, <x> is not indented properly.

I wish the output in this case should be:

<?xml version="1.0" encoding="UTF-8"?>
<x>
  <y/>
</x>

This problem seems to happen with any XML output.

Henry Zongaro provided a good argument that why this is so:

The problem here is that the serializer considers that the result document might be used as an external general parsed entity. So, suppose the result is named result.xml. If it's referenced inside a document such as the following, inserting whitespace before the x element in result.xml would affect the text content of its parent element, doc.

<!DOCTYPE doc [
<!ENTITY ref SYSTEM "result.xml">
]>
<doc>Some non-whitespace test &ref; Some more non-whitespace text</doc>

Saturday, April 5, 2008

Schema aware XSLT design

I think it would be great to start my blog with a Schema Aware XSLT idea.

We all know that XSLT 2.0 has introduced the concept of utilizing W3C Schemas within XSLT stylesheets. Schemas can be put to use in various ways in the stylesheets. Following are the major ways in which Schemas can be utilized in stylesheets:

1) Validating the input XML documents prior to doing the XSLT transformation. This can ensure that invalid input is not processed by the stylesheet. Input validation has an additional benefit, that it attaches type annotations to XML nodes. This makes possible many useful type aware operations within the stylesheet.

2) Validating the output trees (in most of the cases) prior to serialization. This can ensure that the XSLT stylesheet doesn't produce invalid output.

3) The Schemas can also be utilized to validate intermediate trees.

4) In XSLT 2.0, we can specify types of function/template parameters and return types of functions. We can also specify types of variables. The types can be any user-defined type derived from the imported Schemas. This is tremendously useful, as now the type system of XSLT (2.0) can be extended in an unlimited way.

5) Apart from enhanced static type checking by the XSLT processor (made possible by Schema Awareness), the XSLT processor has opportunity to generate efficient code. Better static type checking allows faster debugging (mostly during compile time).

I would be interested to know about other benefits of Schema Aware stylesheet design.

Wednesday, April 2, 2008

I've started blogging

I've started to blog today. The topics which I currently find interesting are XML, XML Schema, XPath, XSLT and XQuery. Though occasionally, I might post my thoughts on other technical topics as well.

I intend to use this medium, to share some random as well some carefully thought out ideas in information technology domain, to bring them in open for discussion.

I hope that this blog would be useful.

Please feel free to post any comments.