Mukul Gandhi: January 2009

Monday, January 26, 2009

Relational/XML hybrid database design

Lately, I have been reading about the database design (particularly, for relational and XML hybrid storage).

I felt that a column design (for complex types, like say purchase order) having XML data (like the native SQL type, 'XML') is quite useful, as compared to distributing the whole schema (say, the purchase order - PO schema) across multiple relational tables and columns (with simple types).

Following are some of the advantages for using the SQL type, 'XML' in RDBMS tables:

1. The database and nearly the whole application becomes shielded from the Schema changes. If say something in the PO Schema changes, we only change the PO XML Schema. The RDBMS data model need not change (the columns of the table would still be same, say a, b, po where the column po is of type 'XML'). The SQL statements in the application programs (say, the JDBC statements) also need not change.

The impact of the Schema change on the application will be quite less, as compared to the situation, if the data were kept in simple type columns.

2. Designing a complex type (say, a purchase order) in a 'XML' column makes good design sense. The complex type most of the times has hierarchical structure, as is XML. So it is a good design decision to design complex types as XML. This also has advantage during validation of the data during data inserts and updates. The XML data as a whole get's written to the database, and an internal schema in the database validates the incoming XML data. This is much more simpler and natural than validating the data, had it been stored in multiple tables and columns.

3. The XML data in the relational/XML hybrid database can be accessed using XQuery, and also updated using XQuery update facility.

I found DB2 with pureXML having an excellent support for all these features.

Friday, January 16, 2009

Normalizing unnecessary whitespace text nodes during XSLT transformation

Let's say that my input XML is following,


<test>
  <a/>
  <b/>
  <c/>
  <d/>
  <e/>
  <f>some data ..</f>
</test>

I need to write an XSLT transformation, which just removes elements, 'c' and 'd' and keeps rest of the structure same.

The result of the transformation should be following [1]:


<test>
  <a/>
  <b/>
  <e/>
  <f>some data ..</f>
</test>

The obvious solution to this problem is, to write a modified identity transformation logic.

i.e.,


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">
   
   <xsl:output method="xml" indent="yes" />

   <xsl:template match="node() | @*">
     <xsl:copy>
       <xsl:apply-templates select="node() | @*" />
     </xsl:copy>
   </xsl:template>
   
   <xsl:template match="c | d" />
   
</xsl:stylesheet>

But there is a subtle flaw in this logic. The actual output produced by the above stylesheet is,


<?xml version="1.0" encoding="UTF-8"?>
<test>
  <a/>
  <b/>


  <e/>
  <f>some data ..</f>
</test>

There are a kind of two whitepace holes in the output (created by the elements which are removed). This makes the output not 100% same as the desired output [1].

The whitespace holes in the output above can be very well explained. They are actually the newline whitespaces (near the elements 'c' and 'd') present in the original document, which are preserved in the generated output.

Adding a little bit of extra logic in the stylesheet can fix this problem.

The right solution will be following,


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">
   
   <xsl:output method="xml" indent="yes" />

   <xsl:template match="node() | @*">
     <xsl:copy>
       <xsl:apply-templates select="node() | @*" />
     </xsl:copy>
   </xsl:template>
   
   <xsl:template match="c | d" />
   
   <xsl:template match="text()[(normalize-space() = '') and (preceding-sibling::node()[1]/self::c or preceding-sibling::node()[1]/self::d)]" />
   
</xsl:stylesheet>

Please note the last template in this stylesheet, which fixed the whitespaces problem for me.

Saturday, January 10, 2009

XSLT functions returning void

I just thought, whether XSLT 2.0 functions can return something like a void value.

for e.g. as we could do,

public void myfunction() {

}

in Java.

I do not think there is any syntax in XSLT which allows this declaration.

i.e., can we do something like following in XSLT.


<xsl:function name="my:testfunction" as="a-void-type">
  <!-- something here -->
</xsl:function>

I think there is no way of specifying a void type in XSLT. A function must return something. The best we could do is, that we specify the return type as, as="xs:string?". i.e., the function may return a xs:string value, or it may return nothing (i.e., an empty sequence: ()).

But if we want to return something like void from a function, we can instead implement a named template for this. i.e., something like following,


<xsl:template name="testtemplate">
  <!-- something here -->
</xsl:template>

The named template is conceptually similar to xsl:function (both are callable modules), but there are some subtle differences as well, between them. xsl:function is a lot more loosely coupled module than the named template. The named template inherits the context from the caller, whereas the function has no access to the context information of the caller (though any piece of the context can be passed to the function as parameters).

Sunday, January 4, 2009

XSLT: sorting data by duration

I've this input text file (test.txt):

A started at 03:12:10
A ended at 03:20:20
B started at 03:20:25
B ended at 03:22:21
C started at 03:22:23
C ended at 03:22:55
D started at 03:22:57
D ended at 03:23:21
E started at 03:23:25
E ended at 03:24:40

Here A, B, C etc. are some events, and they start at a particular time and end at another time.

I need to produce an output like following:

D : 0-0-24
C : 0-0-32
E : 0-1-15
B : 0-1-56
A : 0-8-10

i.e., events sorted by the time they took (in ascending order of durations). The duration format in the output is, hr-min-sec.

The following XSLT 2.0 stylesheet worked well for this problem,

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                exclude-result-prefixes="xs"
                version="2.0">

<xsl:output method="text" />

<xsl:variable name="time-data" select="tokenize(unparsed-text('test.txt', 'UTF-8'), '\r?\n')" />

<xsl:template match="/">
  <xsl:variable name="temp-data">
    <xsl:for-each select="$time-data">
      <xsl:variable name="val" select="normalize-space(.)" />
      <xsl:variable name="pos" select="position()" />
      <xsl:if test="position() mod 2 = 1">
        <data key="{tokenize($val, '\s')[1]}">
          <xsl:value-of select="xs:time(tokenize($time-data[$pos + 1], '\s')[last()]) -xs:time(tokenize($val, '\s')[last()])" />
        </data>
      </xsl:if>
    </xsl:for-each>
  </xsl:variable>
  <xsl:for-each select="$temp-data/*">
    <xsl:sort select="xs:dayTimeDuration(.)" />
    <xsl:variable name="hr" select="hours-from-duration(.)" />
    <xsl:variable name="min" select="minutes-from-duration(.)" />
    <xsl:variable name="sec" select="seconds-from-duration(.)" />
    <xsl:value-of select="@key" /> : <xsl:value-of select="concat($hr, '-', $min, '-', $sec)" /> <xsl:text>
</xsl:text>
  </xsl:for-each>
</xsl:template>

</xsl:stylesheet>

The following XSLT sort instruction worked, well for this use case:
<xsl:sort select="xs:dayTimeDuration(.)" order="descending" />

I've used Saxon to solve this.

If anybody bumps by this post, and could think of a better solution (particularly written in a more functional style, and avoiding the temporary tree), I would be very glad to know that.

Mukul Gandhi