Mukul Gandhi: xpath

Showing posts with label xpath. Show all posts

Tuesday, September 12, 2023

XSLT 3.0, XPath 3.1 and XalanJ

It's been a while that, I've written a blog post here. I've few new updates, about the work which XalanJ team has been doing over the past few months, that I wish to share with the XML community.

XalanJ project, provides XSLT and XPath processors that are written with Java language. An XSLT processor transforms an XML input document (or even only text files), into other formats like XML, HTML and text.

XalanJ project, has released a new version (2.7.3) of XalanJ on 2023-04-01. This XalanJ release, essentially is a bug fix release over the previous release. The XalanJ 2.7.3 release was extensively tested by XalanJ team, and it has very good compliance with XSLT 1.0 and XPath 1.0 specs.

Since Apr 2023, XalanJ team has been working to develop implementations of XSLT 3.0 and XPath 3.1 language specifications. These XalanJ codebase changes are currently not released by XalanJ team, but are available on XalanJ dev repos branch.

I further wish to write about, XSLT 3.0 user-defined callable component implementation enhancements within XalanJ, that should be available within one of the future XalanJ release. The callable components within a programming language are, essentially functions and procedures. XSLT 1.0 language has only one kind of user-defined callable component, which is written with an XML element name xsl:template.

XSLT 3.0 provides another kind of user-defined callable component, defined with an XML element name xsl:function. An XSLT instruction xsl:function was first made available within XSLT 2.0 language. A user-defined function present within an XSLT stylesheet, may be called within an XPath expression.

Following is an example of XSLT 3.0 stylesheet, that makes use of an xsl:function element,

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns:ns0="http://ns0"

exclude-result-prefixes="ns0"

version="3.0">

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/">

<one>

<xsl:value-of select="ns0:func1(6, 5, true(), false())"/>

</one>

<two>

<xsl:value-of select="ns0:func1(2, 5, true(), false())"/>

</two>

</result>

</xsl:template>

<xsl:function name="ns0:func1">

<xsl:param name="val1"/>

<xsl:param name="val2"/>

<xsl:param name="a"/>

<xsl:param name="b"/>

<xsl:value-of select="if ($val1 gt $val2) then ($a and $b) else ($a or $b)"/>

</xsl:function>

</xsl:stylesheet>

The above cited XSLT stylesheet, defines an user-defined function named "func1" bound to the specified non-null XML namespace. This function definition requires four arguments with a function call, and produces a boolean result based on few logical conditions.

The above cited XSLT stylesheet, produces following output with XalanJ,

<?xml version="1.0" encoding="UTF-8"?><result>

<one>false</one>

</result>

XPath 3.1 provides a new kind of callable component (that wasn't available with XPath 1.0), which is an inline function definition which when compiled by an XPath processor, produces an XPath data model (XDM) function item.

An XPath 3.1 function item, may be called via an XPath dynamic function call expression.

Following is an XSLT 3.0 stylesheet, that specifies an XPath inline function expression, and is an alternate solution to above cited XSLT stylesheet,

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

version="3.0">

<xsl:output method="xml" indent="yes"/>

<xsl:variable name="func1" select="function($val1, $val2, $a, $b) { if ($val1 gt $val2) then ($a and $b) else ($a or $b) }"/>

<xsl:template match="/">

<one>

<xsl:value-of select="$func1(6, 5, true(), false())"/>

</one>

<two>

<xsl:value-of select="$func1(2, 5, true(), false())"/>

</two>

</result>

</xsl:template>

</xsl:stylesheet>

The above cited XSLT stylesheet, specifies an XPath inline function expression assigned to an XSLT variable "func1". This makes, XPath expressions like $func1(..) as function calls (which are termed as dynamic function calls by XPath 3.1 language).

The above cited XSLT stylesheet, produces an output with XalanJ, which is same as with an earlier cited stylesheet.

Its perhaps also interesting to discuss and analyze, which of the above mentioned XSLT callable components approaches an XSLT stylesheet author should choose?

An XPath 3.1 inline function expression is an *XPath expression*, therefore its function body is limited to have XPath syntax only.

Whereas, an xsl:function is an XSLT instruction (which may be invoked as a function call, from within XPath expressions). The xsl:function function's body may have significantly complex logic (with any permissible XSLT syntax and XPath expressions) as compared to XPath inline function expressions.

To conclude, I believe that, when using XSLT 3.0 and XPath 3.1, we have following three main kinds of user-defined callable components which may be used by XSLT stylesheet authors,

1) xsl:template (which is very important within an XSLT stylesheet, and is the core of an XSLT stylesheet)

2) xsl:function

3) XPath inline function expression

That's all I wished to say within this blog post.

Monday, April 10, 2023

XPath 2.0 quantified expressions. Implementation with XSLT 1.0

XPath 2.0 language has introduced new syntax and semantics as compared to XPath 1.0 language, for e.g like the XPath 2.0 quantified expressions.

Following is an XPath 2.0 grammar, for the quantified expressions (quoted from the XPath 2.0 language specification),

QuantifiedExpr ::= ("some" | "every") "$" VarName "in" ExprSingle ("," "$" VarName "in" ExprSingle)* "satisfies" ExprSingle

The XPath 2.0 quantified expression, when evaluated over a list of XPath data model items, returns either boolean 'true' or a 'false' value.

I'm able to, suggest an XSLT 1.0 code pattern (tested with Apache XalanJ), that can implement the logic of XPath 2.0 like quantified expressions. Following is an example, illustrating these concepts,

XML input document:

<?xml version="1.0" encoding="UTF-8"?>

<elem>

</elem>

XSLT 1.0 stylesheet, implementing the XPath 2.0 "every" like quantified expression (i.e, universal quantification):

<?xml version="1.0"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns:exslt="http://exslt.org/common"

exclude-result-prefixes="exslt"

version="1.0">

<xsl:output method="text"/>

<xsl:template match="/elem">

<xsl:variable name="temp">

<xsl:for-each select="a">

<xsl:if test="number(.) > 3">

<yes/>

</xsl:if>

</xsl:for-each>

</xsl:variable>

<xsl:value-of select="count(exslt:node-set($temp)/yes) = count(a)"/>

</xsl:template>

</xsl:stylesheet>

The above XSLT stylehseet, produces a boolean 'true' result, if all XML "a" input elements have value greater than 3, otherwise a boolean 'false' result is produced.

XSLT 1.0 stylesheet, implementing the XPath 2.0 "some" like quantified expression (i.e, existential quantification):

<?xml version="1.0"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns:exslt="http://exslt.org/common"

exclude-result-prefixes="exslt"

version="1.0">

<xsl:output method="text"/>

<xsl:template match="/elem">

<xsl:variable name="temp">

<xsl:for-each select="a">

<xsl:if test="number(.) = 4">

<yes/>

</xsl:if>

</xsl:for-each>

</xsl:variable>

<xsl:value-of select="count(exslt:node-set($temp)/yes) >= 1"/>

</xsl:template>

</xsl:stylesheet>

The above XSLT stylehseet, produces a boolean 'true' result, if at-least one XML "a" input element has value equal to 4, otherwise a boolean 'false' result is produced.

Within the above cited XSLT 1.0 stylesheets, we've used XSLT "node-set" extension function (that helps to convert an XSLT 1.0 "result tree fragment" into a node set).

We can therefore conclude that, within an XSLT 1.0 environment, we can largely simulate logic of many XPath 2.0 language constructs.

Wednesday, September 21, 2022

XPath/XSLT 1.0 data model and beyond

Is the inherent XPath/XSLT 1.0 data model better from the point of view of functional capabilities, or the data models of next versions (2.0, 3.0) of these language specifications?

XPath/XSLT 1.0 data model, focuses on having a well-formed XML document tree as part of the data model. Whereas, 2.0 and 3.0 versions of these language specifications, focus on having a flat sequence of data model items (like atomic/list values or XML nodes). Many of the XPath/XSLT 2.0 and 3.0 use cases, still focus on achieving well-formed XML document trees as part of the output of an XSLT transform.

Although, the definition of data models for 1.0 versions of these language specifications, is fundamentally different than 2.0 versions of these language specifications (one is a coherent XML tree, whereas the newer version is a sequence of data model items), the XSLT 1.0 and 2.0/3.0 transforms try to achieve the same end-result (i.e, an XML well-formed serialization of the data model instance).

I think, XSLT 2.0/3.0 brought sequence of data model items as a fundamental new definition of data model, because XPath 2.0/3.0 data model components need to be strongly typed at a granular level (aligning with XML Schema specification).

If we need, greater strongly typed process of achieving an end-result of the XSLT transform, we should select the 2.0/3.0 versions of these language specifications. Otherwise we should opt for the 1.0 versions of these specifications.

The 2.0/3.0 versions of these language specifications, have brought in newer XSLT language features, and also a vastly expanded function library. That's an advantage of using the XSLT 2.0/3.0 languages, than the 1.0 version of these languages.

At various times, I'm not desirous of too much strong typing (in an XML Schema sense) within an XSLT transformation process (because that involves, greater design effort upfront), and if my XML transformation requirements are simple I tend to opt for an XSLT 1.0 transform. I certainly go for, XSLT 2.0/3.0 options, if I'm not constrained by these factors.

Monday, August 7, 2017

Mathematical table data with XML Schema 1.1

Here's a simple example, using XML Schema 1.1 <assert> to validate elementary school mathematical tables.

XML document:
<?xml version="1.0"?>
<table id="2">
<x>2</x>
<x>4</x>
<x>6</x>
<x>8</x>
<x>10</x>
<x>12</x>
<x>14</x>
<x>16</x>
<x>18</x>
<x>20</x>
</table>

XSD 1.1 document:
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="table">
<xs:complexType>
<xs:sequence>
<xs:element name="x" minOccurs="10" maxOccurs="10"/>
</xs:sequence>
<xs:attribute name="id" type="xs:positiveInteger" use="required">
<xs:annotation>
<xs:documentation>Mathematical table of @id is represented.</xs:documentation>
</xs:annotation>
</xs:attribute>
<xs:assert test="x[1] = @id"/>
<xs:assert test="every $x in x[position() gt 1] satisfies $x = $x/preceding-sibling::x[1] + @id">
<xs:annotation>
<xs:documentation>An XPath 2.0 expression validating the depicted mathematical table.
</xs:documentation>
</xs:annotation>
</xs:assert>
</xs:complexType>
</xs:element>

</xs:schema>

Saturday, February 25, 2012

modular XML instances and modular XSD schemas

I was playing with some new ideas lately related to exploring design options, to construct modular XML instance documents vs/and modular XSD schema documents and thought to write my findings as a blog post here.

I believe, there are primarily following concepts related to constructing modular XML documents (and XSD schemas) when XSD validation is involved:
1. Modularize XML documents using the XInclude construct.
2. Modularize an XSD document via <xs:include> and <xs:import>. The <xs:include> construct maps significantly to modularlity concepts in XSD schemas, and <xs:import> is necessary (necessary in XSD 1.0, and optional in XSD 1.1) to compose (and also to modularize) XSD schemas coming from two or more distinct XML namespaces.

I don't intend to delve much in this post into concepts related to XSD constructs <xs:include> and <xs:import> since these are well known within the XSD and XML communities. In this post, I would tend to primarily focus on XML document modularization via the XInclude construct and presenting few thoughts about various design options (I don't claim to have covered every design option for these use cases, but I feel that I would cover few of the important ones) to validate such XML instance documents via XSD validation.

What is XInclude?
This is an XML standards specification, that defines about how to modularize any XML document information. The primary construct of XInclude is an <xi:include> XML element. Following is a small example of an XInclude aware XML document,

z.xml

<z xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include href="x.xml"/>
<xi:include href="y.xml"/>
</z>

x.xml

<x>
<a>1</a>
2
</x>

y.xml

<y>
 5
<q>6</q>
</y>

We'll be using the XML document, z.xml provided above that is composed from other XML documents via an XInclude meta-data, to provide to an XSD validator for validation.

I essentially discuss here, the XSD schema design options to validate an XML instance document like z.xml above. Following are the XSD design options (that cause successful XML instance validations) that currently come to my mind for this need, along with some explanation of the corresponding design rationale:

XS1:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

 <xs:element name="z">
 <xs:complexType>
 <xs:sequence>
 <xs:any processContents="skip" minOccurs="2" maxOccurs="2"/>
 </xs:sequence>
 </xs:complexType>
 </xs:element>

</xs:schema>

This schema is written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. An xs:any wild-card in this schema would weakly validate (since this wild-card declaration only requires *any particular* XML element to be present in an instance document, which is validated by this wild-card. the wild-card here doesn't specify any other constraint for it's corresponding XML instance elements) each of the included XML document element roots (i.e XML elements "x" and "y").

XS2:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

 <xs:element name="z">
 <xs:complexType>
 <xs:complexContent>
 <xs:restriction base="T1">
 <xs:sequence>
 <xs:element name="include" minOccurs="2" maxOccurs="2" targetNamespace="http://www.w3.org/2001/XInclude"/>
 </xs:sequence>
 </xs:restriction>
 </xs:complexContent>
 </xs:complexType>
 </xs:element>

 <xs:complexType name="T1" abstract="true">
 <xs:sequence>
 <xs:any processContents="skip" maxOccurs="unbounded"/>
 </xs:sequence>
 </xs:complexType>

</xs:schema>

This schema is also written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. But this schema specifies slightly stronger XSD validation constraints as compared to the previous example (stronger in a sense that, this schema declares an XML element and specifies it's name and an namespace). This schema would need an XSD 1.1 processor, since the element declaration specifies a "targetNamespace" attribute. An XSD 1.0 version of this design approach is possible, which would involve using an XSD <xs:import> element to import XSD components from the XInclude namespace.

XS3:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

 <xs:element name="z">
 <xs:complexType>
 <xs:sequence>
 <xs:any processContents="skip" minOccurs="2" maxOccurs="2" namespace="http://www.w3.org/2001/XInclude"/>
 </xs:sequence>
 <xs:assert test="count(*[local-name() = 'include']) = 2"/>
 <xs:assert test="deep-equal((*[1] | *[2])/@*/name(), ('href','href'))"/>
 </xs:complexType>
</xs:element>

</xs:schema>

This schema is also written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. But this schema enforces XSD validation even more strongly than the example "XS2" above (since this schema also requires the XInclude attribute "href" to be present on the XInclude meta-data, which the previous XSD schema doesn't enforce). This schema validates the names of XML instance elements, that are intended to be XInclude meta-data via XSD 1.1 <assert> elements (this may not be the best XSD validation approach, but such an XSD design idiom is now possible with XSD 1.1 language).

XS4:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

 <xs:element name="z">
 <xs:complexType>
 <xs:sequence>
 <xs:element name="x">
 <xs:complexType>
 <xs:sequence>
 <xs:element name="a" type="xs:integer"/>
 <xs:element name="b" type="xs:integer"/>
 </xs:sequence>
 </xs:complexType>
 </xs:element>
 <xs:element name="y">
 <xs:complexType>
 <xs:sequence>
 <xs:element name="p" type="xs:integer"/>
 <xs:element name="q" type="xs:integer"/>
 </xs:sequence>
 </xs:complexType>
 </xs:element>
 </xs:sequence>
 </xs:complexType>
 </xs:element>

</xs:schema>

This schema is written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data expanded. This schema specifies the strongest of XSD validation constraints as compared to the previous three approaches (strongest in a sense that, the internal structure of XML element instances "x" and 'y" are now completely specified by the XSD document).

But to make this XSD validation approach to work, the XInclude meta-data needs to be expanded and the expanded XML infoset needs to be supplied to the XSD validator for validation. This would require an XInclude processor (like Apache Xerces), that plugs within the XML parsing stage to expand the <xi:include> tags.

For the interest of readers, following are few java code snippets (the skeletal class structure and imports are omitted to keep the text shorter) that enable XInclude processing and supplying the resulting XML infoset (i.e post the XInclude meta-data expansion) to the Xerces XSD validator,

try {
 Schema schema = schemaFactory.newSchema(getSaxSource(xsdUri, false));
 Validator validator = schema.newValidator();
 validator.setErrorHandler(new ValidationErrHandler());
 validator.validate(getSaxSource(xmlUri, true));
}
catch(SAXException se) {
 se.printStackTrace();
}
catch (IOException ioe) {
 ioe.printStackTrace();
}

private SAXSource getSaxSource(String docUri, boolean isInstanceDoc) {

 XMLReader reader = null;

 try {
 reader = XMLReaderFactory.createXMLReader();
 if (isInstanceDoc) {
 reader.setFeature("http://apache.org/xml/features/xinclude", true);
 reader.setFeature("http://apache.org/xml/features/xinclude/fixup-base-uris", false);
 }
 }
 catch (SAXException se) {
 se.printStackTrace();
 }

 return new SAXSource(reader, new InputSource(docUri));

}

class ValidationErrHandler implements ErrorHandler {

 public void error(SAXParseException spe) throws SAXException {
 String formattedMesg = getFormattedMesg(spe.getSystemId(), spe.getLineNumber(), spe.getColumnNumber(), spe.getMessage());
 System.err.println(formattedMesg);
 }

 public void fatalError(SAXParseException spe) throws SAXException {
 String formattedMesg = getFormattedMesg(spe.getSystemId(), spe.getLineNumber(), spe.getColumnNumber(), spe.getMessage());
 System.err.println(formattedMesg);
}

 public void warning(SAXParseException spe) throws SAXException {
 // NO-OP
 }

}

private String getFormattedMesg(String systemId, int lineNo, int colNo, String mesg) {
 return systemId + ", line "+lineNo + ", col " + colNo + " : " + mesg;
}

Summary: I would ponder that, is devising the above various XSD design approaches beneficial for an XSD schema design that involves validating XML instance documents that contain <xi:include> meta-data directives? My thought process with regards to the above presented XSD validation options had following concerns:
1) Providing various degrees of XSD validation strenghts for <xi:include> directives (essentially the un-expanded and expanded modes).
2) Exploring some of the new XML validation idioms offered by XSD 1.1 language for the use cases presented above (essentially using "targetNamespace" attribute on xs:element elements, and using <assert> elements).
3) Exploring the java SAX and JAXP APIs to enable XInclude meta-data expansion, and providing a SAXSource object containing an XInclude expanded XML infoset which is subsequently supplied further to the XSD validation pipeline.

I hope that this post was useful.

Sunday, February 5, 2012

"castable as" vs "instance of" XPath 2.0 expressions for XSD 1.1 assertions

I'm continuing with my thoughts related to my previous blog post (ref, http://mukulgandhi.blogspot.in/2012/01/using-xsd-11-assertions-on-complextype.html). The earlier post used the XPath 2.0 "castable as" expression to do some checks on the 'untyped' data of complexType's mixed content (essentially finding if the string/untyped value in an XML instance document is a lexical representation of an xs:integer value).

This post talks about the use of XPath 2.0 "instance of" vs "castable as" expressions in context of XSD 1.1 assertions -- essentially providing guidance about when it may be necessary to use one of these expressions.

The XSD 1.1 "castable as" use case was discussed in my earlier blog post. Here I essentially talk about "instance of" expression when used with XSD 1.1 assertions.

Let's assume that there is an XML instance document like following (XML1):

<X>
 <elem>
 <a>20</a>
 30
 </elem>
 <elem>
 <a>10</a>
 2005-10-07
 </elem>
</X>

The XSD schema should express the following constraints with respect to the above XML instance document (XML1):
1. The elements "a" and "b" can be typed as an xs:integer or a xs:date (therefore we'll express this with an XSD simpleType with variety 'union').
2. If both the elements "a" and "b" are of type xs:integer (this is allowable as per the simpleType definition described in point 1 above), then numeric value of element "a" should be less than numeric value of element "b".
3. If one of the elements "a" or "b" is an xs:integer and the other one is xs:date, then we would like to express the following constraints,
 - the numeric XML instance value of an xs:integer typed element should be less than 100
 - the xs:date XML instance value should be less that the current date

The following XSD (1.1) schema document describes all of the above validation constraints for a sample XML instance document (XML1) provided above:

[XS1]

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

 <xs:element name="X">
 <xs:complexType>
 <xs:sequence>
 <xs:element name="elem" maxOccurs="unbounded">
 <xs:complexType>
 <xs:sequence>
 <xs:element name="a" type="union_of_date_and_integer"/>
 <xs:element name="b" type="union_of_date_and_integer"/>
 </xs:sequence>
 <xs:assert test="if ((data(a) instance of xs:integer) and (data(b) instance of xs:integer))
 then (data(a) lt data(b))
 else if (not(deep-equal(data(a), data(b))))
 then (*[data(.) instance of xs:integer]/data(.) lt 100
 and
 *[data(.) instance of xs:date]/data(.) lt current-date())
 else true()"/>
 </xs:complexType>
 </xs:element>
 </xs:sequence>
 </xs:complexType>
 </xs:element>

 <xs:simpleType name="union_of_date_and_integer">
 <xs:union memberTypes="xs:date xs:integer"/>
 </xs:simpleType>

</xs:schema>

I think it may be interesting for readers to know why I wrote an assertion like the one above. Following are few of the thoughts,
1. Since the XML elements "a" and "b" are typed as a simpleType 'union', therefore for an assertion to access the XML instance atomic values that were validated by such an simpleType we need to use the XPath 2.0 "data" function on a relevant XDM node (elements "a" and "b" in this case). Further determining that the XML document's atomic instance value is typed as xs:integer, we need to use the "instance of" expression -- "castable as" is not needed in this case, since the instance document's data is already typed.
2. The rest of the assertion implements what is mentioned in the requirements above.

If you want to have further visual and/or design elegance within what is written in an assertion above, one may feel free to break assertion rules into two or more assertions.

I would also want to write another XSD 1.1 assertions example which doesn't use an XPath 2.0 "castable as" or an "instance of" expression. This demonstrates that, if an XDM assert node is already typed it would usually be unnecessary to use the "castable as" expression (since "castable as" is essentially useful to programmatically enforce typing with string/untyped values) or an "instance of" expression may be needed for some cases.

Following is a slightly modified variant of the XML instance document specified above (XML1):

[XML2]

<X>
 <elem>
 <a>2</a>
 2012-02-04
 </elem>
 <elem>
 <a>10</a>
 2005-10-07
 </elem>
</X>

The XSD schema should express the following constraints with respect to the above XML instance document (XML2):
1. The element "a" is typed as an xs:nonNegativeInteger value, and element "b" is typed as xs:date.
2. The number of days equal to the numeric value specified in an element "a" if added to the xs:date value specified in an element "b", should result in an xs:date value which must be less than the current date.

The following XSD (1.1) schema document describes all of the above validation constraints for a sample XML instance document (XML2) provided above:

[XS2]

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

 <xs:element name="X">
 <xs:complexType>
 <xs:sequence>
 <xs:element name="elem" maxOccurs="unbounded">
 <xs:complexType>
 <xs:sequence>
 <xs:element name="a" type="xs:nonNegativeInteger"/>
 <xs:element name="b" type="xs:date"/>
 </xs:sequence>
 <xs:assert test="(b + xs:dayTimeDuration(concat('P', a, 'D'))) lt current-date()"/>
 </xs:complexType>
 </xs:element>
 </xs:sequence>
 </xs:complexType>
 </xs:element>

</xs:schema>

That's all I had to say today.

I hope this post was useful.

Thursday, January 26, 2012

Using XSD 1.1 assertions on complexType mixed contents

There were some interesting ;) thoughts coming to my mind lately, and not surprisingly again related to XSD. I was playing with XSD 1.1 assertions once again to try to constrain an XSD complexType{mixed} content model and I'm sharing some of my findings ... (I guess, I hadn't written about this particular topic on this blog before or on any other forum. If you find any duplicacy of information in this post with any information I might have written elsewhere, kindly ignore the earlier things I might have written). I come to the topic now.

What is XSD mixed content (you may ignore reading this, if you already know about this)?
I believe, this isn't really an XSD only topic. It is something which is present in plain XML (there can be a good old well-formed XML document, which might have "mixed" content and needn't be validated at all -- i.e in a schema free XML environment), but XSD allows to report such an XML instance document as 'valid' (more importantly, XSD would report a "mixed" content model XML instance as 'invalid' if validated by an "element only" content model specified by an XSD complexType definition) and also to constrain XML mixed contents in certain ways (particularly with XSD 1.1 in some new ways, which I'll try to talk about further below).

Example of "element only" (content of element "X" here) XML content model [X1]:

<X>
  <Y/>
  <Z/>
</X>

Example of "mixed content" (content of element "X" here) XML content model [X2]: 

<X>
  abc
  <Y/>
  123
  <Z/>
  654
</X>

Therefore, "mixed content" allows "non whitespaced" text nodes as siblings of element nodes.

XSD 1.0 schema definition that allows "mixed" content [XS1]:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="X">    
        <xs:complexType mixed="true">
             <xs:sequence>
                 <xs:element name="Y"/>
                 <xs:element name="Z"/>
             </xs:sequence>
        </xs:complexType>
    </xs:element>
    
</xs:schema>

This schema (XS1) would report the XML document "X2" above as 'valid' (since that instance document has "mixed" content, and this schema allows "mixed" content via a property "mixed = 'true'" on a complexType definition).

But in the schema document "XS1" above, if we remove the property specifier "mixed = 'true'" or set the value of attribute "mixed" as 'false' (which is also the default value of this attribute), then such a modified schema would report the XML instance document "X2" above as 'invalid' (but the XML document "X1" above would be reported as 'valid' -- since it doesn't has "mixed" content).

New capabilities provided by XSD 1.1 to constrain XML "mixed" content further:

Following is a list of new features supported by XSD 1.1 for XML "mixed" contents, that currently come to my mind,

a)

XSD 1.1 schema "XS2":

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="X">    
       <xs:complexType mixed="true">
          <xs:sequence>
             <xs:element name="Y"/>
             <xs:element name="Z"/>
          </xs:sequence>          
          <xs:assert test="deep-equal(text()[matches(.,'\w')]/normalize-space(.), ('abc','123','654'))"/>
       </xs:complexType>
    </xs:element>
    
</xs:schema>

The <assert> element in this schema (XS2) constrains the mixed content in XML instance document to be a list (with order of list items been significant) of only few specified values. The assertion is written only to illustrate the technical capabilities of an assertion here, but not with any application in mind.
Following are few of other things, which XSD 1.1 assertions could achieve in an XML "mixed" content model's context:

b)

<xs:assert test="((text()[matches(.,'\w')]/normalize-space(.))[2] castable as xs:integer)
                    and
                 ((text()[matches(.,'\w')]/normalize-space(.))[3] castable as xs:integer)"/>

This assertion constrains specific items of an XML "mixed" content model list to be of a specified XSD schema type -- here the 2nd and 3rd items of the list need to be typed as xs:integer, whereas the first item is "untyped".

c)

<xs:assert test="count((text()[matches(.,'\w')]/normalize-space(.))[. castable as xs:integer])
                    =
                 count(text()[matches(.,'\w')]/normalize-space(.))"/>

This assertion constrains all items of the XML "mixed" content model list to be of the same type (xs:integer in this case) -- this uses a well defined pattern "count of xs:integer items is equal to the count of all the items".

d)

<xs:assert test="every $x in text()[matches(.,'\w')][position() gt 1]
                   satisfies 
                (number(normalize-space($x)) gt number($x/preceding-sibling::text()[matches(.,'\w')][1]))"/>

This assertion constrains the list of XML "mixed" content model to be in ascending numeric order (assuming that all items in the list are numeric. Though it should be possible to specify a numeric order on a heterogeneously typed list, and specify numeric order only for numeric list items).

Summary: XSD 1.0 allowed an "untyped" XML mixed content, that was uniformly available anywhere within the scope of an XML element that was validated by an XSD complexType. No further constraints on "mixed" content were possible in an XSD 1.0 environment. XSD 1.1 allows some new ways to constrain XML "mixed" content further (some of these capabilities were illustrated in examples above). To my opinion, the likely benefits of constraining XML "mixed" content in some of the ways as illustrated above, is to allow the XML document authors to model certain semantic content in "mixed" content scope and make this knowledge available to the XML applications. All examples above were tested with Apache Xerces (I hope that these examples would also be compliant with other XSD validators, notably Saxon currently which also supports XSD 1.1).

I hope that this information was useful.

Tuesday, August 23, 2011

XPath 2.0 and XSD schemas : sharing experiences

I was just playing with XPath 2.0 and thought of sharing my observations, about a specific use case.

We start with the following XSD schema document,

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
    <xs:element name="X">
       <xs:complexType>
          <xs:sequence>
             <xs:element name="a" type="xs:integer"/>
          </xs:sequence>
          <xs:attribute name="att1" type="xs:boolean"/>
       </xs:complexType>
    </xs:element>

</xs:schema>

This schema intends to validate an XML instance document like following,

<X att1="0">
  <a>100</a>
</X>

I wrote an XPath (2.0) expression like following [1],

/X[if (@att1) then true() else false()]/a/text() AND ran this after enabling validation of the input document.

I though that this would not return any result (i.e an empty sequence).

But the XPath expression above ([1]) returns the result "100". At first thought, I was little amazed by this result. I thought, that since attribute "att1" was declared with type xs:boolean in the schema, the "if condition" should return 'false' in this case. But that's not the correct interpretation of the XPath expression written above ([1]). Following is a little more explanation about this.

The reference @att1 in the XPath expression above (i.e if (@att1) ..) is a node reference (an attribute node) and is not a boolean value (which I thought initially, and I was wrong -- I incorrectly thought, that atomization of the expression @att1 would take place in this case; more about this below).

The XPath 2.0 spec says, that if the first item in the sequence is a non null node, then effective boolean value of such a sequence is 'true' (this interpretation is unaffected by the fact, if the input XML document was validated or not with the XSD schema). And in the expression like above (i.e if (@att1) ..), the effective boolean value of the sequence {@att1} is used to determine IF the "if condition" returns 'true' or not (in this case, this sequence has one item [which is also the first item of this sequence] which is an attribute node whose name is "att1", which makes the effective boolean value as 'true' -- and hence the XPath predicate evaluates to 'true'). I think this explains, why the "if condition" {if (@att1)} would return true for the above XML instance document (even if it was validated by the schema given above, and the XPath 2.0 expression above [1] was run in a schema aware mode).

To write the XPath expression correctly, as I wanted (i.e the expression of the "if condition" should return 'true' if the instance document had value true/1 for the attribute, and 'false' otherwise AND an XSD validation of instance document took place prior to the evaluation of the XPath expression), the XPath expression would need to be modified to either of the following styles [2],

/X[if (data(@att1)) then true() else false()]/a/text()

/X[if (@att1 = true()) then true() else false()]/a/text()

To understand why the expressions given above ([2]) work correctly, one needs to understand the XPath 2.0 "data" function (for the first correct variant above, [2] -- this returns the typed value of the argument of the "data" function) and the process of atomization (for the second correct variant above, [2] -- in this case the attribute node "att1" is atomized to return a sequence of kind {xs:boolean}) as described by the XPath 2.0 spec.

That's all about this. I hope that my experience with this may be helpful to someone (to understand this, one just has to know the XPath [2.0] spec correctly, and how it interacts with XSD schemas!).

Thanks for reading this post.

@2011-11-11: updated in place, to correct few factual errors.

Tuesday, December 28, 2010

Schema based XML compare

David A. Lee (producer of XMLSH -- A command line shell for XML) raised an interesting discussion a while ago on XML-DEV mailing list, about how to do XML Schema aware XML document comparison. The whole of this discussion thread can be read here. Michael Kay suggested to use the XPath 2.0 function deep-equal (where the input document trees need to be validated by a schema -- to enable type-aware comparison, before doing a comparison by this function) for this kind of use case. Following Michael's idea I was playing with this concept using IBM's XPath 2.0 engine (which is XML Schema aware and is a component of WebSphere Application Server feature pack for XML). For the interest of readers, here's a minimal Java program illustrating this program design.

package com.ibm.xpath2;

import javax.xml.namespace.QName;
import javax.xml.transform.stream.StreamSource;

import com.ibm.xml.xapi.XDynamicContext;
import com.ibm.xml.xapi.XFactory;
import com.ibm.xml.xapi.XPathExecutable;
import com.ibm.xml.xapi.XSequenceCursor;
import com.ibm.xml.xapi.XSequenceType;
import com.ibm.xml.xapi.XStaticContext;

public class XMLCompare {

    public static void main(String[] args) throws Exception {
        String dataDir = System.getProperty("dataDir.path");
  
        XFactory factory = XFactory.newInstance();
        factory.setValidating(XFactory.FULL_VALIDATION);
        factory.registerSchema(new StreamSource(dataDir + "/test.xsd"));
        
        XStaticContext staticContext = factory.newStaticContext();
        staticContext.declareVariable(new QName("doc1"), factory.getSequenceTypeFactory().                      documentNode(XSequenceType.OccurrenceIndicator.ONE));
        staticContext.declareVariable(new QName("doc2"), factory.getSequenceTypeFactory().                                      documentNode(XSequenceType.OccurrenceIndicator.ONE));
        XDynamicContext dynamicContext = factory.newDynamicContext();
        dynamicContext.bind(new QName("doc1"), new StreamSource(dataDir + "/test1.xml"));
        dynamicContext.bind(new QName("doc2"), new StreamSource(dataDir + "/test2.xml"));
                
        XPathExecutable executable = factory.prepareXPath("deep-equal($doc1, $doc2)", staticContext);
        XSequenceCursor result = executable.execute(dynamicContext);
        if (result.exportAsList().get(0).getBooleanValue()) {
           System.out.println("deep-equal == true");
        }
        else {
           System.out.println("deep-equal == false");
        }
    }
}

Following are the XML and XML Schema documents used for the above example.

test1.xml

<test>10.00</test>

test2.xml

<test>10</test>

test.xsd

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema">
  <element name="test" type="double" />
</schema>

For the above examples, if the schema type of element node "test" is xs:double then both the XML documents above are reported deep-equal (since the values 10 and 10.00 are same double values, and the element node was annotated with schema type xs:double and deep-equal function did a type aware comparison of XML documents). But if say the schema type of element node "test" is xs:string, then the XML documents shown above would be reported not deep-equal.

I hope that this post is useful.

Sunday, September 5, 2010

XSD 1.1: Xerces-J implementation updates

Over the past one or two months, there have been few interesting changes happening at Xerces-J XML Schema 1.1 implementation. I feel obliged to share these enhancements with the XML Schema community, and also with folks at Eclipse WTP (where we enhanced few "schema aware" components of PsychoPath XPath 2.0 engine, to support these recent Xerces enhancements -- I think we improved the design of typed values of XML element and attribute XDM nodes in PsychoPath XPath2 engine, in case the XDM node has a type annotation of kind XML Schema simpleType, with varieties list or union).

Here's a summary of XML Schema 1.1 implementation changes that have recently been completed with Xerces (available at Xerces SVN repos as of now), which are planned to be part of the Xerces-J 2.11.0 release, planned to take please during November 2010 time frame.

1. Xerces-J now has a complete implementation of XML Schema 1.1 conditional inclusion functionality. The Xerces-J 2.10.0 release had implementation of XML Schema 1.1 conditional inclusion vc:minVersion and vc:maxVersion attributes. Xerces-J now supports all of "conditional inclusion" attributes as specified by the XML Schema 1.1 spec. The "conditional inclusion" attributes that are now newly supported in Xerces-J are: vc:typeAvailable, vc:typeUnavailable, vc:facetAvailable and vc:facetUnavailable. All of XML Schema 1.1 built-in types and facets are now supported by Xerces-J related to XML Schema 1.1 "conditional inclusion" components.

2. There are few interesting changes that have happened to Xerces-J XML Schema 1.1 assertions implementation as well, that are planned to be part of Xerces-J 2.11.0 release. Xerces now has an improved assertions evaluation processing on XML Schema (1.1) simple types, with varieties 'list' and 'union'.

2.1 Enhancements to assertions evaluation on simpleType -> list:

Here's an example of XML Schema 1.1 assertions on an xs:list schema component:
[XML Schema 1]

   <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

      <xs:element name="Example" type="EXAMPLE_LIST" />
   
      <xs:simpleType name="EXAMPLE_LIST">
         <xs:list>
            <xs:simpleType>
               <xs:restriction base="xs:integer">
                  <xs:assertion test="$value mod 2 = 0" />
               </xs:restriction>
            </xs:simpleType>
         </xs:list>
      </xs:simpleType>
   
   </xs:schema>

If an XML instance document has a structure something like following:
[XML 1]
<Example>1 2 3</Example>

And if this XML instance document ([XML 1]) is validated by the above XML schema ([XML Schema 1]), Xerces-J would report error messages like following (assuming the name of XML document was, test.xml):
[Error] test.xml:1:25: cvc-assertion.3.13.4.1: Assertion evaluation ('$value mod 2 = 0') for element 'Example' with type '#anonymous' did not succeed. Assertion failed for an xs:list member value '1'.
[Error] test.xml:1:25: cvc-assertion.3.13.4.1: Assertion evaluation ('$value mod 2 = 0') for element 'Example' with type '#anonymous' did not succeed. Assertion failed for an xs:list member value '3'.

An assertion must evaluate on every 'simpleType -> list' item (which is validated by the itemType of xs:list) in an XML instance document. Xerces now does this, and needed error messages are displayed in case of schema assertion failures.

2.2 Enhancements to assertions evaluation on simpleType -> union:

Here's an example of XML Schema 1.1 assertions on an xs:union schema component:
[XML Schema 2]

   <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   
      <xs:element name="Example">
         <xs:simpleType>
            <xs:union memberTypes="MYDATE xs:integer" />
         </xs:simpleType>
      </xs:element>
   
      <xs:simpleType name="MYDATE">
         <xs:restriction base="xs:date">
            <xs:assertion test="$value lt current-date()" />
         </xs:restriction>
      </xs:simpleType>

   </xs:schema>

If an XML instance document has a structure something like following:
[XML 2]
<Example>2010-12-05</Example>

And this instance document is validated by the schema document, [XML Schema 2] the following error message is displayed by Xerces:
[Error] temp.xml:1:30: cvc-assertion.union.3.13.4.1: Element 'Example' with value '2010-12-05' is not locally valid. One or more of the assertion facets on an element's schema type, with variety union, have failed.

Xerces tried to validate an atomic value '2010-12-05' both with schema types xs:integer and MYDATE. Since none of these types could successfully validate this atomic value, and an assertion failed in the process of these validation checks, the relevant assertion failure was reported by Xerces.

If the XML schema, [XML Schema 2] tries to validate the XML instance document:
<example>10</Example>

no validation failures are reported in this case, since an atomic value '10' conforms to the schema type xs:integer, which results in an overall validation success of the atomic value with an 'union' schema type.

I'm ending this blog post now. Stay tuned for more news here :)

And I hope, that this post was useful.

Saturday, July 17, 2010

XSD 1.1: XML schema design approaches cotd... PART 3

I'm continuing with the XML Schema design thoughts series, with the third part here. The first two parts are available here:
1) PART 1
2) PART 2

All the examples here have been tested with Xerces-J 2.10.0.

(I'm disclaiming in the beginning, that examples presented in this blog post are somewhat fictitious and may not serve a real life use-case. These examples are kind of cooked-up to only illustrate XML Schema 1.1 constructs, and some of design thinking behind them. I also refer at lot of places a phrase "element particles". This simply means XML elements, but "particles" is a formal term defined by the XML Schema spec, designating XML schema components having minOccurs and maxOccurs attributes -- if minOccurs/maxOccurs attributes are absent, then these have default values for the relevant schema components)

I'm presenting a sample 1.1 XML schema with corresponding XML document first, and then attempting trying to reflect on the inherent design from my point of view in these examples:

XML Schema 1.1 specific constructs are emphasized with a different color.

[XML1]

  <Book>
     <name>XML in a Nutshell</name>
     <ISBN>AB-1001</ISBN>
     <author>Jimmy</author>
     <NoPages>100</NoPages>
  </Book>

[XML Schema 1]

  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
     <xs:element name="Book">
        <xs:complexType>
          <xs:complexContent>
            <xs:extension base="BOOK_FRAGMENT">
               <xs:openContent>
                 <xs:any processContents="lax" />
               </xs:openContent>
               <xs:assert test="not(* except (name, author, ISBN, NoPages)) and 
                                 (if (ISBN)
                                    then not(ISBN/*) 
                                    else true()) and 
                                 (if (NoPages) 
                                     then (not(NoPages/*) and (NoPages/text() castable as xs:positiveInteger))
                                     else true())" />    
            </xs:extension>
          </xs:complexContent>          
        </xs:complexType>
     </xs:element>
   
     <xs:complexType name="BOOK_FRAGMENT">
        <xs:sequence>
          <xs:element name="name" type="xs:string" />
          <xs:element name="author" type="xs:string" />
        </xs:sequence>
     </xs:complexType>

  </xs:schema>

The following use-case requirements motivated me to write this sample (I'm also trying to reflect on the schema design choices I've made, about which I surely invite comments from the readers -- if you've patience to read this post and respond!):
1. XML Schema 1.0 has a limitation that, when a complex type (having sequence or choice particles) is derived by extension then a derived complex type can only add element particles at the end of an element list (within the base type). Supposing that we want to re-use a complex type (having a sequence of element particles) by deriving it with extension, and need to add additional element particles say any-where in between the elements of the base type. This is what the above XML schema (XML Schema 1) example intends to do; and the above schema does indeed validates successfully the corresponding XML document presented above (XML1).

2. A key design decision in the above schema (XML Schema 1) is to use the XML Schema 1.1 "openContent" instruction (newly introduced in 1.1 version). The use of XSD 1.1 assertions here is optional, but is very practical to do so (which I'll try to explain!). An XML schema "openContent" instruction is essentially a wrapper around xs:any wild-card, producing the same effect as xs:any wild-card but has an interleave or a suffix appending behavior (please feel free to read the XML Schema 1.1 spec to learn more about XSD 1.1 open contents. Or perhaps if you want a lighter [but brilliant] explanation, you may read Roger L. Costello's XML Schema 1.1 write-up available here).
The XML Schema 1.1 spec defines an "openContent" instruction as following:

  <openContent
     id = ID
     mode = (none | interleave | suffix) : interleave
     {any attributes with non-schema namespace . . .}>
     Content: (annotation?, any?)
  </openContent>

It is an openContent instruction with "interleave" mode (which is the default openContent mode), which enables adding additional element particles interspersed between base type's element particles.

3. In the above example, the XML elements "ISBN" and "NoPages" are added to the base type's element particles which are not appended at the end of base type's elements, but can be added anywhere within the resulting XML content model. For this particular example, the placement of XML elements coming from the derived complex type are arbitrary, and is done to only illustrate the workings of "openContent" instruction in "interleave" mode.

4. It's interesting to see the benefit of XSD 1.1 assertions here. The assertions here are able to impose certain constraints on the resultant content model (otherwise the content model is kind-of wide open with no restrictions). The assertions in the above schema document (XML Schema 1) mean:
a) The resulting content model can only have XML elements -> "name", "author", "ISBN" and "NoPages".
b) The element "ISBN" needs to be an atomic string value, and the element "NoPages" needs to be an xs:positiveInteger value.

I'm presenting below another XML schema variant (than the example above -- XML Schema 1), which solves the same problem as described above, but in a slightly different way (with advantages and disadvantages described after the example):

[XML Schema 2]

  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
      <xs:element name="Book">
         <xs:complexType>
            <xs:complexContent>
               <xs:extension base="BOOK_FRAGMENT">
                  <xs:openContent>
                     <xs:any processContents="strict"/>
                  </xs:openContent>
                  <xs:assert test="count(distinct-values(for $elem in (* except (name, author)) return $elem/name())) = count(for $elem in (* except (name, author)) return $elem/name())"/>       
               </xs:extension>
            </xs:complexContent>          
         </xs:complexType>
      </xs:element>
   
      <xs:complexType name="BOOK_FRAGMENT">
         <xs:sequence>
           <xs:element name="name" type="xs:string"/>
           <xs:element name="author" type="xs:string"/>
         </xs:sequence>
      </xs:complexType>
   
      <xs:element name="ISBN" type="xs:string" />
   
      <xs:element name="NoPages" type="xs:positiveInteger" />

   </xs:schema>

The example XML document for this schema (XML Schema 2) remains same (XML1). Here are the advantages (and unfortunately a little disadvantage as well, with a suggested workaround for the drawback...) of the sample, XML Schema 2:
1. Here we are using xs:any wild-card with processContents="strict" mode (the earlier example used the wild-card with "lax" mode) and providing the corresponding element declarations in the schema (the last two element declarations). This approach has advantage that, the content model of elements "ISBN" and "NoPages" are enforced natively by the XML schema engine, and the schema author doesn't have to implement the content model constraints herself/himself (for example, that an element is empty and has an atomic value) -- say via assertions. This approach is more robust, than trying to achieve the similar effect with assertions.

2. The assertion in schema document, [XML Schema 2] enforces that elements in the sequence could occur only once. This is accomplished by this simple algorithm:
count(distinct-values(names...)) = count(names...)

3. The only drawback I foresee with XML Schema 2, is that elements "ISBN" and "NoPages" are now global elements (which is necessary to have xs:any wild-card to work with processContents="strict" mode). This has implication that following XML documents would be reported valid as well, by the schema document XML Schema 2:

  <ISBN>AB-1001</ISBN>

AND

  <NoPages>100</NoPages>

This is a side-effect of schema document XML Schema 2, which I myself personally don't seem to like :(

To solve this limitation, I can imagine there could be a workaround as following:
We could perform two validations in sequence. One with the schema document, [XML Schema 2] (let's call this validation V1) and the second one with the following schema document (let's call this validation result V2):

[XML Schema 3]

  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
      
      <xs:element name="ISBN" type="xs:string" />
   
      <xs:element name="NoPages" type="xs:positiveInteger" />

  </xs:schema>

This is kind of a little validation pipeline. The complete/end-to-end (which usually means, that this has domain meaning) schema validation succeeds in entirety, if validation V1 succeeds but V2 doesn't (I imagine, that this kind-of pipeline operation could be enforced by a host language, like Java using the XML Schema JAXP APIs).

Thanks for reading!

I hope that this post is useful.

Sunday, July 11, 2010

XSD 1.1: XML schema design approaches cotd... PART 2

I'm continuing with the XML Schema design approaches series, I started in the previous blog post. Here's the second post in this series.

Here's a description of the use-case I'll be illustrating in this post, with both XML Schema 1.0 and 1.1 examples:

We need to write an XML Schema for the following XML content model:

  colors
    -> (violet | indigo | blue | green | yellow | orange | red)+

Here the words "colors", "violet" etc represent XML elements, and they have no attributes and are empty. The above content model means, that children of element "colors" can repeat and are unordered, and at-least one of them is required.

Therefore following XML document is a valid instance according to this content model:

[XML1]

  <colors>
     <violet/>
     <indigo/>
     <blue/>
     <green/>
     <yellow/>
     <orange/>
     <red/>
  </colors>

AND for example, the following XML document is valid as well, as per the content model described above (here the element "colors" have less children than the previous example, and some of children of "colors" occur more than once):

[XML2]

  <colors>
     <violet/>
     <indigo/>
     <blue/>
     <green/>
     <green/>
  </colors>

Here are two XML schema examples that express the above XML content model constraints:

[XML Schema 1] (written in XML Schema 1.0)

  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
     <xs:element name="colors">
        <xs:complexType>
           <xs:choice maxOccurs="unbounded">
              <xs:element name="violet" type="EMPTY" />
              <xs:element name="indigo" type="EMPTY" />
              <xs:element name="blue" type="EMPTY" />
              <xs:element name="green" type="EMPTY" />
              <xs:element name="yellow" type="EMPTY" />
              <xs:element name="orange" type="EMPTY" />
              <xs:element name="red" type="EMPTY" />     
           </xs:choice>
        </xs:complexType>
     </xs:element>
   
     <xs:complexType name="EMPTY"> 
        <xs:complexContent> 
          <xs:restriction base="xs:anyType" /> 
        </xs:complexContent> 
     </xs:complexType>

  </xs:schema>

[XML Schema 2] (written in XML Schema 1.1 -- the 1.1 specific constructs are displayed with a different color)

  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
     <xs:element name="colors">
        <xs:complexType>
           <xs:sequence>
             <xs:any maxOccurs="unbounded" processContents="lax" />
           </xs:sequence>
           <xs:assert test="every $x in */name() satisfies ($x = 
                              ('violet','indigo','blue','green','yellow','orange','red'))" />
           <xs:assert test="every $x in * satisfies not($x/node())" />
        </xs:complexType>
     </xs:element>

  </xs:schema>

Here's some quick analysis from my point of view, about the differences between the above schema approaches, and if any of the above approaches is better than the other one:
1) "XML Schema 1" is written in a familiar 1.0 style, so people who want to stick with 1.0 can still adopt this technique. We can observe, that the first schema is a little more verbose than the second one, which I see as one of the advantage of the second one.

2) If you are comfortable writing the XPath 2.0 expressions, then there are virtually too many possibilities to express schema validation constraints with XSD 1.1 assertions, which is really lots of power in the hands of the schema author!

3) Personally speaking, I find the second way of writing the XML schema ("XML Schema 2") a really cool NEW way to express these validation constrains. I'm not suggesting that the 1st way is not really good! That technique has great value, in it's own sense and has stood the tests of time. I find the second technique a more natural description from the schema author, to express the logic of the use-case in question.

4) One the possibilities I now foresee with XML Schema 1.1, is that schema author could impose quite a bit of constraints on xs:any wild-card instruction via assertions (which is particularly useful with processContents="lax" mode of the xs:any wild-card). A point worth observing is that with processContents="strict" mode of the xs:any wild-card, assertions are not really useful because, the schema validator would strictly validate the XML element with an element declaration, which must be provided by the schema author to satisfy the processContents="strict" mode of the wild-card (and assertions here would actually interfere with the available element declarations, which to my opinion is not a good design). With processContents="skip" mode of the xs:any wild-card, assertions would always fail (and the XML instance would become invalid), because the concerned XML elements would be discarded by the XML schema validator, and consequently these elements would not be part of the XPath data-model tree, on which assertions operate.

And needless to mention, Xerces-J handles all the above examples fine!

I hope that this post is useful.

Saturday, July 3, 2010

XSD 1.1: XML schema design approaches in XSD 1.1 world... PART 1

I'm thinking to write a series of posts (since writing too many ideas in one post could be boring to read, and could be quite voluminous for one post. I'll try to make sure, that these blog posts starting from this one have cross-references between them for related issues AND I'll convey in some future blog post, when I'm stopping writing this series!) only on XML schema design, given the XML Schema 1.1 constructs. I'll try to reflect why XML Schema 1.1 is essential for certain use-cases, and where XML Schema 1.0 falls short.

It is possible, that there may be a blog post unrelated to this series between these posts. When this series completes, I'll try to summarize the ideas at the end, to make the whole series available as an unit.

I'm disclaiming in the beginning, that any advice offered here may not necessarily be best. Improvements are generally always possible! Any feedback would be great (about the correctness of anything described here, alternative ideas OR anything else).

To start with the 1st post in this series, here's a little background about the use-case I'm describing in the subsequent paragraphs:
I've been reading the book "DB2 pureXML Cookbook: Master the Power of the IBM Hybrid Data Server" [1] recently by Matthias Nicola (a member of DB2 pureXML team). This book describes an example (in chapter 2) as follows.
A physical object could be described by two possible XML content models as follows:

a) Metadata as values, aka Name/Value Pairs (often bad):

  <object type="car">
    <field name="brand" value="Honda" />
    <field name="price" value="5000" />
    <field name="currency" value="USD" />
    <field name="year" value="2002" />
  </object>

b) Metadata as element names (good):

  <car>
    <brand>Honda</brand>
    <price currency="USD">5000</price>
    <year>1996</year>
  </car>

I wouldn't describe here why one of the above XML design approaches is good or bad. This is described well in the book cited above [1], which I would encourage folks to read (the books has some nice explanation about DB2 pureXML as well).

Let's say we want to build an XSD schema, for the XML document (a) above. To start with, one of the design decisions I took is, to define a set of XML Schema types with a hierarchy as following:

  OBJECT
     -> OBJECT_ON_SALE
            -> CAR
            -> BOOK

The other few design decisions I've taken are as following:
1) We'll use XSD 1.1 type-alternatives to select between different schema types, depending on value of the attribute object/@type.
2) We'll use an hierarchy of XSD 1.1 assertions definitions, to enforce certain validation constraints.

The meaning of these design decisions would likely become clear to us, by looking at the XML instance and schema document I'm describing below:

The sample XML instance I propose is as following:
[XML1] (this is same as XML instance "a" above, and is repeated here for convenience)

  <object type="car">
     <field name="brand" value="Honda" />
     <field name="price" value="5000" />
     <field name="currency" value="USD" />
     <field name="year" value="2002" />
  </object>

OR

[XML2]

  <object type="book">
     <field name="title" value="XML in a Nutshell" />
     <field name="author" value="Jimmy" />
     <field name="author" value="Nick" />
     <field name="publisher" value="Prentice Hall" />
     <field name="price" value="15" />
     <field name="currency" value="USD" />
     <field name="year" value="2008" />
  </object>

I propose the following XSD 1.1 schema (the 1.1 specific constructs are highlighted with a different color), that is designed to validate both of above XML instance documents (after which I'll try to analyze few design elements of this schema):

[SCHEMA1]

  <?xml version="1.0" encoding="UTF-8"?>
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
             xmlns:xerces="http://xerces.apache.org">
 
     <xs:element name="object" type="OBJECT">
       <xs:alternative test="@type='book'" type="BOOK" />
       <xs:alternative test="@type='car'" type="CAR" />
     </xs:element>
 
     <xs:complexType name="BOOK">
        <xs:complexContent>
           <xs:extension base="OBJECT_ON_SALE">
              <xs:assert test="field/@name = 'title' and
                               field/@name = 'author' and
                               field/@name = 'publisher' and
                               field/@name = 'year'"
                         xerces:message="For a book the fields title/author/publisher/year are mandatory" />
              <xs:assert test="xs:int(field[@name = 'year']/@value) gt 1900 and 
                               xs:int(field[@name = 'year']/@value) lt 2011"
                         xerces:message="A book's publication year must be between 1900 and 2011" />
              <xs:assert test="count(field[not(@name = 'author')]) = 
                               count(distinct-values(field[not(@name = 'author')]/string(@name)))" 
                         xerces:message="A book can have multiple authors, but none of other fields of a book can occur twice" />
           </xs:extension>
        </xs:complexContent>
     </xs:complexType>
 
     <xs:complexType name="CAR">
        <xs:complexContent>
           <xs:extension base="OBJECT_ON_SALE">
              <xs:assert test="field/@name = 'brand' and
                               field/@name = 'year'" 
                         xerces:message="For a car the fields brand/year are required" />
              <xs:assert test="xs:int(field[@name = 'year']/@value) gt 2000 and 
                               xs:int(field[@name = 'year']/@value) lt 2011" 
                         xerces:message="A car's manufacture year must be between 2000 and 2011" />
              <xs:assert test="count(field) = count(distinct-values(field/string(@name)))" 
                         xerces:message="None of the fields of an object 'car' can occur twice" />
           </xs:extension>
        </xs:complexContent>
     </xs:complexType> 
 
     <xs:complexType name="OBJECT_ON_SALE">
        <xs:complexContent>
          <xs:extension base="OBJECT">
             <xs:assert test="field/@name = ('price','currency')" 
                        xerces:message="An object that can be sold, must have the fields price/currency" />
             <xs:assert test="if (field/@name = 'price') then 
                              (field[@name = 'price']/xs:decimal(@value) gt 0 and
                              field/@name = 'currency')
                              else true()" 
                         xerces:message="If a field price is present, the currency field should exist. The value of price must be greater than 0." />
          </xs:extension>
        </xs:complexContent>
     </xs:complexType>
 
     <xs:complexType name="OBJECT">
       <xs:sequence>
         <xs:element name="field" minOccurs="0" maxOccurs="unbounded">
            <xs:complexType>
               <xs:attribute name="name" type="xs:string" />
               <xs:attribute name="value" type="xs:string" />
            </xs:complexType>
         </xs:element>
       </xs:sequence>
       <xs:attribute name="type" type="xs:string" />    
     </xs:complexType>

  </xs:schema>

The key design elements in the above schema document are the inheritance hierarchy and assertions/type-alternatives. The element "object" is validated by the schema type "CAR" or "BOOK". The set of assertions applicable on the type CAR/BOOK are the assertions on this type, as well as assertions inherited from the base types. The schema type applicable on the element "object" is controlled by the type-alternative switch (which works upon the value of attribute "type").

When the XML document, [XML1] is validated (with Xerces-J 2.10.0 -- actually with the latest code-base on Xerces SVN as of today, because there was a minor bug [which affects this particular example] that got fixed few days after Xerces-J 2.10.0 got released) by the schema document [SCHEMA1] the validation succeeds (as there are no validation errors).

Let's try to introduce some data errors in the XML instance document, and see what happens upon validation with the same schema document.

Here's a modified XML instance document:
[XML3]

  <object type="car">
    <field name="price" value="-100" />
    <field name="currency" value="USD" />
    <field name="year" value="1999" />
  </object>

If the above XML document ([XML3] -- named "test.xml") is validated by the schema document, [SCHEMA1] we get following validation errors with Xerces-J:
[Error] test.xml:5:10: cvc-assertion.failure: Assertion failure. If a field price is present, the currency field should exist. The value of price must be greater than 0.
[Error] test.xml:5:10: cvc-assertion.failure: Assertion failure. For a car the fields brand/year are required.
[Error] test.xml:5:10: cvc-assertion.failure: Assertion failure. A car's manufacture year must be between 2000 and 2011.

If we remove all of xerces:message attributes from assertions above, the following error messages are printed by Xerces for the above scenario:
[Error] test.xml:5:10: cvc-assertion.3.13.4.1: Assertion evaluation ('if (field/@name = 'price') then (field[@name = 'price']/xs:decimal(@value) gt 0 and field/@name = 'currency') else true()') for element 'object' with type 'OBJECT_ON_SALE' did not succeed.
[Error] test.xml:5:10: cvc-assertion.3.13.4.1: Assertion evaluation ('field/@name = 'brand' and field/@name = 'year'') for element 'object' with type 'CAR' did not succeed.
[Error] test.xml:5:10: cvc-assertion.3.13.4.1: Assertion evaluation ('xs:int(field[@name = 'year']/@value) gt 2000 and xs:int(field[@name = 'year']/@value) lt 2011') for element 'object' with type 'CAR' did not succeed

It's up-to the user's, that which error format is appropriate for them (the error format without xerces:message prints more error-context information. While the format with xerces:message could be useful to print user-friendly error messages, upon assertions failure).

I won't describe in much detail now, the domain meaning of error messages above and the problem scenario itself. I believe, this fictitious problem domain is simple enough to understand these examples.

I would end this post now. I'll take up the case of schema type "BOOK" in a future blog post (I imagine, the concepts I'm trying to illustrate here for the domain object CAR would be similar for the object BOOK).

I'll try to write few more XSD 1.1 examples in subsequent posts!

I hope that this post is useful.

Saturday, February 6, 2010

PsychoPath XPath2 processor update: fn:name() function fix

While writing following blog post, http://mukulgandhi.blogspot.com/2010/01/xsd-11-wild-cards-in-compositor-and.html (dated, Jan 31, 2010) [1], I actually unearthed a bug in PsychoPath XPath 2 processor, whereby the XPath2 fn:name() function didn't evaluate properly with zero arity (it raised a "context undefined" exception, even if a context item existed).

This bug led me to use the, fn:local-name() (whose implementation was correct) function instead, for the above mentioned blog post [1].

The good news is, that now this bug with fn:name() function is fixed (ref, https://bugs.eclipse.org/bugs/show_bug.cgi?id=301539).

For the example given in the blog post [1], the given XSD 1.1 assertion could now be written like following, as well:

  <xs:assert test="(*[1]/name() = ('fname', 'lname')) and 
                 (*[2]/name() = ('fname', 'lname'))" />

(instead, of the "local-name" function as used in the mentioned blog post [1])

Mukul Gandhi

Tuesday, September 12, 2023

XSLT 3.0, XPath 3.1 and XalanJ

Monday, April 10, 2023

XPath 2.0 quantified expressions. Implementation with XSLT 1.0

Wednesday, September 21, 2022

XPath/XSLT 1.0 data model and beyond

Monday, August 7, 2017

Mathematical table data with XML Schema 1.1

Saturday, February 25, 2012

modular XML instances and modular XSD schemas

Sunday, February 5, 2012

"castable as" vs "instance of" XPath 2.0 expressions for XSD 1.1 assertions

Thursday, January 26, 2012

Using XSD 1.1 assertions on complexType mixed contents

Tuesday, August 23, 2011

XPath 2.0 and XSD schemas : sharing experiences

Tuesday, December 28, 2010

Schema based XML compare

Sunday, September 5, 2010

XSD 1.1: Xerces-J implementation updates

Saturday, July 17, 2010

XSD 1.1: XML schema design approaches cotd... PART 3

Sunday, July 11, 2010

XSD 1.1: XML schema design approaches cotd... PART 2

Saturday, July 3, 2010

XSD 1.1: XML schema design approaches in XSD 1.1 world... PART 1

Saturday, February 6, 2010

PsychoPath XPath2 processor update: fn:name() function fix

About Me

My Blog List