Getting Rid of QNames in Content

Tuesday, 14 June 2005

Or, What’s Wrong with XInclude?

QNames are evil (at least in content), so I never really liked the WSDL convention of using them to name and refer to constructs. It makes much more sense to refer to things on the Web as TimBL intended; using URIs.

Using URIs — including fragment identifiers — to refer to portions of documents is an intuitive, scalable, and much less intrusive way to modularise XML formats. XML Base — being unevil — allows you have a short form instead of spelling out the whole URI without resorting to QNames.

The W3C has already standardised a way to do this in XInclude, but in practice, it’s hard to find many formats that use it, or even encourage its use.

Why? It’s hard to tell, but my guess is that the overhead of a new element is too intrusive for some tastes, and people don’t like using an inclusion mechanism for logical reference. I know that’s a weak answer, but looking at all the formats out there that invent their own referencing mechanism, it’s pretty clear that something’s wrong.

What’s needed, then, is a less intrusive, URI-based way of referencing other parts of an XML document (as well as other documents). This is pretty obvious stuff, but there are some subtleties, so here’s my take.

A Pattern for Modularity in XML Formats

Imagine you have a purchase order format.

<PurchaseOrders>
  <Order id="123">
    <Customer id="abcdef">
      <name>WidgetCo, Inc</name>
      <contact>Bob</contact>
    </Customer>
  </Order>
...
</PurchaseOrders>

This is fine for a small order that you only see once, but if you get a number of orders, you probably don’t want to repeat that customer information each time, and you certainly don’t want the headache of updating it if Bob leaves the company.

Instead of inventing a new way to refer to and compose portions of your document (like WSDL does), you can say that @id is of type ID, which then enables processors (such as XSLT) to easily refer to it using URIs. Other parts of the document (or other documents) can refer to it and know that such software will be able to do so easily.

To actually reference one of those IDs, I use a language-specific mechanism, like @ref here;

<!DOCTYPE PurchaseOrders [<!ATTLIST PurchaseOrders id ID #REQUIRED>]>
<PurchaseOrders>
  <Order id="123">
    <Customer ref="#abcdef"/>
  </Order>
...
</PurchaseOrders>

Then, you can have a document full of customer definitions, or one document per customer, to point into;

<!DOCTYPE Customer [<!ATTLIST Customer id ID #REQUIRED>]>
<Customer id="abcdef">
  <name>WidgetCo, Inc</name>
  <contact>Bob</contact>
</Customer>

The cool part is that this works well in XSLT, using a template like this;

<xsl:template match="customer">
  <xsl:choose>
    <xsl:when test="@ref">
      <xsl:apply-templates select="document(@ref, /)"/>
    </xsl:when>
    <xsl:otherwise>
      <!-- element-specific processing here -->
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

This is a much less intrusive way to refer to other documents, while still using URIs.

The only downside is that you can’t really capture the semantics of this in XML Schema. I don’t see that as a deal breaker; if you stopped every time XML Schema wasn’t able to capture your applcation’s semantics, you wouldn’t get anywhere!

That said, it would be nice if Schema did; it would allow you to describe graphs in XML much more easily.

Another nice-to-have would be more widespread support for @xml:id; it would avoid the necessity for declaring the type of @id in the DTD. Maybe while they’re at it, XML Core could standardise @xml:ref too, making this pattern standard, rather than a language-specific mechanism.

It isn’t that I don’t like XInclude; in many ways, it’s better than this pattern, because the delineation between the composition mechanism and the underlying data model is cleaner (although that’s still somewhat subjective; a standardised @xml:ref would take care of that), and XInlude is very well-specified. It’s just that it doesn’t seem to be catching on; WSDL alone has at least two reference mechanisms.

Mark Nottingham

other XML posts