XML Infoset, RDF and Data Modelling

Friday, 28 May 2004

I’ve been talking with a few people about my previous assertion that the Infoset is a bad abstraction for data modelling, and my subsequent post about the informational properties of the Infoset.

The feedback has been positive, especially regarding the notion that the Infoset offers great tools for document markup, but presents more problems than solutions when directly used in non-markup applications; i.e., those that are data-oriented.

The best examples of the kind of unneeded complexity I’m talking about are XML’s implicit ordering of element children, even when it has no significance, and the inflexibility of the Infoset itself; instead of adding your own properties and node types, you have to stuff all of your interesting data into the properties’ values (see the Informational Properties entry for the fully scoop).

This isn’t to say that XML isn’t useful for serialising data, but it does call into question the benefits of using the Infoset to model it, in terms of describing is shape or binding it to code. Some other abstraction is needed.

Parallel Stacks

Following these thoughts, it seems reasonable to look for alternate data models that can still be serialised into XML. It turns out we don’t have to go far to find an example.

Ignore, for a moment, the greater vision of the Semantic Web (i.e., “open world” systems, rich inference, Web of Trust, etc.), and concentrate on the core mechanisms. In this light, RDF can easily be viewed as a standard data model*.

From this perspective, the W3C is, intentionally or not, developing two parallel stacks of standards, built on two separate data models. Both of them happen to be primarily serialised as XML, but that’s where the similarity ends. To wit;

Data Model	XML Infoset	RDF
Primary Serialisation	XML 1.0	RDF/XML
Alternate Serialisations	XOP, ASN.1 (w/ PER, BER, etc.)	n3, N-Triples
Schema Language	XML Schema	RDF Schema / OWL
Transformation Language	XSLT	rules
Query Language	XPath, XML Query	(watch this space)

As you can see, each stack offers a standard means of performing common tasks that you can leverage in an application. Query is a notable omission for RDF, but there are a number of non-standard offerings, and I believe this situation might change soon.

Babies, Bathwater and Better Mousetraps

In the past few years, all of the industry’s attention has been focused on the left-hand stack, because XML was such a great leap above what preceded it. That’s great.

The right-hand stack hasn’t been noticed as much, because it’s linked to the Semantic Web, which still has a ways to go before it reaches its stated goals. That’s our collective loss. I think it’s appropriate to look at it again, but with more modest goals for right now.

The headaches we’re running up when we use the Infoset stack are a result of its complexity, and being designed for a different task; talk to anybody about the practicalities of XML Schema and you’ll see this.**.

What does this mean in the real world? It’s not realistic to try to swap stacks at this stage, but for starters, it might mean that we’ll start seeing RDF Schema in the types section of WSDL files, alongside XML Schema.

<cake><have/><eat/></cake>

To do that, we need to enable people to choose the model and associated tools they use for a particular task; in other words, to map data to both models, to enable the use of all of the tools on it, from either stack.

This can be achieved by defining a mechanism to allow documents described and constrained by XML schema to also have an RDF schema. GRDDL is the latest answer for this problem, and while it’s a step in the right direction, it doesn’t go far enough; while it allows you to extract RDF statements from an XML instance, it doesn’t go the other way; you can’t serialise an RDF graph as XML by looking at a GRDDL transformation.

I’ve had some ideas about this in the past, but now I’m thinking that there’s a better approach. I won’t say more than that until I write some code to prove it.

Remember that RDF was originally in the Metadata Activity, standing for Resource Description Framework (although it’s move far from its roots).
It’s interesting to think about SOAP Encoding in this light; the format itself, nor the motivation behind it, is bad; in fact, we really need something that allows people to express simple graphs in XML. The problem that led WS-I to get rid of SOAP Encoding was XML Schema’s inability to describe it.

Mark Nottingham

other XML posts