XML Infoset, RDF and Data Modelling

Friday, 28 May 2004

I’ve been talking with a few people about my previous assertion that the Infoset is a bad abstraction for data modelling, and my subsequent post about the informational properties of the Infoset.

The feedback has been positive, especially regarding the notion that the Infoset offers great tools for document markup, but presents more problems than solutions when directly used in non-markup applications; i.e., those that are data-oriented.

The best examples of the kind of unneeded complexity I’m talking about are XML’s implicit ordering of element children, even when it has no significance, and the inflexibility of the Infoset itself; instead of adding your own properties and node types, you have to stuff all of your interesting data into the properties’ values (see the Informational Properties entry for the fully scoop).

This isn’t to say that XML isn’t useful for serialising data, but it does call into question the benefits of using the Infoset to model it, in terms of describing is shape or binding it to code. Some other abstraction is needed.

Parallel Stacks

Following these thoughts, it seems reasonable to look for alternate data models that can still be serialised into XML. It turns out we don’t have to go far to find an example.

Ignore, for a moment, the greater vision of the Semantic Web (i.e., “open world” systems, rich inference, Web of Trust, etc.), and concentrate on the core mechanisms. In this light, RDF can easily be viewed as a standard data model*.

From this perspective, the W3C is, intentionally or not, developing two parallel stacks of standards, built on two separate data models. Both of them happen to be primarily serialised as XML, but that’s where the similarity ends. To wit;

Data Model	XML Infoset	RDF
Primary Serialisation	XML 1.0	RDF/XML
Alternate Serialisations	XOP, ASN.1 (w/ PER, BER, etc.)	n3, N-Triples
Schema Language	XML Schema	RDF Schema / OWL
Transformation Language	XSLT	rules
Query Language	XPath, XML Query	(watch this space)

As you can see, each stack offers a standard means of performing common tasks that you can leverage in an application. Query is a notable omission for RDF, but there are a number of non-standard offerings, and I believe this situation might change soon.

Babies, Bathwater and Better Mousetraps

In the past few years, all of the industry’s attention has been focused on the left-hand stack, because XML was such a great leap above what preceded it. That’s great.

The right-hand stack hasn’t been noticed as much, because it’s linked to the Semantic Web, which still has a ways to go before it reaches its stated goals. That’s our collective loss. I think it’s appropriate to look at it again, but with more modest goals for right now.

The headaches we’re running up when we use the Infoset stack are a result of its complexity, and being designed for a different task; talk to anybody about the practicalities of XML Schema and you’ll see this.**.

What does this mean in the real world? It’s not realistic to try to swap stacks at this stage, but for starters, it might mean that we’ll start seeing RDF Schema in the types section of WSDL files, alongside XML Schema.

<cake><have/><eat/></cake>

To do that, we need to enable people to choose the model and associated tools they use for a particular task; in other words, to map data to both models, to enable the use of all of the tools on it, from either stack.

This can be achieved by defining a mechanism to allow documents described and constrained by XML schema to also have an RDF schema. GRDDL is the latest answer for this problem, and while it’s a step in the right direction, it doesn’t go far enough; while it allows you to extract RDF statements from an XML instance, it doesn’t go the other way; you can’t serialise an RDF graph as XML by looking at a GRDDL transformation.

I’ve had some ideas about this in the past, but now I’m thinking that there’s a better approach. I won’t say more than that until I write some code to prove it.

Remember that RDF was originally in the Metadata Activity, standing for Resource Description Framework (although it’s move far from its roots).
It’s interesting to think about SOAP Encoding in this light; the format itself, nor the motivation behind it, is bad; in fact, we really need something that allows people to express simple graphs in XML. The problem that led WS-I to get rid of SOAP Encoding was XML Schema’s inability to describe it.

5 Comments

Bill de hora said:

Bravo.

Regarding this:

[[[ it might mean that we’ll start seeing RDF Schema in the types section of WSDL files, alongside XML Schema ]]]

DAML-S was being targeted at WSDL once upon a time, but it’s mostly defunct now. OWL is more likely to be used than RDFS for this work. Pessimistically, I could imagine a BPEL being kluged into WSDL also.

Saturday, May 29 2004 at 11:07 AM

Danny said:

I’m very much looking forward to hearing your ideas re. the inverse of GRDDL. But I do think the parallels you draw could be misleading. The two sides work on very different levels. The model described by the Infoset isn’t much more than a grammar, the model described by RDF has a full logical formalism. It’s not unlike comparing Unicode with Java (very crude analogy, I know).

Ok, in current applications there’s a lot of common ground - data expressed in RDF/XML that could equally be expressed in a custom XML language. But a bunch of Java source and a bunch of Unicode might look (and be) identical, though there is a qualititive difference.

So although there is very much a value in using these two tracks together (more cake!!), I don’t think simple substitution e.g. XML Schema/RDF Schema makes all that much sense. I think the difference will become a lot more visible once inference engines get more deployment, and declarative programming with RDF/OWL starts replacing some of the current hardwiring in systems.

On specifics, I’m not entirely sure that rules makes a good parallel for XSLT, a graph-based approach (coming soon…) would be closer. Which leaves a gap in the parallel back from rules/inference. Re. query - HP’s RDQL seems to be getting a lot of support, could be heading for de facto standardisation.

re. Bill’s comment - OWL-S, the direct descendent of DAML-S is quite strongly associated with WSDL (that’s how it’s grounded).

{btw, I’ve got one or two plans involving sparta.py, I’ll let you know if anything comes of them}

Saturday, May 29 2004 at 11:57 AM

Dan Brickley said:

Interesting discussion. I looked into SOAP Encoding / RDF mapping a bit, btw. The models are nearly isomorphic, ie. SOAP Encoding can be treated as an RDF syntax, kinda.

I don’t think the open-world aspect of RDF can be entirely set aside though. People who come to RDF from an XML/SQL/etc background often have expectations of the schema language aspect which cause frustration. Specifically that RDF vocabularies don’t say anything about what must appear in some kind of document. Rather, they help you interpret whatever vocabulary does happen to get used within a document instance. Just cos stuff is missing, doesn’t mean the instance is broken, from an RDF perspective. (http://rdfweb.org/mt/foaflog/archives/2003/07/24/12.22.48/ etc).

So I think this can be problematic for messaging oriented apps sometimes, where they really care about the information payload of a message/document. They don’t just care that PurchaseOrders have ShippingAddresses; they want to see a properly filled out ShippingAddress with all the right bits of info there, in the message/document.

A lot of RDF ‘vs’ XML discussion has focussed on the data model aspect, trees vs graphs, ordering vs unordered. And that has been usefufl. But the underlyingly different philosophies, esp the open world design of RDF, also is important and I’m not sure it can be set aside since it is strongly bound up with vocab/namespace design and re-use.

http://swordfish.rdfweb.org/discovery/2004/07/validation/ is somewhat interesting on this front, and I’m moderately optimistic that a piece of tech could be built (on top of the DA WG’s new RDF query language, perhaps) that allows such document-oriented constraints to be expressed. That would go a long way towards balancing RDF’s “anything goes” approach with a more XMLish concern for the contents of specific documents, and validation based on the presence/absence of information…

Friday, August 6 2004 at 2:42 AM

Mark Nottingham said:

Good points, Dan. My take is that open-world and closed-world applications will both be necessary for quite some time. Closed-world is more limited, but it is much easier for most people to understand, at least at this point in time, and AFAIK RDF doesn’t preclude them.

Friday, August 6 2004 at 2:49 AM

Phil Phoenix said:

I can sympathize with the fustration of RDF not providing constraints. Yet syntactical constraint is not part of the RDF goals.

Just as “separation of concerns” has value in system design, it also has value here.

What I concider correct syntax for the address in a PurchaseOrder might be totally different from someone else’s point of view. This can be the case even if we agree what a purchase order resource is.

Obviously, it would be best if we agreed on taxonomy and syntax … still, separating the taxonomy from the syntax constraints gives us more flexibility.

To me, this implies the utilization of specific standards to address taxonomy and syntax separately.Perhaps rules languages applied in conjunction with RDF …. ie SWRL?

Friday, November 19 2004 at 10:35 AM

Mark Nottingham

other XML posts