Separating the Data Model from its Serialisation

Wednesday, 10 August 2005

For some time, I’ve noticed that people defining XML formats spend an inordinate amount of time talking about the structure of the format. This is especially apparent in standards working groups, where hours — no, days — can be spent agonizing over whether to make something an attribute or an element.

Part of this is obviously stylistic; people have different thoughts on what makes good XML, and they’re fight the same battles over and over again. I’ve often thought that an “XML Style” Working Group that set up best practices for certain situations (and indeed, RFC 3470 goes some way towards this, for the IETF).

A bigger part of the problem, though, is when people conflate the data model they’re working with and the syntax they use to represent it. Unfortunately, in some corners of the industry, XML-as-religion has caught on, and everything’s Infoset, Infoset, Infoset.

When this happens, it becomes difficult to separate problems that people have with the underlying data model that is (or should be) shared among all use cases, and the syntactic conveniences and optimisations that are useful in a particular use case.

This is because people very often want to do slightly different things with the format, and serialising the data model into bits unavoidably makes some of those things easier, while making others more difficult, depending on how you do it. If you first tackle the data model, you can get it out of the way and then figure out if you need one or more than one serialisation of it. Otherwise, the constant changes to both your data model and its serialisation at the same time (because if the Infoset is your data model, it’s both) make it hard to progress.

Therefore, my preferred approach is to document an abstract, task-neutral data model first, and then talk about how that gets serialised into bits (possibly in several ways, if you have use cases that require different things). For my purposes, the first can be accomplished well with RDF Schema and OWL — which offer a much cleaner, usable and capable data model than the Infoset — and then do the second as a mapping to the Infoset.

Just food for thought.

P.S. I’ve heard it said that you don’t need a common data model to interoperate; that “syntax on the wire” is enough. What utter rubbish. At most, you might have two different models derived from a common one, but they still have commonalities.

Mark Nottingham

other XML posts

Separating the Data Model from its Serialisation