mark nottingham

Separating the Data Model from its Serialisation

Wednesday, 10 August 2005

XML

For some time, I’ve noticed that people defining XML formats spend an inordinate amount of time talking about the structure of the format. This is especially apparent in standards working groups, where hours — no, days — can be spent agonizing over whether to make something an attribute or an element.

Part of this is obviously stylistic; people have different thoughts on what makes good XML, and they’re fight the same battles over and over again. I’ve often thought that an “XML Style” Working Group that set up best practices for certain situations (and indeed, RFC 3470 goes some way towards this, for the IETF).

A bigger part of the problem, though, is when people conflate the data model they’re working with and the syntax they use to represent it. Unfortunately, in some corners of the industry, XML-as-religion has caught on, and everything’s Infoset, Infoset, Infoset.

When this happens, it becomes difficult to separate problems that people have with the underlying data model that is (or should be) shared among all use cases, and the syntactic conveniences and optimisations that are useful in a particular use case.

This is because people very often want to do slightly different things with the format, and serialising the data model into bits unavoidably makes some of those things easier, while making others more difficult, depending on how you do it. If you first tackle the data model, you can get it out of the way and then figure out if you need one or more than one serialisation of it. Otherwise, the constant changes to both your data model and its serialisation at the same time (because if the Infoset is your data model, it’s both) make it hard to progress.

Therefore, my preferred approach is to document an abstract, task-neutral data model first, and then talk about how that gets serialised into bits (possibly in several ways, if you have use cases that require different things). For my purposes, the first can be accomplished well with RDF Schema and OWL — which offer a much cleaner, usable and capable data model than the Infoset — and then do the second as a mapping to the Infoset.

Just food for thought.

P.S. I’ve heard it said that you don’t need a common data model to interoperate; that “syntax on the wire” is enough. What utter rubbish. At most, you might have two different models derived from a common one, but they still have commonalities.


11 Comments

Mark Baker said:

Not sure what you mean by the last sentence in the postscript. If I author FooML, which has nothing which could be mistaken for a data model, then I can deploy this and interoperate with others that understand FooML, no? That’s what’s normally meant by the “syntax on the wire” blurb, at least when I’ve heard it. In my view, you don’t need a data model to interoperate, but you do to facilitate data integration in a scalable manner.

I sense a disconnect. Maybe an example would help?

Wednesday, August 10 2005 at 6:05 AM

Elias Torres said:

First off, I’m all for RDF/OWL. However, I believe that we are not doing a great job of explaining to the rest of the world how different is modeling data in XML vs. RDF. For example, I’ve said the same before about elements vs. attributes, but if you and I tried to model the same structures in RDF, our models would be totally different. Would you use bNodes or URIs? multiple predicates, lists, sequences, bags, custom-collections? hasProperty or property? isSomethingOf or somethingOf? on and on and on. I’m not sure we have been clear enough on the so called advantages of using RDF vs XML. After all, leaving inference capabilities aside, what does OWL and RDF give us that XML/XSLT can’t? Let’s speak up.

Wednesday, August 10 2005 at 6:51 AM

Jay Fienberg said:

Great post Mark.

I think the “containment as relation” aspect of XML also limits the types of data models one tends to explore, i.e., the containment model is a hierarchical one, and that makes hierarchical data models seem to “make more sense” than relational ones.

Also, your post could serve as an interesting commentary on the data model section of Dare Obasanjo’s recent comparison of “Microformats vs. XML vs. RDF”:

http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=15341993-4d52-4eb2-8392-b35534e06ea5

Thursday, August 11 2005 at 3:29 AM

Henry Story said:

I find it best to use RDF graph notation like this

_something |—–related——> other |—–relation2—-> more

(though N3 or Turtle make avoid problems with ascii graphs being munged over the wire)

The advantage of thinking in triples like this is that it really helps one focus on the questions:

  • what are we speaking about?
  • how are they related?

This goes a long way to teasing out all kinds of basic problems. You can’t really get much more basic than graphs like this: there are things, and things are related, so everyone should be able to participate.

Thursday, August 11 2005 at 6:30 AM

Mark Baker said:

Yah, I suppose I could buy that there’s always a data model, even if only implicitly.

Thursday, August 11 2005 at 7:18 AM

Danny said:

Roy Fielding put the XML implicit data model idea succinctly: “XML does have a built-in containment relation, which is the essence of a mark-up language.”. (One of the many holes I fell down trying to make a similar point to the post here).

http://www.imc.org/atom-syntax/mail-archive/msg11957.html

Thursday, August 11 2005 at 11:31 AM

Henry Story said:

BTW. Once you start thinking in terms of data model, playing with small triple graphs (as shown above) and moving on to OWL to define the classes and their properties, then you are just a step away from UML which millions of programmers allready understand very well. The idea of designing xml formats with UML is I think to many people still very surprising. But it should not be.

Friday, August 12 2005 at 12:10 PM

Danny said:

Mark, yep, ok, slightly reluctant agreement re. RDF/XML.

The model/syntax split sounds a good way of putting it - “separation of concerns” is fairly widely accepted as good practice (originally from Dijkstra, I just discovered).

Henry has a point, RDF/OWL is entity-relationship/UML analysis for the Web.

Saturday, August 13 2005 at 9:19 AM

Terris Linenbach said:

As I’ve experienced multiple times, using XML and w3c schema in particular is a heart-breaking exercise.

First you start with high expectations that your XML schema will be logical, understandable, simple, easy to use, and easily extensible.

And then you get into your business. Does an element represent a class? When do you use attributes? When can you commit to a simple type? I think mixed content should be avoided by I’m not sure. I’m not sure if I should call this element “user” or “userName.”

Why can’t elements refer to multiple complex types (no, it’s not multiple inheritance, bozos)? Oh, I can use groups for that, if I abandon complex types. Darn, I liked those.

It’s utter crap.

As I read your blog Mark and talk to you about work on and off, I of course am aware of RDF. But after all these years I still don’t understand it, and my coworkers haven’t even used w3c schema.

It’s unfortunate but perhaps to be expected that RDF data models are not an Infoset, and vice-versa. If you want to use the DOM I guess you have to convert your actual model into a legacy format like w3c schema. How sad. Really.

I would rather go back to embedded relational engines like Firebird. I see many software projects start out storing XML in local flat files and I just think, oh great, another non-transactional hack that is destined to keep me employed.

So I love XML and w3c schema. It certainly works for me and my family. Maybe what I need is a real XML database, but I want something that is free and is based on XQuery.

I’m not sure if RDF and XQuery are even compatible.

Wednesday, September 14 2005 at 9:46 AM