mark nottingham

Using XML in Data-Oriented Applications

Wednesday, 2 March 2005

XML

So, you’ve got some data that you need to give to somebody else, and you want to use XML to do it; good for you, you’ve seen the light / hopped on the bandwagon / drunk the Kool-Aid.

At first glance, this seems like a pretty straightforward task; after all, it’s just angle brackets, right?

Not so fast. If you’re the only person who every has to look at the XML or write software to work with it, you’re fine, because you’ll say exactly what you mean (no more, no less). When you start involving other people, however, things get complex pretty quickly.

That’s because you have to agree on what’s being said, and just as importantly, what isn’t. What’s that attribute? Is the ordering of the children of that element significant? What’s the relationship between “item” and “entry”? And so forth.

In a nutshell, I see this problem as one of choosing a data model along with an appropriate constraint (i.e., schema) language, and then figuring out how to get from that data model to angle brackets. Therefore, I present a choose-your-own-adventure guide to using XML in data-oriented formats, with the pros and cons of each choice.

The first question to ask is: what’s your meta model? In other words, what the basis of your data model? What are the bits that it’s going to be made up of?

1. Your data model is based on XML

If you’re thinking in terms of Infosets, XQuery Data Models and the like, your data model is based on XML.

This is the approach extolled by most Web services proponents, often through the use of XML Schema-aware tools that bind XML to objects. The upside here is that you know what your serialisation is going to look like; the downside is that the Infoset (and its cousins) make a poor basis for a data model, as explored previously, and XML Schema only serves to make it more burdensome. Remember, XML was made for markup first, and data later.

This path is, in my opinion, the major reason behind the wailing that we hear when people actually try to use Web services and XML ( lots of people seem to agree). It isn’t pretty, and I don’t see it easing significantly, despite the advent of better bindings of XML into languages, or better schema languages. I suspect that Infoset-as-metamodel is the root of the problem.

2. Your data model is based on something else…

You’re not basing your data model on XML, but something else — e.g., RDF, UML, SDO, or a once-off that you cook up yourself — and you can use some other schema language to describe your data.

The upside of this approach is that you’ll probably have an model that’s more tailored for what you’re doing, and it’ll probably be much easier to map it into programming languages.

However, if you want to serialise it as XML — so that you can use XML tools on it, so you don’t have to come up with your own serialisation format from scratch, and and so that you can interoperate with a variety of systems — you need to answer another question; how do you get from that model to XML? Two paths to doing this are apparent;

a. … and there’s a static, fixed mapping from that data model to XML

If there’s a way to serialise any instance of a particular data model into XML, that mechanism falls into this basket.

If you do this, you’re not going to be able to get most (or any) of the benefits that XML brings; e.g., human-readability, applicability of common tools like XPath, XQuery, XSLT, XML Schema, etc.

A perfect example; the much-maligned RDF/XML serialisation, a format that not even a mother could love (or, in this case, a father; TBL seems to prefer the W3C’s illegitimate — or at least non-standard — stepchild, n3). You can’t write an XML Schema for it, can’t use XPath or XQuery against it, and it’s almost as unreadable as, well, XML Schema.

Other good examples include SOAP’s “section 5” encoding and the Excel XML serialisation. Basically, these approaches are using XML as a serialisation format, in the sense that they’re using it to mindlessly serialise an object or other model into XML. The integration into the XML stack is almost accidental where it happens, and for these reasons, I don’t think this is much of an option.

b. … and there’s an application-specific mapping from that data model to XML

This leads us to the last, and possibly most interesting, option. By defining an application-specific, bespoke mapping from the data model to its representation in XML, it’s possible to make the XML human-readable, and retain compatibility with many XML tools (e.g., XPath, XQuery, XSLT).

For example, the WSDL Working Group has described their format not in terms of the Infoset; they’ve come up with their own data (or “component”) model, along with a mapping to the Infoset. They also have a Schema, and XML tools can be used with it; best of both worlds.

There are countless other examples; many XML-based specifications are actually described in a separate data model, even if it’s just a set of XPath expressions. Disconnecting from the constraints of the Infoset frees you to think about what the data model should be, not what it should look like in bits.

Where to go from here?

My take on this, if you haven’t guessed by now, is that 2b is an interesting approach. However, while it’s feasible for working groups producing specifications to come up with their own data models, along with mappings to XML, it’s less reasonable to expect people describing their own data to do so.

RDF seems to have potential here; it’s a very generic and simple data model already standardised by the W3C. All that you’d need would be the ability to annotate the RDF Schema and OWL with triples that tell a processor how to serialise an instance of the data model as Plain Old XML — in a way that’s specific to that format. The same annotations could be used to extract the data model from an XML instance.

You could then author a meaningful XML Schema for such an XML serialisation, as long as you acknowledge that it may not reflect all of the constraints of the data model (a situation which is prevalent anyway). Granted, this means more work (e.g., authoring an RDF schema and an XML Schema), but it avoids putting too much strain — or dependance — on XML Schema.

I’m working on a straw-man for doing this now; stay tuned.


5 Comments

Peter Herndon said:

Mark, you wrote, “However, while it’s feasible for working groups producing specifications to come up with their own data models, along with mappings to XML, it’s less reasonable to expect people describing their own data to do so.”

Maybe I’m missing something, but why? Assuming that the person creating the application is a programmer, how/why is this hard? If you are building an application that is data-centric, then you need to understand your data. Given that you understand the data and how they vary, it shouldn’t be that hard to codify those variations. From there, build sample XML documents, and tweak and refine them. When you get to something you like, codify your syntax in a schema, build whatever code-to-XML translation tools you need, and there you are.

Now, I’m coming at this from the perspective of a corporate developer. In my case, the data already exist, I’m simply manipulating them. In your case, working on the Atom WG may have skewed your perspective, as the WG goal is to create a new format for future use, with longevity in mind. That is, you are creating this model from whole cloth, rather than from a pre-existing source (not discounting RSS, but Atom is not a direct evolution of RSS). I can see that this task is a lot harder, and much more creative. But if you already have data, and you are simply trying to describe it with XML, is that task particularly hard? What am I missing?

Wednesday, March 2 2005 at 11:53 AM

Peter Herndon said:

Mm, yes, that makes sense. I’ve never been in a situation where both sides are active producers as well as consumers of XML. I’ve always been in a position where one side produces and the other (me) consumes, and in these positions, the producer has defined the XML however they wish.

Friday, March 4 2005 at 2:35 AM

Mark Finkle said:

Let me play devil’s advocate here. Although 2b may be the most interesting approach, it seems to be the least likely found “in the wild”. I have seen many uses of XML as a glorified Windows INI file (simple preference file). Right or wrong, this seems to fall into #1. Granted not a great data model, but data never the less.

I have also seen many real examples of XML used as serialization formats. One might even come to the conclusion that #2a is what many applications aspire to achieve. While not ideal, human-readability, XPath and XSLT are being applied quite well.

I am struggling with the practicality of RDF/OWL. Why is it worth the effort? I can’t believe its interoperability. People have been translating and converting plain old XML (no DTD, no Schema, no nothing) for years. Its hard to crack an encryption, not a plain text XML file. Thanks for forcing me to think about this stuff.

Friday, March 4 2005 at 11:38 AM