Wednesday, 2 March 2005
Using XML in Data-Oriented Applications
So, you’ve got some data that you need to give to somebody else, and you want to use XML to do it; good for you, you’ve seen the light / hopped on the bandwagon / drunk the Kool-Aid.
At first glance, this seems like a pretty straightforward task; after all, it’s just angle brackets, right?
Not so fast. If you’re the only person who every has to look at the XML or write software to work with it, you’re fine, because you’ll say exactly what you mean (no more, no less). When you start involving other people, however, things get complex pretty quickly.
That’s because you have to agree on what’s being said, and just as importantly, what isn’t. What’s that attribute? Is the ordering of the children of that element significant? What’s the relationship between “item” and “entry”? And so forth.
In a nutshell, I see this problem as one of choosing a data model along with an appropriate constraint (i.e., schema) language, and then figuring out how to get from that data model to angle brackets. Therefore, I present a choose-your-own-adventure guide to using XML in data-oriented formats, with the pros and cons of each choice.
The first question to ask is: what’s your meta model? In other words, what the basis of your data model? What are the bits that it’s going to be made up of?
1. Your data model is based on XML
If you’re thinking in terms of Infosets, XQuery Data Models and the like, your data model is based on XML.
This is the approach extolled by most Web services proponents, often through the use of XML Schema-aware tools that bind XML to objects. The upside here is that you know what your serialisation is going to look like; the downside is that the Infoset (and its cousins) make a poor basis for a data model, as explored previously, and XML Schema only serves to make it more burdensome. Remember, XML was made for markup first, and data later.
This path is, in my opinion, the major reason behind the wailing that we hear when people actually try to use Web services and XML ( lots of people seem to agree). It isn’t pretty, and I don’t see it easing significantly, despite the advent of better bindings of XML into languages, or better schema languages. I suspect that Infoset-as-metamodel is the root of the problem.
2. Your data model is based on something else…
You’re not basing your data model on XML, but something else — e.g., RDF, UML, SDO, or a once-off that you cook up yourself — and you can use some other schema language to describe your data.
The upside of this approach is that you’ll probably have an model that’s more tailored for what you’re doing, and it’ll probably be much easier to map it into programming languages.
However, if you want to serialise it as XML — so that you can use XML tools on it, so you don’t have to come up with your own serialisation format from scratch, and and so that you can interoperate with a variety of systems — you need to answer another question; how do you get from that model to XML? Two paths to doing this are apparent;
a. … and there’s a static, fixed mapping from that data model to XML
If there’s a way to serialise any instance of a particular data model into XML, that mechanism falls into this basket.
If you do this, you’re not going to be able to get most (or any) of the benefits that XML brings; e.g., human-readability, applicability of common tools like XPath, XQuery, XSLT, XML Schema, etc.
A perfect example; the much-maligned RDF/XML serialisation, a format that not even a mother could love (or, in this case, a father; TBL seems to prefer the W3C’s illegitimate — or at least non-standard — stepchild, n3). You can’t write an XML Schema for it, can’t use XPath or XQuery against it, and it’s almost as unreadable as, well, XML Schema.
Other good examples include SOAP’s “section 5” encoding and the Excel XML serialisation. Basically, these approaches are using XML as a serialisation format, in the sense that they’re using it to mindlessly serialise an object or other model into XML. The integration into the XML stack is almost accidental where it happens, and for these reasons, I don’t think this is much of an option.
b. … and there’s an application-specific mapping from that data model to XML
This leads us to the last, and possibly most interesting, option. By defining an application-specific, bespoke mapping from the data model to its representation in XML, it’s possible to make the XML human-readable, and retain compatibility with many XML tools (e.g., XPath, XQuery, XSLT).
For example, the WSDL Working Group has described their format not in terms of the Infoset; they’ve come up with their own data (or “component”) model, along with a mapping to the Infoset. They also have a Schema, and XML tools can be used with it; best of both worlds.
There are countless other examples; many XML-based specifications are actually described in a separate data model, even if it’s just a set of XPath expressions. Disconnecting from the constraints of the Infoset frees you to think about what the data model should be, not what it should look like in bits.
Where to go from here?
My take on this, if you haven’t guessed by now, is that 2b is an interesting approach. However, while it’s feasible for working groups producing specifications to come up with their own data models, along with mappings to XML, it’s less reasonable to expect people describing their own data to do so.
RDF seems to have potential here; it’s a very generic and simple data model already standardised by the W3C. All that you’d need would be the ability to annotate the RDF Schema and OWL with triples that tell a processor how to serialise an instance of the data model as Plain Old XML — in a way that’s specific to that format. The same annotations could be used to extract the data model from an XML instance.
You could then author a meaningful XML Schema for such an XML serialisation, as long as you acknowledge that it may not reflect all of the constraints of the data model (a situation which is prevalent anyway). Granted, this means more work (e.g., authoring an RDF schema and an XML Schema), but it avoids putting too much strain — or dependance — on XML Schema.
I’m working on a straw-man for doing this now; stay tuned.