mnot’s blog

Design depends largely on constraints.” — Charles Eames

Thursday, 5 August 2004

The ‘Document’ in Document-Oriented Messaging

(Another instalment in “XML Heresies.”)

One of the foundations of most vendors’ approach to Web services is called document-oriented messaging. This is the notion that interoperability is improved by describing a protocol in terms of the artefacts that are exchanged on the wire, rather than how the code that handles them is written.

As far as it goes, that’s good advice. Implementation-specific specifications lead to brittleness, because you can’t swap out the implementation; the message is too tightly coupled to the code. It also isn’t exactly new; although CORBA and DCOM got this wrong, the IETF has been writing protocols in this fashion for quite some time, very successfully.

What’s interesting is the baggage that comes along for the ride. “Document-Oriented” invariably means “XML” to many people; the reasoning being that you can break the link between implementation-specific data structures and wire formats with what is becoming the lingua franca of document formats.

Everybody Needs a Data Model

That’s great, but you still need some way to tell people what it should look like. When you specify a message’s content — whether you’re doing so in an IETF, W3C or OASIS protocol, or in an ad hoc Web services protocol — you need to do so based upon some common abstraction (a.k.a. data model) — even if it’s the Unicode character set — and a means of constraining it (e.g., a schema language).

Note that this is not at all at odds with the idea of document-oriented messaging. The important thing is that the choice of the data model and schema language should not be driven by implementation-specific considerations (e.g., your favourite language’s object model); they should be a vendor- and implementation- neutral, which is one of the reasons that XML has had so much fanfare.

Many alternatives are available to make this job easier; for example, EBNF, DTDs, XML Schema, Relax NG, and OWL all allow you to describe such constraints. With them, you can talk in a shorthand while still being precise about the bits that go on the wire. As a bonus, abstractions give some wiggle room for alternate serialisations, and can also provide tools and patterns for common tasks like extensibility and versioning.

Is XML Schema Right for Web Services?

The whole idea of Web services is to give people a protocol construction toolkit that allows them to easily specify messages, suck them into code and start working with them easily, on most any platform. So, why is it that that Web services went shopping for these things and come back home with the XML Infoset as described by XML Schema, of all things*?

As discussed before, an Infoset isn’t exactly a simple thing; in fact, there are very few bindings to programming languages that capture all of the information in an Infoset, and even fewer of them do it in an intuitive manner**. Instead of mitigating this complexity, XML Schema revels in it.

Think of it from an information theoretic standpoint; if the various Information Items and properties of an Infoset are each capable of carrying information, we’ve got a pretty big footprint to work with, and Schema doesn’t give very precise tools for sorting the signal from the noise. Because each different tool chooses a different, incomplete portion of the Infoset to model, interoperability is hard.

For examples of this in existing specs, look at Atom — an extensible metadata container where the order of children is insignificant. It’s impossible to describe this in XML Schema; if you make something unordered, it can’t be extensible. Another ironic example is WSDL 2.0; rather than using the Infoset as a data model, WSDL 2.0 describes a new one — the Component Model — mapping it to an Infoset and then to bits on the wire. Why do Web services folks think it’s OK for end users to use XML Schema if it isn’t good enough for describing WSDL?

In both of these cases, an implementation-generic data model and set of constraints upon it was necessary, and XML was chosen as the way to serialise bits on the wire. Great. However, XML Schema falls short of describing what’s really going on; something more is needed.

Beyond XML

Compared with what came before it, XML is the bee’s knees. It isn’t implementation-specific, yet widely-deployed. It’s openly specified and unencumbered, and human-friendly to boot. It also offers some very useful advantages over its predecessors, such as nesting and the potential for versioning and extensibility.

However, XML was made for document markup, not data modelling. XML Schema does what it’s supposed to — it describes constraints upon the XML Infoset — with complete coverage, but it misses the boat; by trying to do all things, it makes doing simple things really difficult. The result is — for data-oriented use cases — a complex data model not designed for the task at hand being described by a sub-optimal constraint language.

Don’t get me wrong; XML is a great foundation for syntax, but data models that directly map to it (such as the Infoset, PSVI, XQDM, etc.) are a horrible basis for a generic, interoperable protocol toolkit.

Many people think that Relax NG is the answer, but I think this view misses the deeper problems caused by its continued reliance on the Infoset as a data model. While some people do actually want to shove markup around, the more prevalent use case by far is simple data.

The real trick, IMO, is getting the advantages of XML — like platform neutrality, versioning, extensibility, nested data structures, self-description and human readability — without the complexity of the Infoset or the problems of XML Schema. A simpler, higher-level data model that has a mapping onto the Infoset while still providing these things could do the job.

The question at hand is whether we can profile and subset XML Schema and the Infoset to the point where they’re usable for modelling data, or whether we should start fresh. On the former approach, it’s interesting to see that WS-I has created a Schema Profiling Work Plan Working Group (PDF link), although I don’t know it’ll be able to go far enough***. Arguments that breaking a large number of existing schema instances is bad are also persuasive.

As far as starting fresh goes, we might be able to just switch horses. A little while back, I made a direct comparison between the two stacks that the W3C is developing; one based on the Infoset, the other on the RDF data model. It’s pretty clear to me that the RDF data model is simpler; the next step, I think, is to see if and how it (along with OWL) provides the purported benefits of XML, such as nesting, extensibility and versioning. The first of these is pretty easy (it’s a directed graph, so it’s arguably superior); the latter two are beginning to be explored. Stay tuned.


* Note again the advent and fall of SOAP encoding; it had a simpler data model, but didn't have an effective description mechanism in SOAP 1.1, hence WS-I dropped it in the Basic Profile. You need both parts.

** Before you pipe up and claim that [insert your tool of choice] can do it, consider this; how easily does it expose element ordering? Attributes? Namespace prefixes used? Comments? PIs? The DTD?

*** It bears repeating that this Web site represents my personal opinions and musings only.


Filed under: Protocol Design Semantic Web Web Services XML

10 Comments

James Tauber said:

No surprises that I agree. I say more in my latest blog entry at http://jtauber.com/blog/2004/08/06/more_on_xml_and_rdf

Friday, August 6 2004 at 6:07 AM +10:00

Gregory Graham said:

I'm starting in a new distributed computing research project, and I'm trying to decide between plain XML and RDF. I found your article to be helpful in the thought process.

Friday, August 6 2004 at 8:20 AM +10:00

Randy Charles Morin said:

XML Schema hits the 80-20 mark. End of story. You can always say that any technology is lacking and developers love to say this and re-invent the wheel. That's easy. It's much harder to accept that the 80-20 mark is met and move on. It's much harder to say, "this is good enough" and start developing real applications. We rather, say, this wheel is not good enough, because it's the easy way out. You don't have to do any real work.

Friday, August 6 2004 at 7:16 PM +10:00

Mark Nottingham said:

Randy,

Sorry, I need more convincing than you saying it’s good enough. Lots of people — including myself — have done the work and found XML Schema lacking, so much so that they’re looking for something better.

Rather than committing the fallacy of accident by assuming that because developers think like this in general we don’t need to question Schema, it would be nice to see you attack the reasoning itself. Perhaps CORBA was good enough, but the world saw enough pain around it to try again; likewise with networking stacks, programming languages and lots of other examples.

P.S. Note that I don’t rule out subsetting, profiling or otherwise fixing Schema; we don’t necessarily have to throw everything out and start fresh.

Friday, August 6 2004 at 8:49 PM +10:00

Sean McGrath said:

Randy,

I gotta disagree.

W3C XML Schema hits the 80/20 mark for schema languages the same way that a boiled egg hits the 80/20 mark for a balanced diet.

W3C XML Schema is awful in more ways than I can fit into this comment.

If you want to see what a real 80/20 point looks like in the field of schema languages, look at Relax NG.

Grammar based validation is terribly appealing but has major limits. The future of "validation" in its broadest interpretation lies with processing pipelines that can mix lexical, infoset, grammer and inference-based validations into a cohesive expression of "parse".


Sean

Saturday, August 7 2004 at 2:38 AM +10:00

Randy Charles Morin said:

Mark,

CORBA wasn't good enough. CORBA Orbs couldn't communicate at all until IIOP and even then interop was difficult. SOAP/WSDL/XSD doesn't have those problems.

Over the last 3 years, I have created countless SOAP/WSDL services that just work. The data model was always described in XSD. I knew the problem w/ cardinatlity and order and simply worked w/ it. I've communicated between Java, MSSOAPSDK, Axes and .NET Web Services and I've never seen a problem. It works.

The failing of XSD are well-known, well-documented and you can easily be worked w/. Atom, for instance, could be restructure to work w/ XSD, but the mob simply refuses w/ ridiculous excuses like "ordering is too hard". Somehow, XSD isn't good enough, but ordering your element is too hard.

Last, I've tried to contact the XSD working group and get them to fix the problems. I was ignored. I just wish the factions at W3C would work together. Because right now, the W3C looks stupid w/ some of its own members and x-members working against each other.

Saturday, August 7 2004 at 4:57 AM +10:00

Bill de hora said:

Randy: "simply refuses w/ ridiculous excuses like "ordering is too hard". "

I have some sympathy for this view. The discussion on Atom pro un-ordering has been less than convincing imo. But that's only one facet of a myriad of issues with the WXS, and dare I say it, Object Oriented approaches to modelling.

Sean: "The future of "validation" in its broadest interpretation lies with processing pipelines that can mix lexical, infoset, grammer and inference-based validations"

Unsurprisingly tho' I find myself agreeing with Sean :) RDF technologies will be useful insofar as they'll help drive the interop problem up the stack. But there will continue to be an interop problem since people won't even agree on vocabulary, never mind semantics.

But here's the thing - RDF versus XML, or RDF as some kind of surrogate for XML, are xml-dev permathreads that must die. Really where RDF could have significant impact is not swapping out the XML stack, but in the business logic and mapping rules we're been busy embedding into in systems programming languages for the last two decades - in that sense it aligns nicely with data-directed languages like Schematron, SQL, and from way back Prolog (before it got tarnished with the AI brush). Of course, if you really want to type things in terms of programming language primitives, RDF will let you use XSD types but looking out it's hard to see a continuing role for the likes of C# and Java for anything other than bottom coding.

Also, and this is working well below the envelope packaging and business modelling domains, investigations suggest RDF to be potentially very useful as a content model for distributed systems and operations management in terms of event propagation though it has some issues, notably provenance.

Nonetheless it's been obvious and well understood for a long time that XSD/WXS is a mess, but the 'industry' has persisted in flogging that dead horse irregardless. Profiling an already broken approach doesn't seem smart imo, unless your business models are reliant on maintaining current inefficiencies rather than eliminating them. Of course sometimes incremental improvements are the most likely to be adopted.

Saturday, August 7 2004 at 6:44 AM +10:00

Mark Nottingham said:

N.B. — I mentioned RDF because it has broad exposure and because of the simplicity of the model; that isn’t to say that there aren’t other potential solutions, or that RDF will solve the world’s problems (it has many of its own).

Interestingly, Yaron highlights SDO as another option on his weblog:
http://www.goland.org/Tech/marksdo.htm
See the comments there as well.

Wednesday, August 11 2004 at 10:54 PM +10:00

Lisa Dusseault said:

I think this explanation of document-oriented messaging is important. It tends to make the protocol more long-lived, and easier to hook into existing systems, easier to upgrade and extend. E.g. you can more easily create a gateway to switch the document from one transport to another, when there's a clear separation between the document and the transport.

Friday, August 13 2004 at 11:23 AM +10:00

Claus von Riegen said:

Mark,
for sure you raise an issue that needs to be solved: which XML schema constructs should be used to allow different platforms (particularly in terms of programming languages) interoperate with each other?
But the real problem, in my opinion, lies one level higher. It is not how the message syntax is described - it is how the meaning of the message and its contents is described. Approaches like the UN/CEFACT Core Component technology and the related context driver methodology come to mind. This does not mean that semantics need to be standardized globally uniquely - it is the definition and description methodology that is standardized.

Saturday, August 21 2004 at 5:20 PM +10:00

Creative Commons