mark nottingham

The Problem With Infosets

Sunday, 7 March 2004

XML

An interesting issue poked its head up at the W3C Technical Plenary last week. XML Protocol (known as SOAP to mere mortals) is defined in terms of XML Infosets — it describes how to move Infosets around and process them, as the basis of Web services.

Now, the working group could have chosen to describe SOAP in terms of XML 1.0 angle brackets, but the Infoset provides a nice abstraction; instead of saying “a QName followed by the equals character, followed by a single or double-quote delimited string,” it can say “Attribute Information Item,” or simply “Attribute.”

The problem is that Infosets can contain characters that XML 1.0 can’t. Specifically, Infosets can contain any “ISO 10646 character code […] in the range 0 to #x10FFFF,” even “though not every value in this range is a legal XML character code) of the character.”

This qualification is important. In XML 1.0, the range of legal characters is defined by production two; Char ::= #x9 #xA #xD [#x20-#xD7FF] [#xE000-#xFFFD] [#x10000-#x10FFFF]

As you can see, there are characters legal in the Infoset which aren’t legal when serialised in XML. It’s even more apparent when you consider XML 1.1, which defines the range of legal characters as ;

Char ::= [#x1-#xD7FF] [#xE000-#xFFFD] [#x10000-#x10FFFF]

Whoops. Here, there are characters that are legal in XML 1.1 that aren’t legal in XML 1.0. This means that with SOAP, for example, you might send a message to an intermediary with an XML 1.1 binding, but it won’t be able to forward it with an XML 1.0 binding, because it contains illegal characters for that serialisation.

What does all of this mean? For most formats, I suspect not much; the characters under discussion are very infrequently used. It does mean that formats can’t just say “I’m defined in terms of Infosets,” because that will get you into some sticky corner cases, interoperability-wise. In short, you can’t abstract away the serialisation of XML, no matter how attractive it might be.

I still think that specifying formats in terms of the Infoset is the best, clearest way to communicate, but it may be that formats have to add a proviso about what characters are legal in those Infosets, and what to do when other characters are encountered. For example, “The Foo Format is described in terms of XML Infosets containing characters that are legal in the XML 1.0 serialisation.”

This also means that alternate serialisations of XML (this month’s hot topic) that are based upon the Infoset will need to carefully choose the range of characters that they’re capable of serialising. I’d imagine the best thing to do would be to either superset XML1.1 or just do the whole range from the Infoset. It might get you into some sticky corners when going back to XML 1.0, but at least it’s future-proof.

Overall, this is yet another example of imperfect abstraction after the fact; XML didn’t have the idea of variability in serialisation built in from day one. Although the Infoset is quite attractive, it does bring some problems, some of which have only become apparent after we attempt to use it, courtesy of XML 1.1.


3 Comments

Seairth Jacobs said:

: Whoops. Here, there are characters that
are legal in XML 1.0 that aren’t legal
in XML 1.1.

Did you mean that the other way around?

Sunday, March 7 2004 at 7:33 AM

Aaron Swartz said:

Why is this a problem?

Tuesday, March 9 2004 at 9:40 AM