Informational Properties of Infosets

Wednesday, 12 May 2004

Recently, I’ve been thinking about the influences that using the Infoset has on the information you place in it.

To put it another way: if you work with XML at the Infoset level, what tools are you given to express information with? As an informational channel, the structures that XML gives you can express pretty much anything, of course, but they lend themselves to some things better than others.

As such, using the Infoset encourages data to be moulded to fit into certain shapes over others. I’d like to dig these influences out and consider them from a purely abstract perspective. The phenomena I discuss here aren’t new, by any means; however, they’ve always been seen from the perspective of “best practices for XML.” I want to turn it around and find out not how best to fit your data to XML, but what that process does to your data.

I have two motivations for this work; first, the Binary XML Characterisation Working Group is interested in figuring out the properties of XML (a worthy task no matter what you think of “Binary XML”), and this seems like a good step in that direction. Secondly, many Web specifications use the Infoset both to describe their structures and as a payload, so I’d like to better understand what that is, and if it’s an appropriate base to build upon.

Note that I’m not trying to characterise the information that the Infoset loses, as compared to XML 1.0; I think that’s fairly well-understood.

Here are some preliminary thoughts. Constructive feedback appreciated.

Building Blocks

There are two primary constructs in an Infoset; information items and properties. The Infoset spec defines eleven information items, and doesn’t appear to allow for the definition of any more; if you want a new kind, you’ll need to call the result something other than an Infoset.

Properties dangle off of information items, and are used to relate them to other information items, as well as carry other data specific to their information item. A fairly large number of properties are defined in the Infoset spec, as denoted by the [property] convention. There isn’t any explicit mechanism for defining new properties; other specifications have done so, but have called the resulting structure something different (e.g., XML Schema and the PSVI).

Because neither information items nor properties are extensible, the Infoset effectively requires that all of its payload be stuffed into various properties’ values.

In practice, this is awkward; rather than saying “‘foo’ has a ‘quantity’ of ‘3’” you have to say “the element information item with the [local name] property ‘foo’ has an attribute information item in its [attributes] property with the [local name] property ‘quantity’ and the [normalized value] property ‘3’.”

As a result, people manipulating Infosets in software usually interact with an abstraction that simplifies it, sometimes losing information in the process. Even specifications that use the Infoset to talk about XML (its original purpose) will resort to some kind of shorthand, rather than subject their readers to the turgid prose it requires.

In some cases, the choice of information item and property to use to contain a given piece of data is capricious; for example, the content of an attribute information item in the [attributes] property can very often be just as effectively represented as that of an element information item in the [children] property. Effectively, the type of information item and associated properties becomes noise in these cases.

In others, the content of a property is “live,” in that it can be calculated by examining other properties. This doesn’t do too much harm in a reasonable implementation, but it is extra information.

Structure

There is always a root document information item that all others descend from, and Infosets’ information items are required to form a tree, resulting in the primary relationship between information items being that of parent and child. Therefore, Infosets are rigidly hierarchical. Indeed, it is not possible to build a graph in an Infoset; one needs to layer a referencing mechanism on top of the Infoset to achieve this.

The information item that does most of Infoset’s heavy lifting is the element information item (EII), because it can contain a variety of other information items, in its [children] property. This includes other element information items, comment information items, processing Instruction information items, character information items, and unexpanded entity reference information items. It can also contain Attribute Information Items in its [attributes] property.

The ordering of an element information item’s [children] can be significant; however, there is no way to determine whether this is the case by looking an an Infoset alone (This includes the Infoset’s cousin, the PSVI, as far as I can tell).

The structure of the tree, as well as the content of its properties, can be constrained by a schema, using any of several languages, including XML Schema, RelaxNG, and DTDs.

Identity

The Infoset’s basic mechanism for identifying something of interest in the Infoset is a (URI, localname) tuple called a QName, as per Namespaces in XML. This means that two properties have to be accessed to determine an information item’s full name; [namespace name] and [local name].

As has been noted elsewhere, URIs are the primary identification mechanism for the Web, so it’s bit strange to have an extra bit of information in an identifier; it should be that they can identify anything. There does not exist any standard, widely-implemented way to transform QName tuples to URIs, or vice versa.

In addition, the context of an Information Item is often used to identify it. For example, the “foo:bar” Element Information Item can have a completely different meaning and content model, depending on its [parent] property. However, this cannot be practically indicated in the Infoset (or PSVI).

Similarly, the [in-scope namespaces] property can be used to contextualise the interpretation of character information items in the [children] property, or the [normalized content] of an attribute information item in the [attributes] property. Unfortunately, there isn’t any way to indicate this in the Infoset, which means that this context must be preserved. The PSVI does allow such content to be identified.

Additionally, the [base uri] property can affect the interpretation of such content if it is a URI, if the context specifies it. Again, this information is not available in the Infoset, and only the type of the content is available in the PSVI; whether or not the [base uri] property is to be observed is application-specific.

What Next?

Some people will undoubtedly read this and think that this proves the Infoset is a bad base to build a format upon. If the alternate is XML, I very much disagree; much of this painstaking precision is necessary with XML whether or not you use the Infoset.

XML is great as a medium of exchange, but successful exchange implies a shared model for your data; that model may be mapped into other models that are specific to one party, but there must be some shared understanding.

I’m forming a belief that the complexity of the Infoset as a data model forces an unwelcome choice upon its users:

1) You can describe your format in terms of the Infoset, and therefore get easy human-readability and writability, while getting a lot of baggage as part of the bargain. I believe that a lot of the problems evident in the use of XML Schema and XML itself have their root in this complexity.

2) Or, you can layer a model on top of the Infoset that explains how format-specific components are serialised into XML. This is great for particular formats, but a fair amount of work. For example, WSDL 2.0 defines a component model that gets serialised into XML; the markup is still very human-readable, and the model is clear. However, it takes a fair amount of work to do this, and it’s very tricky to get the full benefits of Infoset-layer mechanisms like Schema in your component model.

3) The other option is to layer a generic model on top of the Infoset. This is the approach that RDF/XML takes; it insulates the data model from the XML serialisation, and as a consequence loses much of the intuitive readability of XML. Ask anybody about RDF, and they’ll tell you that they love the model, but hate the syntax.

The root of this, I think, is that XML was first and foremost a markup language, not a data modelling language; we’ve seen a number of attempts to layer something more appropriate on top of it (e.g., SOAP encoding, RDF/RDFS, XML Schema, etc.) but the human-readability draws people back to the Infoset level every time. There is some motion in the industry to define models on top of the Infoset in an implementation-specific way, but I suspect that they won’t be fully successful until they’re industry-wide.

Another issue that comes to light when you look at things this way is that while most of the benefits of having generic XML standards are offered at the Infoset layer or below — e.g., XSLT, XPath, Schema, Digital Signature — while most people using XML for data want to use an abstraction above the Infoset. This causes tension when people try to use these mechanisms in formats that have been developed in the #2 or #3 styles.

I’d very much like feedback on this; much of it is quite preliminary, and I haven’t drawn any solid conclusions yet; at this point, I’m more interested in exploring this perspective. Keep in mind that I don’t mean to question the value of XML itself — it brings huge value to the table. Rather, I’m questioning what some people are doing with it.

6 Comments

James Tauber said:

It could be my document-centric bias (like many members of the original WG, I came from a publishing and text processing background) but, for the most part, I’ve viewed XML as surface syntax (and by extension, XML schemas as as grammars for surface syntax and the Infoset as modelling surface syntactic information).

RDF/RDFS has always seemed to me to be a much better data modelling language. The problem I’ve always had with the syntax of RDF is that it is neither a fixed serialization of the data model nor a generic mapping to-and-from arbitrary XML. It is rather a middle-ground where some common XML patterns are supported but not the generic case.

I’ve always argued that RDF should support a mapping to any XML surface syntax. Back when I was writing PyTREX (never updated to support RELAX NG, unfortunately) I was hoping to annotate the TREX grammar (which I saw as being about surface syntax) with a mapping to RDF (which I saw as the right way to express the underlying data model of the document). This plus something like Sparta would be then be the XML data binding.

The Infoset is priceless for modelling the surface syntax. For everything else there’s RDF.

Saturday, May 15 2004 at 6:54 AM

Jay Fienberg said:

Really great piece–very useful. I mentioned it on my blog, http://icite.net/blog/200405/data_markup.html , though I haven’t yet formulated a larger set of comments. Basically, I think there is a really important need to consider what data structures literally “say” vs how people think they read.

Tuesday, May 25 2004 at 3:36 AM

Aleksander Slominski said:

One of Infoset APIs that is also very lightweight interface based document object model is XB1 in XPP3 for Java.

Sunday, August 8 2004 at 3:39 AM

Sylvain said:

Could it be that XML, after years of being praised by many as finally settled, like some wine, and that the community has enough experience to start understanding the limitation it has by nature.

Maybe some people have marketed too strongly around the “XML can describe anything” and produced mored noise than anything else.

A bit like the fact for many years now object oriented programming has been touted as the only true way to make software. What has happened? Today languages such as Lisp are coming back to life because it is clear that there is no unique way to do software.

I really appreciate the clarity of your articles and this one is even more interesting as you use a neutral tone that avoid creating tension between supported for each or each solution.

Monday, December 4 2006 at 3:37 AM

Bijan Parsia said:

”"”This is the approach that RDF/XML takes; it insulates the data model from the XML serialisation, and as a consequence loses much of the intuitive readability of XML. Ask anybody about RDF, and they’ll tell you that they love the model, but hate the syntax.”””

I present myself as a counterexample. I hate the model. Not for everything, but for many things. And there are plenty of situtations in which I do not loathe it, but I’m not especially enamoured of it. In fact, I’m hard pressed to think of a context in which I feel love for it.

To prove this is a settled belief, I draw your attention to this message (and thread):

https://lists.w3.org/Archives/Public/www-rdf-logic/2005Jan/0006.html

Specifically:

””“(Note that I’ve not even touched how painful it is for people I’ve taught. We almost always end up falling back on standard logic syntax. This is not Turtle vs. RDF/XML…it’s not the awful xml serialization alone, it’s the relentless triplization.)

So, it sucks for authoring; in sucks for parsing/rendering/tranforming/reasoning/storing/querying; it sucks for reading; it sucks for teaching; it sucks for extending; it sucks for metatheory.”””

I find the trope that RDF has a great model but a sucky syntax to be one of the more annoying presumptions floating about (one that stifles reasonable design). I assure you that I am no where near alone in my feelings toward triples, and that this feeling based on solid, extensive experience.

Monday, December 4 2006 at 3:54 AM

Bijan Parsia said:

Whoa…how’d I end up commenting on a post from two years ago…hm…ah someone else posted on it and somehow that made it come to my attention.

Funny ole thing, comments.

Monday, December 4 2006 at 4:07 AM

Mark Nottingham

other XML posts