Informational Properties of Infosets

Wednesday, 12 May 2004

Recently, I’ve been thinking about the influences that using the Infoset has on the information you place in it.

To put it another way: if you work with XML at the Infoset level, what tools are you given to express information with? As an informational channel, the structures that XML gives you can express pretty much anything, of course, but they lend themselves to some things better than others.

As such, using the Infoset encourages data to be moulded to fit into certain shapes over others. I’d like to dig these influences out and consider them from a purely abstract perspective. The phenomena I discuss here aren’t new, by any means; however, they’ve always been seen from the perspective of “best practices for XML.” I want to turn it around and find out not how best to fit your data to XML, but what that process does to your data.

I have two motivations for this work; first, the Binary XML Characterisation Working Group is interested in figuring out the properties of XML (a worthy task no matter what you think of “Binary XML”), and this seems like a good step in that direction. Secondly, many Web specifications use the Infoset both to describe their structures and as a payload, so I’d like to better understand what that is, and if it’s an appropriate base to build upon.

Note that I’m not trying to characterise the information that the Infoset loses, as compared to XML 1.0; I think that’s fairly well-understood.

Here are some preliminary thoughts. Constructive feedback appreciated.

Building Blocks

There are two primary constructs in an Infoset; information items and properties. The Infoset spec defines eleven information items, and doesn’t appear to allow for the definition of any more; if you want a new kind, you’ll need to call the result something other than an Infoset.

Properties dangle off of information items, and are used to relate them to other information items, as well as carry other data specific to their information item. A fairly large number of properties are defined in the Infoset spec, as denoted by the [property] convention. There isn’t any explicit mechanism for defining new properties; other specifications have done so, but have called the resulting structure something different (e.g., XML Schema and the PSVI).

Because neither information items nor properties are extensible, the Infoset effectively requires that all of its payload be stuffed into various properties’ values.

In practice, this is awkward; rather than saying “‘foo’ has a ‘quantity’ of ‘3’” you have to say “the element information item with the [local name] property ‘foo’ has an attribute information item in its [attributes] property with the [local name] property ‘quantity’ and the [normalized value] property ‘3’.”

As a result, people manipulating Infosets in software usually interact with an abstraction that simplifies it, sometimes losing information in the process. Even specifications that use the Infoset to talk about XML (its original purpose) will resort to some kind of shorthand, rather than subject their readers to the turgid prose it requires.

In some cases, the choice of information item and property to use to contain a given piece of data is capricious; for example, the content of an attribute information item in the [attributes] property can very often be just as effectively represented as that of an element information item in the [children] property. Effectively, the type of information item and associated properties becomes noise in these cases.

In others, the content of a property is “live,” in that it can be calculated by examining other properties. This doesn’t do too much harm in a reasonable implementation, but it is extra information.

Structure

There is always a root document information item that all others descend from, and Infosets’ information items are required to form a tree, resulting in the primary relationship between information items being that of parent and child. Therefore, Infosets are rigidly hierarchical. Indeed, it is not possible to build a graph in an Infoset; one needs to layer a referencing mechanism on top of the Infoset to achieve this.

The information item that does most of Infoset’s heavy lifting is the element information item (EII), because it can contain a variety of other information items, in its [children] property. This includes other element information items, comment information items, processing Instruction information items, character information items, and unexpanded entity reference information items. It can also contain Attribute Information Items in its [attributes] property.

The ordering of an element information item’s [children] can be significant; however, there is no way to determine whether this is the case by looking an an Infoset alone (This includes the Infoset’s cousin, the PSVI, as far as I can tell).

The structure of the tree, as well as the content of its properties, can be constrained by a schema, using any of several languages, including XML Schema, RelaxNG, and DTDs.

Identity

The Infoset’s basic mechanism for identifying something of interest in the Infoset is a (URI, localname) tuple called a QName, as per Namespaces in XML. This means that two properties have to be accessed to determine an information item’s full name; [namespace name] and [local name].

As has been noted elsewhere, URIs are the primary identification mechanism for the Web, so it’s bit strange to have an extra bit of information in an identifier; it should be that they can identify anything. There does not exist any standard, widely-implemented way to transform QName tuples to URIs, or vice versa.

In addition, the context of an Information Item is often used to identify it. For example, the “foo:bar” Element Information Item can have a completely different meaning and content model, depending on its [parent] property. However, this cannot be practically indicated in the Infoset (or PSVI).

Similarly, the [in-scope namespaces] property can be used to contextualise the interpretation of character information items in the [children] property, or the [normalized content] of an attribute information item in the [attributes] property. Unfortunately, there isn’t any way to indicate this in the Infoset, which means that this context must be preserved. The PSVI does allow such content to be identified.

Additionally, the [base uri] property can affect the interpretation of such content if it is a URI, if the context specifies it. Again, this information is not available in the Infoset, and only the type of the content is available in the PSVI; whether or not the [base uri] property is to be observed is application-specific.

What Next?

Some people will undoubtedly read this and think that this proves the Infoset is a bad base to build a format upon. If the alternate is XML, I very much disagree; much of this painstaking precision is necessary with XML whether or not you use the Infoset.

XML is great as a medium of exchange, but successful exchange implies a shared model for your data; that model may be mapped into other models that are specific to one party, but there must be some shared understanding.

I’m forming a belief that the complexity of the Infoset as a data model forces an unwelcome choice upon its users:

1) You can describe your format in terms of the Infoset, and therefore get easy human-readability and writability, while getting a lot of baggage as part of the bargain. I believe that a lot of the problems evident in the use of XML Schema and XML itself have their root in this complexity.

2) Or, you can layer a model on top of the Infoset that explains how format-specific components are serialised into XML. This is great for particular formats, but a fair amount of work. For example, WSDL 2.0 defines a component model that gets serialised into XML; the markup is still very human-readable, and the model is clear. However, it takes a fair amount of work to do this, and it’s very tricky to get the full benefits of Infoset-layer mechanisms like Schema in your component model.

3) The other option is to layer a generic model on top of the Infoset. This is the approach that RDF/XML takes; it insulates the data model from the XML serialisation, and as a consequence loses much of the intuitive readability of XML. Ask anybody about RDF, and they’ll tell you that they love the model, but hate the syntax.

The root of this, I think, is that XML was first and foremost a markup language, not a data modelling language; we’ve seen a number of attempts to layer something more appropriate on top of it (e.g., SOAP encoding, RDF/RDFS, XML Schema, etc.) but the human-readability draws people back to the Infoset level every time. There is some motion in the industry to define models on top of the Infoset in an implementation-specific way, but I suspect that they won’t be fully successful until they’re industry-wide.

Another issue that comes to light when you look at things this way is that while most of the benefits of having generic XML standards are offered at the Infoset layer or below — e.g., XSLT, XPath, Schema, Digital Signature — while most people using XML for data want to use an abstraction above the Infoset. This causes tension when people try to use these mechanisms in formats that have been developed in the #2 or #3 styles.

I’d very much like feedback on this; much of it is quite preliminary, and I haven’t drawn any solid conclusions yet; at this point, I’m more interested in exploring this perspective. Keep in mind that I don’t mean to question the value of XML itself — it brings huge value to the table. Rather, I’m questioning what some people are doing with it.

Mark Nottingham

other XML posts