Saturday, 13 August 2005
Adding Semantics to Excel with Microformats and GRDDL
When I worked in the financial industry, I quickly noticed that Excel spreadsheets contain the bulk of the data in the enterprise. It may make IT execs tear their hair out, but having the data nearby and ready for analysis is sloppy, but oh-so-effective. The challenge is to make the data reusable elsewhere.
Unfortunately, spreadsheets are a mish-mash of structured but meaningless data; there’s no easy way to tell which columns contain data and which ones are headers. To make them useful, we have to add some reusable, stable semantics to them.
Adding semantics to an existing document is fairly straightforward when you have a schema for it, or when you can do a format-specific mapping (e.g., with GRDDL). But what happens when you get a generic container format that doesn’t have much structure to hang onto in the first place?
In HTML, the answer has become microformats. I like this approach but have had concerns, because I thought they were just using element and attribute extensibility willy-nilly, which brings HTML validation problems. However, a little while back I ran into Tantek Çelik, who patiently corrected my misperceptions.
Microformats work by exploiting the existing mechanisms in a format to add meaningful tags; in HTML, this means things like the ‘class’ and ‘rel’ attributes. You then can reference (usually in the document’s metadata) a description of those tags, either in prose or another microformat, to unambiguously associate machine-readable semantics with existing content.
That’s great for HTML, but what about other formats? In Excel, Named Ranges are locally-scoped identifiers for cells, ranges and formulas in spreadsheets; they can be used to unambiguously identify something. Think of them as xml:id attributes in XML (where normal
sheet1!A1:10 references are as to XPaths into the document’s structure); they’re a lot less prone to change if the document gets edited.
Another microformats technique is to use a format’s metadata slots to store data and declare that a particular microformat is in use. In HTML, this is the HEAD section’s LINK and META tags. All recent versions of Excel allow you to add custom metadata to a document, so this is covered as well.
As a bonus, Excel 2003 and later also allow XML export, which makes it a lot easier to suck in marked-up documents and extract the semantic goodness from them.
Pulling It All Together
So, in a nutshell, Excel can be used as an incredibly powerful, flexible editor/browser for tabular data, which can be made available and reusable through microformats techniques. But what to do with the data?
For my purposes, turning it into RDF is interesting, so I’ve created a GRDDL-like stylesheet that turns SpreadsheetML documents into RDF/XML, based on some fairly generic hints, microformat-style.
To use it, you need to do the following to your spreadsheet;
- Name all cells containing headers (columns or rows) with the tag ‘Header’
- Name all cells containing non-header data that you want to capture with the tag ‘Data’
- Name each Header and Data cell with exactly one other name that will become the predicate for that object
- Add a custom document property containing the profile and desired property namespace (see the example)
- Save as XML.
Now, run it through the stylesheet and watch the magic. The beauty is that this is within the grasp of a non-technical user; the vocabulary is small, and if they know the tools, it’s pretty intuitive. Excel as RDF editor, here we come!
Note that RDF is just one possible output; the stylesheet is malleable, and the vocabulary can be used with completely different implementations.
If you modify or use the stylesheet, I’d be interested to hear about it, either below or in e-mail.