Tuesday, 18 October 2005
XSLT for the Rest of the Web
While that’s all fine and good on some higher-plane, utopian, RESTful, stateless, DTD- and Schema- described, Cool URIish Web, it’s not the useful on the Web that most of us surf every day. Don’t get me wrong — I believe in all (OK, most) of those things, but most of the data I’m aching to get my automated fingers on doesn’t live there (yet).
So, what to do? While there are toolkits out there for scraping ill-formed HTML, they’re usually very language-specific, fairly procedural, and don’t let me leverage the considerable value in the XML stack. I want a tool that lets me work with the Web natively in XSLT at a pretty low level. Which leads to the question,
What Happens When You Glue XSLT and HTML Tidy Together?
After looking around a bit and a few false starts, here’s what I came up with; libxslt_web. It’s a set of extension functions for the very fast, very powerful and very badly documented libxslt, companion to libxml2. What do they do? Here’s the list so far;
- get(uri) — HTTP GET the URI, returning the response headers and body as a nodeset.
- post(uri, body, content_type) — HTTP POST the body, using content_type, to URI. Return the response headers and body as a nodeset.
- tidy_parse(node) — parse the node’s content using HTML Tidy and return a nodeset. No matter what.
- form_encode(node) — encode the node’s children using urlencoding (see example in the code).
GET and POST both allow access to the HTTP response headers that came back, as well as the body.
I’ve already played around with scraping my Amazon shopping cart and some bank accounts; one of my main use cases for this is to automate the process of downloading my QIF files from different accounts, so I don’t have to do the Dance of the Thousand Clicks every month to get at my own data. It strikes me that it would also be trivial to implement a microformat parser using this technique.
I’ve got a few ideas of where it should go next; there are a number of facets of HTTP that it ignores right now (mostly as a result of deficiencies in Python’s urllib2). Suggestions and feedback are welcome; I’m especially interested in efforts to implement something similar in other XSLT engines.