XSLT for the Rest of the Web
Tuesday, 18 October 2005
I’ve raved before about how useful the XSLT document() function is, once you get used to it. However, the stars have to be aligned just so to use it; the Web site can’t use cookies for anything important, and the content you’re interested in has to be available in well-formed XML.
While that’s all fine and good on some higher-plane, utopian, RESTful, stateless, DTD- and Schema- described, Cool URIish Web, it’s not the useful on the Web that most of us surf every day. Don’t get me wrong — I believe in all (OK, most) of those things, but most of the data I’m aching to get my automated fingers on doesn’t live there (yet).
So, what to do? While there are toolkits out there for scraping ill-formed HTML, they’re usually very language-specific, fairly procedural, and don’t let me leverage the considerable value in the XML stack. I want a tool that lets me work with the Web natively in XSLT at a pretty low level. Which leads to the question,
What Happens When You Glue XSLT and HTML Tidy Together?
After looking around a bit and a few false starts, here’s what I came up with; libxslt_web. It’s a set of extension functions for the very fast, very powerful and very badly documented libxslt, companion to libxml2. What do they do? Here’s the list so far;
- get(uri) — HTTP GET the URI, returning the response headers and body as a nodeset.
- post(uri, body, content_type) — HTTP POST the body, using content_type, to URI. Return the response headers and body as a nodeset.
- tidy_parse(node) — parse the node’s content using HTML Tidy and return a nodeset. No matter what.
- form_encode(node) — encode the node’s children using urlencoding (see example in the code).
GET and POST both allow access to the HTTP response headers that came back, as well as the body.
What Next?
I’ve already played around with scraping my Amazon shopping cart and some bank accounts; one of my main use cases for this is to automate the process of downloading my QIF files from different accounts, so I don’t have to do the Dance of the Thousand Clicks every month to get at my own data. It strikes me that it would also be trivial to implement a microformat parser using this technique.
I’ve got a few ideas of where it should go next; there are a number of facets of HTTP that it ignores right now (mostly as a result of deficiencies in Python’s urllib2). Suggestions and feedback are welcome; I’m especially interested in efforts to implement something similar in other XSLT engines.
P.S. If you’re installing on OSX, try darwinports; just port install python (you need 2.4); port install libxml2; port install libxslt and then manually install mxTidy.
12 Comments
M. David Peterson said:
Wednesday, October 19 2005 at 3:35 AM
M. David Peterson said:
Wednesday, October 19 2005 at 3:40 AM
Sylvain Hellegouarch said:
Wednesday, October 19 2005 at 3:51 AM
l.m.orchard said:
Wednesday, October 19 2005 at 4:52 AM
Bob DuCharme said:
Wednesday, October 19 2005 at 7:04 AM
Mark Nottingham said:
Wednesday, October 19 2005 at 8:04 AM
Mark Nottingham said:
Wednesday, October 19 2005 at 8:37 AM
Sylvain Hellegouarch said:
Wednesday, October 19 2005 at 12:44 PM
Daniel Veillard said:
Thursday, October 20 2005 at 7:27 AM
Mark Nottingham said:
Thursday, October 20 2005 at 10:32 AM
Terris Linenbach said:
Monday, November 14 2005 at 6:36 AM
l.m.orchard said:
Saturday, November 19 2005 at 11:23 AM