XSLT for the Rest of the Web

Tuesday, 18 October 2005

I’ve raved before about how useful the XSLT document() function is, once you get used to it. However, the stars have to be aligned just so to use it; the Web site can’t use cookies for anything important, and the content you’re interested in has to be available in well-formed XML.

While that’s all fine and good on some higher-plane, utopian, RESTful, stateless, DTD- and Schema- described, Cool URIish Web, it’s not the useful on the Web that most of us surf every day. Don’t get me wrong — I believe in all (OK, most) of those things, but most of the data I’m aching to get my automated fingers on doesn’t live there (yet).

So, what to do? While there are toolkits out there for scraping ill-formed HTML, they’re usually very language-specific, fairly procedural, and don’t let me leverage the considerable value in the XML stack. I want a tool that lets me work with the Web natively in XSLT at a pretty low level. Which leads to the question,

What Happens When You Glue XSLT and HTML Tidy Together?

After looking around a bit and a few false starts, here’s what I came up with; libxslt_web. It’s a set of extension functions for the very fast, very powerful and very badly documented libxslt, companion to libxml2. What do they do? Here’s the list so far;

get(uri) — HTTP GET the URI, returning the response headers and body as a nodeset.
post(uri, body, content_type) — HTTP POST the body, using content_type, to URI. Return the response headers and body as a nodeset.
tidy_parse(node) — parse the node’s content using HTML Tidy and return a nodeset. No matter what.
form_encode(node) — encode the node’s children using urlencoding (see example in the code).

GET and POST both allow access to the HTTP response headers that came back, as well as the body.

What Next?

I’ve already played around with scraping my Amazon shopping cart and some bank accounts; one of my main use cases for this is to automate the process of downloading my QIF files from different accounts, so I don’t have to do the Dance of the Thousand Clicks every month to get at my own data. It strikes me that it would also be trivial to implement a microformat parser using this technique.

I’ve got a few ideas of where it should go next; there are a number of facets of HTTP that it ignores right now (mostly as a result of deficiencies in Python’s urllib2). Suggestions and feedback are welcome; I’m especially interested in efforts to implement something similar in other XSLT engines.

P.S. If you’re installing on OSX, try darwinports; just port install python (you need 2.4); port install libxml2; port install libxslt and then manually install mxTidy.

12 Comments

M. David Peterson said:

Hi Mark,

I agree with Sylvain. Excellent post.

Another interesting tidbit that you may not have come across as of yet is that since the release of Saxon 8.5 (current release is 8.5.1) you can do two things using the collection() function that fits right in line with what you are speaking about. The first is the ability to process an entire directory of files specifying which type of file using the standard *.html (or whatever other file type you may want to process) The second is the ability to call an extension function to process these files first, to then process the result with the transformation file.

If you visit this > http://www.xsltblog.com/archives/2005/08/saxon_85_now_av.html http://weblog.saxondotnet.org <

Of course this will only be of interest to those who develop for the .NET platform via Microsoft’s or the Mono Project’s implementation of the CLI Framework.

Hope this info helps!

Wednesday, October 19 2005 at 3:35 AM

M. David Peterson said:

It seems a portion of my post was cut off… probably because I used less-than and greater-than symbols to point to URL’s heres the last bit of the post again, which hopefully won’t get cut off…

If you visit this http://www.xsltblog.com/archives/2005/08/saxon_85_now_av.html post you will find this line item entry close to the bottom which showcases how this works using “query string” syntax that most of us are already quite use to:

collection(“dir?recurse=yes;select=*.html;parser=org.ccil.cowan.tagsoup.Pars er”)

returns all the *.html files in the given directory, expanded recursively, using John Cowan’s TagSoup parser to convert them on-the-fly to XML

As a side note, this functionality is also available in Saxon.NET 8.5.1 which, at the moment, I haven’t released publicly except to the curious folks who have stumbled upon the fact that I have been posting all of the latest builds to the projects subversion repository which is viewable via a Trac interface. You can access the link via the projects weblog http://weblog.saxondotnet.org

Of course this will only be of interest to those who develop for the .NET platform via Microsoft’s or the Mono Project’s implementation of the CLI Framework.

Hope this info helps!

Wednesday, October 19 2005 at 3:40 AM

Sylvain Hellegouarch said:

Hey David,

It’s interesting indeed and could very handy.

I think having an XSLT toolbox for the web would be useful. For instance, I’ve been playing a lot when writing my blog system with Atom and XSLT.

I tried to keep its functionnalities very basic (CRUD entries and comments).

My goal was to try to use only Atom 1.0 to store all the data (entries and comments), then use XSLT 1.0 to transform directly those entries into XHTML.

All of this without having to go through an intermediate layer to transform the entries before processing them through XSLT.

That’s why I dod use the document() function to fetch the data as needed.

But for instance, let’s assume the entry content has been entered as a Markdown formatted content. Right now, I need to transform the formatted content into XHTML before storing it into the Atom entry.

But when I want the use to update its content, I’ve lost the format source itself.

It’d have been neat to be able to call up an extension to transform the Markdown content into XHTML when rendering the Atom feed itself.

Anywya, that’s one case where I would have found handy to have a toolbox such as EXSLT for web tasks.

Sylvain

Wednesday, October 19 2005 at 3:51 AM

l.m.orchard said:

Mark: Ooh, nice stuff.

I’ve been pretty well convinced for a while now that REST + XSLT is a powerful combination, and I’ve been muddling my way through with it for a few years. I’ve got a suite of XSL for scraping RSS and Atom from web pages 1 and wrote a few articles on XSLT with wishlists in Amazon API 2. And with all the AJAX stuff lately, I’m sold on XSLT being a really good tool in the REST stack 3. Oh yeah, and I’ve been screwing with XSLT filters in WSGI in Python 4.

Hope that paragraph above wasn’t gratuitous, but I get excited about this stuff. :)

Now… How well have you fared at using document() on URLs with query strings? I noticed the form_encode() function up there. I don’t know if it was the version I was using, but I was having an issue where apparently libxslt was stripping query strings in the request, so I couldn’t pass any parameters.

Does this bug ring a bell at all…?

Wednesday, October 19 2005 at 4:52 AM

Bob DuCharme said:

Instead of using extension functions with libxslt, have you tried using the -html switch with xsltproc, libxslt’s command-line XSLT processor? It tells xsltproc to treat the source HTML as if it were well-formed, and I’ve used it with the document() function with no need for extra libraries.

There’s a bit more on this at http://www.xml.com/pub/a/2005/08/03/libxslt.html.

Bob

Wednesday, October 19 2005 at 7:04 AM

Mark Nottingham said:

Bob,

I have in the past; while it sometimes worked, I found that it isn’t forgiving enough for some pages. Also, I really needed the cookie-handling and POST capabilities.

Wednesday, October 19 2005 at 8:04 AM

Mark Nottingham said:

Thanks!

To be honest, when I want a command-line XSLT processor, I use Saxon, although as I get more familiar with libxslt, I might switch over to xsltproc. So, I hadn’t noticed anything (yet).

Wednesday, October 19 2005 at 8:37 AM

Sylvain Hellegouarch said:

Hi there,

This is excellent. Really an excellent idea. I was feeling doomed while playing the document() function myself in a web context. It felt the function was very limited.

Your idea is very interesting and I’d like to know what you would feel about an effort such as EXSLT but with some specific extensions to the web context.

For instance, having an extension that fetch a text document and transforms it depending on its format (Markdown, Textile, etc.), the result would be a regular nodeset.

Thoughts?

Sylvain

Wednesday, October 19 2005 at 12:44 PM

Daniel Veillard said:

l.m.orchard: hum, I think some of this was fixed lately, try to bugzilla if you still have problems with latest versions of the libxml2/libxslt combo

Mark: yeah the docs are very minimal, especially for programming extensions. Unfortunately I don’t really have time at this point for this, and I would rather focuse more on libxml2 docs first as there is far more people using it.

Daniel

Thursday, October 20 2005 at 7:27 AM

Mark Nottingham said:

Daniel,

Understood; the performance and power more than make up for it. The hardest thing was finding the functions and methods with the semantics I wanted, because the footprint is so large.

FWIW, it took me about four nights of on-and-off hacking to get to this point; finding some of the test suite and a few relevant blog entries in Google halfway through helped tremendously.

Thanks for a great stack!

Thursday, October 20 2005 at 10:32 AM

Terris Linenbach said:

A team I was involved with did this, but not as elegantly, in mid 1999. We didn’t think about putting a tidy call into the xslt engine but rather had a small workflow engine that treated xslt as just a task.. I guess sort of like coccoon.

Before the courts shut down aggregators thi was quite an innovative space: Yodlee…. vertical one… 1view…

Not sure if I’m spelling those names correctly.

Somehow microsoft, yahoo, excite, alta vista, and later google managed to avoid litigation (universality?).

What choice do we have besides XSLT really… I guess XQuery… Why aren’t you using XQuery anyway?

Monday, November 14 2005 at 6:36 AM

l.m.orchard said:

A question, which you may or may not know the answer to: I’m working from a fresh install of OS X Tiger on my laptop and installed DarwinPorts. However, it looks like this version of libxml2 2.6.22 installed is missing a constant, HTML_PARSE_RECOVER.

Having not yet dug in very far, is this a big issue?

Saturday, November 19 2005 at 11:23 AM

mark nottingham

other XML posts