mark nottingham

Extreme URL Scraping and Debugging

Saturday, 5 June 2004

Because Web sites often don’t make information available to us in the way we’d like, we have to bring the mountain to Mohammed and scrape screens.

I’ve played around with this in the past with xpath2rss, a scraping tool that allows you to generate RSS feeds from HTML. Jon Udell has done likewise with his excellent LibraryLookup experiment (Jon, if you read this, please add [Peninsula Library](javascript:var%20re=/([\/-] is[bs]n=)(\d{7,9}[\dX])/i;if(re.test(location.href)==true){var%20isbn=RegExp.$2;void(win=window.open(‘http://catalog.plsinfo.org’+’/ipac20/ipac.jsp?index=ISBN&term=’+isbn,’LibraryLookup’,’scrollbars=1,resizable=1,location=1,width=575,height=500’))})).

In many cases, scraping the HTML is less than half the battle; the real challenge is getting the right URL. Of course, a Web description format would help greatly here, but in the meantime some tools would come in handy.

Specifically, many Web sites use POST when they really should use GET. This is a shame, but I’ve found that lots of them still actually support GET, probably because of the tools they’re using. In other words, if you go into a Web form and hack the form method from POST to GET, you’ll find that in many cases it’ll work, making your search results cacheable.

To do this right now, I have to save the HTML, change the form, but a base URI in, and then send that to my browser. It would be much easier if there were a browser plugin that allowed me to change a POST to a GET form dynamically, perhaps with a contextual menu on the submit button, or a menu option.

Another helpful thing would be a mechanism to step through HTTP and HTML redirects, in the same manner that you step through breakpoints in a debugger.

Ideally, these enhancements would be in Safari, but I suspect I’ll have more luck if I ask the LazyWeb for them in Firefox.


5 Comments

RichB said:

The Web Developer Toolbar in Firefox (select Custom installation, then check the Web Developer checkbox) has an option called “Convert POSTs to GETs”

Saturday, June 5 2004 at 5:06 AM

Nathan McFarland said:

To change your POSTs to GETs in firefox use the excellent Web Developer Extension at http://chrispederick.myacen.com/work/firefox/webdeveloper/ .

Saturday, June 5 2004 at 11:50 AM

Phil Wilson said:

Ha, I too was about to tell you about the joys of the web developer extension!

Instead I’ll have to settle for going “Um, the Live HTTP Headers extension is probably about as close as you’re going to get to stepping through”.

Monday, June 7 2004 at 4:38 AM

Matt Vance said:

Here’s another option for changing POSTs to GETs:

http://www.squarefree.com/bookmarklets/forms.html#frmget

Thursday, September 9 2004 at 8:11 AM