Extreme URL Scraping and Debugging
Saturday, 5 June 2004
Because Web sites often don’t make information available to us in the way we’d like, we have to bring the mountain to Mohammed and scrape screens.
I’ve played around with this in the past with xpath2rss, a scraping tool that allows you to generate RSS feeds from HTML. Jon Udell has done likewise with his excellent LibraryLookup experiment (Jon, if you read this, please add [Peninsula Library](javascript:var%20re=/([\/-] | is[bs]n=)(\d{7,9}[\dX])/i;if(re.test(location.href)==true){var%20isbn=RegExp.$2;void(win=window.open(‘http://catalog.plsinfo.org’+’/ipac20/ipac.jsp?index=ISBN&term=’+isbn,’LibraryLookup’,’scrollbars=1,resizable=1,location=1,width=575,height=500’))})). |
In many cases, scraping the HTML is less than half the battle; the real challenge is getting the right URL. Of course, a Web description format would help greatly here, but in the meantime some tools would come in handy.
Specifically, many Web sites use POST when they really should use GET. This is a shame, but I’ve found that lots of them still actually support GET, probably because of the tools they’re using. In other words, if you go into a Web form and hack the form method from POST to GET, you’ll find that in many cases it’ll work, making your search results cacheable.
To do this right now, I have to save the HTML, change the form, but a base URI in, and then send that to my browser. It would be much easier if there were a browser plugin that allowed me to change a POST to a GET form dynamically, perhaps with a contextual menu on the submit button, or a menu option.
Another helpful thing would be a mechanism to step through HTTP and HTML redirects, in the same manner that you step through breakpoints in a debugger.
Ideally, these enhancements would be in Safari, but I suspect I’ll have more luck if I ask the LazyWeb for them in Firefox.
5 Comments
RichB said:
Saturday, June 5 2004 at 5:06 AM
Nathan McFarland said:
Saturday, June 5 2004 at 11:50 AM
Mark Nottingham said:
Saturday, June 5 2004 at 12:43 PM
Phil Wilson said:
Monday, June 7 2004 at 4:38 AM
Matt Vance said:
Thursday, September 9 2004 at 8:11 AM