Saturday, 5 June 2004
Extreme URL Scraping and Debugging
Because Web sites often don’t make information available to us in the way we’d like, we have to bring the mountain to Mohammed and scrape screens.
In many cases, scraping the HTML is less than half the battle; the real challenge is getting the right URL. Of course, a Web description format would help greatly here, but in the meantime some tools would come in handy.
Specifically, many Web sites use POST when they really should use GET. This is a shame, but I’ve found that lots of them still actually support GET, probably because of the tools they’re using. In other words, if you go into a Web form and hack the form method from POST to GET, you’ll find that in many cases it’ll work, making your search results cacheable.
To do this right now, I have to save the HTML, change the form, but a base URI in, and then send that to my browser. It would be much easier if there were a browser plugin that allowed me to change a POST to a GET form dynamically, perhaps with a contextual menu on the submit button, or a menu option.
Another helpful thing would be a mechanism to step through HTTP and HTML redirects, in the same manner that you step through breakpoints in a debugger.
Ideally, these enhancements would be in Safari, but I suspect I’ll have more luck if I ask the LazyWeb for them in Firefox.