[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: how to scrape



I started working on this problem last year.  Originally we wanted to scrape
a number of sites for headlines and links to the articles. This was for a
consumer portal that offered its users customizable headlines on their front
page - we wanted to give them as much choice as possible.

It's easy to solve the problem for most cases.  The cases where it's harder
are where the headline is not linked to the article, but there's a link
below (sometimes the word more..., or an icon) to the full article.

We solved it in two ways, one way using perl, and the other, in java, using
html tidy (same one as Ken MacLeod mentions), then custom software to
traverse the generated xml tree.  Why custom software, not xsl? XSL doesn't
understand order. For example, I might know that I have headlines at
/html/body/table/tr/td/b
and then the link is at /html/body/table/tr/td/a

but it might not even be in the same td. I tried using xsl to solve the
problem, but it doesn't understand well how to go up and then back down the
next td to look for the a tag.

Feel free to contact me if you're interested in how this works, or if you
think xsl can be used, and then I can disagree :)

One thing it seems that you're not going to have to deal with is source
management. Since we are scraping many sites, it's not practical to code for
each site - instead we have a web-based GUI to create configuration files
telling how and what to get from each site, and then the software processes
the pages based on the config. If you can custom-code the scraping for each
site, it's probably simpler.

braxton
robbason@buzzmetrics.com


------------------------------
   Date: Tue, 20 Mar 2001 18:44:35 -0000
   From: "Alis Marsden" <alis@purplepages.ie>
Subject: How to scrape?

Hi,

I'd like to "scrape" either the headlines or full stories from a couple of
different sites that are not currently producing an RSS file or available
through any existing aggregators.

The legal issues are not really a consideration in my case - It is only
going to be done with a couple of sites that have already given me
permission to do so.

I'm guessing I'd do it by spidering the pages somehow but there really
doesn't seem to be much information about how it could be done on the web.

If anyone has any suggestions or knows of any code examples or literature on
the subject I'd love hear about it.

Thanks

With Kindest Regards

Alis Marsden

Purple Pages
http://www.purplepages.ie
e: alis@purplepages.ie
t: + 353 1 4961943
f: + 353 1 4911497