[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: Translate non-structured documents into Xml RSS format



On Monday, September 25, 2000, 10:45:19 PM, Jeff wrote:

> I think that Ben is asking for an HTML scraper. They
> generally use some obscenely complex Perl regular
> expressions to extract the relevant headlines from
> a page. The expressions are specific to the page.

> I know that Ian over at Internet Alchemy runs one.
I do run one still, although I don't maintain it as much as I should.

> I'm not a big fan of scraping -- it seems to be
> fragile and error-prone -- if the site changes
> its format the regular expressions could break.
Scrapers can be fragile, but the breakage is not as high as you might
think. Many sites use them. I believe Moreover uses WebMethods to
create their large set of feeds.

Ian