[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: Translate non-structured documents into Xml RSS format

To: syndication@egroups.com
Subject: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
From: Ian Davis <ian@calaba.com>
Date: Tue, 26 Sep 2000 11:55:53 -0700
In-reply-to: <001301c0277c$f48b9430$020d0dc0@vertexdev.com>
Organization: Calaba Ltd.
References: <001301c0277c$f48b9430$020d0dc0@vertexdev.com>
Reply-to: Ian Davis <iand@internetalchemy.org>

On Monday, September 25, 2000, 10:45:19 PM, Jeff wrote:

> I think that Ben is asking for an HTML scraper. They
> generally use some obscenely complex Perl regular
> expressions to extract the relevant headlines from
> a page. The expressions are specific to the page.

> I know that Ian over at Internet Alchemy runs one.
I do run one still, although I don't maintain it as much as I should.

> I'm not a big fan of scraping -- it seems to be
> fragile and error-prone -- if the site changes
> its format the regular expressions could break.
Scrapers can be fragile, but the breakage is not as high as you might
think. Many sites use them. I believe Moreover uses WebMethods to
create their large set of feeds.

Ian

Follow-Ups:
- Re: Translate non-structured documents into Xml RSS format
  - From: ben@ubiquick.com

References:
- RE: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: "Jeff Barr" <jeff@vertexdev.com>

Prev by Date: Re: Translate non-structured documents into Xml RSS format
Next by Date: Re: Translate non-structured documents into Xml RSS format
Previous by thread: Re: Translate non-structured documents into Xml RSS format
Next by thread: Re: Translate non-structured documents into Xml RSS format
Index(es):
- Date
- Thread