[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Translate non-structured documents into Xml RSS format
- To: syndication@egroups.com
- Subject: Re: Translate non-structured documents into Xml RSS format
- From: "Rick Winfield" <rick@rickwinfield.com>
- Date: Tue, 26 Sep 2000 07:19:50 -0000
- In-reply-to: <001301c0277c$f48b9430$020d0dc0@vertexdev.com>
- User-agent: eGroups-EW/0.82
That sounds like it, just found a service today called EchoFactor.com
that does something like this (no idea if they use RSS). They are a
spin-off of Infonautics and it looks like they use a scraper to
gather headlines and Infonautics technology to classify it. They
claim to be much larger then moreover.com (I wonder about the quality
of the "channels").
--- In syndication@egroups.com, "Jeff Barr" <jeff@v...> wrote:
> I think that Ben is asking for an HTML scraper. They
> generally use some obscenely complex Perl regular
> expressions to extract the relevant headlines from
> a page. The expressions are specific to the page.
>
> I know that Ian over at Internet Alchemy runs one.
>
> I'm not a big fan of scraping -- it seems to be
> fragile and error-prone -- if the site changes
> its format the regular expressions could break.
>
> Jeff;
>
> Jeff Barr - Home: 425-836-5624 Office: 425-936-3098
> mailto:jeff@v...
> http://www.vertexdev.com/~jeff
> http://jeffbarr.editthispage.com/
> 4610 191st Place NE. Redmond, WA
>
>
> -----Original Message-----
> From: Aaron Swartz [mailto:aswartz@s...]
> Sent: Monday, September 25, 2000 3:17 PM
> To: syndication@egroups.com
> Subject: [syndication] Re: Translate non-structured documents into
Xml
> RSS format
>
>
> ben@u... <ben@u...> wrote:
>
> > I would like to know if anybody has already worked on a bot that
> > could grab unstructured documents and translate them into RSS
format.
>
> I'm not quite sure I follow. You mean a spider that would crawl the
website
> and output a channel with a listing of all the pages on that site?
I've
> never heard of such a thing, it does sound like an interesting
possibility,
> however.
>
> What would you use this for, since the site map would rarely change
(making
> it not very useful for news)?
>
> --
> Aaron Swartz |"This information is top security.
> <http://swartzfam.com/aaron/>| When you have read it, destroy
yourself."
> <http://www.theinfo.org/> | - Marshall McLuhan