[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Translate non-structured documents into Xml RSS format



That sounds like it, just found a service today called EchoFactor.com 
that does something like this (no idea if they use RSS).  They are a 
spin-off of Infonautics and it looks like they use a scraper to 
gather headlines and Infonautics technology to classify it.  They 
claim to be much larger then moreover.com (I wonder about the quality 
of the "channels").

--- In syndication@egroups.com, "Jeff Barr" <jeff@v...> wrote:
> I think that Ben is asking for an HTML scraper. They
> generally use some obscenely complex Perl regular
> expressions to extract the relevant headlines from
> a page. The expressions are specific to the page.
> 
> I know that Ian over at Internet Alchemy runs one.
> 
> I'm not a big fan of scraping -- it seems to be
> fragile and error-prone -- if the site changes
> its format the regular expressions could break.
> 
> Jeff;
> 
> Jeff Barr - Home: 425-836-5624 Office: 425-936-3098
> mailto:jeff@v...
> http://www.vertexdev.com/~jeff
> http://jeffbarr.editthispage.com/
> 4610 191st Place NE. Redmond, WA
> 
> 
> -----Original Message-----
> From: Aaron Swartz [mailto:aswartz@s...]
> Sent: Monday, September 25, 2000 3:17 PM
> To: syndication@egroups.com
> Subject: [syndication] Re: Translate non-structured documents into 
Xml
> RSS format
> 
> 
> ben@u... <ben@u...> wrote:
> 
> > I would like to know if anybody has already worked on a bot that
> > could grab unstructured documents and translate them into RSS 
format.
> 
> I'm not quite sure I follow. You mean a spider that would crawl the 
website
> and output a channel with a listing of all the pages on that site? 
I've
> never heard of such a thing, it does sound like an interesting 
possibility,
> however.
> 
> What would you use this for, since the site map would rarely change 
(making
> it not very useful for news)?
> 
> --
>         Aaron Swartz         |"This information is top security.
> <http://swartzfam.com/aaron/>|     When you have read it, destroy 
yourself."
>   <http://www.theinfo.org/>  |             - Marshall McLuhan