[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] (Random Thoughts) Content syndication and content "cleansing"



"Kevin A. Burton" said:

> The problem is that the full article is meant to run on a modern browser
> like IE or Netscape and doesn't look right on this small device.  So
> what is needed is a content cleansing mechanism that removes all the
> junk in an complex HTML document and just displays the content.

Hi Kevin,

Take a look at my project, sitescooper -- http://sitescooper.cx/ --
I do this, with an application specifically focussed on reading on a Palm
handheld (it also produces plain HTML or plain text output however).
It's in perl though ;)

(It also supports following links from a HTML page as well as RSS,
but that's kind of irrelevant to this forum. However this functionality
would be needed in your tool as a lot of sites deliver stories in
multi-page format.)

Sitescooper uses "site files" which contain details of markers in the HTML
to identify where story text starts and the sidebar tables end (it also
uses table widths to guess this, but that's not always reliable). These
site files are in a proprietary (text-based) format.

In order to allow cooperation between developers of similar tools however,
there's also http://jmason.org/scraping/ , which is an effort to come up
with a good XML-based syndication  file format for the site layout
description files.

Feel free to mail me offline, or join up to the scraping list to discuss
this, or just discuss it here ;)


> Thoughts?  What is the legality here?  Technically it wouldn't be used
> to rip out advertisements but to only display this content to devices
> that couldn't originally see it anyway.

hmm... a tricky issue alright.

my pos is that it's fine, as long as the HTTP GETs are issued from the
end-user's machine; that (AFAICS) fits in with most sites' T&C's.

I also have a page ( http://sitescooper.cx/scoops/ ) which contains
pre-compiled sitescooper retrievals in several Palm formats; only
certain sites are included there for this reason; I want to keep
on the safe side of the law here.

--j.

-- 
Justin Mason       Work:  http://www.netnoteinc.com/ <jm@netnoteinc.com>
                   Personal:      http://jmason.org/     <jm@jmason.org>

"It's true that some sharks get cancer. I said this in my book."
	 	   -- William Lane, author of _Sharks Don't Get Cancer_