[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [syndication] (Random Thoughts) Content syndication and content "cleansing"
This is a good discussion to have.
I really wonder about the legality and ethics of scraping. As
a warning, I am not a lawyer.
But consider that scraping is like taking a newspaper, clipping
out all of the ads, taping the remnants together, and then
giving this version out to the world. The way I see it,
the sites (and the newspapers) are giving you the content
for free (or at reduced cost in the case of a newspaper)
because the advertisers are paying for some, most, or all
of the costs of giving you the info. When you scrape out
the goodies you are taking the good stuff and ignoring the
"bill" for it (the ad). If everyone did this then the
content provider would realize no revenue for their effort.
Please don't take this as a criticism. Getting a scraper
to work is definitely an accomplishment to be proud of.
Legality and ethics aside, I am also pretty concerned about
the fragility of the scraping process. It seems that the scraper
can be broken (for a site) if the site makes a simple change
or a redesign. Even the sites that claim to support XML/RSS
still have some learning to go. It is way too common to
find XML with unescaped entities
Longer term, I do think that encouraging sites to export
headlines and summaries via RSS is the way to go. The RSS
file is itself a form of an ad come-on. The sites go to
the trouble of generating it and making it available with
the hope that the headlines will be sufficiently interesting
to draw you to your site.
You may want to take a look at the list of providers on
our site (www.headlineviewer.com). We do no scraping,
although I know that some of our suppliers do. About
90% of these sites are available via RSS or some other XML
format. The rest use a text-based "backend" file. At last
count we had 1234 providers:
368 builtins
639 From the list at Userland
39 From the list at StartsHere
158 From the list XMLTree
30 At GrokSoup
By the Fall (September) of this year we hope to employ
a part time "syndication evangelist" to spend time
contacting sites, asking them for RSS, and helping them
to get going.
Carmen
Try Headline Viewer at http://www.headlineviewer.com
-----Original Message-----
From: jm@netnoteinc.com [mailto:jm@netnoteinc.com]On Behalf Of
jm-onelist@jmason.org
Sent: Friday, May 19, 2000 3:28 AM
To: syndication@egroups.com
Cc: jetspeed@list.working-dogs.com
Subject: Re: [syndication] (Random Thoughts) Content syndication and
content "cleansing"
"Kevin A. Burton" said:
> The problem is that the full article is meant to run on a modern browser
> like IE or Netscape and doesn't look right on this small device. So
> what is needed is a content cleansing mechanism that removes all the
> junk in an complex HTML document and just displays the content.
Hi Kevin,
Take a look at my project, sitescooper -- http://sitescooper.cx/ --
I do this, with an application specifically focussed on reading on a Palm
handheld (it also produces plain HTML or plain text output however).
It's in perl though ;)
(It also supports following links from a HTML page as well as RSS,
but that's kind of irrelevant to this forum. However this functionality
would be needed in your tool as a lot of sites deliver stories in
multi-page format.)
Sitescooper uses "site files" which contain details of markers in the HTML
to identify where story text starts and the sidebar tables end (it also
uses table widths to guess this, but that's not always reliable). These
site files are in a proprietary (text-based) format.
In order to allow cooperation between developers of similar tools however,
there's also http://jmason.org/scraping/ , which is an effort to come up
with a good XML-based syndication file format for the site layout
description files.
Feel free to mail me offline, or join up to the scraping list to discuss
this, or just discuss it here ;)
> Thoughts? What is the legality here? Technically it wouldn't be used
> to rip out advertisements but to only display this content to devices
> that couldn't originally see it anyway.
hmm... a tricky issue alright.
my pos is that it's fine, as long as the HTTP GETs are issued from the
end-user's machine; that (AFAICS) fits in with most sites' T&C's.
I also have a page ( http://sitescooper.cx/scoops/ ) which contains
pre-compiled sitescooper retrievals in several Palm formats; only
certain sites are included there for this reason; I want to keep
on the safe side of the law here.
--j.
--
Justin Mason Work: http://www.netnoteinc.com/ <jm@netnoteinc.com>
Personal: http://jmason.org/ <jm@jmason.org>
"It's true that some sharks get cancer. I said this in my book."
-- William Lane, author of _Sharks Don't Get Cancer_
------------------------------------------------------------------------
Remember four years of good friends, bad clothes, explosive chemistry
experiments.
http://click.egroups.com/1/4051/1/_/304461/_/958732095/
------------------------------------------------------------------------