[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] (Random Thoughts) Content syndication and content "cleansing"
jm-onelist@jmason.org wrote:
>
> [scraping@jmason.org added to recipients]
>
> Carmen said:
>
> > But consider that scraping is like taking a newspaper, clipping out all
> > of the ads, taping the remnants together, and then giving this version
> > out to the world. The way I see it, the sites (and the newspapers) are
> > giving you the content for free (or at reduced cost in the case of a
> > newspaper) because the advertisers are paying for some, most, or all of
> > Legality and ethics aside, I am also pretty concerned about
> > the costs of giving you the info. When you scrape out the goodies you
> > are taking the good stuff and ignoring the "bill" for it (the ad). If
> > everyone did this then the content provider would realize no revenue for
> > their effort. Please don't take this as a criticism. Getting a scraper
> > to work is definitely an accomplishment to be proud of.
>
> I agree... it damages the revenue stream pretty severely and, if it became
> widespread, would encourage sites to imposed charged subscriptions. :(
>
> I personally think it would be far from acceptable to scrape other news
> sites' content, remove source advertising and copyright info, and place it
> in HTML on my own site under my own advertising, for example.
Yup. If you cleanse/scrape content for legitimate purposes you should
clearly state your intentions. :)
> However the reason I wrote a scraper is because I wanted to read sites on
> my Palm handheld. This seems to apply for many of the tools I've found
> (Plucker, AvantGo, InfoRover, WebFetch, NewsClipper etc.)
(URL?)
Right. This is what the clean scraping would be about. Getting content
to people who would not get it otherwise. I wouldn't think anyone would
throw the book at anyone on this. The thing that is interesting is that
the content wouldn't be put on the handheld machines anyway. The point
is that this *helps* these big publishing companies gain recognition and
brand loyalty on content that otherwise wouldn't be seen.
This is one of the issues in Software piracy. MS and other proprietary
software vendors claim that they loose gazillions of dollars due to
software piracy (mostly foreign) but this isn't really the case. In
normal situations they wouldn't sell any licenses anyway so the point is
that nothing is changing.
> Currently it's impossible to do this (assuming no wireless modems etc)
> *without* using a scraper; and the trimming of images & extraneous HTML is
> a definite bonus when you've got 1Mb of space and a 2" screen.
Totally!
> I'm sure there's similar situations where scraping technology is a plus,
> or a requirement.
>
> > Legality and ethics aside, I am also pretty concerned about
> > the fragility of the scraping process. It seems that the scraper
> > can be broken (for a site) if the site makes a simple change
> > or a redesign.
>
> Yep. This is one reason why I wanted to institute a scraping-related
> mailing list, so people writing HTML scrapers could swap and coordinate
> site details, to handle changes like this. The bigger the community, the
> faster the site layout descriptions could be fixed...
Cleansing and scraping are really different issues. When you are
scraping you are trying to use HTML as a protocol layer. IMO this is
fragile and really dangerous. But cleansing just destroys content
(doesn't make anything new) and gives a subset of the content so should
theoretically be much safer than scraping.
> > You may want to take a look at the list of providers on
> > our site (www.headlineviewer.com). We do no scraping,
> > although I know that some of our suppliers do.
>
> Yep, I'm aware of RSS -- sitescooper uses it to find stories ;)
I might use sitescooper in Jetspeed. Just black box the whole thing so
that anyone can swap in a URL filter.
> (I should have written up a quick para on the site clarifying that BTW)
>
> The headline viewer's pretty cool, and you've done a great job of tracking
> down those RSS URLs!
Check out xmltree.com for about 1700 more of them :)
--
Kevin A Burton (burton@apache.org)
http://relativity.yi.org
Message to SUN: "Please Open Source Java!"
"For evil to win is for good men to do nothing."