[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [syndication] RSS vs. HTML Bandwidth and "Scalability"...



To expand on Mark's answer of caching..

The way that we've minimized the problem of repetitive requests
for feeds is to put a caching system in place "in front of" our
system that generates the rss files.

We've successfully used squid - an open-source caching server -
to proxy all the traffic to the cgi that generates the rss files.
(http://www.squid-cache.org/).
We stress tested it and it performed great.

This doesn't help with bandwidth issues, but can have a sizeable
impact if you are generating the rss files dynamically each time
they are requested.

If you're willing to spend money to help reduce the problem - it
is possible to use services such as Akamai to cache feeds on the
network as well.

Obviously, the use of any type of caching solution requires that
the feeds update infrequently enough where you'll be having a
decent ratio of users who are hitting the cached results and are
not missing new links.

Andrea
------------------------------------------------------------------
Andrea Michalek
phone: 215.280.1805
fax: 413.691.6607
email: andrea@1800cto.com

www.1800cto.com
Systems Architecture ~ Technology Strategy ~ Project Leadership

To subscribe to the 1-800-CTO mailing list - send an email to:
1800cto-subscribe@yahoogroups.com

-----Original Message-----
From: Morbus Iff [mailto:morbus@disobey.com]
Sent: Thursday, August 02, 2001 4:08 PM
To: syndication@yahoogroups.com; sjoerd@w3future.com
Subject: [syndication] RSS vs. HTML Bandwidth and "Scalability"...


 From w3future.com/weblog/:

 >In July my RSS file has been downloaded 15741 times. That's 134Mbyte, 55%
 >percent of my total traffic. This is way to much, if you compare that to
the
 >2617 times my html weblog has been downloaded last month. This looks like
a
 >scalability problem. But I have a monthly traffic limit of 1500MByte, so I
 >don't worry.

Initially, I was "hey! what's the problem? people care more about your
content than the pretty design! be happy!". But as I started to write
exactly that, I shifted quickly.

He makes a bit of an interesting point when you think. Beside search engine
spiders and proxies, I can think of no magical programs that hit a website
time and time again to get updates, when there aren't updates to be had.

On the other hand, most "constant on" RSS aggregators hit websites every
hour to get the latest updates. However, I'm not really sure how Meerkat or
NewsIsFree.com handles it (or for that matter, xmltree.com).

So, what's the solution to this pointless waste of possible bandwidth?

  a) Embed the time limit in the RSS file. This has been
     allowed in the old MS CDF format, as well as in
     scriptingNews (i think). The big problems is that
     aggregators don't listen to them, since there's a
     stunning lack of adoption. Sadly, I fall into this
     group with my AmphetaDesk too. I know Jeff Barr's
     Headline Viewer has an internal option on when
     to update; however, I don't know the default.

  b) Check the HTTP headers from the server. This would
     only work if the content wasn't dynamic, which is
     rare nowadays. For a while now, I've been thinking
     of checking content-length's / filesizes and
     comparing for newness.

  c) Implement server control - block repetitive ip's
     on a cron'd schedule and allow them back in when
     the going gets happy. This shifts the "blame"
     onto the server people though, and we really shouldn't
     be making RSS maintenance any harder than it is.

What are your thoughts? Any additions to the above?


--
Morbus Iff ( i am your scary godmother )
http://www.disobey.com/ && http://www.gamegrene.com/
please me: http://www.amazon.com/exec/obidos/wishlist/25USVJDH68554
icq: 2927491 / aim: akaMorbus / yahoo: morbus_iff / jabber.org: morbus






Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/