Saturday, 3 May 2003
RSS Traffic Characterisation
I’m setting up a weblog for a fairly well-known colleague, and doing some traffic estimates to try to size his server.
- 5000 people will eventually subscribe to the weblog
- Each person’s aggregator will poll once an hour for twelve hours a day (some less, some more)
- 75% of the hits will generate 1k of downstream traffic (304 Not Modified)
- 25% of the hits will generate ~50k of downstream traffic (200 OK)
So, 5000 * 12 * ((.75 * 1) + (.25 * 50) = 795,000 / 1024 = 776M/day
That’s a lot of RSS.
I’ve worked out a hosting solution for that blog, but the problem remains; I suspect that as RSS gets more popular, traffic like this will be more then norm than the exception.
For example, Web traffic is well-known to be self-similar. That is, it has the same “burstiness” no matter what time scale you look at it on; the highs and lows in your traffic will look roughly the same no matter if you look at a one minute snapshot vs. a one week snapshot. I suspect this isn’t true for RSS; the tendency for aggregators to poll at a preset time (often, the top of the hour, or at hourly intervals from startup) means that there’s a huge, regular burst of traffic, followed by a long period with sparse hits.
While this might not seem like a big deal, it plays havoc with the Web infrastructure, which is based on this assumption; servers are sized accordingly, and intermediaries optimize their caching algorithms based on it as well.
The RSS community has already worked to improve delivery; the adoption of ETag validation gives us the 75% 304 rate above (this is likely conservative; my weblog sees about a 90% hit rate. I suspect it depends mostly on both shared hits from popularity and rate of change - i.e., how often you publish).
We can do more, however; intelligent use of Cache-Control directives to align with publishing schedules, for example. It may also be that we can deploy new infrastructure, such as a network of Web caches that only stores 304s to mitigate the cost of running them, or a small-scale Akamai-like network for personal publishing. We could also come up with guidelines for randomization of polling to spread the traffic out a bit.
There’s also the potential for using a HTTP-based invalidation protocol (many have been proposed), but I suspect that trust, scalability and firewall issues will still come into play.
Many people are looking at alternate delivery mechanisms (like nttp or jabber.), but I’m not ready to abandon HTTP, at least until it’s proven unworkable. Using HTTP has several advantages, including simplicity, ubiquity, deployed infrastructure and codebase, ability to easily traverse a firewall, and mindshare.
If you run a well-known weblog or other popular RSS feed, I’d love to see your logs (suitably anonymized) or statistics. In particular, I’m looking for;
- average hits per day
- average hits per minute
- peak hits per minute (ideally, a minute-resolution graph for a day or portion thereof)
- average hits per second
- peak hits per second
- ratio of 200 to 304 response status codes
- average size of 200 and 304 responses
Feel free to post a (reasonably sized!) comment here, or send me e-mail.