mnot’s blog

Design depends largely on constraints.” — Charles Eames

Monday, 21 August 2006

Caching Performance Notes

There have been some interesting developments in Web caching lately, from a performance perspective; event loops are becoming mainstream, and there are lots of new contenders on the scene.

Fortuitously, I’ve been benchmarking proxies with an eye towards the raw performance of their HTTP stacks (i.e., how well they handle connections and parse headers, serving everything from memory), for work, so that we can select a platform for internal service caching.

In particular, I’ve been looking at the maximum response rate that a device can deliver, how much latency it introduces, and what the effect of overload is on both of those metrics. Additionally, how they handle large numbers of idle persistent connections is very interesting, because of their performance-enhancing effects, especially in an AJAX world (see Caching Web 2.0 for more).

Here are some quick impressions; YMMV. This was on a dual single 3Ghz Xeon box, tested using autobench and httperf over Gigabit ethernet.

Update: I asked for dual CPU boxes, and was fooled by Xeon HT; all of these tests are on a single CPU. D’oh. Hyper Theading usually gives multi-threaded user apps a 15%-20% boost; I’m going to re-test with truly multiple CPUs (or at least cores) soon. Mea Culpa.

Squid

Squid has been around pretty much since the start of the Web, as an outgrowth of the Harvest project. While it’s probably also the most widely deployed caching proxy, it’s been criticised because it doesn’t perform very well. In particular, when it’s overloaded it behaves very badly, increasing the response times as load increases, eventually delaying for multiple seconds and dropping lots of connections.

However, Squid 2.6 was recently released, with support for epoll and kqueue. My testing shows it as being much better-behaved under load; response times are perfectly flat at about 180ms during overload, no matter what the request rate. 2.6 was also able to hold 15,000 persistent connections open without any noticeable change in response rate or latency. Impressive.

Squid 2.6 doesn’t perform quite as well in terms of raw capacity (serving about 7,500 small responses per second, vs. 2.5’s 9,000), but hopefully it’ll get better as the 2.6 line matures.

One thing that still concerns me about Squid is that its capacity drops much more than other proxies do when response sizes get larger.

Overall, Squid is a good workhorse that’s somewhat limited by its age; since it it’s single-process and single-threaded, it can’t take advantage of multiple cores, putting it at a severe disadvantage against threaded servers. Still, it’s very configurable, has good instrumentation, and is a known quantity. Not a bad option.

Apache

Apache 2.2’s mod_cache along with mod_proxy make it possible to cobble together a caching intermediary from the venerable (and extremely popular) server. This was highlighted in a recent OSCON presentation, where I’ve heard it was touted as a serious competitor to Squid for gateway caching (a.k.a. reverse proxying).

First, the worker MPM. While raw capacity was good at around 14,000 responses a second and overload behaviour was beautiful, this configuration utterly fell down when I tried holding any significant number of idle persistent connections open.

I think this is because the worker MPM uses a thread per connection, and when you run out of threads, you can’t accept any more connections. That can’t be the whole story, though, because even with Apache configured to have 8,000 threads (spread between a number of processes), it wasn’t happy when more than about 1,000 connections were open.

That’s a big problem, so I next tried the event MPM, in hopes of avoiding this problem. It was better, but still was only able to hold about 11,000 connections open before giving up, and introducing about 100ms of latency as well. Additionally, it had lower overall performance, topping out at about 12,000 responses a second, and was unstable under extreme load.

I was really hopeful about Apache, but until the event MPM comes out of experimental status, it doesn’t seem like a good idea.

Varnish

Another contender is Varnish, a brand-new multi-threaded gateway project out of Norway. While it’s light on documentation, what’s on the site looks promising; the folks behind it seem to respect the protocols and have good intentions.

I only briefly tested it, and saw it go up to about 10,000 responses/second before taking a serious dive when overloaded, down to less than 1,000, while response times rocketed to more than a second.

Of course, it’s still an alpha project, so it’s definitely one to keep an eye on.

Lighttpd

Lighttpd isn’t a caching proxy, it’s a high-performance Web server. However, it does have a proxy module (that’s being actively rewritten for version 1.5) and there has been some interest expressed in writing a caching module for it.

There’s a good reason for that; Lighty (as it’s called) is very, very fast — 19,000 responses/second kind of fast. It handles overload very gracefully, and it doesn’t blink when it has a large number of idle connections open.

In short, Lightly would be an excellent basis for a proxy or gateway cache, if we can get the caching part taken care of. Listen to an interview with Lighty’s primary developer for more.


Filed under: Caching Web

15 Comments

Dominic Mitchell said:

I listened to that interview about Lighty and doesn't it have the same problem that squid does? i.e that a single process won't spread over multiple CPUs well?

Monday, August 21 2006 at 3:43 PM +10:00

Mark Nottingham said:

Not really, because it’s been built from the ground up to be event-based, whereas Squid had it bolted on very late in its lifetime.

It is possible to build a high-performance Web server without using kernel-level threads; effectively, you build user-level threads using techniques like coroutines and continuations. E.g., the Twisted framework in Python takes this approach, with excellent results.

As far as Multiple CPUs — Lighty does have a “hidden” configuration option to use multiple worker processes; they recommend setting it to 2x the number of CPUs you have, assuming your’e not running CGI processes, etc. This is what I did for the numbers above.

Cheers,

Monday, August 21 2006 at 4:53 PM +10:00

adrian chadd said:

There's a magic option you can set to tell lighttpd to use >1 process. It'll then spawn that many processes which are all just bound to the same incoming socket.

Things will get exciting once a caching module is written as there's suddenly some shared state in the mix.

Monday, August 21 2006 at 4:54 PM +10:00

Hugo Haas said:

Have you ever run some tests on Polipo?

http://www.pps.jussieu.fr/~jch/software/polipo/

I used to use it to improve the performance of Python's urllib2. It's a lightweight HTTP/1.1 caching proxy, which is very good at keeping connections open and trying to use pipelining whenever it can.

I'd be curious to see how it performs under your tests.

Cheers,

Hugo

Tuesday, August 22 2006 at 9:41 AM +10:00

Stelios G. Sfakianakis said:

Mark, you do not give any details about the OS used for these tests. Although this is not very critical to the discussion I was also puzzled about the Apache behaviour regarding threads (you say you configured Apache with 8000 threads but you had problems with ~1000 connections): maybe you have used FreeBSD and libpthread? I think libthr (the alternative ala Linux 1:1 thread library) supposedly gives better performance...

thanks, Stelios

Wednesday, August 23 2006 at 3:56 AM +10:00

Mark Nottingham said:

Hi Stelios,

Most of this was on Linux (2.6), but I cross-checked some results (particularly various uses of epoll vs. kqueue) on FreeBSD 4.

I was puzzled by the Apache result as well, and want to dig deeper into it. It does seem in line with what other people have said in the past about Apache and large numbers of connections.

Cheers,

Wednesday, August 23 2006 at 7:05 AM +10:00

Mark Nottingham said:

I double-checked this, and it was as I suspected; Apache gets memory-starved (in this case, consuming about a quarter of a gigabyte per process, I run out of 2G of memory at about 1,000 idle connections). I've disabled most modules.

When this occurs, Apache can't allocate memory for workers, and all kinds of havoc ensues.

Wednesday, August 23 2006 at 11:56 AM +10:00

Harry Fuecks said:

Looks like Perlbal (uses non-blocking IO) got an in-memory cache with 1.48 - http://code.sixapart.com/svn/perlbal/tags/Perlbal-1.48/lib/Perlbal/ - seems to be for conditional GET's only and not sure on what basis it flushes the cache.

Also http://lists.danga.com/pipermail/perlbal/2006-August/000233.html - the questions are getting asked...

Thursday, August 24 2006 at 6:35 AM +10:00

Poul-Henning Kamp said:

Hi there,

Thanks for noticing Varnish.

Yes, we are very green code, but in fact our release 1.0 is due today, and we are confident that properly configured Varnish will deliver at whatever the hardware and operating system will allow us to.

Yesterday we served 700 Mbit/s, 10 kreq/s on a dual opteron running FreeBSD 6.2-Prerelease. Response times were in the low milliseconds.

And it still had approx. 40% idle CPU

We are already running live on www.vg.no, Norways most busy web-site, and you can see here what that did for their responsetime:

https://tech.dignet.no/smokeping?&target=World.Europe.Norway.VerdensGang

Their previous configuration was a dozen boxes running squid.

While performance is a big item for Varnish, I think people will find our features, in particular the VCL programming language for policy implementation very interesting as well.

Stay tuned, we'll get our web-page and docs updated eventually :-)

Poul-Henning

Tuesday, September 19 2006 at 1:15 AM +10:00

Allan said:

I discovered varnish following some developer links on the pfSense firewall appliance page and I was interested to see if any form of external review was up on google yet, which lead me to this page instead. Just from the technical description on the varnish site, I don't give the other competitors much of a chance at scale, although I can see lighty becoming extremely servicable in the mid-scale.

When I followed PHK's URL to the RRD smokeping results, I could see a recent time period ending 11:50 in some unspecified time zone where the average ping latency was in the 30ms range, then a rapid climb which converged around 140ms and holding. Did they switch varnish out of the loop at that point in time? The data there is not self-explanatory.

PHK has the same insider-advantage writing an http-cache that the OpenBSD team had when they tackled packet filter (pf). Though I have to point out one point of disagreement with PFK's design screed: the granularity of the OS page-management mechanism (traditionally 4KB) is not a priori domain appropriate, unless PHK is exploiting a userland nmapish facility to simultaneously touch all pages (demand pull from the VM system) at the same time in a longish request response object, but he makes no mention of this. Otherwise, a lot is left up to chance that the VM system incorporates a reasonable page prefetch heuristic in compensation for weak domain knowledge.

This is a general point: it's almost always best to have one layer rather than two as PHK advocates *if* you don't end up sacrificing domain knowledge in the process, such as the foreknowledge that an entire large response will be blasted out sequentially. That said, this is a well studied problem in VM design (over multiple decades) that PHK most likely regards by this point in time--perhaps subconsciously--as water under the bridge.

Tuesday, September 19 2006 at 5:53 AM +10:00

Poul-Henning Kamp said:

Allan,

I think we are on the same page here (pun intended).

Don't get confused about the page granularity thing, the entire object is stored sequentially in one chunk in virtual memory if I can get away with it. My point about page alignment is only about where the object starts in VM, not about how long it is.

That also, indirectly addresses your second point, because the entire response, headers & body, is transmitted in a single systemcall (writev(2) or sendfile(2)), so the kernel knows up front how many bytes it will need to send and where to find them.

Obviously, the kernel should do intelligent prefetching from backing store, ideally based on a bandwidth estimate coming out of the network stacks TCP processing. If it doesn't, it should be fixed, rather than having the application try to band-aid this shortcoming.

And equally obvious, if your operating system has scatter/gather I/O facility, like writev(2) or sendfile(2), then the situation would be entirely different. But who would try to do high performance computing on an operating system which is 20 years behind in facilities ?

And yes, I know that I have a different attitude to "make the operating system do the right thing" than most people are in a position to maintain. That is why I have enjoyed this excursion out into userland so much: It helps me understand better where the kernel needs to go in the future.

I have to say though, that given modern computer hardware I think most of this is less of an issue than it might immediately seem.

12 years ago, we could fill a 100Mbit/s ethernet with a 486DX50 (remember those ?).

If a dual 2.8GHz Opteron can not fill at least two and hopefully more GigE networks, something is seriously wrong somewhere.

Some may argue that the by now somewhat quaint TCP protocol is one of these "wrongs", but so far it seems that pushing more and more of TCP processing into hardware will keep that problem under control.

The biggest wrong, IMO, is that people still program for the computer architecture of the 1970ies and 1980ies: "A computer consists of a central processing unit, internal storage and external storage".

Since BSD 4.3, unix has only emulated that view for backwards compatibility.

The computer architecture to program for today is "A computer has a number of processing units with private caches, sharing a page granularity cache of anonymous and named objects, possibly backed by slow storage devices".

Wrapping you head around that mouthful is the key to high performance.

My hope is that Varnish will serve as an example in this direction.

Poul-Henning

Wednesday, September 20 2006 at 2:05 AM +10:00

Dan Kubb said:

Mark, also be sure to check out Nginx (pronounced Engine X):

http://nginx.net/

It's similar in performance to lighttpd, and likewise doesn't have a caching module yet. However it is more stable, memory efficient and (IMHO) has a cleaner code base than lighttpd. It also handles HTTP proxying, name/ip based virtual servers, SSL, gzip compression and other basics.

Nginx would make a fantastic proxy or gateway cache with the addition of a caching module.

Saturday, December 9 2006 at 12:20 AM +10:00

John said:

Hi,

I trying to figure out the best way to load test two squid reverse proxies and an looking for a tool that will allow me to test with lots of different URLS. Basically, I have a list of about a million URL's from a day's worth of squid access logs that I would like to use as the targets for my test. Is there a good tool for testing caches that also allows a good spread of URL's.

What tools did you use in the tests above?

Thanks

Friday, July 27 2007 at 9:53 PM +10:00

Warwick Poole said:

Hi Mark

I realize this discussion was started a while back and these things progress quickly... to follow up on Dan's comment above, there is now an Nginx cache solution, called ncache: http://code.google.com/p/ncache

Or rather, this a release of nginx which includes a new native caching module.

I had some build issues with it, but the developers were very responsive and this looks like something to eval alongside Varnish and Perlbal

Monday, January 21 2008 at 10:57 PM +10:00

David said:

I'd love to see an update of this article with the addition of Perlbal, polipo, and nginx, and updated versions of the others.

Friday, April 10 2009 at 3:27 AM +10:00

Creative Commons