mnot’s blog

Design depends largely on constraints.” — Charles Eames

Sunday, 29 April 2007

Squid is My Service Bus

The QCon presentation (slides) was ostensibly about how we use HTTP for services within Yahoo’s Media Group. When I started thinking about the talk, however, I quickly concluded that everyone’s heard enough about the high-level benefits of HTTP and not nearly enough details of what it does on the ground. So, I decided to concentrate on one aspect of the value that we get from using HTTP for services; intermediation, as an example.

If your service is struggling to do 20 or 50 requests a second, the thousands that a modern HTTP cache can handle is a relief.

The most obvious advantage of using an HTTP intermediary is caching; if your service is struggling to do 20 or 50 requests a second, the thousands that a modern HTTP cache can handle is a relief, to say nothing of the superior connection handling you’ll get thanks to the easier utilisation of event looping techniques like epoll and kqueue. This is often the difference between deploying two boxes and twenty or more.

My intermediary of choice at the moment is Squid, which is by far the most predominant Open Source Web proxy cache implementation. It’s not particularly performant compared to the competition (it can only serve about 7,000 requests a second out of memory on a Xeon, although it will do 12,000 on a Core2 Duo), it’s single-threaded, and perhaps most damning, it’s still only HTTP/1.0 as far as connection handling goes. However, it makes up for it in features and flexibility.

Not only can Squid be used to cache content and route requests (using redirectors), it can also enforce security policy (using ACLs and authentication), serve as a metrics collection point (e.g., see the histograms on slide 31), and it can be used for load balancing between multiple origin servers (to the point where you can dynamically route at the application level through a network). Cache peering protocols like ICP and HTCP can be used to tie caches together, both increasing their footprint as well as their reliability and efficiency. Cache Digests can be used to further improve performance by predicting what’s in a peer’s cache.

More specialised features can help reliability and scaling even more; for example, collapsed forwarding prevents storms of requests from overcoming the server by collapsing multiple requests for the same URI into one. Squid can retry requests intelligently, and re-route as necessary upon failure — without breaking the semantics of HTTP. It will also intelligently pool persistent connections, to reduce the latency of opening new ones. There are some more enhancements along this track in the pipeline that I’ll talk about separately soon.

These are just a few examples; Squid has been under development for more than a decade (growing out of the Harvest project, the granddaddy of pretty much every Web cache *and* search engine out there), and because it’s community-developed, it’s very feature-rich.

The point of this is that Squid — or most any other HTTP cache implementation, for that matter (because HTTP has a well-defined intermediary role, it’s easy to drop a new one in) — can serve as the basis of what most people think of as an Enterprise Service Bus for HTTP. True, it doesn’t support any WS-*, but more people are considering that a plus, not a minus, and you don’t have to pay a vendor for the privilege of debugging their beta product; it’s free and battle-tested. Oh, and a hell of a lot faster.


Filed under: Caching HTTP Web Web Services

6 Comments

schickb said:

What data format do you prefer coming out of your "back-end" servers? One concern I have with implementing REST internally using web technologies is that the data formats are optimized as much for humans as for machines. I guess front-end machines are cheap, but it just seems like a waste parsing json, xml, or worst of all html.

I'm assuming that the back-end machines are exposing resource representations intended for further transformation, rather than representations meant for simple aggregation and display.

Monday, May 28 2007 at 1:43 PM +10:00

Mark Nottingham said:

XML, mostly. Some JSON, etc. Mostly transformation, but not all. It's not that much of an issue; other inefficiencies dwarf the parsing and transmission costs (XML parsers have become *very* good), and the interoperability advantages are considerable.

This is pretty much the IETF party line (as much as there can be one), and IMO it's confirmed by the work we did in the XML Binary Characterisation WG (despite some people's feelings).

Wednesday, May 30 2007 at 10:14 PM +10:00

schickb said:

I thought one of the XML Binary Characterization WG conclusions was that "Binary XML is needed"? (I am reading this http://www.w3.org/TR/xbc-characterization/ ). Do you use some form of Binary XML?

Thursday, May 31 2007 at 2:07 PM +10:00

Mark Nottingham said:

That's what they thought, yes. I suppose gzipped XML could count as binary...

Thursday, May 31 2007 at 5:58 PM +10:00

scottdawson said:

Using the terminology from your Caching Tutorial, would you consider squid a Gateway Cache in the scenario you are describing?

Saturday, December 22 2007 at 2:05 AM +10:00

Mark Nottingham said:

It depends; sometimes it's deployed like that, but often we use Squid as a straight proxy.

Saturday, December 22 2007 at 8:46 AM +10:00

Creative Commons