mnot’s Web log

Design depends largely on constraints.” — Charles Eames

Friday, 3 July 2009

Come to the Stockholm IETF!

The Stockholm IETF meeting is shaping up to be an interesting one (and not just because it’s in such a beautiful city).

As announced on the mailing list, we are having a HTTPbis working group meeting. It looks like all of the editors will be there as well, so we’ll have a chance to get good feedback from the community, as well as move forward on the documents in between other meetings.

Additionally, I’m helping to arrange a couple of informal meetings:

The IETF/W3C Liaison has been discussing issues surrounding IRIs for a little while now, and we’re holding an IRI Bar BoF (informal meeting that’s often but not always in a bar) to get more involvement from the wider community (including the IDNAbis effort) so that we can figure out the appropriate standards actions.

I’m particularly interested in this one, since a lot of XML efforts (e.g., Atom) are reflexively using IRIs instead of URIs wherever they can — including cases where they’re not intended for display to humans — even though supporting them is potentially a lot trickier.

There’s also been a fair amount of recent discussion around Atom extensions and revisions, so we’re arranging an Atom Bar BoF as well. My personal feeling is that revising Atom to account for non-blog use cases is necessary, although the energy that the community has to devote to it is probably low. Should be an interesting discussion.

See you there!

this entry’s page

Thursday, 25 June 2009

The Resource Expert Droid

A (very) long time ago, I wrote the Cacheability Engine to help people figure out how a Web cache would treat their sites. It has a few bugs, but is generally useful for that purpose.

However, as I’ve got more involved in using HTTP for non-browser things, it’s become apparent that more than caching is important when you’re examining a resource to see how it behaves; things like partial content, syntax checking and other esoteric but important details. Very often, I’d find myself manually debugging a RESTful Web service with telnet — and as they say, that doesn’t scale.

Looking back at that decade-old code, I decided that rather than fixing it up (“lipstick” and “pig” are two words that come to mind), I’d rewrite. The result, after quite a few evenings and weekends, is the Resource Expert Droid.

In a nutshell, RED is a framework for testing HTTP resources; it fetches responses, analyses them, and then based upon the responses it may interact with the resource more to see how it behaves. In this manner, it’s very purposefully encouraging RESTfulness.

Note that I say “resources”, not “servers.” Since a single server can serve you content in a number of different ways (think plain files vs. CGI vs. mod_autoindex), you need to test on a per-resource granularity when you have problems. Of course, this observation isn't new.

A few examples for fun (please take it easy on these!):

Make sure you hover over the messages in the list for a full explanation of each.

REDbot is Open Source, and hosted at Github. It’s nowhere near finished yet, there’s still lots more to do (see the issues list), but contributions and suggestions are more than welcome.

this entry’s page ( 15 comments )

Wednesday, 17 June 2009

面向站长和网站管理员的Web缓存加速指南

The caching tutorial is now available in Chinese, courtesy of Che Dong (and apologies for taking so long in linking to it!).

Norwegian should be coming soon...

this entry’s page ( 1 comment )

Friday, 12 June 2009

What to Look For in a HTTP Proxy/Cache

Part of my job is maintaining Yahoo!’s build of Squid and supporting its users, which use it to serve everything from the internal Web services that make sites go to serving Flickr’s images.

In that process, I often am asked “what about X?”, where X is another caching or load balancing product (yes, Squid can be used as a load balancer). For example, Varnish, or lighttpd.

Generally, these comparisons come down to three factors; performance, features and manageability. Almost invariably, Squid doesn’t do as well as newer comers in performance (although it generally is faster than Apache), but wins on features and manageability — and that’s why it’s so widely used.

I’m not going to argue that Squid is best for every deployment, but I do think that it’s important to evaluate the whole picture, rather than just one metric. So, here are a few initial thoughts about what’s important when you’re evaluating a proxy/cache:

Performance

Performance can mean a lot of things. The least interesting but most widely cited benchmark for this kind of server is “how many 1k responses can it serve from memory per second?” but that doesn’t tell you how it will do serving 200K (or 200M) responses from disk, which is a much more difficult thing to manage.

Try looking at:

Features

Concurrence

How does the proxy handle multiple requests for the same URL? This is often critical in “reverse proxy” deployments, where a flood of requests can come in for the same thing if it gets suddenly popular, or when you first bring a cache online. If the response isn’t cached and fresh, that flood of requests can quickly overcome your back-end servers.

There are a few techniques for dealing with this. Collapsed forwarding will only allow one request for a URL to go forward at a time, if there isn’t anything in cache; if the response is cacheable, it will be sent to all waiting clients, saving those requests from going forward and swamping the origin server.

If something is cached but stale, stale-while-revalidate lets the cache serve the stale response while it refreshes what it has in the background. Not only does this save you from a flood of validation requests, but it also effectively hides the latency of refreshing your content from your clients, offering better quality of service.

ACLs

In my experience, one of the biggest things that gets a workout in a proxy/cache is the ACL system. Make sure you have maximum flexibility here; e.g., can you apply access control to something based on whether it’s a cache miss? Can an ACL select things by the request method, URL, headers, client address? Can you combine ACL directives? Can you extend the ACL system?

Streaming and Buffering

A good proxy will offer fine-grained control over how it buffers requests and responses. For example, if you’re deploying as a reverse proxy, you want to be able to buffer up the entire response, so that you can free up resources on the origin server as quickly as possible if the client is slow. Likewise, buffering the request before sending it to the origin server can help conserve resources in some deployments, increasing capacity.

Conversely, however, it’s not good if your proxy requires responses to be buffered before they’re sent; this consumes too many resources on the proxy if you’re sending large responses, and doesn’t work at all for streaming applications (e.g., video).

Cache Behaviour Tuning

Although HTTP has excellent controls to allow both the origin server and the client to say how caches should behave, inevitably there will be cases where you’ll need to… ahem… fine-tune them. This includes tuning the heuristic algorithm, which is what to do when there are no such instructions.

It also includes overriding the specified behaviour. For example, a reverse proxy probably wants to ignore Cache-Control: no-cache, since the cache is under control of the origin server.

All of these tuning knobs need to be applicable in a fine-grained way; Squid does it with regular expressions against the URL (in refresh_patterns).

Cache Configuration

The cache as a whole needs to be configurable as well.

For example, when the set of cached objects gets larger than the allocated memory or disk space, the cache needs to evict some. As a mountain of research will attest, some replacment policies are more efficient than others, especially under different workloads.

Resilience to Errors

Networked systems inevitably fail. Besides the obvious aspects of this (e.g., configurable timeouts), in a cache it’s also important to handle failures as gracefully as possible, to preserve both quality of service and cache efficiency.

Stale-If-Error helps to hide temporary back-end problems by allowing a cache to use a stale cached response (if available) when it can’t get a fresh one, or if the server returns an error code like 500 Internal Server Error. For situations where having something stale is better than nothing at all, this helps.

Quick Abort works from the other side; when the client aborts (because of a network or software problem, or a simple timeout), the cache should be able to be configured to continue downloading the response from the server, so that the next client will have the benefit of having it in cache.

Peering

Caches are often deployed in sets, both to increase capacity and also to assure availability. In these deployments, support for inter-cache protocols like ICP and HTCP means a better hit rate and, perhaps more importantly, the ability to bring a “cold” cache up-to-speed without overloading origin servers.

When evaluating support for peering, keep in mind that HTCP is more capable than ICP, because it takes into account the request headers, not just the URL. Also, HTCP CLR support means that something becoming invalid in one cache can trigger purges from neighbouring caches too (a pattern I’ll talk more about soon). Good implementations should also have a means of assuring that forwarding loops don’t happen.

Finally, Cache Digests are an interesting way to use a Bloom filter; by keeping a lossy digest of peers’ contents, it’s possible to predict whether a given request will be a hit. This is useful when the latency between peers makes “normal” inter-cache protocols too expensive (e.g., deployments between coasts or continents).

Routing

Proxies often get used as layer 7 routers; usually, to shift traffic around to the right server, for some value of “right.” A good proxy will have a number of tools to help you do this, including active and passive monitoring of peers and origin servers (to determine health and availability), flexible request rewriting (including both the request URI and response Location headers), and controls over how many connections can go to a particular server, as well as how many idle connections to keep open to each server.

Another form of routing is CARP, which routes based upon a consistent hashing algorithm — like DHTs. This allows you to build a massive array of caches to serve a very large working set (e.g., photos, a CDN).

One thing that often goes hand in hand with routing is retries — i.e., being able to try a different origin server (or IP address, or peer) if you can’t get a successful answer on the first try (if allowed by the protocol; this makes sense for GET, not POST, obviously).

Getting the Standards Right

Really, this isn’t a feature, it’s a floor to entry. If you’re going to use a proxy/cache, you have to be sure that it’s going to behave in a predictable, interoperable way, and that means conforming to HTTP1.1, SSL and all of the other applicable standards.

In the case of HTTP, this means not taking shortcuts; for example, variant caching is hard, but it’s necessary to have it for a cache to be useful. A great tool to help evaluate this is Co-Advisor.

Manageability

Stability

A proxy is worthless if it goes down all of the time, or if you’re worried that it will. Part of this is how mature it is, and part is how well it’s been tested. One of the reasons I like Squid is that it’s used in thousands (if not tens of thousands) of applications around the world; it’s been around for more than a decade, so it’s been hammered on hard.

Because of this breadth of deployment, I can confidently use it in a new (to me) situation, knowing that it’s probably been used in that way before. Contrast this with software that’s been designed for a particular purpose and hasn’t been used outside that narrow profile very much.

Metrics

Managing a cache means knowing what it’s doing, and what went wrong if you have a problem. A good implementation should have extremely extensive metrics available, ideally in many forms (e.g., over HTTP, SNMP, in logs), as well as easy-to-use debugging mechanisms, because at the end of the day all of these platforms are really complex beasts.

Ease of Use

Finally, caches have to be intuitive to use. Typically, they’re designed for a sysadmin or a netadmin, not a developer, and I think this is a shame, because these days that should be a primary audience.

this entry’s page ( 1 comment )

Friday, 5 June 2009

Opera Turbo

HTTP performance is a hot topic these days, so it’s interesting that Opera has announced a “turbo” feature in Opera 10 Beta;

Ever felt a Web site was loading slowly? Do you think it will happen again? Think again: Opera Turbo is a compression technology that provides significant improvements in browsing speeds over limited-bandwidth connections like a crowded Wi-Fi in a cafe or browsing through your mobile phone while commuting.

They go on to give more details in their sales materials as well as a blog entry (both found by a search engine, not linked from the feature description):

That’s why we’ve been working on Opera Turbo, a server-side optimization and compression technology that provides significant improvements in browsing speeds over limited-bandwidth connections by compressing network traffic. This does not only make you surf faster, but also lowers the cost of browsing when you are on a pay per usage plan.

Note the use of “server-side” here. The interesting thing here is that when I turn on Turbo and sniff the network to see what’s going on, all of my connections seem to go to a server like this:

Macintosh:~> nslookup 064.255.180.252
Server:		192.168.1.254
Address:	192.168.1.254#53

** server can't find 064.255.180.252: NXDOMAIN

Macintosh:~> nslookup 64.255.180.252
Server:		192.168.1.254
Address:	192.168.1.254#53

Non-authoritative answer:
252.180.255.64.in-addr.arpa	canonical name = 252.0-24.180.255.64.in-addr.arpa.
252.0-24.180.255.64.in-addr.arpa	name = global-turbo-1-lvs-usa.opera-mini.net.

In other words, this isn’t a “server-side” technology; it’s a proxy.

From a technical standpoint, this is an interesting approach; intermediation is a great way to introduce new features into the request stream (here, they’re compressing content and stripping headers, by the look of it).

However, I’m in Australia, and they’re sending all of my requests — even for Australian content — through a US proxy, which adds several hundred milliseconds to every request, and depending on my provider, may cost me more (some AU providers make local content free). Considering that the people who this technology’s marketing will appeal to most — e.g., those in the Australian bush, or rural India — won’t be served well by this, it seems like it would be important to point this out.

More damningly, a quick test shows that Turbo’s proxy doesn’t honour the Cache-Control: no-transform directive, and moreover, strips it from responses. no-transform is specified to assure that clients and servers have a way of avoiding problems with transcoding proxies — just like Turbo (emphasis added):

no-transform
Implementors of intermediate caches (proxies) have found it useful to convert the media type of certain entity bodies. A non- transparent proxy might, for example, convert between image formats in order to save cache space or to reduce the amount of traffic on a slow link.
Serious operational problems occur, however, when these transformations are applied to entity bodies intended for certain kinds of applications. For example, applications for medical imaging, scientific data analysis and those using end-to-end authentication, all depend on receiving an entity body that is bit for bit identical to the original entity-body.
Therefore, if a message includes the no-transform directive, an intermediate cache or proxy MUST NOT change those headers that are listed in section 13.5.2 as being subject to the no-transform directive. This implies that the cache or proxy MUST NOT change any aspect of the entity-body that is specified by these headers, including the value of the entity-body itself.

To put it mildly, this is disappointing, given Opera’s historical focus on standards compliance.

From a privacy standpoint, it gets worse. Calling this a server-side technology is frankly unconscionable. A reasonable person who reads the blurb and follows the in-browser instructions will have no idea that their requests are being routed through Opera, and no disclosure is made about what is done with that data. I’m a little surprised by this, considering that Opera is an EU-based company, and therefore subject to the European Data Protection laws.

It is worth noting that in their blog entry (which again, has to be found separately), they do say

Your privacy is important

Even when Turbo is enabled, encrypted traffic does not go through our compression servers. This means that when you are on a SSL site, we bypass these traffic and let you communicate with the SSL site directly. Opera generates statistics of the usage of Opera Turbo, but these are aggregated numbers and no information can be linked to a single user. Opera does not store any users’ private information.

So, their heart is in the right place, but this doesn’t make up for not informing users up-front.

this entry’s page ( 19 comments )


Powered by Movable Type