mnot’s blog

Design depends largely on constraints.” — Charles Eames

Tuesday, 24 February 2009

Caching When You Least Expect it

There’s a rule of thumb about when a HTTP response can be cached; the Caching Tutorial says:

If the response’s headers tell the cache not to keep it, it won’t.
If the request is authenticated or secure, it won’t be cached.
If no validator (an ETag or Last-Modified header) is present on a response, and it doesn’t have any explicit freshness information, it will be considered uncacheable.

And, generally, this is true; most implementations won’t both caching something that doesn’t have either explicit freshness or a validator, because these responses can’t be reused except in very unusual circumstances; effectively, they use the lack of this information as a heuristic to avoid “polluting” their cache with responses that won’t be used.

This is so prevalent, in fact, that it’s developed into a bit of common wisdom; it’s easy to think that if something doesn’t have explicit freshness (e.g., a Cache-Control: max-age or Expires header) or a validator, it won’t be cached, ever.

Except…

This generalisation isn’t completely accurate. HTTP’s caching section is confusing, to put it kindly. However, it does clearly say that a cache can store anything that doesn’t have a no-store directive; from 2616:

Unless specifically constrained by a cache-control (section 14.9) directive, a caching system MAY always store a successful response (see section 13.8) as a cache entry, MAY return it without validation if it is fresh, and MAY return it after successful validation. If there is neither a cache validator nor an explicit expiration time associated with a response, we do not expect it to be cached, but certain caches MAY violate this expectation (for example, when little or no network connectivity is available).

The real constraints in HTTP’s caching model are when a stored response can be reused. However, there are some pretty big allowances given for calculating heuristic freshness and using stale responses when the origin server isn’t contactable. This usually hasn’t been an issue, because as it says above, most caches won’t bother storing this kind of response anyway.

Enter ISA

It turns out that one does, and that common wisdom is wrong. Microsoft’s ISA server — commonly deployed at enterprises, including Microsoft, of course — does indeed cache these kinds of responses.

Which means that it can and apparently will store a response like this:

REQ: GET /my-personalised-home-page/ HTTP/1.1
REQ: Host: www.example.com

RES: HTTP/1.1 200 OK
RES: Content-Type: text/html
RES: Connection: close
RES:
RES: <!— my personalised HTML content here —>

Note the lack of explicit freshness information and validators, as well as the absence of anything that tells a cache that this can’t be reused. Now, it won’t reuse it prolifically, but HTTP does allow its reuse it in a number of situations, including when the origin server looks like it’s down (e.g., a network failure).

So, in a nutshell, if you serve personalised Web pages without any caching metadata (like above), expecting them not to be cached, you may be surprised.

What does this mean?

I’m sure some people will try to paint this as ISA server being evil or a bad citizen. In fact, it’s the opposite; they’re following the agreed-upon standard for HTTP, and exposing a feature that I’ve had people ask for explicitly (and recently), being frustrated with other cache implementations that don’t store some responses. In fact, in my experience ISA server is one of the better (read: more HTTP conformant) cache implementations out there.

However, if you publish personalised content on the Web, it does mean you need to think carefully about caching. The caching model in HTTP wasn’t designed with Cookie authentication in mind. If you assume that no validators and no freshness means no caching, you could be caught out, badly.

This simplest way to fix this is to set a Cache-Control: private directive on all personalised responses; that way, shared caches know not to reuse it, while browser caches can still, so that user experience isn’t impacted. Cache-Control: no-store also works, but it will avoid the browser cache as well.

There are a number of other tricks that you can play, but that I wouldn’t recommend on the open Internet; e.g., using Vary: Cookies won’t do much good. Using different URIs for different users is more Web-friendly (and still the best technique for back-end caching), but probably not too useful in the common case, because you still have to address the risk of someone else going looking through the cache for other people's content.

Moving Forward

For me, the most interesting part of all of this is what it means for the caching model in HTTPbis. I spend some time with the editors late last year in sunny Orange County, trying to untangle the caching model while they diligently edited the other parts. That work hasn’t been published yet, but the upshot was that there are many parts that are poorly specified, sometimes even conflicting with itself.

One of the assumptions that I tentatively made in cleaning things up was that only stale responses could be reused in such circumstances, but obviously I’ll need to revisit that now. The challenge moving forward is going to make the caching model easier to comprehend without breaking existing implementations, based on their actual behaviour rather than general assumptions like the one above.

And, of course, I need to update the Caching Tutorial.


Filed under: Caching HTTP Web

6 Comments

Roy T. Fielding said:

That last rule of thumb in the tutorial is indeed wrong.

All of the caches deployed in 1995-96 would cache a response to GET that had no cache-control or expires. HTTP has always viewed the lack of specific restrictions to be an invitation to cache by heuristics. In other words, we optimize for the common case (no restriction == less bits on the wire in header fields == caching is enabled by default for responses to GET). It is far easier and economically justifiable for the generators of dynamic personalized content to add cache restrictions than it would be for the vast majority of static content providers to remove them.

Tuesday, February 24 2009 at 10:10 AM +10:00

Mark Nottingham said:

Yes; that was the only way to retrofit caching onto the Web. I distinctly remember struggling with Harvest (and burning through a few RZ1000 disks in the process, due to the load put on them!).

Anybody still have a copy sitting around?

Tuesday, February 24 2009 at 10:26 AM +10:00

rs mohan said:

nice

Tuesday, February 24 2009 at 8:13 PM +10:00

Steve Souders said:

Another example: Firefox caches resources that don't contain any Expires or Cache-Control headers. This was covered in the HttpWatch blog ( http://blog.httpwatch.com/2008/10/15/two-important-differences-between-firefox-and-ie-caching/ ). In the absence of any expiration information, Firefox assigns an expiration date to these resources.

2616 allows this, with the following explanation, "Also, if the response does have a Last-Modified time, the heuristic expiration value SHOULD be no more than some fraction of the interval since that time. A typical setting of this fraction might be 10%." Here's the formula Firefox uses to assign an expiration date to resources that don't have any expiration information: Expiration Time = Now + 0.1 * (Time since Last-Modified)

Here's a fun test:
1. Using Firefox, go to http://www.whitehouse.gov/ with some packet sniffer
==> notice that http://www.whitehouse.gov/includes/eop/admin.css has no Expires or Cache-Control header
2. go to about:cache?device=disk
==> notice that admin.css has an Expires date of ~3 days in the future!
3. close all instances of your browser, then start a new instance
4. go to http://www.whitehouse.gov/
==> notice that there is no HTTP request for admin.css, not even a Conditional GET request!

IE has a similar heuristic.

The two most popular browsers don't issue conditional GET requests in as cleancut a fashion as we might expect, even across sessions.

Thursday, February 26 2009 at 3:29 AM +10:00

Mark Nottingham said:

Yep, that's quite common; almost every cache implements it (called heuristic freshness in 2616).

Thursday, February 26 2009 at 7:00 AM +10:00

Creative Commons