mark nottingham

ETags, ETags, ETags

Tuesday, 7 August 2007

Caching HTTP Protocol Design Web Web Services

I’ve been hoping to avoid this, but ETags seem to be popping up more and more often recently. For whatever reason, people latch onto them as a litmus test for RESTfulness, as the defining factor of HTTP’s caching model, and much more.

So, let me counter: they’re not all that. In fact, there are a number of pitfalls you need to be wary of if you use them.

First, depending on how they’re generated, you might find different boxes in a farm producing different ETags, with unfavourable results for caching.

Or, you might find that your implementation doesn’t really understand HTTP very well, so it gives the same ETag to two different representations of the same resource, causing downstream caches to bend over backwards to accommodate your broken server. It can happen to the best of us.

If you’re trying to be a good guy and both compressing your content and hashing it to calculate the ETag, beware; the gzip file format has a timestamp in it that means the ETag will change every time you re-compress it. Oops.

Even if you get the whole ETag thing right, there’s no guarantee that a cache will use it; although recent versions of Squid understand and use ETags, lots of older implementations don’t.

Another mistake is to think that ETags are only used for caching. If you hand out ETags and your resource supports methods like PUT or DELETE, you’d better be ready to properly handle conditional headers like If-Match on requests; otherwise, they’ll end up doing the wrong thing. I’m heartened somewhat that the APP spec alludes to this in passing, but I do wonder how many people have got this wrong (thanks to Lisa for bringing this up in discussion). Then there’s the whole ETag-on-write issue.

Finally, there’s the whole mess of weak ETags; although they’re potentially very powerful, they’re also very misunderstood.

All of this is not to say that ETags are useless; far from it. However, I do get confused and concerned when people seem to focus on just one feature of HTTP to the exclusion of other, just as (or more) appropriate ones. Dare I say “cargo cult”?

While ETags are a fine validation mechanism, Last-Modified is also perfectly fine in many situations. Even better, avoid the round trip altogether and give your response some freshness information with Cache-Control: max-age.

So What’s Right With ETags?

Now that I’ve had my rant, there are some good things about ETags. If you need a strong validator (i.e., your response might change more than once a second), they can’t be beat, and if you don’t like how Last-Modified is used as input to freshness heuristics, it’s a fine alternative, as long as you keep the caveats above in mind.

Of course, if you need to do optimistic concurrency, they’re a great option.

IIRC Yves has done some very cool things with weak ETags in Jigsaw, so that small changes don’t upset caches.

Finally, Tim Bray also has some very intelligent things to say about them, pointing out that if you’re clever, you can use an ETag to avoid a bunch of work on the server side during validation. Unfortunately, I don’t see too many people doing this yet.


Henrik Frystyk Nielsen said:

As an additional reference we wrote a long time ago a document [1] on how to use etags in the context of avoiding the lost update problem when editing resources using HTTP HEAD, GET, and PUT. It desscribes in practical terms how to use etags with conditional requests in various authoring scenarios.



Wednesday, August 8 2007 at 1:50 AM

Noah Slater said:

“So, let me counter: they’re not all that. In fact, there are a number of pitfalls you need to be wary of if you use them.”

This and the whole of your essay is a non sequitur.

The technical merits of ETags are completely orthogonal to how well they are usually implemented by developers.

While I am not disagreeing that ETags are sometimes poorly used I think it is misleading for you to use this as a base of some qualitative conclusion about the merits of ETags.

You are conflated two issues.

Wednesday, August 8 2007 at 3:51 AM

Justin Makeig said:

Like any good hash, the ETag should represent the state of underlying resource, not necessarily its representation. Like Tim Bray says in the above reference, once you’ve created your view and hashed it, you’ve already done all of the hard work. In the case of files on a filesystem, state and representation are probably the same. However, more structured resources will probably have logical keys that can be hashed. Since it’s based on resource state, not system or request data, the logical hash should be consistent across machines in a clustered environment.

Wednesday, August 8 2007 at 6:08 AM

Noah Slater said:

Mark, yes I agree completely. Rhetoric can be useful sometimes, I was just in the mood to call you on it. Heh.

Wednesday, August 8 2007 at 8:44 AM

Kishore Senji said:

Cache-Control:max-age removes the round trip, but once the resource is stale, the client would do a GET. If the resource actually did not change even after the max-age, then sending the whole response is not desirable. With ETag (or Last-Modified) coupled with Cache-Control:max-age or Expires, we would be able to do a conditional GET saving bandwidth. Just using ETag alone (with out Cache-Control or Expires) is not optimium as it is left to the client when to do the conditional validation on that resource. So, I think ETag or Last-Modified with some cache headers is better.

Nice to know that the gzip has timestamp in it. A quick peek at some implementations of gzip (Java) set the timestamp to 0. But as Justin commented, the hash is better to be done on the input variables anyways rather than the output.

Thursday, August 9 2007 at 6:52 AM

Nikunj Mehta said:

In the Atompub (which is sometimes also called the APP) world, most (sane) implementers do not accept unconditional PUT requests. It looks like Atom feed servers might be the tipping point in ETag handling. While I agree that REST » ETag, its use is essential for a sane REST application.

I echo the comments of Kishore that the hash should be calculated on the input variables and not the output.

Friday, August 10 2007 at 1:20 AM

Jon Hanna said:

“Like any good hash, the ETag should represent the state of underlying resource, not necessarily its representation.”

No, very definitely no.

ETags are specifically about the entity, which is (the main) part of the representation. If you make your e-tag represent the state of the resource then you will have the same e-tag for all representations. This fails. The case this is most often seen is with content-encoding, as quite a few implementations of content-encoding in script will send the same e-tag for either representation. It fails in just the same way for any other reason on which representations may differ.

Now, if you have only one representation, and your means of producing this from the resource is static, then producing the ETag from its state is valid. Similarly you can create an ETag from its state and then add something representing which of the possible representations you are dealing with.

In all though, it’s essential that an e-tag is about a representation, not a resource.

Thursday, November 27 2008 at 12:25 PM