Inherent HTTP Coherence

Mark Nottingham
mnot@pobox.com

Abstract

This document discusses issues surrounding coherence mechanisms in the HTTP, and proposes two solutions to limitations of current implementations.

Background

There are two primary methods of gaining efficiency by use of HTTP caches; freshness and validation. Freshness allows the object to be served directly from the cache without contacting the origin server. Validation avoids repeated transfer of an object already in the cache, if it has not changed.

Validation is a method of establishing coherence; that is, it allows the cache to guarantee that the correct object is being served. There are two validation methods in use today; Last-Modified/If-Modified-Since and ETag/If-None-Match. Both methods are characterised by the issue of a token from the Web server (Last-Modified and Etag, respectively) that a cache can store, to be presented later, in a If-Modified-Since or If-None-Match request (respectively). If the object hasn't changed, a 304 Not Modified response is sent, saving the network cost incurred by retransmitting the object. Otherwise, if a new version is available, it is sent.

The specification makes a further distinction; LM/IMS validation is considered weak; that is, not able to guarantee that the object is truly unique (mostly because of the 1-second resolution of dates in the HTTP). On the other hand, Etag/INM validation is considered strong, because the server is responsible for guaranteeing that the Etag is unique for any given object instance.

Motivation

Although there is a reasonable amount of research into suitable coherence mechanisms, it tends to cover the problems of when to validate (freshness, invalidation protocols, etc), but not the merits of the validation mechanism itself.

Wills' study [] established that a significant portion of Web objects do not have validators or freshness hints associated with them. Approximately 33% of HTML objects were unchanged during the study, while lacking any information that would have allowed the object to be reused.

Although HTML and other generated objects are not the largest (and therefore theoretically the most beneficial to reuse) objects, they are still interesting for a few reasons;

Problems with Current Coherence Mechanisms

The more widely used validator, Last-Modified, actually performs a double duty; the date contained can be used to determine freshness. This often makes use of LM/IMS validation inappropriate. For instance, a database-driven site may have no knowledge of the action that caused the contents of a page to change, or what page was served at a particular date in the past.

Etags and strong validation were designed to solve this. However, because strong validators must be guaranteed to be unique for every permutation of an object, the process actually generating the object must integrate the logic to create the validator into the application itself. In effect, the server must keep state, and tie it to a particular version of the object.

With a static file, examining aspects of it that are guaranteed to be unique, such as inode information is suitable. This is not possible with generated content, whether from a database, scripting engine or other source. Because Web publishers are now responsible for generating the HTTP response, they are required to be knowledgeable about validation and to handle conditional requests correctly.

For instance, if a Web developer wishes to publish a document that is part of a large collection, he or she may do so with one of the many server-side scripting engines available. These tools are typically used for the flexibility and power that is offered. Examples of such tools include Perl, PHP, Cold Fusion, and ASP, all of which serve a significant portion of traffic on the Web.

Use of such a tool means that the server cannot calculate any unique identifier for an object instance on its own. The developer who generates the object must contrive to generate it, because the server is not aware of the internal criteria that were used to generate the entity. While it is certainly possible for a competent programmer to implement both the generation of validators and handling of conditional requests, in practice few have the time or discipline to do so.

Solution within the Protocol

The traditional means of establishing uniqueness of an entity is through a hash function, such as MD5. In the HTTP, such a function could be used to generate a validator for each instance of an entity, rather than forcing the underlying mechanism to generate it.

However, the ETag must be transmitted with the HTTP headers of the object; as such, it must be known before object is sent. Because a hash must have the entire object as its input, this poses a problem; for large objects, there can be a significant delay while the object is generated. Most implementations would desire to send the object as soon as possible.

This can be solved by two strategies:

Solution by Protocol Extension

By taking the hash transmission and comparison into the protocol itself, validation can be limited to the domain of the server and cache exclusively. Such a scheme would require only one additional request header.

The header could be called 'If-Not-Hash', and would be functionally and lexically similar to If-None-Match, except that the payload would be any number of MD5 hashes, which would be obtained either from Content-MD5 response headers, or (more likely) calculated by the cache itself upon storage of an object.

For example, assume that a server generates a page that is sent to a cache en route to the end user. The cache can store the object and generate a hash, which is stored along with its other metadata. Upon further requests for the same object, the server can generate a If-Not-Hash request header to be sent, containing the hash(es) that it has associated with stored version(s) of the object.

Upon receipt of an If-None-Match request header, a server can generate the response as usual, generating a hash before sending the response. If the hash matches one of those contained in the request, it can send a 304 Not Modified response, containing the hash of the appropriate object if necessary.

Benefits

Potential Problems

Example

Proxied request for a new (uncached) object

GET http://foo.com/bar.html HTTP/1.0
[request headers as normal]

Origin server response

HTTP/1.0 200 OK
[response headers as normal]

Proxied request for the object, with INH validation
Cache can calculate object hash upon storage, or first need. INH only used if no alternate validator.

GET http://foo.com/bar.html HTTP/1.0
[request headers as normal]
If-Not-Hash: [md5 hash of cached body or bodies]

Origin server response
Normal response to a conditional request; either

HTTP/1.0 200 OK
[response headers as normal as well as response body]

Or

HTTP/1.0 304 Not Modified
[response headers as normal]

Summary

Two mechanisms for making coherence of objects independant of content are discussed. The first, which can be implemented by pure HTTP/1.1 servers and clients, has limitations, but can be easily implemented.

The extension mechanism makes it possible to validate nearly every object available, not just those based on simple delivery mechanisms, like filesystems. It does so with no intervention needed by the content author, or even the content delivery system developer, since it can be implemented entirely within the Web server itself.

This mechanism does not address higher-level issues such as freshness calculation vs. invalidation, but it does provide a foundation that they can grow upon.