Prefetching (again)

Sunday, 22 May 2005

There’s been quite a kerfuffle over Google’s Web Accelerator, because it prefetches Web content.

It’s amusing to see these issues recycle over time; in the late nineties, prefetching was one of the biggest areas of research in Web caching. There were lots of papers written (search for “prefetch”), and even a commercial implementation; CacheFlow was the prefetching cache (they called it “active content”), and it was pretty controversial in the industry.

After a long period of back-and-forth and considerable experience, the consensus came to be that prefetching was not worth the overhead for conventional (i.e., proxy) Web caching. CacheFlow never got to be as big as they needed to, and they eventually gave up caching as a core business (which is now dominated by the likes of Cisco and Network Appliance) and went into Web security appliances, calling themselves “ BlueCoat”*.

This is why I’ve been skeptical about so-called optimistic prefetching, like in Mozilla (they weren’t the first by far, but they are easily the most widespread).

Prefetching On Behalf of the Server

That’s not to say that there aren’t uses for prefetching. When I worked at [insert huge financial company here], we designed and deployed a global content distribution network to allow customer service reps to look at multi-megabyte PDFs in near-real time (i.e., while they were on the phone with customers). The full details deserve a separate entry, but we needed prefetching to get that first view of the PDF to be acceptably quick, and ended up working with Network Appliance to get it into the product.

Akamai also has used some forms of prefetching for quite some time. In both of these cases, it’s important to differentiate proxy cache prefetching, like CacheFlow did, and gateway cache prefetching, which makes a lot more sense, because it’s done in coordination with the content publisher. We did make motions towards standardising the control mechanisms, but it never got off the ground.

What might be most interesting about GWA is that it apparently** coordinates caching between both the client and the intermediary. I don’t have much insight into what they’re doing, but tight integration between the two could buy you some interesting benefits. Server-side control mechanisms would help even more, but I suspect Google would be reluctant to give up their heuristics.

* Academic interest in prefetching continued far beyond the point when its lack of utility was apparent; there are certain topics that just won’t die. Another example; cache replacement algorithms were the subject of endless papers, even though the small gains in efficiency that they found were dwarfed by the falling prices of disks.

** I say apparently because AFAIK GWA doesn’t work on Mac OSX.

Mark Nottingham

other HTTP Caching posts

Prefetching (again)

Prefetching On Behalf of the Server