Thoughts on Archiving HTTP

Wednesday, 5 May 2010

Steve Souders and others have been working for a while on HAR, a HTTP Archive format.

I love the idea behind HAR, but as I expressed on the mailing list (so far no response), I have a hard time believing that different implementations will parse and process HTTP and then serialise it into HAR in exactly the same way.

My use case is incorporating HAR support into RED, both for import and export. However, RED needs byte-for-byte access to the entire HTTP interaction to do a proper protocol analysis.

Now, you may say that this level of detail isn’t what HAR is designed for, but I think it affects others like YSlow and Page Speed. For example, last time I checked LiveHTTPHeaders shows you HTTP headers — especially cache-related ones — after Mozilla’s processing, which can change the values and skew results.

That’s just one example, of course; parsing HTTP is by nature a very complex thing. Choosing any abstraction to express an HTTP message in loses information, in the same way that considering XML to be an Infoset makes encryption and signatures so difficult.

The end result is that I think once people really get their hands dirty with HAR, they’ll find interoperability — i.e. getting the same results from tools for the same HTTP interaction, no matter what generated it — is really hard.

Why? Because HAR is really a format for capturing the results of analysing a HTTP message exchange, not the exchange itself.

A Proposal

I think what’s required is an additional format; something that can capture the raw information in an HTTP exchange so that different tools can analyse it without worrying about how it was captured.

HTTP already defines an ‘archive format’ of sorts; the application/http format in RFC2616, which shows how to serialise an HTTP message as MIME. In reality, this is just the “bytes on the wire”, which is exactly the level of detail needed.

What’s missing is timing; i.e., how long it took to look up DNS, connect, wait for a response, etc. Potentially, applications would need timing and boundary information for each individual packet received.

This level of detail could be achieved by annotating the HTTP message with a byte offset; e.g., by having a set of headers in a MIME wrapper that said something like

Timing: start=1273021621.2059989
Timing: dns_resolved=+.21233
Timing: connected=+.8195
Timing: packet=+.8195;start=0;end=1253
Timing: packet=+1.022;start=1254;end=2200

and so forth.

Of course, this loses information as well; we don’t have the TCP/IP information available here, but it’s a much lower abstraction than HAR, and it requires a lot less interpretation.

So — is the HAR community (or anyone else) interested in talking about a format along these lines? I’m happy to compromise to make it more implementable (e.g., it may be difficult for browser plug-ins to get this kind of information from the APIs available to them), but if no-one is interested, I won’t spend any energy on coding it.

Mark Nottingham

other HTTP posts

Thoughts on Archiving HTTP

A Proposal