Wednesday, 5 May 2010
Thoughts on Archiving HTTP
I love the idea behind HAR, but as I expressed on the mailing list (so far no response), I have a hard time believing that different implementations will parse and process HTTP and then serialise it into HAR in exactly the same way.
My use case is incorporating HAR support into RED, both for import and export. However, RED needs byte-for-byte access to the entire HTTP interaction to do a proper protocol analysis.
Now, you may say that this level of detail isn’t what HAR is designed for, but I think it affects others like YSlow and Page Speed. For example, last time I checked LiveHTTPHeaders shows you HTTP headers — especially cache-related ones — after Mozilla’s processing, which can change the values and skew results.
That’s just one example, of course; parsing HTTP is by nature a very complex thing. Choosing any abstraction to express an HTTP message in loses information, in the same way that considering XML to be an Infoset makes encryption and signatures so difficult.
The end result is that I think once people really get their hands dirty with HAR, they’ll find interoperability — i.e. getting the same results from tools for the same HTTP interaction, no matter what generated it — is really hard.
Why? Because HAR is really a format for capturing the results of analysing a HTTP message exchange, not the exchange itself.
I think what’s required is an additional format; something that can capture the raw information in an HTTP exchange so that different tools can analyse it without worrying about how it was captured.
HTTP already defines an ‘archive format’ of sorts; the application/http format in RFC2616, which shows how to serialise an HTTP message as MIME. In reality, this is just the “bytes on the wire”, which is exactly the level of detail needed.
What’s missing is timing; i.e., how long it took to look up DNS, connect, wait for a response, etc. Potentially, applications would need timing and boundary information for each individual packet received.
This level of detail could be achieved by annotating the HTTP message with a byte offset; e.g., by having a set of headers in a MIME wrapper that said something like
Timing: start=1273021621.2059989 Timing: dns_resolved=+.21233 Timing: connected=+.8195 Timing: packet=+.8195;start=0;end=1253 Timing: packet=+1.022;start=1254;end=2200
and so forth.
Of course, this loses information as well; we don’t have the TCP/IP information available here, but it’s a much lower abstraction than HAR, and it requires a lot less interpretation.
So — is the HAR community (or anyone else) interested in talking about a format along these lines? I’m happy to compromise to make it more implementable (e.g., it may be difficult for browser plug-ins to get this kind of information from the APIs available to them), but if no-one is interested, I won’t spend any energy on coding it.