mark nottingham

Thoughts on Archiving HTTP

Wednesday, 5 May 2010

HTTP

Steve Souders and others have been working for a while on HAR, a HTTP Archive format.

I love the idea behind HAR, but as I expressed on the mailing list (so far no response), I have a hard time believing that different implementations will parse and process HTTP and then serialise it into HAR in exactly the same way.

My use case is incorporating HAR support into RED, both for import and export. However, RED needs byte-for-byte access to the entire HTTP interaction to do a proper protocol analysis.

Now, you may say that this level of detail isn’t what HAR is designed for, but I think it affects others like YSlow and Page Speed. For example, last time I checked LiveHTTPHeaders shows you HTTP headers — especially cache-related ones — after Mozilla’s processing, which can change the values and skew results.

That’s just one example, of course; parsing HTTP is by nature a very complex thing. Choosing any abstraction to express an HTTP message in loses information, in the same way that considering XML to be an Infoset makes encryption and signatures so difficult.

The end result is that I think once people really get their hands dirty with HAR, they’ll find interoperability — i.e. getting the same results from tools for the same HTTP interaction, no matter what generated it — is really hard.

Why? Because HAR is really a format for capturing the results of analysing a HTTP message exchange, not the exchange itself.

A Proposal

I think what’s required is an additional format; something that can capture the raw information in an HTTP exchange so that different tools can analyse it without worrying about how it was captured.

HTTP already defines an ‘archive format’ of sorts; the application/http format in RFC2616, which shows how to serialise an HTTP message as MIME. In reality, this is just the “bytes on the wire”, which is exactly the level of detail needed.

What’s missing is timing; i.e., how long it took to look up DNS, connect, wait for a response, etc. Potentially, applications would need timing and boundary information for each individual packet received.

This level of detail could be achieved by annotating the HTTP message with a byte offset; e.g., by having a set of headers in a MIME wrapper that said something like

Timing: start=1273021621.2059989
Timing: dns_resolved=+.21233
Timing: connected=+.8195
Timing: packet=+.8195;start=0;end=1253
Timing: packet=+1.022;start=1254;end=2200

and so forth.

Of course, this loses information as well; we don’t have the TCP/IP information available here, but it’s a much lower abstraction than HAR, and it requires a lot less interpretation.

So — is the HAR community (or anyone else) interested in talking about a format along these lines? I’m happy to compromise to make it more implementable (e.g., it may be difficult for browser plug-ins to get this kind of information from the APIs available to them), but if no-one is interested, I won’t spend any energy on coding it.


8 Comments

Beat Bolli said:

Why not use PCAP (of tcpdump and wireshark fame) and build on that?

Wednesday, May 5 2010 at 5:06 AM

Erik Hetzner said:

The WARC format comes pretty close to this already:

http://bibnum.bnf.fr/WARC/

See also the previous ARC format:

http://www.archive.org/web/researcher/ArcFileFormat.php

These are used for web archiving.

I have been involved with web archiving for a number of years, so I was surprised to hear of this “HTTP Archive” format. I hadn’t thought of the need for HTTP archiving for tools like Firebug.

Having used ARC and WARC, I think you are correct to have identified the proper solution as being a serialization of the bits on the wire. This method has worked very well for web archiving. It is simple to store, and libraries to parse HTTP messages are ubiquitous. I do not understand why one should build a system to parse the HTTP session and then re-serialize it in a different format, JSON or otherwise. This is bound to cause problems.

Thursday, May 6 2010 at 5:15 AM

Steve Souders said:

How would Firebug generate the type of file you’re describing? In other words, does your proposal also require that browsers build an API that exposes byte-level HTTP data? If so, that’s going to take some time to get done, whereas HAR is making progress now. Should these be two types of files?

Friday, May 7 2010 at 2:14 AM

karl said:

Slightly related I remember the development of HTTP expressed in RDF for describing HTTP interactions.

https://www.w3.org/TR/HTTP-in-RDF10/

HAR seems indeed cool and promising. More exactly because there seems to be traction around it from different people.

Saturday, May 8 2010 at 2:48 AM

Gordon Mohr said:

I too was surprised to hear of an “HTTP Archive” format that didn’t include the verbatim HTTP transaction, but did include user-agent specific parsing/rendering/execution timings. That information is certainly useful, but it’s not really “HTTP” info.

I’d recommend HAR be clearly described to make this distinction clear – perhaps you can keep the acronym by referring to it as “HTTP Agent Record”.

Meanwhile, for a format that does store verbatim over-the-wire HTTP requests/responses (like WARC), it’d be great to reuse HAR’s defnitions for fields (such as timings) that aren’t yet covered in WARC-related specs. Helpful steps to allow this might be:

  • definition of a HAR-metadata MIME type, so that the separate ‘metadata’ record in WARC (which describes one or more other verbatim records) could have that type

  • an allowance for such a HAR-metadata record to only include the novel info (not necessarily repeating basics visible in the HTTP transaction)

Storing bytes verbatim is also likely to have benefits when dealing with servers that are buggy or confused-about-encoding; the normalization to UTF8 for HAR might otherwise obscure guesswork that is being done by the recording-client.

Saturday, May 29 2010 at 2:33 AM