Monday, 11 July 2011
What Proxies Must Do
The explosion of HTTP implementations isn’t just in clients and servers. An oft-overlooked but important part of the Web ecosystem is the intermediary, often called just a “proxy”*.
These days, it’s pretty easy for anyone to build a proxy using Python, Ruby, Perl, Java or Node.JS, and there are a bunch of frameworks that can help you do this, such as node-http-proxy. Additionally, there are lots of off-the-shelf proxies that you can use, from the widely-known Squid and Traffic Server to more niche products like Polipo and tinyproxy.
It’s great that it’s so easy to do this, but building a proxy is trickier than it seems; not only do you have to worry about things like concurrency, performance and stability, you can also hurt the Web if you get it wrong.
Let me explain.
Bad Proxies Hurt the Web
When a Web server doesn’t interoperate with the rest of the Web well it becomes apparent pretty quickly, and the person running it either gets it fixed, or uses another server; it’s pretty simple. Likewise, if your Web browser makes it difficult to browse your favourite site, you know what happens next: the barrier to switching browsers has never been lower.
However, proxies are awkward. When they go bad, Web sites can’t do anything about it, and users can only complain to faceless IT departments who don’t have much time and frankly probably care even less.
So, when a problem is introduced into a proxy, it affects the whole Web, badly.
Polipo, for example**, doesn’t honour the Cache-Control: private directive, which breaks the very important contract between servers and caches; now, when you’re setting your Cache-Control headers, you either have to accept that the very small number of people who use Polipo as a shared cache may see each others’ content, or you have to bend over backwards, wasting bytes (and money!) to send other directives that Polipo will follow. And, even if the Polipo guys decide to fix it, there’s no guarantee that existing deployments will be upgraded anytime soon.
In other words, the privileged position of a proxy has great power (to muck things up), and correspondingly great responsibility to get it right, because at their hearts, protocols are agreement, and when you don’t honour them, you don’t communicate.
So, what should proxies (and proxy frameworks) do? And, what should you look for when you’re shopping for one to deploy? Here’s a baker’s dozen of things to keep in mind.
0. Advertise HTTP/1.1 Correctly
HTTP/1.1 is the current version spoken on the Internet, and as long as the proxy implements it correctly (in particular, it handles chunked encoding correctly), it should always advertise itself as HTTP/1.1 conformant.
This means that the top line of requests and responses should always contain
HTTP/1.1 as the version identifier, even when talking to something that says it’s
The reason for this is that HTTP 1.1 defines not only how to talk to 1.1 devices, but also to 1.0 ones. When a
HTTP/1.0 message contains a 1.1 mechanism like
Cache-Control, its meaning doesn’t change, and should still be respected.
See the spec for more information.
1. Remove Hop-by-hop Headers
The number one thing that proxies must do is remove hop-by-hop headers before forwarding messages — both requests and responses. This means that the Connection header and any header it lists MUST be removed, as well as TE, Transfer-Encoding, Keep-Alive, Proxy-Authorization, Proxy-Authentication, Trailer and Upgrade.
Proxies that don’t do the right thing here will make it impossible to deploy new hop-by-hop mechanisms, and can introduce security vulnerabilities. For example, if transfer-encoding isn’t stripped, it can cause confusion about the message delimitation, as well as cause interop problems.
For example, in this request:
GET /foo HTTP/1.1 TE: gzip Host: example.net Connection: Keep-Alive, Foo, Bar Foo: abc Foo: def Keep-Alive: timeout=30
Keep-Alive and both
Foo headers must be removed before forwarding it. If
Bar occurred in the message, it would be removed too, but its absence isn’t an error.
See the spec for more details on getting it right.
2. Detect Bad Framing
Proxies also need to be on the lookout for Content-Length headers that are duplicates, as well as ones that conflict with the use of Transfer-Encoding, and either reject the message or remove the bad headers.
This is because there are entire classes of attacks that exploit the differences between how implementations frame messages.
For example, this response:
HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 45 Content-Length: 20
has an ambiguous length. If a proxy treats it differently than a client, an attacker can inject a response. Likewise, this one:
HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 200 Transfer-Encoding: chunked
has both a Content-Length and chunked encoding. The chunked encoding has precedence, and the Content-Length has to be removed before forwarding the message.
See the spec for how to do it well.
3. Route Well
The destination for a request can appear in the URL (as an absolute URI) as well as in the Host header. So, it’s important for proxies to behave correctly when both appear. In short, the host and port in an absolute URI always override the Host header. For example:
GET http://example.net/foo HTTP/1.1 Host: www.example.com:8000
Here, the host is
example.net and the port is
80 (the default for HTTP). When there’s disagreement, a proxy is expected to “fix up” the Host header. See the spec for more details.
4. Insert Via
A lot of proxies treat the Via header as optional; they don’t want to advertise their presence. However, HTTP depends on its use; not only does it tell clients and servers that an intermediary is present, but it also tells them what the HTTP version of the hop beyond the intermediary is, so that they can figure out the capabilities of the chain as a whole.
This helps clients decide whether they can use 1.1-only features like pipelining and Expect: 100-continue.
One of the common complaints about Via is that it exposes information about the network, but it doesn’t have to; the spec allows you to use a arbitrary pseudonym, like this:
Via: 1.0 bob, 1.1 mary, 1.1 private
Once again, see the spec for the fine points.
5. Meet Expectations
Proxies also need to forward requests with the Expect header correctly. Otherwise, clients can hang, waiting (usually for the
100 Continue status code).
See the spec.
6. Pipeline Correctly
HTTP/1.1 servers — including those built into intermediaries — are required to support pipelining. Unfortunately, some proxies haven’t supported pipelining well, very occasionally with disastrous results (e.g., mixing up responses), causing browsers to be very cautious about using pipelining.
Fortunately, this is starting to change, so you can expect more pipelined requests on the Web. This is great for performance, but it raises the bar for implementing an intermediary.
Unfortunately, there isn’t (yet) a clear, easy-to-follow guide to all of the pitfalls for implementing pipelining in a proxy. I have a draft about helping clients; with a little work (help?), it may expand to cover intermediaries too.
However, as long as your server-side handles pipelining well — even if it just buffers the requests and sends them out one at a time — that’s a good starting point.
7. Support Chunking — Both Ways
One of the biggest changes in HTTP/1.1 was the introduction of chunked encoding. This is a huge win when you don’t want to buffer a large message (e.g., one generated by a script), and essential for good performance in some use cases.
Most intermediaries get response chunking right, because it’s so prevalent. However, there are growing use cases for request chunking as well. While it’s OK spec-wise to refuse these with a
411 Length Required, a good intermediary will pass through chunked requests such as this one:
POST /thing HTTP/1.1 Host: www.example.com Transfer-Encoding: chunked Content-Type: text/html ...
8. Buffer Intelligently
HTTP is a message-oriented protocol, which means that it’s technically fine to buffer an entire request or response before forwarding it. However, this isn’t friendly to a lot of uses that people have for HTTP.
Of course, some amount of buffering is necessary (and indeed unavoidable), but it should be done in a way that the next hop isn’t waiting too long for part of a request or response.
Note that some commonly-used HTTP “reverse” proxies will buffer the entire response and/or request; while this is fine in some deployments, it’s important to understand that it’s a serious limitation for others (e.g., serving large files and/or streaming).
9. Don’t Limit Arbitrarily
It’s necessary for all HTTP implementations to limit the resources used by a single request, to avoid various kinds of attacks. However, those limits should be generous; otherwise, you’re limiting the Web itself.
In particular, URIs should be allowed at least 8000 octets, and HTTP headers should have 4000 as an absolute minimum (in practice, header blocks can get much bigger).
All of this should be configurable, of course. We’re discussing the details in HTTPbis, but those numbers should be considered an absolute floor; most implementations will want to exceed them.
10. Cache Correctly
If your proxy implements a cache, it needs to respect the Cache-Control directives that both clients and servers provide. This shouldn’t be hard; HTTP gives considerable latitude to caches, but there are a few inviolate rules, especially regarding private and no-store. If caches don’t listen to sites, sites will find ways to work around bad caches, and everybody loses, so respect the contract that’s implicit in HTTP.
Likewise, proxy caches need to do the right thing with the Date and Age headers. Date should NOT be changed by proxies; doing so messes up the caching model of HTTP in some pretty subtle ways, and Age is necessary to make sure that content isn’t double-cached (see Edith Cohen’s paper for more details).
11. Don’t Transform
If you’re writing a proxy, or deploying something as a proxy (i.e., something that goes to arbitrary Web sites, not just your own), you need to honour
Cache-Control: no-transform, both in requests and responses.
This allows people to tell you not to mess with their stuff, in a nutshell. While it’s tempting to ignore it and insert that ad / transcode that content / do whatever it is you do, if you ignore it, they’ll just find a way to work around you, and again, everybody loses.
12. Bonus: Support Trailers
Finally, while trailers are completely optional in HTTP, and they aren’t widely used to date, there are some interesting use cases for them, such as post-response debugging and tracing. A friendly intermediary will pass them through.
Getting It Right
Whether you’re creating a new proxy or you’re trying to find one to deploy, there are tools to help you. Co-Advisor is a comprehensive test suite for proxies — both with and without caches, and both forward and reverse — that can be used to assess how HTTP conformant a product is. It’s also free for Open Source projects, so there’s no excuse.
If you run Co-Advisor, remember that perfect conformance isn’t necessary; almost every product will have problems. It’s the big stuff that’s important.
* Proxy is actually a more specific term; it means something that direct requests to all sites, usually with explicit browser configuration. A “reverse proxy” is more correctly known as a gateway, and all of these things are intermediaries. I use proxy here more generically, as that seems to be how people use it casually.
** This isn’t intended to pick just on Polipo, of course; there are many other badly-behaved proxies out there.