mark nottingham

How Multiplexing Changes Your HTTP APIs

Sunday, 13 October 2019


When I first learned about SPDY, I was excited about it for a number of reasons, but near the top of the list was its potential impact on APIs that use HTTP.

While HTTP is great for APIs for a number of reasons, one of the problems that many APIs come up against is one of granularity; it’s most useful to have fine-grained APIs, so that you’re only returning the information that the client needs, but that leads to clients needing to make lots of requests to get everything they need.

For example, if your API is about widgets, you probably want to expose a URL for each widget, so that clients can just access the widgets they’re interested in.

GET /widget/5382894223 HTTP/1.1
Accept: application/widget+json
Accept-Encoding: gzip, deflate
User-Agent: BobsWidgetClient/1.5

However, in HTTP/1, requests are expensive, and an API designed like this quickly runs into practical problems; if a client needs to find out about forty specific widgets, 40 requests need to be made. Besides the overhead of transmitting all of those requests (at about 150 bytes each, as per above, that’s almost 6K), the client has an awkward decision about how to transmit them:

  1. It can use a single connection for all 40 requests. Practically speaking, this means that the minimum time to service all of the requests is 40 round trips between the client and server (plus a few more for connection setup). Even if can use HTTP pipelining (which is often problematic), any delay in producing one of the responses will block all of those behind it.

  2. It can use a single connection per request, so that none of them blocks the others. However, this means that the client will use 40 connections, which take time to open, consume resources, and will very often cause TCP congestion problems, which adds – not removes – significant delays and inefficiency, due to retransmission.

  3. It can use a smaller number of connections to spread out the load, reducing (but not removing) the risk of congestion issues without requiring the requests to be completely serialised. This is effectively the compromise position; it won’t necessarily take 40 round trips, but it’ll be at least 40 divided by the number of connections used. Also, delay on one response will still block the responses “behind” it on that same connection – meaning that performance will vary depending on a number of factors. And variance in performance is often worse than consistent delay.

Some APIs try to dodge this choice by offering a URL that clients can GET or POST a “batch” request to, e.g.,

GET /widgets/5382894223,35223231,534232313,5231332435 HTTP/1.1
Accept: application/widgets+json
Accept-Encoding: gzip, deflate
User-Agent: BobsWidgetClient/1.5

However, this approach has a number of downsides. Both the service and clients need to understand a new endpoint, and a new list-based format, bloating the API – especially if there are many different types of resources that need similar treatment. Furthermore, this approach seriously impacts cache efficiency, creating further server load and client-perceived latency.

This has also been one of the factors leading to query languages like GraphQL being developed; if you can describe precisely what you want to the server in an efficient format, it can reply with just what you want.

This awkward choice is also faced by Web browsers when using HTTP/1 to request all of the various assets on a Web page, and led to the design of SPDY, which served as input to HTTP/2. HTTP/2 is fundamentally different to HTTP/1 in several ways, but multiplexing – the ability to have multiple requests and responses in flight on one connection – is the biggest.

Multiplexing associates each request/response exchange with a stream ID, and uses that to make sure that the chunks of each aren’t mixed up when they’re transmitted, no matter how they’re interleaved. It means that you can use only one connection without sacrificing performance due to Head of Line Blocking.

The above is probably not news to most, but I suspect the implications for API design aren’t fully apparent. On the wire, HTTP/2 (and soon, HTTP/3) allows you to express a large number of requests in a very compact way.

To give a very simplistic example, consider this script, which makes 40 concurrent HTTP/2 requests with nghttp2:


for i in {1..40}
    URLS+="https://localhost:8443/widgets/${i}/whatever "

nghttp -y --no-dep $URLS

If you watch that in Wireshark, you’ll see those 40 requests go by in one 1440 byte packet. Here’s the relevant bits of the text dump (with some details elided and lines folded):

No. 11  Time 0.002228  Source ::1  src port 56623  destination ::1
Protocol HTTP2  Length 1440   Information
Magic, SETTINGS[0],
HEADERS[1]: GET /widgets/1/whatever,
HEADERS[3]: GET /widgets/2/whatever,
HEADERS[5]: GET /widgets/3/whatever,
HEADERS[7]: GET /widgets/4/whatever,
HEADERS[75]: GET /widgets/38/whatever,
HEADERS[77]: GET /widgets/39/whatever,
HEADERS[79]: GET /widgets/40/whatever

Frame 11: 1440 bytes on wire (11520 bits),
    1440 bytes captured (11520 bits) on interface 0
Internet Protocol Version 6, Src: ::1, Dst: ::1
Transmission Control Protocol
    Source Port: 56623
    Destination Port: 8443
    [Stream index: 0]
    [TCP Segment Len: 1364]
    TCP payload (1364 bytes)

Most of the individual requests are 23 or so bytes, so you should be able to fit about 400 requests like this in the first 10 packets (the most common initial CWND) of a connection before any further requests need to wait for an ACK. No worries about Head of Line blocking, congestion events due to multiple connections, or bloat.

The responses come back efficiently too; as soon as the server has part of a response available, it can send it, constrained only by available bandwidth. In particular, if a cache is between the server and client, it can fill the available bandwidth while the server answers uncached requests – making the most of the resources available.

This is pretty powerful. Without too much exaggeration, another way to think of HTTP/2 is as a new query language – one that lets you encode a very complex set of requests into a small amount of data that is heavily optimised for transmission, while still allowing standard HTTP components – especially caches – to work with the individual requests.

You might say that the URLs in my example are very similar, which isn’t realistic for many use cases. That’s true, but because of the way that HPACK header compression works, that doesn’t matter too much; forty requests for any URL paths of about that length would encode to be about the same size.

There are, of course, other reasons to use batch endpoints or specialised query languages; in particular, if some of your requests have side effects that are visible in others that you’re making, so that their processing order is important.

It’s also early days for HTTP/2 clients; while it’s well established in browsers, it’s just starting to be available and usable as a library in many languages (hint: try libcurl). Your server implementation will also need to be carefully considered to exploit this kind of request pattern; many will need rethinking before they can handle 40 or 400 concurrent requests on one connection efficiently (but it is possible; caching can help tremendously, as can consider outstanding requests as a pool, rather than in isolation).

With all of those caveats noted, HTTP/2 should put to rest any notion of the need to minimise the number of requests for APIs; the protocol makes them cheap enough to not practically matter. Go ahead and design a highly granular HTTP API to meet the needs of your clients.