Sunday, 20 February 2022
Server-Sent Events, WebSockets, and HTTP
What's the best way to do pub/sub on the Web?
The orange site is currently discussing an article about Server-Sent Events, especially as compared with WebSockets (and the emerging WebTransport). Both the article and discussion are well-informed, but I think they miss out on one aspect that has fairly deep implications.
Intermediation is Important
Many years back when I worked on the infrastructure team of a Very Large Website, WebSockets was beginning to be supported in browsers, and the various properties on that site (news, sports, entertainment, and so on) were excited about the possibilities it offered. They needed to scale WebSockets out and we wanted to support them, so we asked them which library – effectively, which protocol – they wanted us to implement.
Let’s pull a number out of the air and say we asked twenty properties. The problem is that we got probably twelve different answers from them; they couldn’t agree, because there are a lot of different ways you can use WebSockets.
While some of those uses are truly unique, many of the properties just wanted an efficient, reliable publish/subscribe mechanism to stream events to browsers. They’d used long polling for this and found it to be less than optimal – especially because it effectively consumes an entire HTTP connection, and by that time browsers were starting to more strictly limit the number of connections available to each origin.
WebSockets seemed like the answer, but there is no standard pub/sub for WebSockets; instead, you choose a library for your preferred language, deploy its server-side component and ship the corresponding client-side code to the browser.
Given that Very Big Website didn’t have a single language convention, and that the landscape of WebSockets libraries is pretty large even in one language, it’s not surprising that we were left with a problem – choosing which one to support in infrastructure.
For CDNs, this problem is multiplied by the number of customers they have. Intermediaries can’t understand application-specific semantics, so they can’t add much value to them without making some big assumptions.
I suspect this is why most CDN WebSockets products are effectively TCP-layer proxies; they pass WebSockets connections 1:1 back to the configured origin, and perhaps provide some DDoS protection and HTTP/2 connection coalescing (if they’re fancy).
What’s left out is the ability to scale the protocol based upon intermediaries’ understanding of the higher-level semantics of the application. In cases where the application is doing pub/sub, intermediaries can’t do much with WebSockets, even though it would be relatively straightforward for them to do so if they could “grab onto” those semantics.
And while intermediation isn’t required for deploying pub/sub, it’s going to be a massive aid to scaling, reliability, and reducing latency. Just like intermediary caching is for “normal” HTTP.
At heart, this is a coordination problem. Having lots of different, effectively proprietary (even if Open Source) little protocols shuts intermediaries out of adding value to the WebSockets protocol.
The question, then, is how we enable intermediation for pub/sub – in the process, perhaps making pub/sub a standards-defined part of the Web?
Server-Sent Events is one possibility. Fastly already allows SSE to be “fanned out” using HTTP caching’s collapsed forwarding, so really we already have a form of Web pub/sub with support for intermediation – it just hasn’t caught on. Why not?
There are a few potential reasons. One is that the HTTP/1.1 connection limit makes SSE tricky (to the point of unworkable); effectively, SSE requires at least HTTP/2. For many Web developers, HTTP/2 is relatively new.
Even with HTTP/2, TCP head-of-line blocking is possible when packets are lost, so if you need to avoid that (e.g., you need as close to “real-time” as possible), you’ll need HTTP/3, which is even newer and more unfamiliar to most.
Last-Event-ID mechanism isn’t used in this approach, which means that events could be lost during a reconnection. To support this, the intermediary (reverse proxy or CDN) would have to understand the event stream and tailor its responses appropriately. That’s more of an implementation issue than a protocol issue, however.
Those issues can be resolved over time, as the newer HTTP versions are more widely deployed and (perhaps) as intermediaries support SSE more deeply. However, some remaining issues are more fundamental.
One such issue (for some) is that SSE currently only allows textual content in events. Base64 is a workaround here, but not great. In theory, browser vendors could add support for binary events, but this is doubtful, since all of their current focus is on WebTransport.
Also, even when TLS is used, SSE can encounter issues with anti-virus proxies and enterprise firewalls, who sometimes buffer the entire HTTP response before sending it. Some commenters in the orange site threads mention some strategies for mitigating this, apparently with good results. Padding responses does’t seem like a great solution, but if anti-virus is becoming smarter about recognising SSE, that might be enough. This hints at a broader issue, though.
The high-level issue with SSE-with-collapsed-forwarding is that it’s an implicit solution; the pub/sub semantics are not very explicit in the protocol; they’re really only surfaced in the
text/event-stream response media type.
As a result, you’re relying on the intermediary to implement caching in a particular way to achieve the desired effect. This might not be a big deal, in that CDNs and reverse proxies are typically well-coordinated with the origin, but it’s not optimal protocol design.
None of these issues rules out SSE for all use cases, but they do create friction against its adoption.
Extending WebSockets / WebTransport
Another option is defining a new WebSockets sub-protocol for pub/sub. Technically, that’s pretty straightforward; the protocol would be explicitly declared during connection setup:
… and then you’d just need to define the ‘pub’ and ‘sub’ messages on that WebSockets connection.
This is not a new idea – it has been proposed many times. Because WebSockets is effectively a blank canvas, there are a lot of choices to be made when designing a protocol on top of it, and no one way of doing it has yet gained momentum.
I’d suggest that’s made more difficult because all of these proposals are creating something completely separate to HTTP – they’re not building on top, they’re requiring you to buy into a new protocol.
So, again, this is a coordination problem; if we can get everyone to agree to do it in one way, it should work. So far, that hasn’t happened.
I suspect part of the reason is that intermediaries can’t just change how they implement a protocol without potentially breaking many sites that depends on them, and as a result are looking for a very stable solution. At the same time, Open Source libraries don’t feel those constraints and want the flexibility of refining how they operate.
The third option is to extend the core semantics of HTTP to include pub/sub. This has the merits of making those semantics more explicit to the protocol – and intermediaries.
When I’ve noodled on this, I’ve generally thought of it as a new HTTP method called
SUB, and a new non-final (i.e.,
1xx) status code called
PUB. Because a single request can have many non-final responses, it’s possible to map events to each of the non-final responses, like this:
/foo/stream results in three events, each carrying their data in the
My-App-Data HTTP header. Headers are used because
1xx responses can’t contain body content.
The overhead here is very small (especially in HTTP/2 and HTTP/3). As in other approaches, the client would be responsible for maintaining the connection and re-establishing it when necessary. We could even standardise a HTTP status code for a ‘keepalive’ event to help that:
Some might be concerned that
1xx isn’t interoperable. However, Google’s work on
103 (Early Hints) has shown that they’re web-compatible, provided that the connection is encrypted.
Another potential objection is that HTTP headers are generally thought of as textual, leading to the same content limitation as SSE. However, HTTP headers can be binary (although that might not be widely interoperable). However, Structured Fields might offer a way out – there has been discussion of creating a binary encoding of them, in which case APIs could offer direct access to binary data. Eventually.
There are also some architectural/philosophical concerns about how non-final responses relate to the state of the resource. However, since we’re defining a new method, it does’t have to, so this shouldn’t be a showstopper.
This approach is very similar to SSE – indeed, it could be accessed through the EventSource API with some minor tweaks. The difference is that the semantics are very explicit on the wire, and so HTTP intermediaries will be able to understand and support them if they wish to.
Running Code at the Edge
One more way to do this would be to run code at the ‘edge’ of the network, in a way similar to how code is run in the browser for most WebSockets protocols.
This is a very new capability; until recently, CDNs only provided services like caching and manipulation of header fields at the edge. Now, however, solutions like Fastly Compute@Edge, Cloudflare Workers, and Akamai EdgeWorkers allow you to write code for processing protocols in their intermediaries.
This represents a huge shift in how we think about protocol functions. However, they are very new, and not yet interoperable; if you write code for one of them, it won’t necessarily work on one of the others without at least some rewrites.
Also, these networks have deep understanding of their internal topology and state, and can use that to inform protocol-level decisions at a fine grain. That leaves me questioning whether it’s a good idea for anyone to write code to perform this function on top of one of these platforms when it could be better provided as a broadly useful part of it.
Choosing the Best Path Forward
As an HTTP person, I’m biased: my primary interest is making sure that the HTTP protocol provides as much value as possible, to keep it relevant and preserve the considerable investments the community has made in it. That means making sure that the protocol provides rich functionality, good efficiency, and good interoperability, based upon commonly implemented standards.
In contrast, the WebSockets approach to providing protocol functions is to let them emerge in open source implementations, rather than be specified in open standards. Because the server gets to deploy code on the client, that works pretty well – you choose a library like socket.io, deploy the server and client components, and it just works – but the protocol between the client and server is essentially proprietary.
That’s because at its heart, WebSockets provides a very low abstraction: effectively “TCP for the Web” (words that WS proponents have themselves used). The abstraction you’re working with as a developer is no longer WebSockets, it’s that which is provided by the library you choose.
With that in mind, to me it makes more sense to define a standards-based function for pub/sub in HTTP, rather than WebSockets (or WebTransport).
However, I’m far from certain that will happen, and so the best path forward is uncertain. Should we just extend and refine SSE (if we can get browsers on board)? Because a WebSockets sub-protocol wouldn’t require browser buy-in, that path might be more practical – but will we be able to get enough momentum behind a single proposal? Alternatively, can we get the browsers to implement a new
PUB HTTP method? Or will edge compute platforms converge and make all of this unnecessary?
I think the answer is important not just because pub/sub is a broadly useful pattern, but because it might give us a path for introducing other, high-level protocol functions on the Web that can benefit from its architecture (including intermediation).