Will there be a Distributed HTTP?

Tuesday, 18 August 2015

One of the things that came up at the HTTP Workshop was “distributed HTTP” — i.e., moving the Web from a client/server model to a more distributed one. This week, Brewster Khale (of Archive.org fame) talked about similar thoughts on his blog and at CCC. If you haven’t seen that yet, I’d highly suggest watching the latter.

These are not new discussions, and there are a growing number of systems — e.g. IPFS, Twister, Tahoe-LAFS, Storj, Maidsafe, FileCoin, ICN and NDN — that offer distributed storage in a way that can easily be leveraged into the Web.

Likewise, there’s a lot of interest — as evidenced by things like Peerjs, peerCDN and Peer5 (more here) — in using WebRTC’s DataChannel for peer-to-peer Web content distribution.

So, what’s all the fuss about? Are we going to see a “Distributed HTTP” soon? Below is what I know; if you have answers to the questions and issues below, I’d love to hear from you.

Why Distributed?

There are a number of reasons why various people are interested in a distributed Web protocol.

The most practical are scaling and reliability — if you don’t have one server for your Web traffic, you don’t have to worry about it going down in a flash crowd or when there’s a network problem nearby. The current solution for this problem is to use a CDN, but even CDNs (disclosure: I work at one) are interested in doing something smarter to get content out there, and distributed / p2p protocols are on that list.
People are also much more aware of the ability of governments to see and control what you do online, and having multiple copies of (and paths to) content is one way to make it more available despite attempts to censor it.
Finally, cutting the server out of the equation is seen as an opportunity to reset the Web’s balance of power regarding cookies and other forms of tracking; if you don’t request content from its owner, but instead get it from a third party, the owner can’t track you.

As it is, HTTP is an inherently client/server protocol, in that the authority (the part of the link just after “http://“ or “https://“) tells your browser where to go to get the content. Although HTTP (with the help of related systems like DNS) allow servers to delegate that authority to others to allow them to serve the content (which is how CDNs are made), the server and their delegates still act as a single point of control, exposure and failure.

Improving all of this sounds really interesting, both as a technical person and as a user. Why is this not just a simple matter of programming?

The Request Privacy Problem

Merely making a HTTP request can be a very revealing thing; it can tell someone what movies you watch, what news you’re reading, what your personal preferences are, and yes, even reveal your secrets. What you do online is big business; so big that your personal data is now classified as an asset class of its own.

This is one of the big reasons why many people are advocating the use of HTTPS to secure the Web, so that you’re not leaking this information to anyone who cares to listen on the network (governments, ISPs, criminals, neighbours).

Of course, that still leaves the server you’re talking to, and their CDN, as well as any advertising and tracking bugs that are included on a Web page. The amount of information that’s revealed to them is pretty astounding — and worth a serious look in the Web architecture — but fundamentally, you know who’s ultimately responsible, because it’s right up in the location bar (hopefully verified with a TLS certificate). Distributed approaches to the Web typically replace this one identified party (plus delegates) with many potentially unidentified servers. Instead of asking youtube.com for your video — and having some level of understanding of its reputation, good or bad — you’re now asking servers that are unrelated to youtube for its content.

There are a lot of ways that this can happen. In a fully distributed approach, you’re asking complete strangers for the content. While it will be encrypted and have integrity (so that you know it’s what you asked for), the fact that you asked for it is very, very difficult to hide. That’s because you need to ask for something, perhaps with a cryptographic hash of the content, or some other unique identifier, and the party you’re asking for it can discover what it is — with sufficient resources —by crawling the Web and remembering interesting identifiers.

This means that an attacker could write a peer to participate in the protocol and listen for requests from your address, so that they know what you’re browsing, whether it’s pr0n, activist literature, terrorist instruction manuals or whatever. Not so good for the Web. I wrote about the implications of adding new parties to protocols earlier in the entry about intermediation.

In theory, this risk can be mitigated through Private Information Retrieval. However, my understanding (based upon talking to a few researchers who are infinitely smarter than I) is that deploying this for anything like Web caching — with tight bounds on performance and bandwidth use — is impractical for the foreseeable future. If you know differently, I’d love to hear about it.

A more practical way to mitigate this involves selecting one or more known peers/servers to interact with, so that you can control your exposure to a level that you’re comfortable with. For example, you could nominate a personal or corporate server that you trust enough to expose what you’re browsing to, but don’t trust to modify the contents. You could allow such sharing amongst your local network at home, peer-to-peer.

However, even conveying this subtle tradeoff to users is problematic, in a world where many people don’t understand what the “lock” icon really means, or what privacy mode is. Should such a facility default to “on” when most people won’t understand what’s happening? For peer-to-peer approaches, how will people feel about the extra data involved?

Restricting the servers/peers involved also has a corresponding impact on the benefits to scalability, reliability, anonymity and so on.

Some State and Processing Really Wants to Be Centralised

Another issue that comes up with decentralised approaches is that many of the interactions on the Web are more than just “fetch this bit of data.” This is most evident for what HTTP calls POST — a method which basically means “here’s some data, process it in an application-specific fashion and send me the result.”

At first glance, a decentralised Web doesn’t play well with this pattern of interaction, because that processing has to happen somewhere. In HTTP, it’s on a central server; where could it happen on the decentralised Web? While there are a lot of proposals for portable code to be deployed in the cloud, I’m not holding my breath. Luckily, the Web platform has matured tremendously, to the point where you can do very sophisticated processing in the client. I’m talking about JavaScript, of course. For many cases, moving centralised processing to the client is the obvious answer, but there will be some where this proves difficult, either because the amount of code/state to perform the processing is too large, or its owner doesn’t want to make it available as a whole.

Likewise, a successful distributed HTTP will need to have a way to update distributed state from the client — equivalent to HTTP’s PUT and DELETE. This is tricky, especially when that update is more than “store this blob of bits, looked up by its hash,” but instead “update the list of comments / widgets / whatever to include / replace foo.”

That certainly isn’t impossible, but it’s going to require a fairly sophisticated protocol to achieve; I’m not aware of one yet, would be happy to be shown otherwise.

Mixed Content, Again

Once you have the nice properties above in some distributed Web protocol, you need to assure that they’re enforced for an entire page. It doesn’t do much good to have your HTML (or CSS, or images) loaded over a reliable, highly scalable, censorship resistant, anonymising protocol if the rest of the page is vulnerable to all of these downsides. In security circles, this is called “ tranquility” (thanks, Brad Hill).

If this all sounds familiar, it’s because we already have such a bright line in the Web — it’s the difference between “http” and “https”, and the Mixed Content policy. It’s how a browser makes sure that a HTTPS page’s security is not compromised by HTTP content. The most obvious way to achieve this is to create a new URL scheme for the distributed Web, and define a “Mixed Content Plus” to assure that distributed pages are really distributed. However, given the pain that we’re seeing around the transition between HTTP and HTTPS (largely caused by the need to change links, etc.), this may be asking too much — never mind that there will be a long period when most browsers won’t support this new URL scheme (unless it can be managed with registerProtocolHandler and ServiceWorker somehow, although Anne says “not yet”).

Modifying The Web is Scary

A lot of the proposals in this space look like they’ll require substantial modification to content — not only to change links, but also to fundamentally re-architect functionality in response to the changed capabilities of the protocol.

I think this is worth mentioning because often, modifying content is seen as something to be avoided — as witnessed in the HTTP->HTTPS transition, it can make things difficult. There are two responses to this. You can either reinvent substantial portions of the Web wholesale, and hope that the new thing has enough value to take off on its own merit, or you can propose small, incremental changes to the existing Web, getting them deployed as we’ve deployed many improvements to it over its lifetime. The first path is inherently risky, but the payoff is big. What about incrementalism?

Less Ambitious Approaches

If you’re willing to compromise on some of the use cases above (e.g., if you want to improve scalability and reliability, but pass on the other goals for now), you can leverage existing Web functions like SRI (although you’ll still face some of the problems above, and the WG has punted on the distributed caching use case for many reasons including the “request privacy” issue). Similarly, some folks in the HTTP community have been talking about so-called Blind Proxy Caches that store encrypted blobs, to offer better scalability and availability without sacrificing all of the security guarantees of HTTPS. It’s very early days, and it’s not clear how deployable it will be, but a system with similar tradeoffs already exists in Microsoft’s BranchCache.

That said, it’s interesting to think about how these techniques might intersect with Tor and .onion sites to provide not only scalability but also anonymity.

Finally, another approach that it seems like a number of people are interested in is not making HTTP itself distributed, but instead making decentralised protocols for specific applications (e.g., microblogging, file syncronisation, etc.). This wouldn’t be revising the Web (or at least HTTP) so much as it would be creating new protocols that live inside it.

That’s all I’ve got for now; I’m sure I’ve missed a lot, because I’m not up to speed on many of the existing projects, even though they do fascinate me.

mark nottingham

other HTTP posts