Random ruminations

On Building Robust Systems

This blog post is inspired by two works I came across recently: Martin Thomson’s RFC proclaiming Postel’s Maxim is wrong and Barath Raghavan’s recent LIMITS paper, “Abstraction, Indirection, and Sevareid’s Law: Towards Benign Computing”. It is also largely based on a set of e-mails Barath and I passed around, though anything bad is probably a misunderstanding on my part.

Thomson’s RFC, which I have seen on HN, lobste.rs and a few mailing list (and hence marked as Internet famous) argues that Postel’s robustness principle is bad in the long term. Postel’s law says that implementations should be conservative in what they send and liberal in what they accept. Originally, stated in the context of early TCP implementations, this was meant to allow different implementations to coexist. Thomson’s argues that the flexibility afforded by this has meant that in the long term, protocols, e.g. HTTP, become unusually complex (since as time has gone by, we have had to liberally accept all the options used by any implementation). Unsurprisingly, this is not the first time Postel’s law has been criticized in this manner, e.g., in 2011, Eric Allman of Sendmail fame made similar comments in an ACM Queue article.

Of course, Postel’s law also makes it more likely that systems use common protocols instead of each system relying on its own; implementors extend rather than replace protocols when they need additional features, and as long as implementations adhere to the robustness principle, these extensions do not disrupt communication between extended and traditional implementations.

Barath’s paper looks at a very different, but somewhat related problem: how do we design systems so they are benign, i.e., they have minimize negative consequences on society. While the problem is different, the four principles he comes up with for meeting these goals; scale-out, fails-well, open-design and self-similarity; are very reminiscent of what Postel’s law was trying to achieve. Essentially, the paper’s argument is that building large-scale systems out of loosely coupled components, whose implementation can be changed, provides at least two important benefits (a) society can change the implementation of any of the components to remove negative externalities, without affecting the overall system; and (b) this provides greater resilience, because of a diversity of implementations. The last bit is similar to the argument for genetic diversity, having diverse and different implementations for each subpart makes it more likely that a system as a whole functions correctly. Unsurprisingly, computers used on board space vehicles and aircrafts already rely on this mode of redundancy.

The tension between these two papers leads to an interesting question: how do we design systems that allow lose coupling, and implementation freedom, but do not lead to the problems identified by Thomson? Achieving this presents some conflicting requirements:

  • We need well-specified protocols: without protocols, getting automated systems to talk to each other is hard. Is this actually true? I am not sure, maybe there is an ideal world where we can use some of the lessons from NLP to allow underspecified protocols.
  • The protocol should allow diverse implementations, i.e., the protocol should specify enough semantics to allow the system as a whole to function, but leave unspecified as to allow diverse implementations.

And therein lies the challenge: specify too little and Postel’s law becomes essential, specify too much and we end up destroying our ability to have diversity. So, academically (I do live in an ivory tower), how do we know when we have reached this ideal equilibrium for specification, and is this ideal equilibrium useful? Unfortunately, I do not know the answer to any of these questions. However, what I have gathered is a list of protocols that I think have failed to gather a diverse set of implementations, and some that have.

My list of failures is basically things that keep the web secure, note that these lists are very subjective and might in fact be wrong.

  • SSH might have once been a fairly well designed protocol, however the current state of affairs means that a few crypto protocols, implemented by OpenSSH is what everyone uses. Should we find a bug in OpenSSH, most Linux boxen are suddenly affected. OpenBSD and Theo de Raadt (the maintainers of OpenSSH) are certainly good, security-conscious, programmers, but no one is perfect.
  • SSL, same situation, except we have already experienced the damage caused by OpenSSL being the only reasonable implementation. While the three or four competing OpenSSL replacements (LibreSSL, BoringSSL, OCaml TLS) are trying to remedy this, most of them are merely forking and making minor code fixes to OpenSSL and none of them are trying to look at trying to diversify the algorithmic base.

My list of successes are increasingly outdated, having been designed for a less security conscious world, but have several dozen competing implementations:

  • HTTP servers, when used with OpenSSL, seem to allow all forms of decent implementations. No one seems to be that affected by the webserver they are talking to, and somehow even webservers written by undergraduates can be used with things like Chrome.
  • Telnet, FTP servers, basically any of the old protocols, I guess a strong adherence to Postel’s law helps.
  • OAuth seems to have been an interesting, modern, attempt to create a protocol that did meet both of these requirements, however it seems to have died somewhere along the way.

Armed with these list of examples, a broad question to ask is why are the later ones more amenable to diversity? More narrowly, observe that both my examples of things that do poorly involve cryptography. So maybe another question is whether cryptographic protocols can meet these requirements?

I think, if we move to cryptographic protocols that allow specification of both a key and some parameters defining an algorithm (for example, the set of blocks used in a block cipher), we might provide implementors with more freedom, at the cost of security. I think this trade-off is reasonable to navigate for most SSL deployments: users can chose to not connect to any servers supporting only weak crypto schemes (this is already what is done), and clients using weak crypto schemes are mainly impacting their own safety.

I don’t actually have any answers to the broader questions, and more generally I think my cryptography observation is a red-herring: perhaps any sufficiently complicated protocol must grapple with the same issues. Therefore, as we seem to be moving to more and more complex protocols, perhaps it is a good time to try and analyze this tradeoff.