I recently came across a statement in Aphyr’s excellent Jepsen blog series that caught my eye:

“In a very real sense, [network] partitions are just really big windows of concurrency.”

This statement seems to imply that distributed systems are “equivalent” to parallel (single-machine) computing systems, for the following reason: partitions, which occur in a network but don’t really occur on a single chip [0], appear to be the key distinguishing property of distributed systems. But if partitions are just a special case of concurrency, then there shouldn’t be any fundamental reasons why algorithms for multicore computational models (such as PRAM) wouldn’t be perfectly suitable for solving all the problems we might encounter in a distributed setting. We know this to be false, so I’ve been trying to puzzle out precisely what properties of distributed computing distinguish it from parallel computing [1].

I’ve been taught that distributed systems have two crucial features:

  • Asynchrony, or “absence of synchrony”: messages from one process to another do not arrive immediately. In a fully asynchronous system, messages may be delayed for unbounded periods of time.
  • Partial failure: some processes in the system may fail while other processes continue executing.

Let’s discuss these two properties separately.

Asynchrony

Parallel systems also exhibit asynchrony, as long it’s possible for there to be a delay between one process sending a message [2] and the other processes having the opportunity to read that message. Even on a single machine, this delay might be induced by locks within the operating system kernel, or by the cache coherence protocol implemented in hardware on a multicore chip.

With this in mind, let’s return to Aphyr’s statement. What exactly did he mean by “big windows of concurrency”? His article focuses on what happens when multiple clients write to the same database key, so by “concurrency” I think he is referring to situations where multiple processes might simultaneously issue writes to the same piece of state. But if you think about it, the entire execution is a “big window of concurrency” in this sense, regardless of whether the database replicas are partitioned. By “big windows of concurrency” I think Aphyr was really talking about asynchrony (or more precisely, periods of high message delivery delays), since network partitions are hard to deal with precisely because the messages between replicas aren’t deliverable until after the partition is recovered: when replicas can’t coordinate, it’s challenging (or impossible, if the system chooses to enforce linearizability) for them to correctly process those concurrent writes. Amending Aphyr’s statement then:

“Network partitions are just really big windows of asynchrony.”

Does this amendment resolve our quandary? Someone could rightly point out that because partitions don’t really occur within a single chip [0], parallel systems can effectively provide guarantees on how long message delays can last [3], whereas partitions in distributed systems may last arbitrarily long. Some algorithms designed for parallel computers might therefore break in a distributed setting, but I don’t think this is really the distinction we’re looking for.

Partial Failure

Designers of distributed algorithms codify their assumptions about the possible ways nodes can fail by specifying a ‘failure model’. Failure models might describe how many nodes can fail–for example, quorum-based algorithms assume that no more than N/2 nodes ever fail, otherwise they cannot make progress–or they might spell out how individual crashed nodes behave. The latter constraint forms a hierarchy, where weaker failure models (e.g. ‘fail-stop’, where crashed nodes are guaranteed to never send messages again) can be reduced to special cases of stronger models (e.g. ‘Byzantine’, where faulty nodes can behave arbitrarily, even possibly mimicking the behavior of correct nodes) [4].

Throughout the Jepsen series, Aphyr tests distributed systems by (i) telling clients to issue concurrent writes, (ii) inducing a network partition between database replicas, and (iii) recovering the partition. Observe that Jepsen never actually kills replicas! This failure model is actually weaker than fail-stop, since nodes are guaranteed to eventually resume sending messages [5]. Aphyr’s statement is beginning to make sense:

“Network partitions that are followed by network recovery are just really big windows of asynchrony.”

This statement is true; from the perspective of a node in the system, a network partition followed by a network recovery is indistinguishable from a random spike in message delays, or peer nodes that are just very slow to respond. In other words, a distributed system that guarantees that messages will eventually be deliverable to all nodes is equivalent to an asynchronous parallel system. But if any nodes in the distributed system actually fail, we’re no longer equivalent to a parallel system.

Who cares?

This discussion might sound like academic hairsplitting, but I claim that these distinctions have practical implications.

As an example, let’s imagine that you need to make a choice between shared memory versus message passing as the communication model for the shiny new distributed system you’re designing. If you come from a parallel computing background you would know that message passing is actually equivalent to shared memory, in the sense that you can use a message passing abstraction to implement shared memory, and vice versa. You might therefore conclude that you are free to choose whichever abstraction is more convenient or performant for your distributed system. If you jumped to this conclusion you might end up making your system more fragile without realizing it. Message passing is not equivalent to shared memory in distributed systems [6], precisely because distributed systems exhibit partial failures; in order to correctly implement shared memory in a distributed system it must always be possible to coordinate with a quorum, or otherwise be able to accurately detect which nodes have failed. Message passing does not have this limitation.

Another takeaway from this discussion is that Jepsen is actually testing a fairly weak failure mode. Despite Jepsen’s simplicity though, Aphyr has managed to uncover problems in an impressive number of distributed databases. If we want to uncover yet more implicit assumptions about how our systems behave, stronger failure modes seem like an excellent place to look.


[0] After I posted this blog post, Aphyr and others informed me that some of the latest multicore chips are in fact facing partial failures between cores due to voltage issues. This is quite interesting, because as multicore chips grow in transistor density, the distinction between parallel computing and distributed computing is becoming more and more blurred: modern multicore chips face both unbounded asynchrony (from the growing gap between levels of the memory hierarchy) and partial failure (from voltage issues).

[1] Thanks to Ali Ghodsi for helping me tease out the differences between these properties.

[2] or writing to shared memory, which is essentially the same as sending a message.

[3] See, for example, PRAM or BSP, which assume that every node can communicate with every other node within each “round”. It’s trivial to solve hard problems like consensus in this world, because you can always just take a majority vote and decide within two rounds.

[4] See Ali Ghodsi’s excellent slides for a taxonomy of these failure models.

[5] Note that this is not equivalent to ‘crash-recovery’. Crash-recovery is actually stronger than fail-stop, because nodes may recover or they may not.

[6] Nancy Lynch, “Distributed Algorithms”, Morgan Kaufmann, 1996.

According to a study by Turner et al. [1], wide area network links have an average of 1.2 to 2.7 days of downtime per year. This translates to roughly two and a half 9’s of reliability [2].

I was curious how this compared to datacenter links, so I took a look at Gill et. al’s paper [3] on datacenter network failures at Microsoft. Unfortunately some of the data has been redacted, but I was able to reverse engineer the mean link downtime per year with the help of Aurojit Panda’s svg-to-points converter. The results are interesting: out of all links types, the average downtime was 0.3 days. This translates to roughly three and a half 9’s of reliability, an order of magnitude greater than WAN links.

Intuitively this makes sense. WAN links are much more prone to drunken hunters, bulldozers, wild dogs, ships dropping anchor and the like than links within a secure datacenter.

Footnotes

[1] Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California Fault Lines: Understanding the Causes and Impact of Network Failures, Table 4. SIGCOMM ‘10.

[2] Note that this statistic is specifically about hardware failure, not overall network availability.

[3] Phillipa Gill, Navendu Jain, Nachiappan Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications, Figures 8c & 9c. SIGCOMM ‘11

In 2010, Jeff Dean gave a talk that laid out a list of numbers every programmer should know. His list has since become relatively well known among the systems community.

The other day, a friend mentioned a latency number to me, and I realized that it was an order of magnitude smaller than what I had memorized from Jeff’s talk. The problem, of course, is that hardware performance increases exponentially! After some digging, I actually found that the numbers Jeff quotes are over a decade old [1].

Partly inspired by my officemate Aurojit Panda, who is collecting awesome data on hardware performance, I decided to write a little tool [2] to visualize Jeff’s numbers as a function of time [3].

Without further ado, here it is.

Footnotes

[1] Jeff’s numbers are from 2001, and were first publicized by Peter Norvig in this article.

[2] Layout stolen directly from ayshen on GitHub.

[3] The hardware trends I’ve gathered are rough estimates. If you want to tweak the parameters yourself, I’ve made it really easy to do so – please send me updates! Better yet, issue a pull request.

tl;dr: handling event ordering correctly in distributed systems is tricky. In this post I cover 7 approaches to coping with concurrency.

Robust distributed systems are notoriously difficult to build. The difficulty arises from two properties in particular:

  • Limited knowledge: each node knows its own state, and it knows what state the other nodes were in recently, but it can’t know their current state.

  • (Partial) failures: individual nodes can fail at any time, and the network can delay or drop messages arbitrarily.

Why are these properties difficult to grapple with? Suppose you’re writing code for a single node. You’re deep in a nested conditional statement, and you need to deal with a message arrival. How do you react? What if the message you’re seeing was actually delayed by the network and is no longer relevant? What if some of the nodes you need to coordinate with have failed, but you aren’t aware of it yet? The set of possible event sequences you need to reason about is huge, and it’s all too easy to forget about the one nasty corner case that will eventually bring your system to a screeching halt.

To make the discussion more concrete, let’s look at an example.

Floodlight bug

The figure above depicts a race condition [1] in Floodlight, a distributed controller for software-defined networks. With Floodlight, switches maintain one hot connection to a master controller and one or more cold connections to replica controllers. The master holds the authority to modify the configuration of the switches, while the other controllers are in slave mode and do not perform any changes to the switch configurations unless they detect that the master has crashed [2].

The race condition is triggered when a link fails (E1), and the switch attempts to notify the controllers (E2,E4) shortly after the master has died (E3), but before a new master has been selected (E6). In this case, all live controllers are in the slave role and will not take responsibility for updating the switch flow table (E5). At some point, heartbeat messages time out and one of the slaves elevates itself to the master role (E6). The new master will proceed to manage the switch, but without ever clearing the routing entries for the failed link (resulting in a persistent blackhole) [3].

If we take a step back, we see that there are two problems involved: leader election (“Who is the master at any point in time?”), and replication (“How should the backups behave?”). Let’s assume that leader election is handled by a separate consensus algorithm (e.g. Paxos), and focus our attention on replication.

Now that we have a concrete example to think about, let’s go over a few solutions this problem. The first four share the same philosophy: “get the ordering right”.

Take it case-by-case

The straightforward fix here is to add a conditional statement for this event ordering: if you’re a slave, and the next message is a link failure notification, store the notification in memory in case you become master later.

This fix seems easy in retrospect. But recall how the bug came about in the first place: the programmer had some set of event orderings in mind when writing the code, but didn’t implement one corner case. How do we know there isn’t another race condition lurking somewhere else in the code? [4]

The number of event orderings you need to consider in a distributed system is truly huge; it scales combinatorially with the number of nodes you’re communicating with. For the rest of this post, let’s see if we can avoid the need to reason on a case-by-case basis altogether.

Replicate the computation

Consider a system consisting of only one node. In this world, there is a single, global order (with no race conditions)!

How can we obtain a global event order, yet still achieve fault tolerance? One way [5] is to have the backup nodes mimic every step of the master node: forward all inputs to the master, have the master choose a serial order for those events, issue the appriopriate commands to the switches, and replicate the decision to the backups [6]. The key here is that each backup should execute the computation in the exact same order as the master.

For the Floodlight bug, the backup would still need to hold the link failure message in memory until it detects that the master has crashed. But we’ve gained a powerful guarantee over the previous approach: when the backup takes over for the master, it will be in a up-to-date state, and know exactly what commands it needs to send to the switches to get them into a correct configuration.

Make your event handlers transactional

Transactions allow us to make a group of operations appear either as if they happened simultaneously, or not at all. This is a powerful idea!

How could transactions help us here? Suppose we did the following: whenever a message arrives, find the event handler for that message, wrap it in a transaction, run the event handler, and hand the result of the transaction to the master controller. The master decides on a global order, checks whether any concurrent transactions conflict with each other (and aborts one of them if they do), updates the switches, sends the serialized transactions to the backups, and waits for ACKs before logging a commit message.

This is very similar to the previous solution, but it gives us two benefits over the previous approach:

  • We can potentially handle more events in parallel; most of the transactions will not conflict with each other, and we can simply abort and retry the ones that do.
  • We can now roll back operations. Suppose a network operator issues a policy change to the controller, but realizes that she made a mistake. No problem – she can simply roll back the previous transaction and start again where she began.

Compared to the first approach, this is a significant improvement! Each event is handled in isolation from the other events, so there’s need to reason about event interleavings; if a conflicting transaction was committed before we get to commit, just abort and retry!

Reorder events when no one will notice

It turns out that we can achieve even better throughput if we use a replication model called virtual synchrony. In short, virtual synchrony provides a library with three operations:

  • join() a process group
  • register() an event handler
  • send() an atomic multicast message to the rest of your process group.

These primitives provide two crucial guarentees:

  • Atomic multicast means that if any correct node gets the message, every live node will eventually get the message. That implies that if any live node ever gets the link failure notification, you can rest assure that one of your future masters will get it.
  • The join() protocol ensures that every node always know who’s a member of its group, and that everyone has the same view of who is alive and who is not. Failures results in a group change, but everyone will agree on the order in which the failure occured.

With virtual synchrony, we no longer need a single master; atomic multicast means that there is a single order of events observed by all members of the group, regardless of who initiated the message. And with multiple masters, we aren’t constrained by the speed of a single node.

The virtual part of virtual synchrony is that when the library detects that two operations are not causally related to each other, it can reorder them in whatever way it believes most efficient. Since those operations aren’t causally related, we’re guaranteed that the final output won’t be noticeably different.

OK, let’s move on to the final three approaches, which take a different tack than the first four: “avoid having to reason about event ordering altogether”

Make yourself stateless

In a database, the “ground truth” is stored on disk. In a network, the “ground truth” is stored in the routing tables of the switches themselves. This implies that the controllers’ view of the network is just soft state; we can always recover it simply by querying the switches for their current configuration!

How does this observation relate to the Floodlight bug? Suppose we didn’t even attempt to keep the backup controllers in sync with the master. Instead, just have them recompute the entire network configuration whenever they realize they need to take over for the master. Their only job in the meantime is to monitor the liveness of the master!

Of course, the tradeoff here is that it may take significantly longer for the newly elected master to get up to speed.

We can apply the same trick to avoid race conditions between concurrent events at the master: instead of maintaining locks between threads, just restart computation of the entire network configuration whenever a new event comes in. Race conditions don’t happen if there is no shared state!

Incidentally, Google’s wide-area network controller is almost entirely stateless, presumably for many of the same reasons.

Force yourself to be stateless

In the spirit stateless computation, why not write your code in a language that doesn’t allow you to keep state at all? Programs written in declarative languages such as Overlog have no explicit ordering whatsoever. Programmers simply declare rules such as “If the switch has a link failure, then flush the routing entries that go over that link”, and the language runtime handles the order in which the computation is carried out.

With a declarative language, as the long as the same set of events is fed to the controller (regardless of their order), the same result will come out. This makes replication really easy: send inputs to all controllers, have each node compute the resulting configuration, and only allow the master node to send out commands to the switches once the computation has completed. The tradeoff is that without an explicit ordering. the performance of declarative languages is difficult to reason about.

Guarantee self-stabilization

The previous solutions were designed to always guarantee correct behavior despite failures of the other nodes. This final solution, my personal favorite, is much more optimistic.

The idea behind self-stabilizing algorithms is to have a provable guarantee that no matter what configuration the system starts in, and no matter what failures occur, all nodes will eventually stabilize to a configuration where safety properties are met. This eliminates the need to worry about correct initialization, or detect whether the algorithm has terminated. As a nice side benefit, self-stabilizing algorithms are usually considerably simpler than their order-aware counterparts.

What do self-stabilizing algorithms look like? Self-stabilizing algorithms are actually everywhere in networking – routing algorithms are the most canonical example.

How would a self-stabilizing algorithm help with the Floodlight bug? The answer really depends on what network invariants the control application needs to maintain. If it’s just to provide connectivity [7], we could simply run a traditional link-state algorithm: have each switch periodically send port status messages to the controllers, have the controllers compute shortest paths using Dijkstra’s, and have the master push the appropriate updates to the switches. Even if there are transient failures, we’re guaranteed that the network will eventually converge to a configuration with no loops and deadends.


Ultimately, the best replication choice depends on your workload and network policies. In any case, I hope this post has convinced you that there’s more than viable approach to handling concurrency!

Footnotes

[1] Note that this issue was originally discovered by the developers of Floodlight. (We don’t mean to pick on BigSwitch here; we chose this bug because it’s a great example of the difficulties that come up in distributed systems). For more information, see line 605 of Controller.java.

[2] This invariant is crucial to maintain. Think of the switches’ routing tables as shared variables between threads (controllers). We need to ensure mutual exclusion over those shared variables, otherwise we could end up with internally inconsistent routing tables.

[3] The Floodlight bug noted in [1] actually involves neglecting to clear the routing tables of newly connected switches, but the same flavor of race condition could occur for link failures. We chose to focus on link failures because they’re likely to occur much more often than switch connects.

[4] It’s possible in some cases to use a model checker to automatically find race conditions, but the runtime complexity is often intractable and very few systems do this in practice.

[5] There are actually a handful of ways to implement state machine replication. Ours depend on a concensus algorithm to choose the master, but you could also run the concensus algorithm itself to achieve replication. There are also cheaper algorithms such as reliable broadcast. You can also get significantly better read throughput with chain replication, which doesn’t require quorum for reads, but writes become more complicated.

[6] We still need to maintain the invariant that only the master modifies the the switch configurations. Nonetheless, with state machine replication the backup will always know what commands need to be sent to switches if and when it takes over for the master.

[7] Although if your goal is only to provide connectivity, it’s not clear why you’re using SDN in the first place.