Research on distributed systems is often motivated by some variation of the following:

Developers of distributed systems face notoriously difficult challenges, such as concurrency, asynchrony, and partial failure.

That statement seems convincing enough, but it’s rather abstract. In this post we’ll gain a concrete understanding of what makes distribution so challenging, by describing correctness bugs we found in an implementation of the Raft consensus protocol.

Raft is an interesting example because its authors designed it to be understandable and straightforward to implement. As we’ll see, implementing even the relatively straightforward Raft spec correctly requires developers to deal with many difficult-to-anticipate issues.

Fuzz testing setup

To find bugs we’re going to employ fuzz testing. Fuzz tests are nice because they help us exercise situations that developers don’t anticipate with unit or integration tests. In a distributed environment, semi-automated testing techniques such as fuzzing are especially useful, since the number of possible event orderings a system might encounter grows exponentially with the number of events (e.g. failures, message deliveries)—far too many cases for developers to reasonably cover with hand-written tests.

Fuzz testing generally requires two ingredients:

  1. Assertions to check.
  2. A specification of what inputs the fuzzer should inject into the system.


The Raft protocol already has a set of nicely defined safety conditions, which we’ll use as our assertions.

Raft Invariants

Figure 3 from the Raft paper (copied left) shows Raft’s key invariants. We use these invariants as our assertions. Each assertion should hold at any point in Raft’s execution.

For good measure, we also add in one additional assertion: no Raft process should crash due to an uncaught exception.

We’ll check these invariants by periodically halting the fuzz test and inspecting the internal state of each Raft process. If any of the assertions ever fails, we’ve found a bug.

Input generation

The trickier part is specifying what inputs the fuzzer should inject. Generally speaking, inputs are anything processed by the system, yet created outside the control of the system. In the case of distributed systems there are a few sources of inputs:

  • The network determines when messages are delivered.
  • Hardware may fail, and processes may (re)join the system at random points.
  • Processes outside the system (e.g. clients) may send messages to processes within the system.

To generate the last two types of inputs, we specify a function for creating random external messages (in the case of Raft: client commands) as well as probabilities for how often each event type (external message sends, failures, recoveries) should be injected.

We gain control over the network by interposing on the distributed system’s RPC layer, using AspectJ. For now, we target a specific RPC system: Akka. Akka is ideal because it provides a narrow, general API that operates at a high level abstraction based around the actor model.

Our interposition essentially allows us to play god: we get to choose exactly when each RPC message sent by the distributed system is delivered. We can delay, reorder, or drop any message the distributed system tries to send. The basic architecture of our test harness (which we call DEMi) is shown below:

Every time a process sends an RPC message, the Test Harness intercepts it and places it into a buffer. The Test Coordinator later decides when to deliver that message to the recipient. In a fully asynchronous network, the Test Coordinator can arbitrarily delay and reorder messages.

The Test Coordinator also injects external events (external message sends, failures, recoveries) at random according to the probability weights given by the fuzz test specification.

Test Harness

Interposing at the RPC layer has a few advantages over interposing at a lower layer (e.g. the network layer, a la Jepsen). Most importantly, we get fine-grained control over when each individual (non-segmented) message is delivered. In contrast, iptables is a much more blunt tool: it only allow the tester to drop or delay all packets between a given pair of processes [1].

Targeting applications built on Akka gives us one other advantage: Akka provides a timer API that obviates the need for application developers to read directly from the system clock. Timers are a crucial part of distributed systems, since they are used to detect failures. In Akka, timers are modeled as messages, to be delivered to the process that set the timer at a later point in the execution. Rather than waiting for the wall-clock time for each timer to expire, we can deliver it right away, without the application noticing any difference.

Target implementation: akka-raft

The Raft implementation we target is akka-raft. akka-raft is written by one of the core Akka developers, Konrad Malawski. akka-raft is fully featured according to the Raft implementation page; it supports log replication, membership changes, and log compaction. akka-raft has existing unit and integration tests, but it has not yet been deployed in production.

UPDATE: Konrad asked me to include a short note, and I’m glad to oblige:

akka-raft is not an officially supported Akka module, but rather just a side project of Konrad’s. The Akka modules themselves are much more rigorously tested before release.

For our fuzz tests we set up a small 4-node cluster (quorum size=3). akka-raft uses TCP as its default transport protocol, so we configure DEMi to deliver pending messages one-at-a-time in a semi-random order that obeys FIFO order between any pair of processes. We also tell DEMi to inject a given number of client commands (as external messages placed into the pending message buffer), and check the Raft invariants at a fixed interval throughout the execution. We do not yet exercise auxiliary features of akka-raft, such as log compaction or cluster membership changes.

Bug we found

For all of the bugs we found below, we first minimized the faulty execution before debugging the root cause [2]. With the minimized execution in hand, we walked through the sequence of message deliveries in the minimized execution one at a time, noting the current state of the process receiving the message. Based on our knowledge of the way Raft is supposed to work, we found the places in the execution that deviated from our understanding of correct behavior. We then examined the akka-raft code to understand why it deviated, and came up with a fix. We submitted all of our fixes as pull requests.

A few of these root causes had already been pointed out by Jonathan Schuster through a manual audit of the code, but none of them had been verified with tests or fixed before we ran our fuzz tests.

On with the results!

raft-45: Candidates accept duplicate votes from the same election term.

Raft is specified as a state machine with three states: Follower, Candidate, and Leader. Candidates attempt to get themselves elected as leader by soliciting a quorum of votes from their peers in a given election term (epoch).

In one of our early fuzz runs, we found a violation of ‘Leader Safety’, i.e. two processes believed they were leader in the same election term. This is a bad situation for Raft to be in, since the leaders may overwrite each other’s log entries, thereby violating the key linearizability guarantee that Raft is supposed to provide.

The root cause here was that akka-raft’s candidate state did not detect duplicate votes from the same follower in the same election term. (A follower might resend votes because it believed that an earlier vote was dropped by the network). Upon receiving the duplicate vote, the candidate counts it as a new vote and steps up to leader before it actually achieved a quorum of votes.

raft-46: Processes neglect to ignore certain votes from previous terms.

After fixing the previous bug, we found another execution where two leaders were elected in the same term.

In Raft, processes attach an ‘election term’ number to all messages they send. Receiving processes are supposed to ignore any messages that contain an election term that is lower than what they believe is the current term.

Delayed Term

akka-raft properly ignored lagging term numbers for some, but not all message types. DEMi delayed the delivery of messages from previous terms and uncovered a case where a candidate incorrectly accepted a vote message from a previous election term.

raft-56: Processes forget who they voted for.

akka-raft is written as an FSM. When making a state transition, FSM processes specify both which state they want to transition to, and which instance variables they want to keep once they have transitioned.

Raft FSM

All of the state transitions for akka-raft were correct except one: when a candidate steps down to follower (e.g., because it receives an AppendEntries message, indicating that there is another leader in the cluster), it forgets which process it previously voted for in that term. Now, when another process requests a vote from it in the same term, it will vote again but this time for a different process than it previously voted for, allowing two leaders to be elected.

raft-58a: Pending client commands delivered before initialization occurs.

After ironing out leader election issues, we started finding other issues. In one of our fuzz runs, we found that a leader process threw an assertion error.

When an akka-raft candidate first makes the state transition to leader, it does not immediately initialize its state (the nextIndex and matchIndex variables). It instead sends a message to itself, and initializes its state when it receives that self-message.

Through fuzz testing, we found that it is possible that the candidate could have pending ClientCommand messages in its mailbox, placed there before the candidate transitioned to leader and sent itself the initialization message. Once in the leader state, the Akka runtime will first deliver the ClientCommand message. Upon processing the ClientCommand message the leader tries to replicate it to the rest of the cluster, and updates its nextIndex hashmap. Next, when the Akka runtime delivers the initialization self-message, it will overwrite the value of nextIndex. When it reads from nextIndex later, it’s possible for it to throw an assertion error because the nextIndex values are inconcistent with the contents of the leader’s log.

raft-58b: Ambiguous log indexing.

In one of our fuzz tests, we found a case where the ‘Log Matching’ invariant was violated, i.e. log entries did not appear in the same order on all machines.

According to the Raft paper, followers should reject AppendEntries requests from leaders that are behind, i.e. prevLogIndex and prevLogTerm for the AppendEntries message are behind what the follower has in its log. The leader should continue decrementing its nextIndex hashmap until the followers stop rejecting its AppendEntries attempts.

This should have happened in akka-raft too, except for one hiccup: akka-raft decided to adopt 0-indexed logs, rather than 1-indexed logs as the paper suggests. This creates a problem: the initial value of prevLogIndex is ambiguous:

  • followers can’t distinguish between an AppendEntries for an empty log (prevLogIndex == 0)
  • an AppendEntries for the leader’s 1st command (prevLogIndex == 0), and
  • an AppendEntries for the leader’s 2nd command (prevLogIndex == 1 – 1 == 0).

The last two cases need to be distinguishable. Otherwise followers won’t be able to reject inconsistent logs. This corner would have been hard to anticipate; at first glance it seems fine to adopt the convention that logs should be 0-indexed instead of 1-indexed.

raft-42: Quorum computed incorrectly.

We also found a fuzz test that ended in a violation of the ‘Leader Completeness’ invariant, i.e. a newly elected leader had a log that was irrecoverably inconsistent with the logs of previous leaders.

Leaders are supposed to commit log entries to their state machine when they knows that a quorum (N/2+1) of the processes in the cluster have that entry replicated in their logs. akka-raft had a bug where it computed the highest replicated log index incorrectly. First it sorted the values of matchIndex (which denote the highest log entry index known to be replicated on each peer). But rather than computing the median (or more specifically, the N/2+1’st) of the sorted entries, it computed the mode of the sorted entries. This caused the leader to commit entries too early, before a quorum actually had that entry replicated. In our fuzz test, message delays allowed another leader to become elected, but it did not have all committed entries in its log due to the previously leader committing too soon.

raft-62: Crash-recovery not yet supported, yet inadvertently triggerable.

Through fuzz testing I found one other case where two leaders became elected in the same term.

The Raft protocol assumes a crash-recovery failure model — that is, it allows for the possibility that crashed nodes will rejoin the cluster (with non-volatile state intact).

The current version of akka-raft does does not write anything to disk (although the akka-raft developers intend to support persistence soon). That’s actually fine — it just means that akka-raft currently assumes a crash-stop failure model, where crashed nodes are never allowed to come back.

The Akka runtime, however, has a default behavior that doesn’t play nicely with akka-raft’s crash-stop failure assumption: it automatically restarts any process that throws an exception. When the process restarts, all its state is reinitialized.

If for any reason, a process throws an exception after it has voted for another candidate, it will later rejoin the cluster, having forgotten who it had voted for (since all state is volatile). Similar to raft-56, this caused two leaders to be elected in our fuzz test.

raft-66: Followers unnecessarily overwrite log entries.

The last issue I found is only possible to trigger if the underlying transport protocol is UDP, since it requires reorderings of messages between the same source, destination pair. The akka-raft developers say they don’t currently support UDP, but it’s on their radar.

The invariant violation here was a violation of the ‘Leader Completeness’ safety property, where a leader is elected that doesn’t have all of the needed log entries.

Lamport Time Diagram

Leaders replicate uncommitted ClientCommands to the rest of the cluster in batches. Suppose a follower with an empty log receives an AppendEntries containing two entries. The follower appends these to its log.

Then the follower subsequently receives an AppendEntries containing only the first of the previous two entries. (This message was delayed, as shown in the Lamport Time Diagram). The follower will inadvertently delete the second entry from its log.

This is not just a performance issue: after receiving an ACK from the follower, the leader is under the impression that the follower has two entries in its log. The leader may have decided to commit both entries if a quorum was achieved. If another leader becomes elected, it will not necessarily have both committed entries in its log as it should.


The wide variety of bugs we found gets me really excited about how useful our fuzzing and minimization tool is turning out to be. The development toolchain for distributed systems is seriously deficient, and I hope that testing techniques like this see more widespread adoption in the future.

I left many details of our approach out of this post for brevity’s sake, particularly a description of my favorite part: how DEMi minimizes the faulty executions it finds to make them easier to understand. Check out our paper draft for more details!


[1] RPC layer interposition does come with a drawback: we’re tied to a particular RPC library. It would be tedious for us to adapt our interposition to the impressive range of systems Jepsen has been applied to.

[2] How we perform this minimization is outside the scope of this blog post. Minimization is, in my opinion, the most interesting part of what we’re doing here. Check out our paper draft for more information!

Distributed systems have two distinguishing features:

  • Asynchrony, or “absence of synchrony”: messages from one process to another do not arrive immediately. In a fully asynchronous system, messages may be delayed for unbounded periods of time. In contrast, synchronous networks always provide bounded message delays.
  • Partial failure: some processes in the system may fail while other processes continue executing.

It’s the combination of these two features that make distributed systems really hard; the crux of many impossibility proofs is that nodes in a fully asynchronous system can’t distinguish message delays from failures.

In practice, networks are somewhere between fully asynchronous and synchronous. That is, most (but not all!) of the time, networks give us sufficiently predictable message delays to allow nodes to coordinate successfully in the face of failures.

When designing a distributed algorithm however, common wisdom says that you should try to make as few assumptions about the network as possible. The motivation for this principle is that minimizing your algorithm’s assumptions about message delays maximizes the likelihood that it will work when placed in a real network (which may, in practice, fail to meet bounds on message delays).

On the other hand, if your network does in fact provide bounds on message delays, you can often design simpler and more performant algorithms on top of it. An example of this observation that I find particularly compelling is Speculative Paxos, which co-designs a consensus algorithm and the underlying network to improve overall performance.

At the risk of making unsubstantiated generalizations, I get the sense that theorists (who have dominated the field of distributed computing until somewhat recently) tend to worry a lot about corner cases that jeopardize correctness properties. That is, it’s the theorist who’s telling us to minimize our assumptions. In contrast, practitioners are often willing to sacrifice correctness in favor of simplicity and performance, as long as the corner cases that cause the system to violate correctness are sufficiently rare.

To resolve the tension between the theorists’ and the practitioners’ principles, my half-baked idea is that we should attempt to answer the following question: “How asynchronous are our networks in practice”?

Before outlining how one might answer this question, I need to provide a bit of background.

Failure Detectors

In reaction to the overly pessimistic asynchrony assumptions made by impossibility proofs, theorists spent about a decade [1] developing distributed algorithms for “partially synchronous” network models. The key property of the partially synchronous model is that at some point in the execution of the distributed system, the network will start to provide bounds on message delays, but the algorithm won’t know when that point occurs.

The problem with the partial asynchrony model is that algorithms built on top of it (and their corresponding correctness proofs) are messy: the timing assumptions of the algorithm are strewn throughout the code, and proving the algorithm correct requires you to pull those timing assumptions through the entire proof until you can finally check at the end whether they match up with the network model.

To make reasoning about asynchrony easier, a theorist named Sam Toueg along with a few others at Cornell proposed the concept of failure detectors. Failure detectors allow algorithms to encapsulate timing assumptions: instead of manually setting timers to detect failures, we design our algorithms to ask an oracle about the presence of failures [2]. To implement the oracle, we still use timers, but now we have all of our timing assumptions collected cleanly in one place.

Failure detectors form a hierarchy. The strongest failure detector has perfect accuracy (it never falsely accuses nodes of failing) and perfect completeness (it always informs all nodes of all failures). Weaker failure detectors might make mistakes, either by falsely accusing nodes of having crashed, or by neglecting to detect some failures. The different failure detectors correspond to different points on the asynchrony spectrum: perfect failure detectors can only be implemented in a fully synchronous network [3], whereas imperfect failure detectors correspond to partial synchrony.

Measuring Asynchrony

One way to get a handle on our question is to measure the behavior of failure detectors in practice. That is, one could implement imperfect failure detectors, place them in networks of different kinds, and measure how often they falsely accuse nodes of failing. If we have ground truth on when nodes actually fail in a controlled experiment, we can quantify how often those corner cases theorists are worried about come up.

Anyone interested in getting their hands dirty?


[1] Starting in 1988 and dwindling after 1996.

[2] Side note: failures detectors aren’t widely used in practice. Instead, most distributed systems use ad-hoc network timeouts strewn throughout the code. At best, distributed systems use adaptive timers, again strewn throughout the code. A library or language that encourages programmers to encapsulate timing assumptions and explicitly handle failure detection information could go a long way towards improving the simplicity, amenability to automated tools, and robustness of distributed systems.

[3] Which is equivalent to saying that they can’t be implemented. Unless you can ensure that the network itself never suffers from any failures or congestion, you can’t guarantee perfect synchrony. Nonetheless, some of the most recent network designs get us pretty close.

A typical web page is composed of multiple objects: HTML files, Javascript files, CSS files, images, etc..

When your browser loads a web page, it executes a list of tasks: first it needs to fetch the main HTML, then it can parse each of the HTML tags to know what other objects to fetch, then it can process each of the fetched objects and their effect on the DOM, and finally it can render pixels to your screen.

To load your web page as fast as possible, the browser tries to execute as many of these tasks as it can in parallel. The less time the browser spends sitting idle waiting for tasks to finish, the faster the web page will load.

It is not always possible to execute tasks in parallel. This is because some tasks have dependencies on others. The most obvious example is that the browser needs to fetch the main HTML before it can know what other objects to fetch [1].

In general, the more dependencies a web page has, the longer it will take to load. Prudent web developers structure their web pages in a way that minimizes browsers’ task dependencies.

A particularly nasty dependency is Javascript execution. Whenever the browser encounters a Javascript tag, it stops all other parsing and rendering tasks, waits to fetch the Javascript, executes it until completion, and finally restarts the previously blocked tasks. Browsers enforce this dependency because Javascript can modify the DOM; by modifying the DOM, Javascript might affect the execution of all other parsing and rendering tasks.

Placing Javascript tags in the beginning of an HTML page can have a huge performance hit, since each script adds 1 RTT plus computation time to the overall page load time.

Fortunately, the HTML standard provides a mechanism that allows developers to mitigate this cost: the defer attribute. The defer attribute tells the browser that it’s OK to fetch and execute a Javascript tag asynchronously.

Unfortunately, using the defer tag is not straightforward. The issue is that it’s hard for the web developer to know whether it’s safe to allow the browser to execute Javascript asynchronously. For instance, the Javascript may actually need to modify the DOM to ensure the correct execution of the page, or it may depend on other resources (e.g. other Javascript tags).

Forcing web developers to reason about these complicated (and often hidden!) dependencies is, at best, a lot to ask for, and at worst, highly error-prone. For this reason few web developers today make use of defer tags.

So here’s my half-baked idea: wouldn’t it be great if we had a compiler that could automatically mark defer attributes? Specifically, let’s apply static or dynamic analysis to infer when it’s safe for Javascript tags to execute asynchronously. Such a tool could go a long way towards improving the performance and correctness of the web.


[1] See the WProf paper for a nice overview of browser activity dependencies.

gdb, although an incredibly powerful tool for debugging single programs, doesn’t work so well for distributed systems.

The crucial difference between a single program and a distributed system is that distributed computation revolves around network messages. Distributed systems spend much of their time doing nothing more than waiting for network messages. When they receive a message, they perform computation, perhaps send out a few network messages of their own, and then return to their default state of waiting for more network messages.

Because there’s a network separating the nodes of the distributed system, you can’t (easily) pause all processes and attach gdb. And, in the words of Armon Dadgar, “even if you could, the network is part of your system. Definitely not going to be able to gdb attach to that.”

Suppose that you decide to attach gdb to a single process in the distributed system. Even then, you’ll probably end up frustrated. You’re going to spend most of your time waiting on a select or receive breakpoint. And when your breakpoint is triggered, you’ll find that most of the messages won’t be relevant for triggering your bug. You need to wait for a specific message, or even a specific sequence of messages, before you’ll be able to trace through the code path that leads to your bug.

Crucially, gdb doesn’t give you the ability to control network messages, yet network message are what drive the distributed system’s execution. In other words, gdb operates at a level of abstraction that is lower than what you want.

Distributed systems need a different kind of debugger. What we need is a debugger that will allow us to step through the distributed system’s execution at the level of network messages. That is, you should be able to generate messages, control the order in which they arrive, and observe how the distributed system reacts.

Shameless self-promotion: STS supports an “Interactive Mode” that takes over control of the (software) network separating the nodes of a distributed system. This allows you to interactively reorder or drop messages, inject failures, or check invariants. We need something like this for testing and debugging general distributed systems.

As a graduate student, I find that the rate of progress I’m able to make on my current research project is significantly lower than the rate at which I encounter ideas for new research projects. Over time, this means that the number of half-baked ideas jotted down in my notebook grows without bound.

In Academia, we sometimes feel dissuaded from sharing our half-baked ideas. Our fear is that we may get ‘scooped’; that is, we worry that if we share an idea before we have a time to flesh it out, someone else may take that idea and turn it into a fully-fledged publication, thereby stealing our opportunity to publish.

Until now, I haven’t publicly shared any of my half-baked ideas. I would like to change that [1].

So, in the hope of generating discussion, I’ll be posting a series of half-baked ideas. Please feel welcome to steal them, criticize them, or add to them!

[1] In part, this is because I have come to believe that academic caginess is petty. More importantly though, I have come to the terms with the reality that I will not have time to pursue most of these ideas.

At Berkeley I have the opportunity to work with some of the smartest undergrads around. One of the undergrads I work with, Andrew Or, did some neat work on modeling the performance of network control plane systems (e.g. SDN controllers). He decided to take a once-in-a-lifetime opportunity to join Databricks before we got the chance to publish his work, so in his stead I thought I’d share his work here.

An interactive version of his performance model can be found at this website. Description from the website:

A key latency metric for network control plane systems is convergence time: the duration between when a change occurs in a network and when the network has converged to an updated configuration that accommodates that change. The faster the convergence time, the better.

Convergence time depends on many variables: latencies between network devices, the number of network devices, the complexity of the replication mechanism used (if any) between controllers, storage latencies, etc. With so many variables it can be difficult to build an intuition for how the variables interact to determine overall convergence time.

The purpose of this tool is to help build that intuition. Based on analytic models of communication complexity for various replication and network update schemes, the tool quantifies convergence times for a given topology and workload. With it, you can answer questions such as “How far will my current approach scale while staying within my SLA?”, and “What is the convergence time of my network under a worst-case workload?”.

The tool is insightful (e.g. note the striking difference between SDN controllers and traditional routing protocols) and a lot of fun to play around with; I encourage you to check it out. In case you are curious about the details of the model or would like to suggest improvements, the code is available here. We also have a 6-page write up of the work, available upon request.

I often overhear a recurring debate amongst researchers: is Academia a good place to build real software systems? By “real”, we typically mean “used”, particularly by people outside of academic circles.

There have certainly been some success stories. BSD, LLVM, Xen, and Spark come to mind.

Nonetheless, some argue that these success stories came about at a time when the surrounding software ecosystem was nascent enough for a small group of researchers to be able to make a substantial contribution, and that the ecosystem is normally at a point where researchers cannot easily contribute. Consider for example that BSD was initially released in 1977, when very few open source operating systems existed. Now we have Linux, which has almost 1400 active developers.

Is this line of reasoning correct? Is the heyday of Academic systems software over? Will it ever come again?

Without a doubt, building real software systems requires substantial (wo)manpower; no matter how great the idea is, implementing it will require raw effort.

This fact suggests an indirect way to evaluate our question. Let’s assume that (i) any given software developer can only produce a fixed (constant) amount of coding progress in a fixed timeframe and (ii) the maturity of the surrounding software ecosystem is proportional to collective effort put into it. We can then approximate an answer to our question by looking at the number of software developers in industry vs. the number of researchers over time.

It turns out that the Bureau of Labor Statistics publishes exactly the data we need for the United States. Here’s what I found:

OES data

Hm. The first thing we notice is that it’s hard to even see the line for academic and industrial researchers. To give you a sense of where it’s at, the y-coordinate at May, 2013 for computer science teachers and professors is 35,770, two orders of magnitude smaller than the 3,339,440 total employees in the software industry at that time.

What we really care about though is the ratio of employees in industry to number of researchers:

OES ratio data

In the last few years, both the software industry and Academia are growing at roughly the same rate, whereas researchers in industrial labs appear to be dropping off relative to the software industry. We can see this relative growth rate better by normalizing the datasets (dividing each datapoint by the maximum datapoint in its series — might be better to take the derivative, but I’m too lazy to figure out how to do that at the moment):

OES normalized data

The data for the previous graphs only goes back to 1995. The Bureau of Labor Statistics also publishes coarser granularity going all the way to 1950 and beyond:

NES data

(See the hump around 2001?)

Not sure if this data actually answers our initial question, but I certainly found it insightful! If you’d like more details on how I did this analysis, or would like to play around with the data for yourself, see my code.

I recently came across a statement in Aphyr’s excellent Jepsen blog series that caught my eye:

“In a very real sense, [network] partitions are just really big windows of concurrency.”

This statement seems to imply that distributed systems are “equivalent” to parallel (single-machine) computing systems, for the following reason: partitions, which occur in a network but don’t really occur on a single chip [0], appear to be the key distinguishing property of distributed systems. But if partitions are just a special case of concurrency, then there shouldn’t be any fundamental reasons why algorithms for multicore computational models (such as PRAM) wouldn’t be perfectly suitable for solving all the problems we might encounter in a distributed setting. We know this to be false, so I’ve been trying to puzzle out precisely what properties of distributed computing distinguish it from parallel computing [1].

I’ve been taught that distributed systems have two crucial features:

  • Asynchrony, or “absence of synchrony”: messages from one process to another do not arrive immediately. In a fully asynchronous system, messages may be delayed for unbounded periods of time.
  • Partial failure: some processes in the system may fail while other processes continue executing.

Let’s discuss these two properties separately.


Parallel systems also exhibit asynchrony, as long it’s possible for there to be a delay between one process sending a message [2] and the other processes having the opportunity to read that message. Even on a single machine, this delay might be induced by locks within the operating system kernel, or by the cache coherence protocol implemented in hardware on a multicore chip.

With this in mind, let’s return to Aphyr’s statement. What exactly did he mean by “big windows of concurrency”? His article focuses on what happens when multiple clients write to the same database key, so by “concurrency” I think he is referring to situations where multiple processes might simultaneously issue writes to the same piece of state. But if you think about it, the entire execution is a “big window of concurrency” in this sense, regardless of whether the database replicas are partitioned. By “big windows of concurrency” I think Aphyr was really talking about asynchrony (or more precisely, periods of high message delivery delays), since network partitions are hard to deal with precisely because the messages between replicas aren’t deliverable until after the partition is recovered: when replicas can’t coordinate, it’s challenging (or impossible, if the system chooses to enforce linearizability) for them to correctly process those concurrent writes. Amending Aphyr’s statement then:

“Network partitions are just really big windows of asynchrony.”

Does this amendment resolve our quandary? Someone could rightly point out that because partitions don’t really occur within a single chip [0], parallel systems can effectively provide guarantees on how long message delays can last [3], whereas partitions in distributed systems may last arbitrarily long. Some algorithms designed for parallel computers might therefore break in a distributed setting, but I don’t think this is really the distinction we’re looking for.

Partial Failure

Designers of distributed algorithms codify their assumptions about the possible ways nodes can fail by specifying a ‘failure model’. Failure models might describe how many nodes can fail—for example, quorum-based algorithms assume that no more than N/2 nodes ever fail, otherwise they cannot make progress—or they might spell out how individual crashed nodes behave. The latter constraint forms a hierarchy, where weaker failure models (e.g. ‘fail-stop’, where crashed nodes are guaranteed to never send messages again) can be reduced to special cases of stronger models (e.g. ‘Byzantine’, where faulty nodes can behave arbitrarily, even possibly mimicking the behavior of correct nodes) [4].

Throughout the Jepsen series, Aphyr tests distributed systems by (i) telling clients to issue concurrent writes, (ii) inducing a network partition between database replicas, and (iii) recovering the partition. Observe that Jepsen never actually kills replicas! This failure model is actually weaker than fail-stop, since nodes are guaranteed to eventually resume sending messages [5]. Aphyr’s statement is beginning to make sense:

“Network partitions that are followed by network recovery are just really big windows of asynchrony.”

This statement is true; from the perspective of a node in the system, a network partition followed by a network recovery is indistinguishable from a random spike in message delays, or peer nodes that are just very slow to respond. In other words, a distributed system that guarantees that messages will eventually be deliverable to all nodes is equivalent to an asynchronous parallel system. But if any nodes in the distributed system actually fail, we’re no longer equivalent to a parallel system.

Who cares?

This discussion might sound like academic hairsplitting, but I claim that these distinctions have practical implications.

As an example, let’s imagine that you need to make a choice between shared memory versus message passing as the communication model for the shiny new distributed system you’re designing. If you come from a parallel computing background you would know that message passing is actually equivalent to shared memory, in the sense that you can use a message passing abstraction to implement shared memory, and vice versa. You might therefore conclude that you are free to choose whichever abstraction is more convenient or performant for your distributed system. If you jumped to this conclusion you might end up making your system more fragile without realizing it. Message passing is not equivalent to shared memory in distributed systems [6], precisely because distributed systems exhibit partial failures; in order to correctly implement shared memory in a distributed system it must always be possible to coordinate with a quorum, or otherwise be able to accurately detect which nodes have failed. Message passing does not have this limitation.

Another takeaway from this discussion is that Jepsen is actually testing a fairly weak failure mode. Despite Jepsen’s simplicity though, Aphyr has managed to uncover problems in an impressive number of distributed databases. If we want to uncover yet more implicit assumptions about how our systems behave, stronger failure modes seem like an excellent place to look.

[0] After I posted this blog post, Aphyr and others informed me that some of the latest multicore chips are in fact facing partial failures between cores due to voltage issues. This is quite interesting, because as multicore chips grow in transistor density, the distinction between parallel computing and distributed computing is becoming more and more blurred: modern multicore chips face both unbounded asynchrony (from the growing gap between levels of the memory hierarchy) and partial failure (from voltage issues).

[1] Thanks to Ali Ghodsi for helping me tease out the differences between these properties.

[2] or writing to shared memory, which is essentially the same as sending a message.

[3] See, for example, PRAM or BSP, which assume that every node can communicate with every other node within each “round”. It’s trivial to solve hard problems like consensus in this world, because you can always just take a majority vote and decide within two rounds.

[4] See Ali Ghodsi’s excellent slides for a taxonomy of these failure models.

[5] Note that this is not equivalent to ‘crash-recovery’. Crash-recovery is actually stronger than fail-stop, because nodes may recover or they may not.

[6] Nancy Lynch, “Distributed Algorithms”, Morgan Kaufmann, 1996.

According to a study by Turner et al. [1], wide area network links have an average of 1.2 to 2.7 days of downtime per year. This translates to roughly two and a half 9’s of reliability [2].

I was curious how this compared to datacenter links, so I took a look at Gill et. al’s paper [3] on datacenter network failures at Microsoft. Unfortunately some of the data has been redacted, but I was able to reverse engineer the mean link downtime per year with the help of Aurojit Panda’s svg-to-points converter. The results are interesting: out of all links types, the average downtime was 0.3 days. This translates to roughly three and a half 9’s of reliability, an order of magnitude greater than WAN links.

Intuitively this makes sense. WAN links are much more prone to drunken hunters, bulldozers, wild dogs, ships dropping anchor and the like than links within a secure datacenter.


[1] Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California Fault Lines: Understanding the Causes and Impact of Network Failures, Table 4. SIGCOMM ‘10.

[2] Note that this statistic is specifically about hardware failure, not overall network availability.

[3] Phillipa Gill, Navendu Jain, Nachiappan Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications, Figures 8c & 9c. SIGCOMM ‘11

In 2010, Jeff Dean gave a talk that laid out a list of numbers every programmer should know. His list has since become relatively well known among the systems community.

The other day, a friend mentioned a latency number to me, and I realized that it was an order of magnitude smaller than what I had memorized from Jeff’s talk. The problem, of course, is that hardware performance increases exponentially! After some digging, I actually found that the numbers Jeff quotes are over a decade old [1].

Partly inspired by my officemate Aurojit Panda, who is collecting awesome data on hardware performance, I decided to write a little tool [2] to visualize Jeff’s numbers as a function of time [3].

Without further ado, here it is.


[1] Jeff’s numbers are from 2001, and were first publicized by Peter Norvig in this article.

[2] Layout stolen directly from ayshen on GitHub.

[3] The hardware trends I’ve gathered are rough estimates. If you want to tweak the parameters yourself, I’ve made it really easy to do so — please send me updates! Better yet, issue a pull request.