site_title2021-07-05T23:29:04+00:00unique_feed_identifierauthor_namecopyright_detailsDo Toll-Free Phone Lines Violate Network Neutrality? A Blueprint for Providing Toll-Free Cellular Data2017-05-17T00:01:12+00:00home_url/blog/2017/05/17/net-neutrality<p>The network neutrality debate focuses primarily on supply side issues, such as whether content providers should be
allowed to pay Internet service providers for “fast lanes”, and whether allowing them to do so would stifle innovation.</p>
<p>It seems fruitful to also look at network neutrality through a
different lens. In this post I explore the following question:</p>
<blockquote>
<p>What implications do network neutrality regulations have on equality of access to the Internet?</p>
</blockquote>
<h3 id="the-moral-dilemma">The Moral Dilemma</h3>
<p>I’ve had a recurring debate with my friends who support network neutrality. Most of them agree with the following premises:</p>
<ul>
<li>
<p>Society should strive to provide equal <em>opportunities</em> to all citizens.</p>
</li>
<li>
<p>Access to the Internet provides opportunities.</p>
</li>
<li>
<p>There are over a billion people in the world who cannot afford the costs of Internet data.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">1</a></sup></p>
</li>
<li>
<p>Therefore, something should be done (immediately) to provide more
affordable access to the Internet.</p>
</li>
</ul>
<p>But when I try to take the argument further (as follows), the debate gets heated:</p>
<ul>
<li>
<p>Some content providers (most notably: Facebook and the other members of the Free Basics program)
are willing to pay Internet providers for the costs that consumers incur by accessing their content.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">2</a></sup></p>
</li>
<li>
<p>Preventing these content providers from subsidizing Internet access (on
network neutrality grounds<sup id="fnref:9" role="doc-noteref"><a href="#fn:9" class="footnote" rel="footnote">3</a></sup>) <em>prolongs</em> inequalities in society.</p>
</li>
</ul>
<h3 id="who-will-pay">Who Will Pay?</h3>
<p>Although my net neutrality friends may agree that we should not
prolong inequalities in society, they disagree that we should allow content providers
to pay for data costs. According to them, the negative implications that might
result outweigh the benefits of improved access.</p>
<p>But who else will pay?</p>
<p>One answer: the government. The government is (in most of the world<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>) a
democratically elected body, accountable to the people. They could in theory
implement an affordable Internet subsidy while
maintaining network neutrality.</p>
<p>Personally, I’m skeptical that government-lead data subsidies will ever become
widespread, particularly in countries where
access inequalities are the most stark. Governments are slow to move,
struggle to sustain projects that span multiple election cycles,
and are not without their own conflicts of interest (e.g., consider that
governments sell spectrum rights for billions of dollars,<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup> and may even seek
to maintain high data costs as a form of censorship<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup>)</p>
<p>Clearly this is a complex issue. In lieu of a solution to the moral dilemma, I’d like pursue
another line of argumentation with my net neutrality friends:</p>
<h3 id="the-definitional-dilemma">The Definitional Dilemma</h3>
<p>Before the World Wide Web (and still to this day!), phone calls were a very common way for
people to access information. In the same way that billing works on the Internet, the consumer of voice
information (the caller) needs to pay the phone provider for the costs of
transporting their phone calls.</p>
<p>Some <em>voice</em> content providers (customer service
centers, crime reporting centers, suicide help lines, etc.) decided that they
would like to allow consumers of their services to call in for free.
Instead of having the consumer pay, the content providers pay the phone provider for the costs of
maintaining their toll-free phone line.</p>
<p>The parallels to the Internet are striking. Many of the arguments against Internet content providers also seem to apply to voice content providers.
(What if a small innovative startup cannot afford to pay for a toll-free phone line? etc.).</p>
<p>Yet, toll-free phone lines aren’t bothersome to my net neutrality
friends. I think it would be valuable for us to understand why!</p>
<p>I suspect that the <em>content</em> being served (and the companies that are doing the serving) is the real source of discomfort
for people who oppose Free Basics. The same people don’t have much of an issue with toll-free phone
lines because toll-free phone lines are commonly used by governments, non-profits and companies with ostensibly charitable intentions.</p>
<p>Yet there are plenty of governments, non-profits, and companies with charitable intentions who are eager to
pay to deliver <em>Internet</em> content to consumers who can’t afford it. Which
brings us to an actionable takeaway from this whole discussion:</p>
<h3 id="toll-free-data-via-voice-calls">Toll-Free Data via Voice Calls</h3>
<p>Remember the good old days of dial-up modems? Dial-up modems transferred
Internet data over <em>voice</em> lines, by encoding data as audio signals.</p>
<p>There’s no reason we can’t do the same over cellular to deliver data to a
mobile phone. We just need software (a “modem”) running on the
phone to decode the audio signals. In fact, this idea has already been proposed
before, in a slightly different context.<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup></p>
<p>Now, if a well meaning company wants to provide data to consumers free of
charge, they could pay for a toll-free phone line (or a return-your-missed-call phone line<sup id="fnref:11" role="doc-noteref"><a href="#fn:11" class="footnote" rel="footnote">8</a></sup>) for consumers to call, and run a server that transfers data over that
voice line whenever it receives a call. An application running on the phone could periodically make calls (without
the user needing to intervene) to the toll-free phone number to retrieve small
amounts of data.</p>
<p>What’s neat about this scheme is that we can implement it right now, without needing to change business models or wait on regulatory decisions. And it’s not fundamentally any different than toll-free phone lines, or even a regular return-your-missed-call phone system.<sup id="fnref:10" role="doc-noteref"><a href="#fn:10" class="footnote" rel="footnote">9</a></sup></p>
<p>Anyone have spare cycles to build the modem and server infrastructure? I’ve already got some ideas for great applications we could build
on top of toll-free data.</p>
<hr />
<p>Thanks to Bill Thies, Aurojit Panda, and Sachin Gaur for helping me shape these thoughts.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:2" role="doc-endnote">
<p><a href="https://info.internet.org/en/wp-content/uploads/sites/4/2016/07/state-of-connectivity-2015-2016-02-21-final.pdf">Internet.org State of Connectivity Report</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>In fact there are two startups, <a href="https://www.jana.com/home">Jana</a> and <a href="https://www.movivo.com/">Movivo</a>, who are in the business of making it easy for content providers to reimburse consumers for the costs of their data usage, by sending mobile top-ups to the consumer after they have consumed the data. In India, Airtel rolled out a <a href="https://en.wikipedia.org/wiki/Airtel_Zero">business model</a> that allowed content providers to pay for data, but this has come under substantial criticism from network neutrality advocates. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:9" role="doc-endnote">
<p>As in the <a href="https://www.theguardian.com/technology/2016/may/12/facebook-free-basics-india-zuckerberg">ban of Free Basics within India</a>. <a href="#fnref:9" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p><a href="https://ourworldindata.org/democracy/">Historical data</a> on the fraction of the world population living in democratic countries. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p><a href="https://www.livemint.com/Industry/xt5r4Zs5RmzjdwuLUdwJMI/Spectrum-auction-ends-after-lukewarm-response-from-telcos.html">India ends spectrum auction for 9.5 billion dollars</a> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>As in Jordan or Eritrea. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p><a href="https://www.cs.nyu.edu/~jchen/publications/com31a-dhananjay.pdf">Hermes: Data Transmission over Unknown Voice Channels</a>. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:11" role="doc-endnote">
<p>With the (sole?) exception of the USA, phone calls are billed only to the caller, not the receiver. By giving a missed call (for free), the client can signal to the server to call them back, such that the server bears the cost of the phone call rather than the client. <a href="#fnref:11" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:10" role="doc-endnote">
<p>Which does not require a toll-free number, you can run it over a normal phone number. <a href="#fnref:10" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Do the benefits of technological innovation trickle down?2016-08-13T00:01:12+00:00home_url/blog/2016/08/13/does-it-trickle-down<p>In my job interviews earlier this year I had the opportunity to speak with many interesting people.
I posed the following question to several of my interviewers:</p>
<blockquote>
<p>Do technological innovations [that computer scientists produce] benefit society as
a whole, or do they primarily improve the livelihood of people and institutions who are already well off?</p>
</blockquote>
<p>Strikingly, I received a nearly identical response from three people I spoke to. Their response was
something along the lines of the following:</p>
<blockquote>
<p>“The innovations from computer science (are one of the few types of innovation that)
trickle down”</p>
</blockquote>
<p>In this blog post I seek to evaluate their claim.</p>
<h3 id="scoping-the-hypothesis">Scoping the hypothesis</h3>
<p>The question I posed was admittedly vague. What exactly did I mean by
‘benefit’, and how can we possibly make a statement about
all technological innovations?</p>
<p>To scope the discussion, let’s focus on a more specific question:</p>
<blockquote>
<p>Does progress in information technology correlate with an improvement in poverty rates?</p>
</blockquote>
<p>We’d like to know, at a macroscopic level, whether IT helps people get out of poverty. Causation
is difficult to argue, but if there is causation we should expect there to
also be correlation.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
<h3 id="data-from-the-united-states">Data from the United States</h3>
<p>In the following figure, Kentaro Toyama
<a href="https://www.youtube.com/watch?v=cxutDM2r534">examines</a> the
data for the United States, using the federal
government’s <a href="https://en.wikipedia.org/wiki/Poverty_in_the_United_States">definition of
poverty</a>:</p>
<p><img src="https://eecs.berkeley.edu/~rcs/research/kentaro_data.png" alt="US Poverty Data" height="800px" width="800px" /></p>
<p>As we see, poverty rates in the USA haven’t budged since the early 1970s, and
the <em>absolute</em> number of people living in poverty has actually gone up. This
is despite huge advances in the proliferation of information technologies like
the Internet, the world wide web, and smartphones.</p>
<p>From this, we can reasonably conclude that information technology <em>alone</em> has not had much
of an effect on poverty numbers in the USA; there must be some other factors
preventing such a large number of people from getting out of poverty.</p>
<h3 id="data-from-the-world">Data from the World</h3>
<p>The USA is just one country, and it is an anomalous country in many ways. Perhaps
information technology does more to help the impoverished in the rest of the world?
The remarkable chart below uses the World Bank’s definition<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>
for extreme poverty at $1.25 per day adjusted to purchasing power parity:</p>
<p><img src="https://eecs.berkeley.edu/~rcs/research/poverty_global.jpg" alt="Worldwide Poverty Data" height="800px" width="800px" /></p>
<p>According to one reading, this data does not provide reason to doubt our
trickle down hypothesis:
there is a downward trend in the absolute number of people living in
extreme poverty, which may be partially due to the proliferation of information
technologies.</p>
<p>A less optimistic
<a href="https://www.humanosphere.org/basics/2015/09/global-poverty-falling-not-fast-may-think/">reading</a> is that the decline in poverty
numbers we see starting in the 1970s (well before the popularity of the Web) can largely be attributed to
changes in the Chinese government’s polices, and, more recently, India’s own reforms starting in 1991.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> It seems entirely feasible that
information technology played a negligible role in these transformations.</p>
<p>I’m not personally convinced that the innovations we (systems researchers) produce, which are often
targeted towards the more privileged members of society, currently play a significant role in
the socioeconomic well-being of the poor.</p>
<h3 id="caveats-caveats-caveats">Caveats, caveats, caveats</h3>
<p>My conclusion certainly does not imply that computer scientists should stop
focusing on innovations targeted at the more privileged members of society.
It’s very difficult to predict what kinds of impact our innovations will have; for example,
the Apple engineers who developed the iPhone were explicitly targeting the
rich, yet the popularity of the iPhone sparked a drive towards cellular data
networks which are now the dominant mode of Internet in the developing world.</p>
<p>It’s also undeniably true that information technology positively touches the lives of the poor,
even if the jury is still out on whether it can play a significant role in helping large numbers of them develop their
socioeconomic well-being.</p>
<p>Personally, I’ve decided to pivot my direction away from high tech. I’m
devoting the next year or two to understanding what role IT can play in the lives of the less privileged. We’ll
see what I learn!</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Strictly speaking, neither correlation nor lack thereof prove anything about causation. But most scientists probably think <a href="https://en.wikipedia.org/wiki/Problem_of_induction">Hume</a> was just being pedantic. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>The World Bank announced plans to move their <a href="https://en.wikipedia.org/wiki/Extreme_poverty#cite_note-7">definition</a> of extreme poverty to $1.90 per day to recognize higher price levels in developing countries than previously estimated. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>See Chapter 2 and 6 of Angus Deaton’s “The Great Escape” for an in-depth analysis of this data. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Technologies for Testing Distributed Systems, Part I2016-03-04T23:01:12+00:00home_url/blog/2016/03/04/technologies-for-testing-and-debugging-distributed-systems<p>Last month, <a href="https://tagide.com/about.html">Crista Lopes</a> asked the twitterverse:</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Twitter friends: what papers or frameworks do you recommend regarding distributed systems regression or integration testing techniques?</p>— Crista Lopes (@cristalopes) <a href="https://twitter.com/cristalopes/status/690663597752631296">January 22, 2016</a></blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>She compiled the answers she received in a <a href="https://tagide.com/blog/research/distributed-systems-testing-the-lost-world/">blog post</a>, which ended on a
dispirited note:</p>
<blockquote>
<p>In spite of unit testing being a standard practice everywhere, things don’t
seem to have gotten any better for testing distributed systems end-to-end.</p>
</blockquote>
<p>From my viewpoint atop the ivory tower, the
state-of-the-art in testing distributed systems doesn’t seem quite as
disappointing as Crista’s blog post might lead you to believe. As I am
now wrapping up my dissertation on testing and debugging distributed systems, I feel
compelled to share some of what I’ve learned about testing over the last five
years.</p>
<p>Crista points out that there are several existing surveys of testing
techniques for distributed systems, e.g. Inés Sombra’s RICON 2014 <a href="https://speakerdeck.com/randommood/testing-in-a-distributed-world">talk</a> or Caitie McCaffrey’s <a href="https://cacm.acm.org/magazines/2016/2/197420-the-verification-of-a-distributed-system/fulltext">CACM article</a>.
Here, I’ll structure the discussion around the <em>challenges</em> posed by
different testing goals, and the tradeoffs different testing technologies make in overcoming those
challenges. I’ll mostly cover end-to-end techniques (per Crista’s original
question), and I’ll focus on academic research rather than
best practices.</p>
<p>Here we go!</p>
<h3 id="regression-testing-for-correctness-bugs">Regression Testing for Correctness Bugs</h3>
<p>Crista’s original question is about regression testing, so I’ll start there.
The regression testing problem for correctness bugs is the following:</p>
<ul>
<li><strong>We’re given</strong>: (i) a safety condition (assertion) that the system has
violated in the past, and (ii) the environmental conditions (e.g.
system configuration) that caused the system to violate the
safety condition.</li>
<li><strong>Our goal</strong>: we want to produce an oracle (automated test case) that will notify us
whenever the old bug resurfaces as we make new changes to the codebase.</li>
</ul>
<p>What’s hard about producing these oracles for distributed systems? A few challenges:</p>
<ul>
<li><strong>a) Non-determinism</strong>: we’d like our regression test to reliably reproduce
the bug whenever it resurfaces. Yet distributed systems depend on two major sources of
non-determinism: the order of messages delivered by the network, and clocks
(e.g., failure detectors need timeouts to know when to send heartbeat messages<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>).</li>
<li><strong>b) Timeliness</strong>: we’d like our regression test to complete within a
reasonable amount of time. Yet if we implement it naïvely, the regression test will
have to <em>sleep()</em> for a long time to ensure that all events have completed.</li>
</ul>
<p>One way to overcome both a) and b) is to <em>interpose</em> on message and
timer APIs. The basic idea here is to first record the behavior of the
non-deterministic components of the system (e.g., track the order of messages delivered by the network)
leading up to the original bug. Then, when we execute the regression test, we guide the behavior of those
non-deterministic components to stay as close as possible to the original recorded execution.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">2</a></sup></p>
<p>Interposition helps us produce reliable results, and it allows us to know
exactly when the test has completed so that we don’t need to <em>sleep()</em> for
arbitrary amounts of time. In some cases we can even run our regression tests
significantly faster than they would actually take in production, by delivering timer
events before the true wall-clock time for those timers has elapsed (without
the system being aware of this fact).<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">3</a></sup></p>
<p>However, interposition brings two additional challenges:</p>
<ul>
<li><strong>c) Engineering effort</strong>: depending on where we choose to interpose, we
might need to expend large amounts of effort to ensure that our executions are
sufficiently deterministic.</li>
<li><strong>d) Limited shelf-life</strong>: if we record interposition events at fine
granularity, the regression test is likely to break as soon as we make small changes to the
system (since the API calls invoked by the system will differ). Ideally we would like our regression
tests to remain valid even as we make substantial changes to the system under
test.</li>
</ul>
<h4 id="solutions">Solutions</h4>
<p>On one extreme, we could use <strong>deterministic replay</strong> tools to reliably
execute our regression test. These tools interpose on syscalls, signals, and certain non-deterministic instructions.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">4</a></sup>
If your distributed system has a small enough memory footprint, you can just execute deterministic replay with
all of your nodes on a single physical machine.<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">5</a></sup><sup>,</sup><sup id="fnref:19" role="doc-noteref"><a href="#fn:19" class="footnote" rel="footnote">6</a></sup> There are also approaches for replaying across multiple machines.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">7</a></sup><sup>,</sup><sup id="fnref:18" role="doc-noteref"><a href="#fn:18" class="footnote" rel="footnote">8</a></sup></p>
<p>Deterministic replay assumes that you are able to record an execution that leads up to the bug.
To replay that execution, these tools wait for the application to make the same sequence of syscalls as the original execution, and
return the same syscall values that were originally returned by the OS. Since the application must
go through the syscall layer to interact with the outside world (including, for example, to read the current time), we
are guaranteed determinism. That said, one major issue with using deterministic replay for regression testing is that the
execution recording has limited shelf-life (since syscall recordings are very fine-grained).</p>
<p>Another issue with deterministic replay is that you don’t always have a recorded execution
for bugs that you know exist. If you’re willing to wait long enough, it is possible to synthesize an
interleaving of messages / threads that leads up to the known bug.<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">9</a></sup><sup>,</sup><sup id="fnref:12" role="doc-noteref"><a href="#fn:12" class="footnote" rel="footnote">10</a></sup><sup>,</sup><sup id="fnref:15" role="doc-noteref"><a href="#fn:15" class="footnote" rel="footnote">11</a></sup><sup>,</sup><sup id="fnref:13" role="doc-noteref"><a href="#fn:13" class="footnote" rel="footnote">12</a></sup><sup>,</sup><sup id="fnref:14" role="doc-noteref"><a href="#fn:14" class="footnote" rel="footnote">13</a></sup></p>
<p>Another point in the design space is <strong>application-specific interposition</strong>, where we interpose on a
narrow, high level API such as the RPC layer. We aren’t guaranteed determinism here, but we can achieve decent reliability
with a judicious choice of interposition locations.</p>
<p>One major advantage of application-specific interposition is
reduced recording overhead: since we’re interposing on a high level API, we might be able to turn on execution
recording in production to avoid needing to reproduce bugs in a test environment. Another advantage is extended shelf-life:
we’re interposing on coarse, high level events, and we also have access to application semantics that help us recognize
extraneous changes to the application’s behavior (e.g., we can know that cookies or sequence numbers
should be ignored when deciding whether a message
from the recorded execution is logically equivalent to a message in a replay execution).</p>
<p>[Shameless plug for my own work, which fits into the category of application-specific interposition: our
<a href="https://github.com/NetSys/demi">DEMi tool</a><sup id="fnref:10" role="doc-noteref"><a href="#fn:10" class="footnote" rel="footnote">14</a></sup> allows you to produce regression tests without having to write any code. First, you use it find correctness bugs
through randomized concurrency testing. You can then minimize the buggy execution, and finally you can replay the
execution as a regression test. DEMi interposes on timers to make the execution super fast, and allows you to specify message fields that should be
ignored to help increase the shelf-life of the recording.]</p>
<p>Finally, on the other extreme of the design space, we can just <strong>replay multiple times</strong> without any interposition and hope that the bug-triggering event interleaving shows
up at least once.<sup id="fnref:11" role="doc-noteref"><a href="#fn:11" class="footnote" rel="footnote">15</a></sup> This requires minimal engineering effort and has unbounded shelf-life, but it may be unable to consistently reproduce the buggy execution.</p>
<h3 id="regression-testing-for-performance-latency-bugs">Regression Testing for Performance (Latency) Bugs</h3>
<p>The regression testing problem for latency bugs is similar to above, with a
few differences:</p>
<ul>
<li><strong>We’re given</strong>: (i) an assertion that we know the system has
violated in the past, usually statistical in nature, about how long requests should take to be
processed by the system, and (ii) a description of the system’s workload
at the time the latency problem was detected.</li>
<li><strong>Our goal</strong>: we want to produce an oracle that will notify us
whenever request latency gets notably worse as we make new changes to the
system.</li>
</ul>
<p>A few challenges:</p>
<ul>
<li><strong>a) Flakiness</strong>: Performance characteristics typically exhibit large
variance. Despite variance, we need our assertion to avoid reporting too
many false positives. Conversely, we need to prevent the assertion from
missing too many true positives.</li>
<li><strong>b) Workload characterization</strong>: it can be difficult to reproduce production
traffic mixes in a test environment.</li>
</ul>
<p>Regardless of whether your system is distributed or located on a single machine,
request latency is defined by the time it takes to execute the program’s <em>control flow</em> for that request.
In a system where concurrent tasks process a single request, it is useful to
consider the <em>critical path</em>: the longest chain of dependent tasks
starting with the request’s arrival, and ending with completion of the control
flow.</p>
<p>The challenges with observing control flow for distributed systems are the
following:</p>
<ul>
<li><strong>c) Limited Visibility</strong>: the control flow for a single request can touch
thousands of machines, any one of which might be the source of a latency
problem.<sup id="fnref:16" role="doc-noteref"><a href="#fn:16" class="footnote" rel="footnote">16</a></sup> So, we need to aggregate timing information across
machines. Simple aggregation of statistics often isn’t sufficient though, since a
single machine doesn’t have a way of knowing which local tasks were
triggered by which incoming request.</li>
<li><strong>d) Instrumentation Overhead</strong>: It’s possible that the act of measuring execution times can
itself significantly perturb the execution time, leading to false positives or false negatives.</li>
<li><strong>e) Intrusiveness</strong>: if we’re using our production deployment to find performance problems, we need to
avoid increasing latency too much for clients.</li>
</ul>
<h4 id="solutions-1">Solutions</h4>
<p>The main technique for addressing these challenges is <strong>distributed tracing</strong>. The
core idea is simple:<sup id="fnref:25" role="doc-noteref"><a href="#fn:25" class="footnote" rel="footnote">17</a></sup> have the first machine assign an ID to the incoming request,
and attach that ID (plus a pointer to the parent task) to all messages that are generated in response to the
incoming request. Then have each downstream task that is involved in processing those
messages log timing information associated with the request ID to disk.</p>
<p>Propagating the ID across all machines results in a <em>tree</em> of
timing information, where each vertex contains timing information for a single
task (the ingress being the root), and each edge represents a control flow dependency between tasks.
This timing information
can be retrieved asynchronously from each machine.
To minimize instrumentation overhead and intrusiveness, we can <em>sample</em>:
only attach an ID to a fraction of incoming requests. As long as overhead
is low enough, we could overcome the
challenge of workload characterization by running causal tracing on our
production deployment.</p>
<!-- TODO: how does causal tracing overcome flakiness -->
<p>Here is an illustration<sup id="fnref:17" role="doc-noteref"><a href="#fn:17" class="footnote" rel="footnote">18</a></sup>:</p>
<p><img src="https://eecs.berkeley.edu/~rcs/research/tracing_example.png" alt="Trace Example" height="700px" width="700px" /></p>
<p>What can we do with causal trees? A bunch of cool stuff: characterize the
production workload so that we can reproduce it in a test environment,<sup id="fnref:20" role="doc-noteref"><a href="#fn:20" class="footnote" rel="footnote">19</a></sup>
resource accounting<sup id="fnref:21" role="doc-noteref"><a href="#fn:21" class="footnote" rel="footnote">20</a></sup> and ‘what-if’ predictions for resource planning,<sup id="fnref:22" role="doc-noteref"><a href="#fn:22" class="footnote" rel="footnote">21</a></sup>
track flows across administrative domains,<sup id="fnref:23" role="doc-noteref"><a href="#fn:23" class="footnote" rel="footnote">22</a></sup> visualize traces and express expectations about how flows should
or should not be structured,<sup id="fnref:24" role="doc-noteref"><a href="#fn:24" class="footnote" rel="footnote">23</a></sup> monitor performance isolation in a multi-tenant environment,<sup id="fnref:27" role="doc-noteref"><a href="#fn:27" class="footnote" rel="footnote">24</a></sup> and most relevant for performance regression
testing: detecting and diagnosing performance anomalies.<sup id="fnref:26" role="doc-noteref"><a href="#fn:26" class="footnote" rel="footnote">25</a></sup></p>
<p>Distributed tracing does require a fair amount of engineering effort: we need to
modify our system to attach and propagate IDs (it’s unfortunately non-trivial to
‘bolt-on’ a tracing system like Zipkin). Perhaps the simplest form of
performance regression testing we can do is to analyze performance statistics
<strong>without correlating</strong> across machines. We can still get end-to-end latency
numbers by instrumenting clients, or by ensuring that
the machine processing the incoming request is the same as the machine sending
an acknowledgment to the client. The key issue then is figuring
out the source of latency once we have detected a problem.</p>
<h3 id="discovering-problems-in-production">Discovering Problems in Production</h3>
<p>Despite our best efforts, bugs invariably make it into production.<sup id="fnref:28" role="doc-noteref"><a href="#fn:28" class="footnote" rel="footnote">26</a></sup> Still,
we’d prefer to discover and diagnose these issues through means that are more
proactive than user complaints.
What are the challenges of detecting problems in production?:</p>
<ul>
<li><strong>a) Runtime overhead</strong>: It’s crucial that our instrumentation doesn’t incur noticeable
latency costs for users.</li>
<li><strong>b) Possible privacy concerns</strong>: In some cases, our monitoring data
will contain sensitive user information.</li>
<li><strong>c) Limited visibility</strong>: We can’t just stop the world to collect
our monitoring data, and no single machine has global visibility into
the state of the overall system.</li>
<li><strong>d) Failures in the monitoring system</strong>: The monitoring system is itself a
distributed system that needs to deal with faults gracefully.</li>
</ul>
<h4 id="solutions-2">Solutions</h4>
<p>An old idea is particularly useful here: <strong>distributed snapshots</strong>.<sup id="fnref:29" role="doc-noteref"><a href="#fn:29" class="footnote" rel="footnote">27</a></sup>
Distributed snapshots are defined by consistent cuts: a subset of the events in
the system’s execution such if any event e is contained in the subset, all
‘happens-before’ predecessors of e are also contained in the subset.</p>
<p>Distributed snapshots allow us to obtain a global view of the state of all
machines in the system, without needing to stop the world. Once we have a
distributed snapshot in hand, we can check assertions about the state of the
overall system (either offline<sup id="fnref:30" role="doc-noteref"><a href="#fn:30" class="footnote" rel="footnote">28</a></sup> or online<sup id="fnref:31" role="doc-noteref"><a href="#fn:31" class="footnote" rel="footnote">29</a></sup>).</p>
<p>Since runtime overheads limit how much information we can record in production, it can be challenging to diagnose a
problem once we have detected it. <strong>Probabilistic diagnosis</strong> techniques<sup id="fnref:32" role="doc-noteref"><a href="#fn:32" class="footnote" rel="footnote">30</a></sup><sup>,</sup><sup id="fnref:33" role="doc-noteref"><a href="#fn:33" class="footnote" rel="footnote">31</a></sup><sup>,</sup><sup id="fnref:34" role="doc-noteref"><a href="#fn:34" class="footnote" rel="footnote">32</a></sup> seek
to capture carefully selected diagnostic information (e.g. stack traces, thread & message interleavings) that should have high
probability of helping us find the root cause of a problem. One key insight
underlying these techniques is <strong>cooperative debugging</strong>: the realization that even if we don’t collect enough diagnostic information from a single bug report,
it’s quite likely that the bug will happen more than once.<sup id="fnref:bofa" role="doc-noteref"><a href="#fn:bofa" class="footnote" rel="footnote">33</a></sup></p>
<p>Identifying which pieces of hardware in your system have failed (or are exhibiting
flaky behavior) is a non-trivial task when you only have a partial view into
the state of the overall system. <strong>Root cause analysis</strong> techniques (a
frustratingly generic name IMHO..) seek to infer unknown failure events from limited monitoring
data.<sup id="fnref:36" role="doc-noteref"><a href="#fn:36" class="footnote" rel="footnote">34</a></sup><sup>,</sup><sup id="fnref:37" role="doc-noteref"><a href="#fn:37" class="footnote" rel="footnote">35</a></sup></p>
<h3 id="thats-it-for-now-more-to-come">That’s It For Now; More to Come!</h3>
<p>I should probably get back to writing my dissertation. But stay tuned for future posts,
where I hope to cover topics such as:</p>
<ul>
<li>Fault tolerance testing</li>
<li>Test case reduction</li>
<li>Distributed debugging</li>
<li>Tools to help you write correctness conditions</li>
<li>Tools to help you better comprehend diagnostic information</li>
<li>Dynamic analysis for finding race conditions & atomicity violations</li>
<li>Model checking & symbolic execution</li>
<li>Configuration testing</li>
<li>Verification</li>
<li>Liveness issues</li>
<li>…</li>
</ul>
<p>If you’d like to add anything I missed or correct topics I’ve mischaracterized, please feel free to issue a
<a href="https://github.com/colin-scott/colin-scott.github.io/tree/master/_posts/2016-03-04-technologies-for-testing-and-debugging-distributed-systems.markdown">pull
request</a>!</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>“Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication.” International Workshop on Distributed Algorithms ‘97 <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>The test framework can’t modify the messages sent by the system, but it can control other sources of non-determinism, e.g. the order in which messages are delivered. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>“To Infinity and Beyond: Time-Warped Network Emulation”, NSDI ‘06 <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>“Hardware and software approaches for deterministic multi-processor replay of concurrent programs”, Intel Technology Journal ‘09 <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>“ReVirt: Enabling Intrusion Analysis Through Virtual-Machine Logging and Replay”, OSDI ‘02 <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:19" role="doc-endnote">
<p>“Deterministic Process Groups in dOS, OSDI ‘10”. [Technically deterministic execution, not deterministic replay] <a href="#fnref:19" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>“Replay Debugging For Distributed Applications”, ATC ‘06 <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:18" role="doc-endnote">
<p>“DDOS: Taming nondeterminism in distributed systems”, ASPLOS ‘13. <a href="#fnref:18" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>“Execution Synthesis: A Technique For Automated Software Debugging”, EuroSys ‘10 <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:12" role="doc-endnote">
<p>“PRES: Probabilistic Replay with Execution Sketching on Multiprocessors”, SOSP ‘09 <a href="#fnref:12" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:15" role="doc-endnote">
<p>“ODR: Output-Deterministic Replay for Multicore Debugging”, SOSP ‘09 <a href="#fnref:15" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:13" role="doc-endnote">
<p>“Analyzing Multicore Dumps to Facilitate Concurrency Bug Reproduction”, ASPLOS ‘10 <a href="#fnref:13" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:14" role="doc-endnote">
<p>“Debug Determinism: The Sweet Spot for Replay-Based Debugging”, HotOS ‘11 <a href="#fnref:14" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:10" role="doc-endnote">
<p>“Minimizing Faulty Executions of Distributed Systems”, NSDI ‘16 <a href="#fnref:10" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:11" role="doc-endnote">
<p>“Testing a Database for Race Conditions with QuickCheck”, Erlang ‘11 <a href="#fnref:11" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:16" role="doc-endnote">
<p>We might not need to aggregrate statistics from all the machines, but at the very least we need timings from the first machine to process the request and the last machine to process the request. <a href="#fnref:16" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:25" role="doc-endnote">
<p>“Path-based failure and evolution management”, SOSP ‘04 <a href="#fnref:25" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:17" role="doc-endnote">
<p>“So, you want to trace your distributed system? Key design insights from years of practical experience”, CMU Tech Report <a href="#fnref:17" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:20" role="doc-endnote">
<p>“Using Magpie for request extraction and workload modelling”, SOSP ‘04 <a href="#fnref:20" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:21" role="doc-endnote">
<p>“Stardust: tracking activity in a distributed storage system”, SIGMETRICS ‘06 <a href="#fnref:21" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:22" role="doc-endnote">
<p>“Ironmodel: robust performance models in the wild”, SIGMETRICS ‘08 <a href="#fnref:22" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:23" role="doc-endnote">
<p>“X-trace: a pervasive network tracing framework”, NSDI ‘07 <a href="#fnref:23" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:24" role="doc-endnote">
<p>“Pip: Detecting the Unexpected in Distributed Systems”, NSDI ‘06 <a href="#fnref:24" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:27" role="doc-endnote">
<p>“Retro: Targeted Resource Management in Multi-tenant Distributed Systems”, NSDI ‘15 <a href="#fnref:27" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:26" role="doc-endnote">
<p>“Diagnosing performance changes by comparing request flows”, NSDI ‘11 <a href="#fnref:26" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:28" role="doc-endnote">
<p>‘Hark’, you say! ‘Verification will make bugs a thing of the past!’ –I’m not entirely convinced… <a href="#fnref:28" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:29" role="doc-endnote">
<p>Distributed Snapshots: Determining Global States of Distributed Systems, ACM TOCS ‘85 <a href="#fnref:29" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:30" role="doc-endnote">
<p>WiDS Checker: Combating Bugs in Distributed Systems, NSDI ‘07 <a href="#fnref:30" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:31" role="doc-endnote">
<p>D3S: Debugging Deployed Distributed Systems, NSDI ‘08 <a href="#fnref:31" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:32" role="doc-endnote">
<p>SherLog: Error Diagnosis by Connecting Clues from Run-time Logs, ASPLOS ‘10 <a href="#fnref:32" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:33" role="doc-endnote">
<p>Effective Fault Localization Techniques for Concurrent Software, PhD Thesis <a href="#fnref:33" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:34" role="doc-endnote">
<p>Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures, SOSP ‘15 <a href="#fnref:34" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:bofa" role="doc-endnote">
<p>Cooperative Bug Isolation, PhD Thesis <a href="#fnref:bofa" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:36" role="doc-endnote">
<p>A Survey of Fault Localization Techniques in Computer Networks, SCP ‘05 <a href="#fnref:36" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:37" role="doc-endnote">
<p>Detailed Diagnosis in Enterprise Networks, SIGCOMM ‘09 <a href="#fnref:37" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Fuzzing Raft for Fun and Publication2015-10-07T23:19:35+00:00home_url/blog/2015/10/07/fuzzing-raft-for-fun-and-profit<p>Research on distributed systems is often motivated by some variation of the following:</p>
<blockquote>
<p>Developers of distributed systems face notoriously difficult challenges, such as concurrency, asynchrony, and partial failure.</p>
</blockquote>
<p>That statement seems convincing enough, but it’s rather abstract. In this post
we’ll gain a concrete understanding of what makes distribution so challenging,
by describing correctness bugs we found in an <a href="https://github.com/ktoso/akka-raft">implementation</a> of the <a href="https://ramcloud.stanford.edu/raft.pdf">Raft consensus protocol</a>.</p>
<p>Raft is an interesting example because its authors designed it to be
understandable and straightforward to implement. As we’ll see, implementing even the relatively
straightforward Raft spec correctly requires developers to deal with many difficult-to-anticipate
issues.</p>
<h3 id="fuzz-testing-setup">Fuzz testing setup</h3>
<p>To find bugs we’re going to employ fuzz testing. Fuzz tests are nice because
they help us exercise situations that developers don’t anticipate
with unit or integration tests. In a distributed environment,
semi-automated testing techniques such as fuzzing are especially useful, since the number of possible
event orderings a system might encounter grows exponentially with the number
of events (e.g. failures, message deliveries)–far too many cases for developers to reasonably cover with
hand-written tests.</p>
<p>Fuzz testing generally requires two ingredients:</p>
<ol>
<li>Assertions to check.</li>
<li>A specification of what inputs the fuzzer should inject into the system.</li>
</ol>
<h4 id="assertions">Assertions</h4>
<p>The Raft protocol already has a set of nicely defined safety
conditions, which we’ll use as our assertions.</p>
<div class="row">
<div class="span1">
</div>
<div class="span5">
<p><img src="https://www.eecs.berkeley.edu/~rcs/research/raft_invariants.png" alt="Raft Invariants" width="100%" /></p>
</div>
<div class="span5">
<p>
Figure 3 from the <a href="https://ramcloud.stanford.edu/raft.pdf">Raft paper</a> (copied left) shows Raft's key invariants.
We use these invariants as our assertions. Each assertion should hold at
any point in Raft's execution.
</p>
<p>
For good measure, we also add in one additional assertion:
no Raft process should crash due to an uncaught exception.
</p>
<p>
We'll check these invariants by periodically halting the fuzz test and
inspecting the internal state of each Raft process. If any of the assertions
ever fails, we've found a bug.
</p>
</div>
</div>
<p><br /></p>
<h4 id="input-generation">Input generation</h4>
<p>The trickier part is specifying what inputs the fuzzer should inject.
Generally speaking, inputs are anything processed by the system, yet created outside the control of
the system. In the case of distributed systems there are a few sources of
inputs:</p>
<ul>
<li>The network determines when messages are delivered.</li>
<li>Hardware may fail, and processes may (re)join the system at random points.</li>
<li>Processes outside the system (e.g. clients) may send messages
to processes within the system.</li>
</ul>
<p>To generate the last two types of inputs, we specify a function for creating random external messages (in the case of Raft: client commands)
as well as probabilities for how often each event type (external message sends, failures, recoveries) should be injected.</p>
<p>We gain control over the network by <a href="https://github.com/NetSys/demi">interposing</a> on the distributed system’s RPC layer, using <a href="https://en.wikipedia.org/wiki/AspectJ">AspectJ</a>. For now, we target a
specific RPC system: <a href="https://akka.io/">Akka</a>. Akka is ideal because it provides
a narrow, general API that operates at a high level abstraction
based around the <a href="https://en.wikipedia.org/wiki/Actor_model">actor model</a>.</p>
<p>Our interposition essentially allows us to play god: we get to choose exactly when each
RPC message sent by the distributed system is delivered. We can delay, reorder, or drop any message the
distributed system tries to send. The basic architecture of our test harness
(which we call <a href="https://github.com/NetSys/demi">DEMi</a>) is shown below:</p>
<div class="row">
<div class="span1">
</div>
<div class="span5">
<p>
Every time a process sends an RPC message, the Test Harness intercepts it and places
it into a buffer. The Test Coordinator later decides when to deliver that message to the recipient.
In a fully asynchronous network, the Test Coordinator can arbitrarily delay
and reorder messages.
</p>
<p>
The Test Coordinator also injects external events (external message sends, failures, recoveries)
at random according to the probability weights given by the fuzz test
specification.
</p>
</div>
<div class="span6">
<p><img src="https://www.eecs.berkeley.edu/~rcs/research/test_infrastructure.png" alt="Test Harness" width="100%" /></p>
</div>
</div>
<p>Interposing at the RPC layer has a few advantages over interposing at a
lower layer (e.g. the network layer, a la <a href="https://aphyr.com/tags/jepsen">Jepsen</a>).
Most importantly, we get fine-grained control over when each individual
(non-segmented) message is delivered. In contrast, iptables is a much more blunt tool: it only allow the tester
to drop or delay all packets between a given pair of processes [1].</p>
<p>Targeting applications built on Akka gives us one other
advantage: Akka provides a <a href="https://doc.akka.io/docs/akka/2.4.0/scala/scheduler.html">timer API</a> that obviates the need for application
developers to read directly from the system clock. Timers are a crucial part of distributed systems,
since they are used to detect failures. In Akka, timers are modeled
as messages, to be delivered to the process that set the timer at a later
point in the execution. Rather than waiting for the wall-clock time for each
timer to expire, we can deliver it right away, without the application
noticing any difference.</p>
<h3 id="target-implementation-akka-raft">Target implementation: akka-raft</h3>
<p>The Raft implementation we target is <a href="https://github.com/ktoso/akka-raft">akka-raft</a>. akka-raft is written by one of the
core Akka developers, <a href="https://github.com/ktoso">Konrad Malawski</a>. akka-raft is fully featured according to
the Raft <a href="https://raft.github.io/">implementation page</a>; it supports log
replication, membership changes, and log compaction. akka-raft has existing
unit and integration tests, but it has not yet been deployed in production.</p>
<p>UPDATE: Konrad asked me to include a short note, and I’m glad to oblige:</p>
<blockquote>
<p>akka-raft is not an officially supported Akka module, but rather just a side project of Konrad’s. The Akka modules themselves are much more rigorously tested before release.</p>
</blockquote>
<p>For our fuzz tests we set up a small 4-node cluster (quorum size=3). akka-raft
uses TCP as its default transport protocol, so we
configure <a href="https://github.com/NetSys/demi">DEMi</a> to deliver
pending messages one-at-a-time in a semi-random order that obeys FIFO order
between any pair of processes. We also tell DEMi to inject a given number of client
commands (as external messages placed into the pending message buffer), and
check the Raft invariants at a fixed interval throughout the execution. We do
not yet exercise auxiliary features of akka-raft, such as log compaction or
cluster membership changes.</p>
<h3 id="bug-we-found">Bug we found</h3>
<p>For all of the bugs we found below, we first minimized the faulty execution
before debugging the root cause [2]. With the minimized execution in hand,
we walked through the sequence of message deliveries in the
minimized execution one at
a time, noting the current state of the process receiving the message. Based on
our knowledge of the way Raft is supposed to work, we found the places in
the execution that deviated from our understanding of correct behavior.
We then examined the akka-raft code to understand why it deviated, and came up
with a fix. We submitted all of our fixes as pull requests.</p>
<p>A few of these root causes had already been pointed out
by <a href="https://github.com/schuster">Jonathan Schuster</a> through a manual audit of the code, but
none of them had been verified with tests or fixed before we ran our fuzz
tests.</p>
<p>On with the results!</p>
<h4 id="raft-45-candidates-accept-duplicate-votes-from-the-same-election-term"><a href="https://github.com/ktoso/akka-raft/issues/45">raft-45</a>: Candidates accept duplicate votes from the same election term.</h4>
<p>Raft is specified as a state machine with three states: <code class="language-plaintext highlighter-rouge">Follower</code>, <code class="language-plaintext highlighter-rouge">Candidate</code>,
and <code class="language-plaintext highlighter-rouge">Leader</code>. Candidates attempt to get themselves elected as leader by
soliciting a quorum of votes from their peers in a given election term (epoch).</p>
<p>In one of our early fuzz runs, we found a violation of ‘Leader Safety’, i.e. two
processes believed they were leader in the same election term. This is a bad
situation for Raft to be in, since the leaders may overwrite each other’s log
entries, thereby violating the key linearizability guarantee that Raft is supposed to
provide.</p>
<p>The root cause here was that akka-raft’s candidate state did not detect
duplicate votes from the same follower in the same election term.
(A follower might resend votes because it believed that an earlier vote was
dropped by the network). Upon receiving the duplicate vote, the candidate
counts it as a new vote and steps up to leader before it actually
achieved a quorum of votes.</p>
<h4 id="raft-46-processes-neglect-to-ignore-certain-votes-from-previous-terms"><a href="https://github.com/ktoso/akka-raft/issues/46">raft-46</a>: Processes neglect to ignore certain votes from previous terms.</h4>
<p>After fixing the previous bug, we found another execution where two leaders
were elected in the same term.</p>
<p>In Raft, processes attach an ‘election term’ number to all messages they send.
Receiving processes are supposed to ignore any messages that contain an
election term that is lower than what they believe is the current term.</p>
<div class="row">
<div class="span1">
</div>
<div class="span4">
<p><img src="https://www.eecs.berkeley.edu/~rcs/research/delayed_term.jpg" alt="Delayed Term" height="50" id="myheight" /></p>
</div>
<div class="span6">
<p>
akka-raft properly ignored lagging term numbers for some, but not all message
types. DEMi delayed the delivery of messages from previous
terms and uncovered a case where a candidate incorrectly accepted a vote message from
a previous election term.
</p>
</div>
</div>
<h4 id="raft-56-processes-forget-who-they-voted-for"><a href="https://github.com/ktoso/akka-raft/issues/56">raft-56</a>: Processes forget who they voted for.</h4>
<p>akka-raft is written as an <a href="https://doc.akka.io/docs/akka/snapshot/scala/fsm.html">FSM</a>. When
making a state transition, FSM processes specify both which state
they want to transition to, and which instance variables they want to keep
once they have transitioned.</p>
<div class="row">
<div class="span1">
</div>
<div class="span6">
<p><img src="https://www.eecs.berkeley.edu/~rcs/research/raft_fsm.png" alt="Raft FSM" width="100%" /></p>
</div>
<div class="span5">
<p>
All of the state transitions for akka-raft were correct except
one: when a candidate
steps down to follower (e.g., because it receives an <code>AppendEntries</code> message,
indicating that there is another leader in the cluster), it
<i>forgets</i> which process it previously voted for in that term. Now, when another
process requests a vote from it in the same term, it will vote again but this
time for a different process than it previously voted for, allowing
two leaders to be elected.
</p>
</div>
</div>
<h4 id="raft-58a-pending-client-commands-delivered-before-initialization-occurs"><a href="https://github.com/ktoso/akka-raft/issues/58">raft-58a</a>: Pending client commands delivered before initialization occurs.</h4>
<p>After ironing out leader election issues, we started finding other issues. In
one of our fuzz runs, we found that a leader process threw an assertion error.</p>
<p>When an akka-raft candidate first makes the state transition to leader, it does not
immediately initialize its state (the <code class="language-plaintext highlighter-rouge">nextIndex</code> and <code class="language-plaintext highlighter-rouge">matchIndex</code> variables).
It instead sends a message to itself,
and initializes its state when it receives that self-message.</p>
<p>Through fuzz testing, we found that it is possible that the candidate could have pending <code class="language-plaintext highlighter-rouge">ClientCommand</code> messages
in its mailbox, placed there <i>before</i> the candidate transitioned to leader
and sent itself the initialization message.
Once in the leader state, the Akka runtime will first deliver the <code class="language-plaintext highlighter-rouge">ClientCommand</code> message. Upon processing the <code class="language-plaintext highlighter-rouge">ClientCommand</code>
message the leader tries to replicate it to the rest of the cluster, and updates its
<code class="language-plaintext highlighter-rouge">nextIndex</code> hashmap.
Next, when the Akka runtime delivers the initialization self-message, it will <em>overwrite</em> the
value of <code class="language-plaintext highlighter-rouge">nextIndex</code>. When it reads from <code class="language-plaintext highlighter-rouge">nextIndex</code> later, it’s possible for
it to throw an assertion error because the <code class="language-plaintext highlighter-rouge">nextIndex</code> values are
inconcistent with the contents of the leader’s log.</p>
<h4 id="raft-58b-ambiguous-log-indexing"><a href="https://github.com/ktoso/akka-raft/issues/58">raft-58b</a>: Ambiguous log indexing.</h4>
<p>In one of our fuzz tests, we found a case where the ‘Log Matching’
invariant was violated, i.e. log entries did not appear in the same order on
all machines.</p>
<p>According to the Raft paper, followers should <em>reject</em> <code class="language-plaintext highlighter-rouge">AppendEntries</code> requests from leaders that are
behind, i.e. <code class="language-plaintext highlighter-rouge">prevLogIndex</code> and <code class="language-plaintext highlighter-rouge">prevLogTerm</code> for the <code class="language-plaintext highlighter-rouge">AppendEntries</code> message are
behind what the follower has
in its log. The leader should continue decrementing its <code class="language-plaintext highlighter-rouge">nextIndex</code> hashmap
until the followers stop rejecting its <code class="language-plaintext highlighter-rouge">AppendEntries</code> attempts.</p>
<p>This should have happened in akka-raft too, except for one hiccup:
akka-raft decided to adopt 0-indexed logs, rather than 1-indexed logs as the
paper suggests. This creates a problem:
the initial value of <code class="language-plaintext highlighter-rouge">prevLogIndex</code> is ambiguous:</p>
<ul>
<li>followers can’t distinguish between an <code class="language-plaintext highlighter-rouge">AppendEntries</code> for an empty log (<code class="language-plaintext highlighter-rouge">prevLogIndex</code> == 0)</li>
<li>an <code class="language-plaintext highlighter-rouge">AppendEntries</code> for the leader’s 1st command (<code class="language-plaintext highlighter-rouge">prevLogIndex</code> == 0), and</li>
<li>an <code class="language-plaintext highlighter-rouge">AppendEntries</code> for the leader’s 2nd command (<code class="language-plaintext highlighter-rouge">prevLogIndex</code> == 1 - 1 == 0).</li>
</ul>
<p>The last two cases need to be distinguishable.
Otherwise followers won’t be able to reject inconsistent logs. This corner would have
been hard to anticipate; at first glance it seems fine to adopt the convention
that logs should be 0-indexed instead of 1-indexed.</p>
<h4 id="raft-42-quorum-computed-incorrectly"><a href="https://github.com/ktoso/akka-raft/issues/42">raft-42</a>: Quorum computed incorrectly.</h4>
<p>We also found a fuzz test that ended in a violation of the ‘Leader Completeness’ invariant, i.e. a newly elected leader
had a log that was irrecoverably inconsistent with the logs of previous leaders.</p>
<p>Leaders are supposed to
commit log entries to their state machine when they knows that a quorum (N/2+1) of
the processes in the cluster have that entry replicated in their logs.
akka-raft had a bug where it computed the highest replicated log index incorrectly.
First it sorted the values of <code class="language-plaintext highlighter-rouge">matchIndex</code> (which denote the highest log entry index
known to be replicated on each peer). But rather than computing the <em>median</em>
(or more specifically, the N/2+1’st) of the sorted entries, it computed the <em>mode</em> of the sorted
entries. This caused the leader to commit entries too early, before a quorum
actually had that entry replicated. In our fuzz test, message delays allowed another leader to become elected, but it did not have all
committed entries in its log due to the previously leader committing too soon.</p>
<h4 id="raft-62-crash-recovery-not-yet-supported-yet-inadvertently-triggerable"><a href="https://github.com/ktoso/akka-raft/issues/62">raft-62</a>: Crash-recovery not yet supported, yet inadvertently triggerable.</h4>
<p>Through fuzz testing I found one other case where two leaders became elected in
the same term.</p>
<p>The Raft protocol assumes a crash-recovery failure model – that is, it allows
for the possibility that crashed nodes will rejoin the cluster (with non-volatile state
intact).</p>
<p>The current version of akka-raft does does not write anything to disk
(although the akka-raft developers intend to support persistence soon).
That’s actually fine – it just means that akka-raft currently assumes a crash-stop failure
model, where crashed nodes are never allowed to come back.</p>
<p>The Akka runtime, however, has a <a href="https://doc.akka.io/docs/akka/snapshot/scala/fault-tolerance.html">default
behavior</a> that doesn’t play nicely
with akka-raft’s crash-stop failure assumption: it automatically restarts any process that throws an exception.
When the process restarts, all its state is reinitialized.</p>
<p>If for any reason, a process throws an exception after it has voted for
another candidate, it will later rejoin the cluster, having forgotten who it had
voted for (since all state is volatile). Similar to <a href="https://github.com/ktoso/akka-raft/issues/56">raft-56</a>,
this caused two leaders to be elected in our fuzz test.</p>
<h4 id="raft-66-followers-unnecessarily-overwrite-log-entries"><a href="https://github.com/ktoso/akka-raft/issues/66">raft-66</a>: Followers unnecessarily overwrite log entries.</h4>
<p>The last issue I found is only possible to trigger if the underlying
transport protocol is UDP, since it requires reorderings of messages between
the same source, destination pair. The akka-raft developers say they don’t currently
support UDP, but it’s on their radar.</p>
<p>The invariant violation here was a violation of the ‘Leader Completeness’ safety property, where a leader is
elected that doesn’t have all of the needed log entries.</p>
<p><img src="https://www.eecs.berkeley.edu/~rcs/research/UDP_bug.jpg" alt="Lamport Time Diagram" width="60%" /></p>
<p>Leaders replicate uncommitted <code class="language-plaintext highlighter-rouge">ClientCommands</code> to the rest of the cluster in batches.
Suppose a follower with an empty log receives an <code class="language-plaintext highlighter-rouge">AppendEntries</code> containing
two entries. The follower appends these to its log.</p>
<p>Then the follower subsequently receives an <code class="language-plaintext highlighter-rouge">AppendEntries</code> containing only the
first of the previous two entries. (This message was delayed, as shown in the
Lamport Time Diagram). The follower will inadvertently delete the second entry from its log.</p>
<p>This is not just a performance issue: after receiving an ACK from the follower, the leader is under the impression that the
follower has two entries in its log. The leader may have decided to commit both
entries if a quorum was achieved. If another leader becomes elected, it will not necessarily have
both committed entries in its log as it should.</p>
<h3 id="conclusion">Conclusion</h3>
<p>The wide variety of bugs we found gets me really excited about how useful our
<a href="https://github.com/NetSys/demi">fuzzing and minimization tool</a> is turning out to be. The development toolchain for
distributed systems is seriously deficient, and I hope that testing techniques like
this see more widespread adoption in the future.</p>
<p>I left many details of our approach out of this post for brevity’s sake,
particularly a description of my favorite part: how DEMi minimizes the faulty executions
it finds to make them easier to understand. Check
out our <a href="https://www.eecs.berkeley.edu/~rcs/research/nsdi_draft.pdf">paper draft</a> for more details!</p>
<h3 id="footnotes">Footnotes</h3>
<p>[1] RPC layer interposition does come with a drawback: we’re tied to a
particular RPC library. It would be tedious for us to adapt
our interposition to the impressive range of systems Jepsen has been applied
to.</p>
<p>[2] How we perform this minimization is
outside the scope of this blog post. Minimization is, in my opinion, the most
interesting part of what we’re doing here. Check out our
<a href="https://www.eecs.berkeley.edu/~rcs/research/nsdi_draft.pdf">paper draft</a> for more
information!</p>
Half-baked idea: Is asynchrony really that bad? Or: how often do failure detectors falsely accuse?2014-12-09T01:32:28+00:00home_url/blog/2014/12/09/half-baked-idea-how-often-do-failure-detectors-falsely-accuse<p>Distributed systems have two distinguishing features:</p>
<ul>
<li>Asynchrony, or “absence of synchrony”: messages from one process to another
do not arrive immediately. In a fully asynchronous system, messages may be
delayed for unbounded periods of time. In contrast, synchronous
networks always provide bounded message delays.</li>
<li>Partial failure: some processes in the system may fail while other processes
continue executing.</li>
</ul>
<p>It’s the combination of these two features that make distributed systems
really hard; the crux of many impossibility proofs is
that nodes in a fully asynchronous system can’t distinguish message
delays from failures.</p>
<p>In practice, networks are somewhere between fully asynchronous and
synchronous. That is, most (but not all!) of the time, networks give us sufficiently predictable message
delays to allow nodes to coordinate successfully in the face of failures.</p>
<p>When designing a distributed algorithm however, common wisdom says that you should try to
make as few assumptions about the network as possible. The motivation for this
principle is that minimizing your algorithm’s assumptions about message delays maximizes the likelihood that it will work when placed
in a real network (which may, in practice, fail to meet bounds on message delays).</p>
<p>On the other hand, if your network does in fact provide bounds
on message delays, you can often design simpler and more performant
algorithms on top of it. An example of this observation that I find particularly
compelling is <a href="https://syslab.cs.washington.edu/research/specpaxos/index.html">Speculative
Paxos</a>, which
co-designs a consensus algorithm and the underlying network to improve overall
performance.</p>
<p>At the risk of making unsubstantiated generalizations, I get the sense that
theorists (who have dominated the field of distributed computing until somewhat recently) tend
to worry a lot about corner cases that jeopardize correctness properties. That
is, it’s the theorist who’s telling us to minimize our assumptions.
In contrast, practitioners are often willing to sacrifice correctness in favor
of simplicity and performance, as long as the corner cases that cause the
system to violate correctness are sufficiently rare.</p>
<p>To resolve the tension between the theorists’ and the practitioners’ principles, my half-baked idea is that we
should attempt to answer the following question:
“How asynchronous are our networks in practice”?</p>
<p>Before outlining how one might answer this question, I need to provide a bit
of background.</p>
<h3 id="failure-detectors">Failure Detectors</h3>
<p>In reaction to the overly pessimistic asynchrony assumptions made by impossibility
proofs, theorists spent about a decade [1] developing distributed algorithms for “partially synchronous” network models.
The key property of the partially synchronous model is that at some point in the execution of the distributed system, the
network will start to provide bounds on message delays, but the algorithm
won’t know when that point occurs.</p>
<p>The problem with the partial asynchrony model is that algorithms built on top
of it (and their corresponding correctness proofs) are messy: the timing assumptions of the algorithm are strewn throughout
the code, and proving the algorithm correct requires you to pull those timing
assumptions through the entire proof until you can finally check at the end
whether they match up with the network model.</p>
<p>To make reasoning about asynchrony easier, a theorist named Sam Toueg along
with a few others at Cornell proposed the concept of <a href="https://www.cs.cornell.edu/home/sam/FDpapers/CT96-JACM.ps">failure detectors</a>.
Failure detectors allow algorithms to encapsulate timing assumptions:
instead of manually setting timers to detect failures, we design our
algorithms to ask an oracle about the presence of failures [2]. To implement the oracle, we
<a href="https://research.microsoft.com/en-us/people/weic/wdag97_hb.pdf">still</a> use timers,
but now we have all of our timing assumptions collected cleanly in one place.</p>
<p>Failure detectors form a hierarchy. The strongest failure detector has perfect
accuracy (it never falsely accuses nodes of failing) and perfect completeness
(it always informs all nodes of all failures). Weaker failure detectors might
make mistakes, either by falsely accusing nodes of having crashed, or by
neglecting to detect some failures. The different failure detectors
correspond to different points on the asynchrony spectrum: perfect failure
detectors can only be implemented in a fully synchronous network [3], whereas
imperfect failure detectors correspond to partial synchrony.</p>
<h3 id="measuring-asynchrony">Measuring Asynchrony</h3>
<p>One way to get a handle on our question is to measure the behavior of failure detectors in practice.
That is, one could implement imperfect failure detectors,
place them in networks of different kinds, and measure how often they falsely
accuse nodes of failing. If we have ground truth on when nodes actually fail
in a controlled experiment, we can quantify how often those corner cases theorists
are worried about come up.</p>
<p>Anyone interested in getting their hands dirty?</p>
<h4 id="footnotes">Footnotes</h4>
<p>[1] Starting in <a href="https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf">1988</a> and dwindling after <a href="https://www.cs.cornell.edu/home/sam/FDpapers/CT96-JACM.ps">1996</a>.</p>
<p>[2] Side note: failures detectors aren’t widely used in practice. Instead, most
distributed systems use ad-hoc network timeouts strewn throughout the code. At best, distributed systems use
adaptive timers, again strewn throughout the code. A library or language
that encourages programmers to encapsulate
timing assumptions and explicitly handle failure detection information could go a long
way towards improving the simplicity, amenability to automated tools, and robustness
of distributed systems.</p>
<p>[3] Which is equivalent to saying that they can’t be implemented. Unless you
can ensure that the network itself never suffers from any failures or
congestion, you can’t guarantee perfect synchrony. Nonetheless, some of the most recent <a href="https://www.usenix.org/sites/default/files/conference/protected-files/vliu_nsdi13_slides.pdf">network designs</a> get us
pretty close.</p>
Half-Baked Idea: Automatically Marking Deferred Javascript2014-11-30T20:31:59+00:00home_url/blog/2014/11/30/half-baked-idea-automatically-marking-defer-tags<p>A typical web page is composed of multiple objects: HTML files, Javascript
files, CSS files, images, etc..</p>
<p>When your browser loads a web page, it executes a list of tasks:
first it needs to fetch the main HTML, then it can parse each of the
HTML tags to know what other objects to
fetch, then it can process each of the fetched objects and their effect on the
DOM, and finally it can render pixels to your screen.</p>
<p>To load your web page as fast as possible, the browser tries to execute as
many of these tasks as it can <em>in parallel</em>. The less time the browser
spends sitting idle waiting for tasks to finish, the faster the web page
will load.</p>
<p>It is not always possible to execute tasks in parallel. This is
because some tasks have dependencies on others. The most obvious example
is that the browser needs to fetch the main HTML before it can know
what other objects to fetch [1].</p>
<p>In general, the more dependencies a web page has,
the longer it will take to load. Prudent web developers structure their web
pages in a way that minimizes browsers’ task dependencies.</p>
<p>A particularly nasty dependency is Javascript execution. Whenever the browser
encounters a Javascript tag, it stops all other parsing and rendering tasks, waits to fetch the
Javascript, executes it until completion, and finally restarts the previously
blocked tasks. Browsers enforce this dependency because Javascript can modify the DOM;
by modifying the DOM, Javascript might affect the execution of all other
parsing and rendering tasks.</p>
<p>Placing Javascript tags in the beginning of an HTML page can have a huge
performance hit, since each script adds 1 RTT plus computation time
to the overall page load time.</p>
<p>Fortunately, the HTML standard provides a mechanism that allows developers to mitigate this
cost: the <a href="https://www.w3schools.com/tags/att_script_defer.asp">defer attribute</a>. The defer attribute tells the browser
that it’s OK to fetch and execute a Javascript tag asynchronously.</p>
<p>Unfortunately, using the defer tag is not straightforward. The issue is
that it’s hard for the web developer to know whether it’s safe to allow the browser to execute Javascript asynchronously.
For instance, the Javascript may actually need to modify the DOM to ensure the correct execution of the page, or it
may depend on other resources (e.g. other Javascript tags).</p>
<p>Forcing web developers to reason about these complicated (and often hidden!)
dependencies is, at best, a lot to ask for, and at worst, highly error-prone.
For this reason few web developers today make use of defer tags.</p>
<p>So here’s my half-baked idea: wouldn’t it be great if we had a compiler that
could automatically mark defer attributes? Specifically, let’s apply static
or dynamic analysis to infer when it’s safe for Javascript
tags to execute asynchronously. Such a tool could go a long way towards improving the
performance and correctness of the web.</p>
<h4 id="footnotes">Footnotes</h4>
<p>[1] See the <a href="https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final177.pdf">WProf paper</a> for a nice overview of browser activity dependencies.</p>
Half-Baked Idea: Distributed Systems Need Message-Level Debuggers2014-11-14T03:26:12+00:00home_url/blog/2014/11/14/half-baked-idea-distributed-systems-need-message-level-debuggers<p>gdb, although an incredibly powerful tool for debugging single programs, doesn’t work
so well for distributed systems.</p>
<p>The crucial difference between a single program and a distributed system is that distributed
computation revolves around <em>network messages</em>. Distributed systems spend
much of their time doing nothing more than waiting for network messages. When
they receive a message, they perform computation, perhaps send out a few
network messages of their own, and then return to their default state of
waiting for more network messages.</p>
<p>Because there’s a network separating the nodes of the distributed system, you
can’t (easily) pause all processes and attach gdb. And, in the words of
<a href="https://twitter.com/armon/status/533050582995435520">Armon Dadgar</a>,
“even if you could, the network is part of your system. Definitely not going to be able to gdb
attach to that.”</p>
<p>Suppose that you decide to attach gdb to a single process in the distributed
system. Even then, you’ll probably end up frustrated. You’re going to spend most
of your time waiting on a <code class="language-plaintext highlighter-rouge">select</code> or <code class="language-plaintext highlighter-rouge">receive</code> breakpoint. And when your
breakpoint is triggered, you’ll find that most of the
messages won’t be relevant for triggering <em>your</em> bug. You need to
wait for a specific message, or even a specific sequence of messages, before
you’ll be able to trace through the code path that leads to your bug.</p>
<p>Crucially, gdb doesn’t give you the ability to control network
messages, yet network message are what drive the distributed
system’s execution. In other words, gdb operates at a level of abstraction that is lower than what you
want.</p>
<p>Distributed systems need a different kind of debugger. What we need is a
debugger that will allow us to step through the distributed system’s
execution at the level of network messages. That is, you should be able to
generate messages, control the order in which they arrive, and observe how the
distributed system reacts.</p>
<p>Shameless self-promotion: <a href="https://ucb-sts.github.io/sts/walkthrough#interactive_mode">STS</a>
supports an “Interactive Mode” that takes over control of the
(software) network separating the nodes of a distributed system. This allows
you to interactively reorder or drop messages, inject failures, or check
invariants. We need something like this for testing and debugging general distributed systems.</p>
Half-Baked Ideas2014-11-14T02:10:45+00:00home_url/blog/2014/11/14/half-baked-ideas<p>As a graduate student, I find that the rate of progress I’m able to make on my current
research project is significantly lower than the rate at which I encounter
ideas for new research projects. Over time, this means that the number of
half-baked ideas jotted down in my notebook grows without bound.</p>
<p>In Academia, we sometimes feel dissuaded from sharing our half-baked ideas. Our
fear is that we may get ‘scooped’; that is, we worry that if we share an idea before we have a time to flesh it out,
someone else may take that idea and turn it into a fully-fledged publication, thereby stealing our opportunity to publish.</p>
<p>Until now, I haven’t publicly shared any of my half-baked ideas. I would like to
change that [1].</p>
<p>So, in the hope of generating discussion, I’ll be posting a series of
half-baked ideas. Please feel welcome to steal them, criticize them, or add to
them!</p>
<hr />
<p>[1] In part, this is because I have come to believe that academic caginess is petty. More importantly though, I
have come to the terms with the reality that I will not have time to pursue most of these ideas.</p>
Performance Modeling for Network Control Plane Systems2014-10-02T04:51:21+00:00home_url/blog/2014/10/02/performance-modeling-for-sdn<p>At Berkeley I have the opportunity to work with some of the smartest undergrads around. One of the undergrads I work with,
<a href="https://plus.google.com/109177137524762864782/about">Andrew Or</a>, did some neat work on modeling the performance of network control plane systems (e.g. SDN controllers).
He decided to take a once-in-a-lifetime opportunity to join <a href="https://databricks.com/">Databricks</a> before we got the chance to publish his work, so in his stead I thought
I’d share his work here.</p>
<p>An interactive version of his performance model can be found at this <a href="https://www.eecs.berkeley.edu/~rcs/research/convergence_modeling/">website</a>. Description from the website:</p>
<blockquote>
<p>A key latency metric for network control plane systems is convergence time: the duration between when a change occurs in a network and when the network has converged to an updated configuration that accommodates that change. The faster the convergence time, the better.<p><br />
<p>Convergence time depends on many variables: latencies between network devices, the number of network devices, the complexity of the replication mechanism used (if any) between controllers, storage latencies, etc. With so many variables it can be difficult to build an intuition for how the variables interact to determine overall convergence time.</p><br />
<p>The purpose of this tool is to help build that intuition. Based on analytic models of communication complexity for various replication and network update schemes, the tool quantifies convergence times for a given topology and workload. With it, you can answer questions such as "How far will my current approach scale while staying within my SLA?", and "What is the convergence time of my network under a worst-case workload?".</p>
</p></p>
</blockquote>
<p>The tool is insightful (e.g.
note the striking difference between SDN controllers and traditional routing protocols) and a lot of fun to play around with; I encourage you to check it out.
In case you are curious about the details of the model or would like to suggest
improvements, the code is available <a href="https://github.com/andrewor14/web-model">here</a>. We also have a 6-page write up of the work, available upon request.</p>
Is Academia A Good Place To Build Real Software?2014-04-30T02:56:04+00:00home_url/blog/2014/04/30/is-academia-a-good-place-to-build-real-software<p>I often overhear a recurring debate amongst researchers: is Academia a good
place to build real software systems? By “real”, we typically mean “used”,
particularly by people outside of academic circles.</p>
<p>There have certainly been some success stories. <a href="https://www.bsd.org/">BSD</a>,
<a href="https://llvm.org/">LLVM</a>, <a href="https://www.xenproject.org/">Xen</a>, and <a href="https://spark.apache.org/">Spark</a> come to mind.</p>
<p>Nonetheless, some argue that these success stories came about at a time
when the surrounding software ecosystem was nascent enough for a small group of researchers to
be able to make a substantial contribution, and that the ecosystem is normally
at a point where researchers cannot easily contribute. Consider for example that BSD was
initially released in 1977, when very few open source operating systems
existed. Now we have Linux, which has almost <a href="https://www.cnet.com/news/linux-development-by-the-numbers-big-and-getting-bigger/">1400 active
developers</a>.</p>
<p>Is this line of reasoning correct? Is the heyday of Academic systems software over? Will it ever come again?</p>
<p>Without a doubt, building real software systems requires substantial (wo)manpower; no
matter how great the idea is, implementing it will require raw effort.</p>
<p>This fact suggests an indirect way to evaluate our question. Let’s assume that
(i) any given software
developer can only produce a fixed (constant) amount of coding progress in a fixed
timeframe and (ii) the maturity of the surrounding software ecosystem is
proportional to collective effort put into it. We can then approximate an
answer to our
question by looking at the number of software developers in industry vs. the number of
researchers over time.</p>
<p>It turns out that the <a href="https://www.bls.gov/">Bureau of Labor Statistics</a>
publishes exactly the <a href="https://www.bls.gov/data/">data</a> we need for the United States.
Here’s what I found:</p>
<p><img src="https://www.eecs.berkeley.edu/~rcs/research/oes.jpg" alt="OES data" width="100%" /></p>
<p>Hm. The first thing we notice is that it’s hard to even see the line for academic and industrial researchers.
To give you a sense of where it’s at, the y-coordinate at May, 2013 for computer science teachers and professors is 35,770, two
orders of magnitude smaller than the 3,339,440 total employees in the software industry at that time.</p>
<p>What we really care about though is the ratio of employees in industry to number of researchers:</p>
<p><img src="https://www.eecs.berkeley.edu/~rcs/research/oes_ratio.jpg" alt="OES ratio data" width="100%" /></p>
<p>In the last few years, both the software industry and Academia are growing at roughly the same rate, whereas researchers in industrial
labs appear to be dropping off relative to the software industry. We can see this relative growth rate better by normalizing the datasets (dividing each datapoint by the maximum datapoint
in its series – might be better to take the derivative, but I’m too lazy to
figure out how to do that at the moment):</p>
<p><img src="https://www.eecs.berkeley.edu/~rcs/research/oes_normalized.jpg" alt="OES normalized data" width="100%" /></p>
<p>The data for the previous graphs only goes back to 1995. The Bureau of Labor
Statistics also publishes coarser granularity going all the way to 1950 and beyond:</p>
<p><img src="https://www.eecs.berkeley.edu/~rcs/research/nes.jpg" alt="NES data" width="100%" /></p>
<p>(See the hump around 2001?)</p>
<p>Not sure if this data actually answers our initial question, but I certainly found it insightful!
If you’d like more details on how I did this analysis, or would like to play around with the data for
yourself, see my <a href="https://github.com/colin-scott/go-bls-client">code</a>.</p>
What Distinguishes Distributed Computing From Parallel Computing?2014-03-30T11:29:00+00:00home_url/blog/2014/03/30/what-distinguishes-distributed-computing-from-parallel-computing<p>I recently came across a <a href="https://aphyr.com/posts/285-call-me-maybe-riak">statement</a> in <a href="https://twitter.com/aphyr">Aphyr’s</a> excellent
<a href="https://aphyr.com/tags/jepsen">Jepsen</a> blog series that caught my eye:</p>
<blockquote>
<p>“In a very real sense, [network] partitions are just really big windows of concurrency.”</p>
</blockquote>
<p>This statement seems to imply that distributed systems are “equivalent”
to parallel (single-machine) computing systems, for the following reason: partitions,
which occur in a network but don’t really occur on a single chip [0], appear to be the key
distinguishing property of distributed systems. But if partitions are just a
special case of concurrency, then there shouldn’t be any fundamental reasons
why algorithms for multicore computational models
(such as <a href="https://en.wikipedia.org/wiki/Parallel_random-access_machine">PRAM</a>) wouldn’t be perfectly suitable for solving all the
problems we might encounter in a distributed setting.
We know this to be false,
so I’ve been trying to puzzle out precisely what
properties of distributed computing distinguish it from parallel computing
[1].</p>
<p>I’ve been taught that distributed systems have two crucial features:</p>
<ul>
<li>Asynchrony, or “absence of synchrony”: messages from one process to another
do not arrive immediately. In a fully asynchronous system, messages may be
delayed for unbounded periods of time.</li>
<li>Partial failure: some processes in the system may fail while other processes
continue executing.</li>
</ul>
<!--
Observe that in a loose sense, network partitions are a form of partial failure,
because from the perspective of the other nodes in the system, a partitioned node
is indistinguishable from a crashed node.
-->
<p>Let’s discuss these two properties separately.</p>
<h3 id="asynchrony">Asynchrony</h3>
<p>Parallel systems also exhibit asynchrony, as long it’s possible for
there to be a delay between one process sending a message [2]
and the other processes having the opportunity to read that message. Even on a single
machine, this delay might be induced by locks within the operating system kernel,
or by the cache coherence protocol implemented in hardware on a multicore chip.</p>
<p>With this in mind, let’s return to Aphyr’s statement.
What exactly did he mean by “big windows of concurrency”?
His article focuses on what happens when multiple clients write to the same
database key, so by “concurrency” I think he is referring to situations where multiple processes
might simultaneously issue writes to the same piece of state. But if you
think about it, the entire execution is a “big window of
concurrency” in this sense, regardless of whether the database replicas are partitioned.
By “big windows of concurrency” I think Aphyr was really talking about <em>asynchrony</em> (or more
precisely, periods of high message delivery delays),
since network partitions are hard to deal with precisely because the messages
between replicas aren’t deliverable until after the partition is recovered:
when replicas can’t coordinate, it’s challenging (or impossible, if the system chooses to enforce linearizability)
for them to correctly process those concurrent writes. Amending Aphyr’s statement then:</p>
<blockquote>
<p>“Network partitions are just really big windows of asynchrony.”</p>
</blockquote>
<p>Does this amendment resolve our quandary? Someone could
rightly point out that because partitions don’t really occur within a single chip [0],
parallel systems can effectively provide guarantees on how long message
delays can last [3], whereas partitions in distributed systems may last
arbitrarily long. Some algorithms designed for parallel computers might
therefore break in a distributed setting,
but I don’t think this is really the distinction we’re looking
for.</p>
<h3 id="partial-failure">Partial Failure</h3>
<p>Designers of distributed algorithms codify their assumptions
about the possible ways nodes can fail by specifying a ‘failure model’. Failure models might describe
how many nodes can fail–for example, quorum-based algorithms assume that no more
than N/2 nodes ever fail, otherwise they cannot make progress–or they might
spell out how individual crashed nodes behave. The latter constraint forms a
hierarchy, where weaker failure models (e.g. ‘fail-stop’, where crashed nodes are guaranteed to never
send messages again) can be reduced to special cases of stronger models (e.g.
‘Byzantine’, where faulty nodes can behave arbitrarily, even possibly
mimicking the behavior of correct nodes) [4].</p>
<p>Throughout the Jepsen series, Aphyr tests distributed systems by (i) telling
clients to issue concurrent writes, (ii) inducing a network partition between
database replicas, and (iii) recovering the partition. Observe that Jepsen
never actually kills replicas! This failure model is actually weaker than fail-stop,
since nodes are guaranteed to eventually resume sending messages [5].
Aphyr’s statement is beginning to make sense:</p>
<blockquote>
<p>“Network partitions that are followed by network recovery are just really big windows of asynchrony.”</p>
</blockquote>
<p>This statement is true; from the perspective of a node in the system, a network partition followed by a network recovery
is indistinguishable from a random spike in message delays, or peer nodes that
are just very slow to respond. In other words, a distributed system that
guarantees that messages will eventually be deliverable to all nodes is
equivalent to an asynchronous parallel system. But if any nodes in the
distributed system actually fail, we’re no longer equivalent to a parallel
system.</p>
<h3 id="who-cares">Who cares?</h3>
<p>This discussion might sound like academic hairsplitting, but I claim that
these distinctions have practical implications.</p>
<p>As an example, let’s imagine that you need to make a choice between shared memory
versus message passing as the communication model for the shiny new distributed
system you’re designing. If you come from a parallel computing background you
would know that message passing is actually equivalent to shared memory, in
the sense that you can use a message passing abstraction to implement
shared memory, and vice versa. You might therefore conclude that you
are free to choose whichever abstraction is more convenient or performant for
your distributed system. If you jumped to this conclusion you might end up
making your system more fragile without realizing it.
Message passing is not equivalent to shared memory in distributed systems [6],
precisely because distributed systems exhibit <em>partial failures</em>;
in order to correctly implement shared memory in a distributed system it must
always be possible to coordinate with a quorum, or
otherwise be able to accurately detect which nodes have failed. Message
passing does not have this limitation.</p>
<p>Another takeaway from this discussion is that Jepsen is actually testing a
fairly weak failure mode. Despite Jepsen’s simplicity though, Aphyr has managed to uncover problems in
an impressive number of distributed databases. If we want to uncover yet more implicit assumptions
about how our systems behave, stronger failure modes seem like an
excellent place to look.</p>
<hr />
<p>[0] After I posted this blog post, Aphyr and others informed me that some of
the latest multicore chips are in fact facing partial failures between cores
due to voltage
issues. This is quite interesting, because as multicore chips grow in transistor density, the
distinction between parallel computing and distributed computing is becoming
more and more blurred: modern multicore chips face both unbounded
asynchrony (from the growing gap between levels of the memory hierarchy) and partial failure (from voltage
issues).</p>
<p>[1] Thanks to <a href="https://www.cs.berkeley.edu/~alig/">Ali Ghodsi</a> for helping me tease out the differences between these properties.</p>
<p>[2] or writing to shared memory, which is essentially the same as sending a
message.</p>
<p>[3] See, for example, PRAM or BSP, which assume that every node can
communicate with every other node within each “round”. It’s trivial to solve
<a href="https://groups.csail.mit.edu/tds/papers/Lynch/pods83-flp.pdf">hard</a> problems like
consensus in this world, because you can always just take a majority
vote and decide within two rounds.</p>
<p>[4] See Ali Ghodsi’s excellent <a href="https://www.cs.berkeley.edu/~alig/cs294-91/events-links.pptx">slides</a> for a taxonomy of these failure models.</p>
<p>[5] Note that this is not equivalent to ‘crash-recovery’. Crash-recovery is
actually stronger than fail-stop, because nodes <em>may</em> recover or they may
not.</p>
<p>[6] Nancy Lynch, “Distributed Algorithms”, Morgan Kaufmann, 1996.</p>
WAN vs. Datacenter Link Reliability2013-05-12T15:46:00+00:00home_url/blog/2013/05/12/wan-vs-datacenter-link-reliability<p>According to a study by Turner et al. [1], wide area network links have an
average of 1.2 to 2.7 days of downtime per year. This translates to roughly
two and a half 9’s of reliability [2].</p>
<p>I was curious how this compared to datacenter links, so I took a look at Gill
et. al’s paper [3] on datacenter network failures at Microsoft. Unfortunately some
of the data has been redacted, but I was able to reverse engineer the mean
link downtime per year with the help of <a href="https://www.eecs.berkeley.edu/~apanda/">Aurojit Panda’s</a>
<a href="https://github.com/apanda/svg-points">svg-to-points</a> converter. The results
are interesting: out of all links types, the average downtime was 0.3 days.
This translates to roughly three and a half 9’s of reliability, an order of magnitude greater
than WAN links.</p>
<p>Intuitively this makes sense. WAN links are much more prone to
<a href="https://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf">drunken hunters, bulldozers, wild dogs,</a>
<a href="https://www.zetatalk.com/newsletr/issue284.htm">ships dropping anchor</a> and the like than links within a <a href="https://www.wired.com/wiredenterprise/2012/10/data-center-easter-eggs/">secure</a> datacenter.</p>
<h4 id="footnotes">Footnotes</h4>
<p>[1] Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California Fault Lines: Understanding the Causes and Impact of Network Failures, Table 4. SIGCOMM ‘10.</p>
<p>[2] Note that this statistic is specifically about hardware failure, not overall network availability.</p>
<p>[3] Phillipa Gill, Navendu Jain, Nachiappan Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications, Figures 8c & 9c. SIGCOMM ‘11</p>
Latency Trends2012-12-24T15:20:00+00:00home_url/blog/2012/12/24/latency-trends<p>In 2010, Jeff Dean gave a <a href="https://goo.gl/0MznW">talk</a> that laid out
a list of <a href="https://gist.github.com/2843375">numbers</a> every programmer
should know. His list has since become relatively well known among the systems community.</p>
<p>The other day, a friend mentioned a latency number to me, and I realized that
it was an order of magnitude smaller than what I had memorized from
Jeff’s talk. The problem, of course, is that hardware performance increases
exponentially! After some digging, I actually found that the numbers Jeff
quotes are over a decade old [1].</p>
<p>Partly inspired by my officemate <a href="https://www.eecs.berkeley.edu/~apanda/">Aurojit Panda</a>, who is collecting
awesome <a href="https://www.eecs.berkeley.edu/~rcs/research/hw_trends.xlsx">data</a> on
hardware performance, I decided to write a little tool [2] to visualize Jeff’s
numbers as a function of time [3].</p>
<p>Without further ado,
<a href="https://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html">here</a> it
is.</p>
<h4 id="footnotes">Footnotes</h4>
<p>[1] Jeff’s numbers are from 2001, and were first publicized by Peter Norvig in this
<a href="https://norvig.com/21-days.html#answers">article</a>.</p>
<p>[2] Layout stolen directly from <a href="https://github.com/ayshen">ayshen</a> on GitHub.</p>
<p>[3] The hardware trends I’ve gathered are rough estimates. If you want to tweak
the parameters yourself, I’ve made it really easy to do so – please send me
updates! Better yet, issue a <a href="https://github.com/colin-scott/interactive_latencies">pull request</a>.</p>
7 ways to handle concurrency in distributed systems2012-10-28T17:54:00+00:00home_url/blog/2012/10/28/7-ways-to-deal-with-ordering-bugs-in-distributed-systems<p><em>tl;dr: handling event ordering correctly in distributed systems is tricky.
In this post I cover 7 approaches to coping with concurrency.</em></p>
<p>Robust distributed systems are notoriously difficult to build. The difficulty
arises from two properties in particular:</p>
<ul>
<li>
<p>Limited knowledge: each node knows its own state, and it knows what state
the other nodes were in recently, but it can’t know their current state.</p>
</li>
<li>
<p>(Partial) failures: individual nodes can fail at any time, and the network can delay or drop
messages arbitrarily.</p>
</li>
</ul>
<p>Why are these properties difficult to grapple with? Suppose you’re writing
code for a single node. You’re deep in a nested conditional statement, and you
need to deal with a message arrival. How do you react? What if the message
you’re seeing was actually delayed by the network and is no longer relevant? What if some of the nodes you need to coordinate with have
failed, but you aren’t aware of it yet? The set of possible event sequences you need to reason about is huge,
and it’s all too easy to forget about the one nasty corner case that will
eventually bring your system to a screeching halt.</p>
<p>To make the discussion more concrete, let’s look at an example.</p>
<p><img src="https://www.eecs.berkeley.edu/~rcs/research/example_bug.png" alt="Floodlight bug" width="100%" /></p>
<p>The figure above depicts a race condition [1] in
<a href="https://floodlight.openflowhub.org/">Floodlight</a>, a distributed controller for
software-defined networks. With Floodlight, switches maintain one hot
connection to a master controller and
one or more cold connections to replica controllers. The master holds the
authority to modify the configuration of the switches, while the other
controllers are in slave mode and do not perform any changes to the
switch configurations unless they detect that the master has crashed [2].</p>
<p>The race condition is triggered when a link fails (E1), and the switch
attempts to notify the controllers (E2,E4) shortly after the master has
died (E3), but before a new master has been selected (E6). In this case,
all live controllers are in
the slave role and will not take responsibility for updating the switch
flow table (E5). At some point, heartbeat messages time out and one of the
slaves
elevates itself to the master role (E6). The new master will proceed to
manage
the switch, but without ever clearing the routing entries for
the failed link (resulting in a persistent blackhole) [3].</p>
<p>If we take a step back, we see that there are two problems involved:
leader election (“Who is the master at any point in time?”), and
replication (“How should the backups behave?”). Let’s assume that leader
election is handled by a separate consensus algorithm (<em>e.g.</em>
Paxos), and focus our attention on replication.</p>
<!-- And atomic commit? -->
<p>Now that we have a concrete example to think about, let’s go over a few
solutions this problem. The first four share the same philosophy: <em>“get
the ordering right”</em>.</p>
<h3 id="take-it-case-by-case">Take it case-by-case</h3>
<p>The straightforward fix here is to add a conditional statement for this event
ordering: if you’re a slave, and the next message is a link failure
notification, store the notification in memory in case you become master
later.</p>
<p>This fix seems easy in retrospect. But recall how the bug came about in the
first place: the programmer had some set of event orderings in mind when
writing the code, but didn’t implement one corner case. How do we know
there isn’t another race condition lurking somewhere else in the code? [4]</p>
<p>The number of event orderings you need to consider in a distributed system is
truly huge; it scales combinatorially with the number of nodes you’re
communicating with. For the rest of this post, let’s see if we can avoid the
need to reason on a case-by-case basis altogether.</p>
<h3 id="replicate-the-computation">Replicate the computation</h3>
<p>Consider a system consisting of only one node. In this world,
there is a single, global order (with no race conditions)!</p>
<p>How can we obtain a global event order, yet still achieve fault tolerance? One way [5] is to have the backup nodes mimic every step of the master node: forward all
inputs to the master, have the master choose a serial order for those events, issue the appriopriate commands to the switches, and replicate the decision to the backups [6]. The key here is that each
backup should execute the computation in the exact same order as the master.</p>
<p>For the Floodlight bug, the backup would still need to hold the link failure
message in memory until it detects that the master has crashed. But we’ve
gained a powerful guarantee over the previous approach: when the backup takes
over for the master, it will be in a up-to-date state, and know exactly what
commands it needs to send to the switches to get them into a correct
configuration.</p>
<h3 id="make-your-event-handlers-transactional">Make your event handlers transactional</h3>
<p>Transactions allow us to make a group of operations appear either as if they
happened simultaneously, or not at all. This is a powerful idea!</p>
<p>How could transactions help us here? Suppose we did the following: whenever a
message arrives, find the event handler
for that message, wrap it in a transaction, run the event handler, and hand
the result of the transaction to the master controller. The master
decides on a global order, checks whether any concurrent transactions
conflict with each other (and aborts one of them if they do), updates the switches, sends the
serialized transactions to the backups, and waits for ACKs before logging
a commit message.</p>
<p>This is very similar to the previous solution, but it gives us two benefits over the previous approach:</p>
<ul>
<li>We can potentially handle more events in parallel; most of the transactions will
not conflict with each other, and we can simply abort and retry the ones
that do.</li>
<li>We can now roll back operations. Suppose a network operator issues a
policy change to the controller, but realizes that she made a mistake.
No problem – she can simply roll back the previous transaction and start
again where she began.</li>
</ul>
<p>Compared to the first approach, this is a significant improvement! Each
event is handled in isolation from the other events, so there’s need to
reason about event interleavings; if a conflicting transaction was
committed before we get to commit, just abort and retry!</p>
<h3 id="reorder-events-when-no-one-will-notice">Reorder events when no one will notice</h3>
<p>It turns out that we can achieve even better throughput if we use a
replication model called virtual synchrony. In short, virtual synchrony
provides a library with three operations:</p>
<ul>
<li><tt>join()</tt> a process group</li>
<li><tt>register()</tt> an event handler</li>
<li><tt>send()</tt> an atomic multicast message to the rest of your process
group.</li>
</ul>
<p>These primitives provide two crucial guarentees:</p>
<ul>
<li>Atomic multicast means that if <em>any</em> correct node gets the message, every live
node will eventually get the message. That implies that if any live
node ever gets the link failure notification, you can rest assure that
one of your future masters will get it.</li>
<li>The <tt>join()</tt> protocol ensures that every node always know who’s a member of its group,
and that everyone has the same view of who is alive
and who is not. Failures results in a group change, but everyone will agree on the order in which the failure occured.</li>
</ul>
<p>With virtual synchrony, we no longer need a single master; atomic multicast
means that there is a single order of events observed by all members of the
group, regardless of who initiated the message. And with multiple masters, we
aren’t constrained by the speed of a single node.</p>
<p>The <em>virtual</em> part of virtual synchrony is that when the library detects
that two operations are not causally related to each other, it can reorder
them in whatever way it believes most efficient. Since those operations aren’t
causally related, we’re guaranteed that the final output won’t be noticeably
different.</p>
<p>OK, let’s move on to the final three approaches, which take a different tack
than the first four: <em>“avoid having to reason about event ordering
altogether”</em></p>
<h3 id="make-yourself-stateless">Make yourself stateless</h3>
<p>In a database, the “ground truth” is stored on disk. In a network, the “ground
truth” is stored in the routing tables of the switches themselves. This
implies that the controllers’ view of the network is just soft state; we can
always recover it simply by querying the switches for their current
configuration!</p>
<p>How does this observation relate to the Floodlight bug? Suppose we didn’t even
attempt to keep the backup controllers in sync with the master. Instead, just
have them recompute the entire network configuration whenever they realize
they need to take over for the master. Their only job in the meantime is to
monitor the liveness of the master!</p>
<p>Of course, the tradeoff here is that it may take significantly longer for the newly elected master to get up to speed.</p>
<p>We can apply the same trick to avoid race conditions between concurrent events
at the master: instead of maintaining locks between threads, just restart
computation of the entire network configuration whenever a new event comes in.
Race conditions don’t happen if there is no shared state!</p>
<p>Incidentally, Google’s <a href="https://www.eecs.berkeley.edu/~rcs/research/google-onrc-slides.pdf">wide-area network
controller</a>
is almost entirely stateless, presumably for many of the same reasons.</p>
<h3 id="force-yourself-to-be-stateless">Force yourself to be stateless</h3>
<p>In the spirit stateless computation, why not write your code in a language
that doesn’t allow you to keep state at all? Programs written in declarative
languages such as <a href="https://p2.berkeley.intel-research.net/">Overlog</a> have no
explicit ordering whatsoever. Programmers simply declare rules such as “If the
switch has a link failure, then flush the routing entries that go over that
link”, and the language runtime handles the order in which the computation
is carried out.</p>
<p>With a declarative language, as the long as the same set of events is fed
to the controller (regardless of their order), the same result will come
out. This makes replication really easy: send inputs to all controllers,
have each node compute the resulting configuration, and only allow the
master node to send out commands to the switches once the computation has
completed. The tradeoff is that without an explicit ordering.
the performance of declarative languages is difficult to reason about.</p>
<h3 id="guarantee-self-stabilization">Guarantee self-stabilization</h3>
<p>The previous solutions were designed to always guarantee correct behavior
despite failures of the other nodes. This final solution, my personal
favorite, is much more optimistic.</p>
<p>The idea behind self-stabilizing algorithms is to have a provable guarantee
that no matter what configuration the system starts in, and no matter what
failures occur, all nodes will eventually stabilize to a configuration where
safety properties are met. This eliminates the need to worry about correct
initialization, or detect whether the algorithm has terminated. As a nice side
benefit, self-stabilizing algorithms are usually considerably simpler than
their order-aware counterparts.</p>
<p>What do self-stabilizing algorithms look like? Self-stabilizing algorithms are
actually everywhere in networking – routing algorithms are the most canonical
example.</p>
<p>How would a self-stabilizing algorithm help with the Floodlight bug? The
answer really depends on what network invariants the control application needs
to maintain. If it’s just to provide connectivity [7], we could simply run a
traditional link-state algorithm: have each switch periodically send port
status messages to the controllers, have the controllers compute shortest
paths using Dijkstra’s, and have the master push the appropriate updates to
the switches. Even if there are transient failures, we’re guaranteed that the
network will eventually converge to a configuration with no loops and
deadends.</p>
<hr />
<p>Ultimately, the best replication choice depends on your workload and
network policies. In any case, I hope this post has convinced you that
there’s more than viable approach to handling concurrency!</p>
<h4 id="footnotes">Footnotes</h4>
<p>[1] Note that this issue was originally discovered by the developers of
Floodlight. (We don’t mean to pick on BigSwitch here; we chose this bug
because it’s a great example of the difficulties that come up in distributed
systems). For more information, see line 605 of
<a href="https://github.com/floodlight/floodlight/blob/2e9427e20ede7dc3941f8c15d2348bfcafdce237/src/main/java/net/floodlightcontroller/core/internal/Controller.java">Controller.java</a>.</p>
<p>[2] This invariant is crucial to maintain. Think of the switches’ routing
tables as shared variables between threads (controllers). We need to ensure
mutual exclusion over those shared variables, otherwise we could end up with
internally inconsistent routing tables.</p>
<p>[3] The Floodlight bug noted in [1] actually involves neglecting to clear the
routing tables of newly connected switches, but the same flavor of race
condition could occur for link failures. We chose to focus on link failures
because they’re likely to occur much more often than switch connects.</p>
<p>[4] It’s possible in some cases to use a <a href="https://www.macesystems.org/">model checker</a> to automatically find race conditions, but the runtime complexity is often intractable and very few systems do this in practice.</p>
<p>[5] There are actually a handful of ways to implement state machine replication. Ours depend on a concensus algorithm to choose the master, but you could also run the concensus algorithm itself to achieve replication. There are also cheaper algorithms such as reliable broadcast. You can also get significantly better read throughput with chain replication, which doesn’t require quorum for reads, but writes become more complicated.</p>
<p>[6] We still need to maintain the invariant that only the master modifies the
the switch configurations. Nonetheless, with state machine replication the
backup will always know what commands need to be sent to switches if and when
it takes over for the master.</p>
<p>[7] Although if your goal is only to provide connectivity, it’s <a href="https://networkheresy.com/2011/11/17/is-openflowsdn-good-at-forwarding/">not clear</a>
why you’re using SDN in the first place.</p>