You'll miss a lot on this web site unless you enable Javascript
in your browser.
This
page contains a highly abridged version of the following article
about TLA+ use at Amazon, published in April, 2015:
How Amazon Web Services Uses Formal Methods
The following paper also describes Amazon's TLA+ use: Why Amazon Chose TLA+Below is the highly abridged version of the first paper. Every word in blue is a direct quote from the paper, though much text has been omitted. I have added in black some connecting prose to substitute for deleted material, as well as a few comments marked by LL.
Since 2011, engineers at Amazon Web Services (AWS) have used formal specification and model checking to help solve difficult design problems in critical systems.
AWS builds systems that are inherently complex.
Errors in the core of our system could cause loss or
corruption of data.
So,
we need to
reach extremely high confidence that it is
correct. We have found the standard verification techniques in
industry are necessary but not sufficient. We routinely use deep
design reviews, code reviews, static code analysis, stress testing,
fault-injection testing, and many other techniques, but we still find
that subtle bugs can hide in complex concurrent fault-tolerant
systems.
Some of the more subtle, dangerous bugs turn
out to be errors in design.
We have found that testing the code is
inadequate as a method for finding these errors, as the
number of reachable states of the code is astronomical.
In order to find subtle bugs in a system design, it is necessary to
have a precise description of that design.
The final executable code is unambiguous but contains an
overwhelming amount of detail. We needed to be able to capture the
essence of a design in a few hundred lines of precise description. As
our designs are unavoidably complex, we needed a highly expressive
language, far above the level of code, but with precise semantics.
And, as we wish to build services quickly, we wanted a
language that is simple to learn and apply, avoiding esoteric
concepts. We also wanted an existing ecosystem of tools.
We were looking for an off-the-shelf method with high return on
investment.
We found what we were looking for in TLA+, a formal specification
language based on simple discrete math, or basic set theory and
predicates, with which all engineers are familiar.
We found it helpful that the same
language is used to describe both the desired correctness properties
of the system (the TLA+ is accompanied by a second language called PlusCal that is closer to a C-style programming language but is much more expressive, as it uses TLA+ for expressions and values. Several engineers at Amazon have found they are more productive using PlusCal than using TLA+. In other cases, the additional flexibility of plain TLA+ has been very useful.
In industry, formal methods have a reputation for requiring a huge
amount of training and effort to verify a tiny piece of relatively
straightforward code, so the return on investment is justified only in
safety-critical domains. Our experience with TLA+ shows that
perception to be wrong. At the time of this writing, Amazon
engineers have used TLA+ on 10 large complex real-world systems.
LL: By the time the paper appeared in print, the
number had grown to 14. I don't have more recent
information. In each, TLA+ has added significant value,
either finding subtle bugs that we are sure we would not have found by
other means, or giving us enough understanding and confidence to make
aggressive performance optimizations without sacrificing
correctness. Amazon now has seven teams using TLA+, with
encouragement from senior management and technical leadership.
Engineers from entry level to principal have been able to learn TLA+
from scratch and get useful results in two to three weeks, in some
cases in their personal time on weekends and evenings, without further
help or training.
TLA+ changes the conventional approach of
thinking about what can go wrong to specifying what must go right
for the system to achieve its goals.
We have found this rigorous
With hindsight, Amazon's path to formal methods seems straightforward;
we had an engineering problem and we found a solution. Reality was
somewhat different. The paper here describes what led author Chris Newcombe to TLA+ after trying a different method that was not up to the task. Newcombe then tried to persuade colleagues at Amazon to adopt TLA+. However, engineers have almost no spare time for such things, unless compelled by need. Fortunately, a need was about to arise.
In January 2012, Amazon launched DynamoDB.
The replication and fault-tolerance mechanisms in DynamoDB were
created by author T.R. He did extensive testing and wrote
detailed informal correctness proofs.
To achieve the highest level of
confidence in the design, T.R. chose to apply TLA+.
T.R. learned TLA+ and wrote a detailed specification of these
components in a couple of weeks.
The model checker verified that a small,
complicated part of the algorithm worked as expected.
T.R. then checked the broader fault-tolerant
algorithm. This time the model checker found a bug that could lead to
losing data if a particular sequence of failures and recovery steps
was interleaved with other processing. This was a very subtle bug;
the shortest error trace exhibiting the bug contained 35 high-level
steps. The improbability of such compound events is not a defense
against such bugs; historically, AWS engineers have observed many
combinations of events at least as complicated as those that could
trigger this bug. The bug had passed unnoticed through extensive
design reviews, code reviews, and testing, and T.R. is convinced that
we would not have found it by doing more work in those conventional
areas. The model checker later found two bugs in other algorithms,
both serious and subtle.
T.R. says that, had he known about TLA+ before starting work on
DynamoDB he would have used it from the start.
Using TLA+
would likely have improved
time to market, in addition to achieving greater confidence in
correctness of the system.
Persuading More EngineersThe article describes the further spread of TLA+. It observes that: TLA+ can be taught by engineers who are still new to it themselves; this is important for quickly scaling adoption in an organization as large as Amazon.
On learning about TLA+, engineers usually ask,
Get design right. Formal methods help engineers get the design
right, which is a necessary first step toward getting the code right.
If the design is broken, then the code is almost certainly broken.
Engineers are unlikely to realize the
design is incorrect while focused on coding;
Gain better understanding. Formal methods help engineers gain a
better understanding of the design. Improved understanding can only
increase the chances they will get the code right; and
Write better code. Formal methods can help engineers write
better
Formal methods are a big success at AWS, helping us prevent subtle but
serious bugs from reaching production, bugs we would not have found
via any other technique. They have helped us devise aggressive
optimizations to complex algorithms without sacrificing quality. At
the time of this writing, seven Amazon teams have used TLA+, all
finding value in doing so. More Amazon teams are starting to use
TLA+. We believe that use of TLA+ will improve both
time-to-market and quality of our systems. Executive management is
now actively encouraging teams to write TLA+ specs for new
features and other significant design changes. In annual planning,
managers now allocate engineering time to TLA+.
|
||