Abridged Version of Amazon Paper

Leslie Lamport

Last modified 3 August 2018

You'll miss a lot on this web site unless you enable Javascript in your browser.

This

page contains a highly abridged version of the following article about TLA+ use at Amazon, published in April, 2015:

    How Amazon Web Services Uses Formal Methods
    Chris Newcombe, Tim Rath, Fan Zhang, Bogdan Munteanu,
         Marc Brooker, and Michael Deardeuff
    Communications of the ACM, Vol. 58 No. 4, Pages 66-73
    available on-line

The following paper also describes Amazon's TLA+ use:

Why Amazon Chose TLA+
Chris Newcombe
Abstract State Machines, Alloy, B, TLA, VDM, and Z
Lecture Notes in Computer Science Volume 8477, 2014, pp 25-39
available from the publisher
A copy can also be obtained on request from the author:
.

Below is the highly abridged version of the first paper. Every word in blue is a direct quote from the paper, though much text has been omitted. I have added in black some connecting prose to substitute for deleted material, as well as a few comments marked by LL.

Since 2011, engineers at Amazon Web Services (AWS) have used formal specification and model checking to help solve difficult design problems in critical systems.

AWS builds systems that are inherently complex. Errors in the core of our system could cause loss or corruption of data. So, we need to reach extremely high confidence that it is correct. We have found the standard verification techniques in industry are necessary but not sufficient. We routinely use deep design reviews, code reviews, static code analysis, stress testing, fault-injection testing, and many other techniques, but we still find that subtle bugs can hide in complex concurrent fault-tolerant systems.

Some of the more subtle, dangerous bugs turn out to be errors in design. We have found that testing the code is inadequate as a method for finding these errors, as the number of reachable states of the code is astronomical.

Precise Designs

In order to find subtle bugs in a system design, it is necessary to have a precise description of that design. The final executable code is unambiguous but contains an overwhelming amount of detail. We needed to be able to capture the essence of a design in a few hundred lines of precise description. As our designs are unavoidably complex, we needed a highly expressive language, far above the level of code, but with precise semantics. And, as we wish to build services quickly, we wanted a language that is simple to learn and apply, avoiding esoteric concepts. We also wanted an existing ecosystem of tools. We were looking for an off-the-shelf method with high return on investment.

We found what we were looking for in TLA+, a formal specification language based on simple discrete math, or basic set theory and predicates, with which all engineers are familiar. We found it helpful that the same language is used to describe both the desired correctness properties of the system (the what) and the design of the system (the how). In TLA+, correctness properties and system designs are just steps on a ladder of abstraction, with correctness properties occupying higher levels, systems designs and algorithms in the middle, and executable code and hardware at the lower levels. TLA+ is intended to make it as easy as possible to show that a system design correctly implements the desired correctness properties, either through conventional mathematical reasoning or more easily and quickly through tools like the TLC model checker that exhaustively checks the desired correctness properties across all possible execution traces. The ladder of abstraction also helps designers manage the complexity of real-world systems. The freedom to choose and adjust levels of abstraction makes TLA+ extremely flexible.

TLA+ is accompanied by a second language called PlusCal that is closer to a C-style programming language but is much more expressive, as it uses TLA+ for expressions and values. Several engineers at Amazon have found they are more productive using PlusCal than using TLA+. In other cases, the additional flexibility of plain TLA+ has been very useful.

Formal Methods for Real-World Systems

In industry, formal methods have a reputation for requiring a huge amount of training and effort to verify a tiny piece of relatively straightforward code, so the return on investment is justified only in safety-critical domains. Our experience with TLA+ shows that perception to be wrong. At the time of this writing, Amazon engineers have used TLA+ on 10 large complex real-world systems. LL: By the time the paper appeared in print, the number had grown to 14. I don't have more recent information. In each, TLA+ has added significant value, either finding subtle bugs that we are sure we would not have found by other means, or giving us enough understanding and confidence to make aggressive performance optimizations without sacrificing correctness. Amazon now has seven teams using TLA+, with encouragement from senior management and technical leadership. Engineers from entry level to principal have been able to learn TLA+ from scratch and get useful results in two to three weeks, in some cases in their personal time on weekends and evenings, without further help or training.

Side Benefit

TLA+ changes the conventional approach of thinking about what can go wrong to specifying what must go right for the system to achieve its goals. We have found this rigorous what needs to go right approach to system design to be significantly less error prone than the ad hoc what might go wrong approach.

More Side Benefits

Writing a formal specification pays dividends over the lifetime of the system. All production services at Amazon are under constant development. Our first priority is always to avoid causing bugs in a production system, so we often have to answer the question: Is this change safe? A major benefit of having a precise, testable model of the core system is that we can quickly verify that even deep changes are safe or learn that they are unsafe without doing harm. In several cases, we have prevented subtle but serious bugs from reaching production. In other cases we have been able to make innovative performance optimizations that we would not have dared to do without having model-checked those changes. LL: Users at Intel also reported that model checking enabled optimizations they would not otherwise have made. We regularly have to bring new people up to speed on systems. To avoid creating subtle bugs, we need all engineers to have the same mental model of the system and for that shared model to be accurate, precise, and complete. A formal specification is precise, short, and can be explored and experimented upon with tools.

First Steps to Formal Methods

With hindsight, Amazon's path to formal methods seems straightforward; we had an engineering problem and we found a solution. Reality was somewhat different.

The paper here describes what led author Chris Newcombe to TLA+ after trying a different method that was not up to the task. Newcombe then tried to persuade colleagues at Amazon to adopt TLA+. However, engineers have almost no spare time for such things, unless compelled by need. Fortunately, a need was about to arise.

First Big Success at Amazon

In January 2012, Amazon launched DynamoDB. The replication and fault-tolerance mechanisms in DynamoDB were created by author T.R. He did extensive testing and wrote detailed informal correctness proofs. To achieve the highest level of confidence in the design, T.R. chose to apply TLA+.

T.R. learned TLA+ and wrote a detailed specification of these components in a couple of weeks. The model checker verified that a small, complicated part of the algorithm worked as expected. T.R. then checked the broader fault-tolerant algorithm. This time the model checker found a bug that could lead to losing data if a particular sequence of failures and recovery steps was interleaved with other processing. This was a very subtle bug; the shortest error trace exhibiting the bug contained 35 high-level steps. The improbability of such compound events is not a defense against such bugs; historically, AWS engineers have observed many combinations of events at least as complicated as those that could trigger this bug. The bug had passed unnoticed through extensive design reviews, code reviews, and testing, and T.R. is convinced that we would not have found it by doing more work in those conventional areas. The model checker later found two bugs in other algorithms, both serious and subtle.

T.R. says that, had he known about TLA+ before starting work on DynamoDB he would have used it from the start. Using TLA+ would likely have improved time to market, in addition to achieving greater confidence in correctness of the system.

Persuading More Engineers

The article describes the further spread of TLA+. It observes that: TLA+ can be taught by engineers who are still new to it themselves; this is important for quickly scaling adoption in an organization as large as Amazon.

Most Frequently Asked Question

On learning about TLA+, engineers usually ask, How do we know that the executable code correctly implements the verified design? The answer is we do not know. Despite this, formal methods help in multiple ways:

Get design right. Formal methods help engineers get the design right, which is a necessary first step toward getting the code right. If the design is broken, then the code is almost certainly broken. Engineers are unlikely to realize the design is incorrect while focused on coding;

Gain better understanding. Formal methods help engineers gain a better understanding of the design. Improved understanding can only increase the chances they will get the code right; and

Write better code. Formal methods can help engineers write better self-diagnosing code in the form of assertions. Formal methods can help improve assertions that help improve the quality of code.

Conclusion

Formal methods are a big success at AWS, helping us prevent subtle but serious bugs from reaching production, bugs we would not have found via any other technique. They have helped us devise aggressive optimizations to complex algorithms without sacrificing quality. At the time of this writing, seven Amazon teams have used TLA+, all finding value in doing so. More Amazon teams are starting to use TLA+. We believe that use of TLA+ will improve both time-to-market and quality of our systems. Executive management is now actively encouraging teams to write TLA+ specs for new features and other significant design changes. In annual planning, managers now allocate engineering time to TLA+.