This page contains Section 4.2 of the following article:
High-Level Specifications: Lessons from
This section was written by Brannon Batson, who was then an Intel engineer. It was preceded by a section describing the use of TLA+ and TLC at Compaq. Batson and some other Compaq engineers moved from Compaq to Intel.
4.2 IntelWe now describe the use of TLA+ and TLC by B[rannon] B[atson] and his colleagues at Intel. The actual specifications are proprietary and have not been viewed by anyone outside Intel.
4.2.1 Overview of the ProblemDesigning a complex system starts with a problem statement and an appropriate set of boundary conditions. A component of a computer system is initially represented abstractly as a black box, with assertions about its functionality and with some guidelines on performance, cost, and complexity. The engineering process involves iteratively refining this abstract model into lower-level models. Each lower-level model is a representation of the design at a certain level of abstraction, and it has a specific purpose. Some models are meant to evaluate tradeoffs between scope, performance, cost, and complexity. Others carry the design down to the low level of detail needed to manufacture the component.
The engineering process therefore creates multiple representations of a design. Validation entails checking these multiple representations against one another. Designers of digital systems have good tools and methods for validating mid-level functional models, written in a hardware description language (HDL), against lower-level models such as circuit net-lists. However, they have not had as much success checking the higher-level functional representations of the design against one another, and against the initial problem statement and functional assertions.
For some components, there is an intuitive correlation between the high-level notions of correctness and the HDL model; such components tend not to be difficult to validate. Other components, like multiprocessor cache-coherence protocols, are sufficiently complex that checking the HDL model against the problem statement is quite challenging. We need formal techniques from the world of mathematics to perform this high-level validation.
Although formal methods are based on mathematics, engineers view them differently from the way mathematicians do. To engineers, formal verification is simply another imperfect validation tool (albeit a powerful one). A TLA+ specification is only an abstraction of the actual system, and model checking can usually validate the specification only for a highly restricted set of system parameters. Validating the specification therefore cannot guarantee that there are no errors in the system. For engineers, formal verification is a way of finding bugs, not of proving correctness.
The main benefit of applying TLA+ to engineering problems comes from the efficiency of the TLC model checker in reaching high levels of coverage and finding bugs. A secondary benefit we have encountered is the ability of TLA+ and TLC to provide good metrics for the complexity of a design. Complexity is a major consideration in evaluating design tradeoffs. However, unlike performance or cost, engineers have not historically had a good way to quantify algorithmic complexity before attempting to validate a design. TLA+ encourages designers to specify the design abstractly, suppressing lower-level details, so the length of the specification provides a measure of a design's complexity. TLC reports the size of the reachable state space, providing another measure of complexity. Experience and intuition will always have a place in evaluating complexity, but TLA+ and TLC provide robust and impartial input to the evaluation. Having this input early in the design process is of considerable value.
4.2.2 Designing with TLA+The core group at Intel started using TLA+ at Compaq while working on the Alpha EV7 and EV8 multiprocessor projects described above. From that experience, the Alpha engineers learned that multiprocessor cache-coherence protocols are an ideal candidate for formal methods because most of the protocol bugs can be found at a high level of abstraction. They also learned that the true value of TLA+ and TLC would be realized when (a) they were applied early enough in the design to provide implementation feedback, and (b) the implementation was based directly on the specification that had been verified. On the EV8 project, the TLA+ specification was completed before the design was stable, and it provided important feedback to the designers.
When the engineers from the Alpha group joined Intel, they began applying their experience in writing TLA+ specifications when collaborating with other Intel engineers on cache-coherence protocols for future Intel products. Intel engineers are now using TLA+ as an integral part of the design process for the protocols that are under study.
Whiteboard PhaseDesigning one cache-coherence protocol from scratch provided the engineers with the opportunity to evaluate TLA+ as a prototyping platform for complex algorithms. Work on this protocol started by exploring the design space on a whiteboard for about two months. In this phase, basic message sequencing was determined, as were some coarse notions of what state had to be recorded at the protocol endpoints. A basic direction was set, based on the guidelines for functionality, performance, and cost.
Because of their background, engineers tend to visualize an algorithm
in terms of a particular implementation. They are better at
gauging implementation complexity than at measuring algorithmic
complexity. One benefit of having engineers write formal
specifications is that it helps them learn how to think about a
protocol abstractly, independent of implementation details. We found
that, even in the whiteboard phase of the protocol design, the Intel
engineers were able to make some judgments on complexity by asking
The whiteboard phase produced a general understanding of the protocol philosophy, an understanding of the constraints placed on the communication medium, the basic message flows, and coarse ideas on what state needed to be maintained. The next step was to introduce the rigor of a formal specification.
TLA+ Scratchpad PhaseThe TLA+ scratchpad phase of the project involved formally describing the abstract system, with appropriate state variables representing high-level components. This phase took about two months, starting with the initial design of the protocol. The difficulty lay not in the use of TLA+ — engineers frequently learn new programming languages—but rather in (a) determining the layer of abstraction and (b) exploring the protocol's corner cases. Task (a) is where TLA+ forces engineers to think about the protocol abstractly, which they often find unnatural. Their ability to think abstractly improves with experience writing TLA+ specifications. Task (b) is inevitable when documenting a protocol formally, as it forces the designers to explore the corner cases. During the scratchpad phase, the designers had to return to the whiteboard a few times when they encountered new race cases while writing the specification.
The actions that formed the major blocks of the specification were chosen early; very few changes were made later. The Intel engineers adopted a methodology used in the earlier Alpha specifications, in which the decomposition of high-level named actions is based on classifying the protocol messages that they process. This methodology has led to fairly readable specifications, since it means that each action changes only a few local state variables. It encouraged the protocol specifications to be designed in a modular way, which also enabled the inter-module interfaces in the specification to be similar to their low-level counterparts in the implementation.
Running TLCThe initial week or so of running TLC was spent finding and fixing typographical errors and type mismatch problems. This time could probably have been shortened by doing more syntax checking when writing the specification, which is what one often does when programming.
The next four weeks saw a continuous process of running TLC, finding bugs, fixing them, and re-running TLC. During this phase of the project, many assumptions and assertions about the protocol were brought into question. This had the effect of educating the engineers about the protocol they had designed. We have found that TLC can be a useful learning tool if we use in-line assertions and global invariants to check everything we think is true. The Intel engineers were able to develop an intuitive understanding of the correctness of the protocol by developing meaningful global invariants and having TLC check them. If an assertion or invariant fails, TLC generates a counterexample that is useful for visualizing a difficult race case. These counterexamples are such a powerful teaching aid that the Intel engineers have developed tools to translate the TLC output into nicely formatted protocol flow diagrams that are easier to read.
Another useful feature of the TLC model checker is its coverage
checking. TLC can print the number of times each action was
The performance of the model checker was sufficient to debug a large
protocol specification. The engineers determined a base configuration
4.2.3 Optimizing with TLCOnce the initial protocol specification was successfully checked by TLC, the Intel engineers were able to use it as a test bed for exploring optimizations. TLA+ is an ideal language to explore changes because its expressiveness usually allows the new version to be written quickly. Model checking the modified specification with TLC not only checks functional correctness, but it also measures any increase in the state space. Such an increase implies additional algorithmic complexity. The engineers spent several months exploring additions to the protocol, testing them with TLC. As a general rule, they would consider adopting only those optimizations that did not appreciably expand the state space. The insight that TLA+ and TLC gave into the complexity of modifications to the protocol was invaluable in iterating towards an optimal solution that adequately weighed algorithmic complexity along with factors like cost and performance.
A significant optimization was later made to the protocol. This optimization followed the normal design cycle described above, though on a compressed schedule. With the original design yielding a good starting point, the entire cycle (whiteboard phase, TLA+ coding, and verification with TLC) was done within six weeks. This modification was accomplished by a recent college graduate with an undergraduate degree in engineering. He was able to learn TLA+ well enough within a matter of weeks to do this work.
4.2.4 Feedback on TLA+ syntaxThe feedback we have received from engineers about the TLA+ language has been mostly positive. Engineers are usually able to pick up and understand a specification within a couple of days. One mistake we made was to present TLA+ to hardware designers as similar to a programming language. This led to some frustration. A better approach seems to be to describe TLA+ as being like a hardware description language. Engineers who design digital systems are well acquainted with methods for specifying finite-state machines, with the associated restrictions of allowing a primed variable to be assigned a value only once within a conjunction, not allowing a primed variable to appear in a conjunction before the assignment of its value, etc. To an engineer, TLA+ looks like a language for specifying finite-state machines.
While writing the protocol specification at Intel, BB was impressed by
the ease of specifying complex data structures in TLA+ as sets and
tuples. The part of the specification that described and manipulated
data structures was a small part of the complete protocol
specification. This compact specification of
For the algorithmic specification, TLA+ naturally encourages nested disjunctions of conjunctions (known to engineers as sums of products of expressions). This method for specifying Boolean formulas has both advantages and disadvantages. One advantage is that it allows expressive comment blocks and assertions to be inserted in-line with a nested conjunct. A disadvantage is that this tends to lead to large specifications. The engineers are experimenting with the use of TLA+ operators to encode large blocks of regular Boolean disjunctions as truth tables, which engineers find more natural to work with.