Model-based Testing Distributed Systems with P Language

P is the name of a domain-specific language for asynchronous event-driven programming. With P, we can specify a system as a set of interacting state machines which talk to each other by sending events. That makes P a suitable language for modeling distributed systems where nodes talk to each other asynchronously via messages. P originally developed by Microsoft and according to its Github repository, it is currently being used extensively inside Amazon AWS for model-checking complex distributed systems. In this post, we want to see how we can use P to catch bugs in our protocols. 

The word P is for protocols [2]. Although one of the main goals of P is to integrate the model-checking and code generation, in this post, we only focus on the model-checking part. It is important to note that unlike other model-checking tools for distributed systems like TLA+, P is not designed for exhaustive model-checking that we use to make sure of the correctness of our protocol in every possible execution. Instead, P is designed for model-based testing. Thus, P cannot be used to prove that a given protocol is correct, and just like any other testing method, its goal is actually to prove that our system is incorrect by finding bugs. 

Outline

Concepts

In this section, we see how programs are  modeled in P. 

Interacting State Machines

In P, we model a protocol as a set of interacting state machines. The state machines interact by sending events to each other. An event may or may not carry a payload. When it does not have a payload, it serves as a simple signal to another state machine. A state machine has a set of states and a set of transitions. The event sent by other state machines will be enqueued to the input queue of the receiving state machine. Each state defines a set of event handlers that will be executed when the state machine dequeues event from its input queue while being in this state. A state also defines blocks of codes to be executed upon arriving at or leaving the state. 

We define one of the state machines as the main state machine. The entry block of this main state machine is the entry point to the execution of our protocol and where the model-checker starts checking the executions. We can instantiate other state machines when we are running the code of a state machine. 

In the example shown in Sketch 1, the main machine creates two instances of the worker machine and then transitions to the SendRequests state. In this state, it sends REQ events to the worker machines and then transitions to the Waiting state where it waits for receiving WORK_DONE events from the worker machines. Once it collects WORK_DONE from both worker machines, it transitions back to the SendRequest state and repeats the process. 


Sketch 1
. An example P protocol. The main machine instantiates two worker machines. It sends event REQ to each of them and waits for them to signal that the work is down by sending a WORK_DONE event back to it. The main machine returns back to sending state once it received WORK_DONE from all worker machines.

Handling Events

Each state must handle any event that it dequeues from the input queue. Otherwise, the model-checker marks it as a bug. In other words, while the model-checker is checking the protocol, if it finds out that a state machine in a particular state is possible to dequeue an event but it does not have an event handler for that event, it notifies us about it. We can let the model-checker know that it is fine to ignore an event in a state, by explicitly saying that using the ignore keyword. 

Sometimes, we want to ignore an event in a particular state, but we don't want to completely ignore it for the rest of the execution. In other words, we just want to ignore it for now in the current state, and let the future states handle it. In these cases, we can tell the model-checker to defer a particular event in a particular state. Note that we cannot defer an event forever. If the model-checker sees that there is execution where an event is never handled, it raises an error. 

Call Transitions

To execute the code of another state we have two options. We can either "step into" the state or "call" that state by pushing the state. Stepping into another state is a normal transition that changes the current state of the state machine. However, calling a state is similar to calling a function, i.e. it puts the called state on top of the call stack without leaving the calling stated. When we transition to a state by calling it, the event handlers of the original state are still active. When we are done with the call, we can pop to return the control back to the original state. Raising an event that a called state does not handle also causes returning to the calling state. 

In the example shown in Sketch 2, we call state Working without leaving state Waiting. Since we don’t leave the state Waiting, the machine is responsive to the PING events even when it is executing the code of the Working state.  

Sketch 2. An example of a call transition. When the state machine enters state Working, it does not leave state Waiting. Thus, it is still responsive to PING events. 

Non-determinism 

Often when we are modeling the environment or faults (as we will see next), we face notions that we are not interested in modeling them or we just can't model them because we don't have enough information about them or they are too complex. In these cases, want can use non-determinism. For example, suppose we want to model the behavior of a client of an elevator system. We don't know why/when a client might want to go up or down (to know that we have to at least model the human brain and whole memory of the client! that might be a little complicated). Here, we can simply capture the behavior of the client via non-determinism as follows: "the client may decide to go up or down". Then, when we are model-checking our system, the model-checker checks all possible executions of the system following both decisions. If any of the decisions lead to a violation, the model-checker may notify us. In P, we can model this non-determinism using $ as follows: 

if ($) {
    //code for going up
} else {
    //code for going down
}

Another way of defining non-determinism is via using choose as we will see later.

If we repeatedly go to a state where the code above is executed, with pure non-determinism, we are considering the possibility of an execution where the client never goes down and always goes up. Sometimes, we don't like this pure non-determinism, and we like our non-deterministic execution to be "fair", i.e., both options occur infinitely often. In other words, we want to avoid an execution where the client from some point in time forever decides to only go up or only go down.  In these cases, we can model fair non-determinism using $$ as follows. 

if ($$) {
    //code for going up
} else {
    //code for going down
}
However, note that since P model-checker does not run an exhaustive model-checking as we said above, even if you use pure non-determinism, it is unlikely that the model-check considers the case where the client always decides to go down. In my experiments, I couldn’t produce a scenario where I get different results when I use $$ instead of $. 

Modeling Faults 

Faults are one of the major contributors to the complexity of distributed systems. Reasoning about the correctness of a protocol in presence of faults can become extremely hard for distributed systems engineers. One of the main purposes of model-checking a distributed protocol is to make sure of its correctness in presence of faults. 

In P, we can model faults as events. Then, we can define state machines that their job is just to send such events to other state machines to disrupt them. We can these machines Fault Injector machines. Our protocol is fault-tolerant if it still satisfies its safety and liveness properties even in presence of the disruption by the fault injector machines. 

Sketch 3 shows, how we can add a fault injector machine to the protocol defined in Sketch 1. The fault injector machine spontaneously decides to send a RESTART event to a non-deterministically-chosen machine among the main and worker machines causing it to crash and restart. To be able to handle RESTART events in all states of these machines, we can handle them in the Init state and instead of transitioning from the Init state to the next state, we can call the next state (see Call Transitions). This way, when a machine receives a RESTART event in any state, the handler of the Init state will be called and simulate a restart by resetting the variables and calling the first state again. To model a complete crash of a machine without recovery, the fault injector can send a halt event to that machine. 

Sketch 3. Adding a fault injector machine to the protocol shown in Sketch 2. The fault injector machine non-deterministically chooses one of the machines and sends a RESTART event to it.
 
Note that to verify liveness properties, often time, we need to assume that faults stop occurring at some point in time. Otherwise, liveness properties are not guaranteed to be satisfied. That is normal. Usually, we want our system to respect safety properties in presence of faults, and eventually satisfy the desired condition by making progress after faults stop occurring. 

Who instantiates the fault injector machine?
We can instantiate it in the entry of the main machine like other machines. If we don't like to have that in the code of our actual machines, we can write a test driver machine and make it the main machine that instantiates other machines.  

Modeling the Environment 

While checking our protocol, often time, we also need to model the context/environment where our protocol executes. For example, we may need to model clients or third-party systems. In P, we model those also as state machines, and they interact with the program state machines like other state machines. 

Sketch 4 adds the timeout mechanism to the protocol shown in Sketch 3. When the main machine transitions to state Waiting, it sends a REQ event to the timer machine. Upon receiving this REQ event, the timer machine transitions to the TimerOn state. In state TimerOn, the timer machine spontaneously sends TIMEOUT events to the main machine and also raises the event which will cause itself to transition back to the Waiting state. Upon receiving a TIMEOUT event, the main machine stops waiting for the worker machines, and transitions back to the SendRequests state. 


Sketch 4. Adding Timer machine to the system.  The timer starts upon a request from the main machine. It then spontaneously fires and sends TIMEOUT event to the main machine. The main machine raises a RESTART event in state Waiting upon receiving a TIMEOUT event. 

Specifying Requirements 

There are two classes of properties that we can enforce: safety and liveness. A safety property requires that a certain "bad condition" must not occur in our system. A liveness property, on the other hand, requires that a certain "good condition" must eventually occur in our system. When we are writing the code for our state machines, we can require local safety properties using assert statements. However, sometimes, we want to define safety requirements that span multiple machines. We can define those requirements using specification state machines

A specification state machine is a certain type of state machine that can observe the communication between other state machines and enforce our desired properties. Specifically, we can define a specification machine that observes any event of a certain type sent in our system. This way, the spec machine reacts to the event even when it is not the destination of the event. The spec machine can also examine the payload of the events. We will see how we can define spec machines later in this post. 

Temperature and Specifying Liveness

Defining safety properties using assert statements, locally or using specification machines, is straightforward. But what about liveness properties? In P, we can define liveness properties using the notion of temperature. For a specification machine, we can define some states as hot states. Remaining in hot states increases the temperature. Once the temperature reaches a certain temperature threshold, the model-checker raises an error.  The idea is that we have to get out of the hot states to prevent the temperature from going over the threshold. Thus, hot states are basically undesired states that we want to get out of them. To require that a certain condition is eventually satisfied in our system, we can design our specification machine to remain in a hot state while the condition is not satisfied, and make it leave the hot state when our desired condition is satisfied. 

Using the modeling explained above we can enforce that some conditions must be eventually satisfied in our system. However, sometimes, we want to enforce a condition like this: whenever A happens, then eventually B must happen. Thus, the cycle of “A, then, B” may repeat again and again. We can model these kinds of conditions using cold states. A cold state reset the temperature of a specification machine done to zero. To model “A, then, B” via cold and hot states, we can do as follows: upon A we transition to a hot state, and upon B we transition to a cold state. 

Let’s see an example. In the worker machines example introduced above, we want to make sure that anytime the main machine sends REQ events, eventually, we see an ALL_DONE event. To model this requirement, we can define a specification machine that observes two events REQ and ALL_DONE. Upon REQ, it transitions to the hot state Waiting. And upon observing ALL_DONE it transitions back to the cold state Init. 
Sketch 5. Adding spec state machine to enforce liveness. The spec machine enters the hot state Waiting upon observing a REQ event and transitions back to the cold state Init. Remaining in a hot state increases the temperature and entering a cold state reset the temperature to zero. If the temperature goes higher than a threshold, the model-checker reports a liveness bug. 


Liveness or safety?

The attentive reader may realize that the way we define “liveness” properties in P via the notion of temperature actually turns our liveness properties into safety properties. As we said above, a safety property requires that “something bad should not occur”. That “something bad” here is crossing the threshold limit. Although the temperature issue technically is not a liveness bug, the P model-checker refers to those bugs as liveness bugs. 

Writing P Programs

Disclaimer! The information provided here is based on my own understanding of the official tutorials, reading the code, and experimenting with P. The reference provided here is not guaranteed to be accurate or complete. Although after reading this section, you should be able to understand P examples and get started with P, it does not eliminate the need for an official reference for P by the development team. I hope we see such an official reference soon. 

Types

Primitive Types

The primitive types include bool, int, float, string, machine, and event

Enums

We can define enums:

enum STATUS {
SUCCESS,
ERROR
}      

Tuples

The type (t1,...,tn) is a tuple with n elements of type t1 through tn, respectively: 

var myTuple : (Type1, Type2, Type3);

myTuple.0 //an instance of Type1
myTuple.1 //an instance of Type2
myTuple.2 //an instance of Type3

Structs

Structs are basically named tuples:

var myTuple : (f1: Type1, f2: Type2, f3: Type3);

myTuple.f1 //an instance of Type1
myTuple.f2 //an instance of Type2
myTuple.f3 //an instance of Type3

Sequences

We can define an extensible array this way:
var mySeq : seq[Type1];
To add an entry: 
mySeq += (index, value)
To remove an entry: 
mySeq -= index
To access/set the ith entry:
mySeq[i]

To get the size: 
sizeof(mySeq)

To clear: 
mySeq = default(seq[int])

Maps

We can define a map this way:
var myMap: map[Type1, Type2];
To insert/update a key:
myMap[key] = value

//or

myMap += (key, value)
To remove a kye:
myMap -= key
To get the size of a container type using sizeof :
sizeof(myMap)

To clear: 
myMap = default(map[Type1, Type2])

Membership Check 

We can check if an element exists in a map variable using in.

if (mykey in myMap) {}

Custom Types 

We can define compound types: 
type Message = (value: string, timestamp: int);
Above, we defined a type named Message that is a structure with two members value and timestamp of types string and int, respectively. 

Events

We can define an event with event:
event MessageReceived: Message;
Above, we defined an event named MessageReceived. This event carries a payload of type Message.

We can require that at most N number of instances of an event may be present in the input queue of any machine using assert keyword:

event MessageReceived assert N: Message;

UnhandledEvent Exception

An important safety property enforced by P is the responsiveness of the state machine, i.e., a state machine must handle any event that it dequeues from its input queue. When a state machine dequeues an event from its input queue and does not have any code to handle it, P runtime throws an UnhandledEvent exception.

halt Event 

There is a special halt event. Sending this event to a state machine models an acceptable termination of the state machine. A halted machine receptive and consumes any event that is sent to it. Thus, sending any future event to the machine does not raise UnhandledEvent exception. 

send MachineTobeHalted, halt;

null Event

We can model spontaneous events by the null event. It is a special event that is internally generated in the runtime. For example, we can use this event to model the ticking of a clock. 

on null do {     send requester_machine, TIMEOUT, term;     raise TIMEOUT, term; }

State Machines

The structure of a state machine is: 
annotation machine StateMachineName {
  • variable defintions
  • state definitions
}
We can annotate a state machine to be the main machine using main keyword. However, when I use the main keyword I get the following error: 

line 6:0 extraneous input 'main' expecting {<EOF>, 'enum', 'event', 'eventset', 'machine', 'interface', 'fun', 'spec', 'type', 'module', 'implementation', 'test'}
Instead of using the main annotation, we can define the main machine when we define our test case as follows (see Examples section): 

test Test0 [main=SM1]: {SM1, SM2, SM3, SM4,...};
We can define a variable:

var myVariable : int;
The structure for defining a state is as follows (the order does not matter):

annotation state StateName {
  • entry {}
  • defer
  • ignore
  • event handlers
  • exit {}
}
A state can be annotated as the initial state using the start keyword. There are other annotations as well. 

The entry statement is a block of code that will be executed upon arriving at the state. The exit statement defines a block of code that will be executed upon leaving the state. The entry block of the start state can receive arguments that are passed when constructing a new instance of the state machine. The entry block of other states may receive an argument as well (see goto). 

entry (payload: PayLoadType) {
}

For events that may be arrived at each state, we can ignore, defer, or act upon. For these responses, can use defer, ignore, and on statements, respectively:

defer event1, event2, ...;
ignore event1, event2, ...;
on event1 do (payload: PayLoadType) {
//code for handling event1
}
on event2 do (payload: PayLoadType) {
//code for handling event2
}
We can use the following statements to write the blocks of code defining the behavior of our state machines. 

Local Variables

We can define local variables in the code blocks using the var keyword. 
entry (writeResp: tWriteTransResp) {     var i: int; i = 0;
    //rest of the code block
}

new

We can create a new instance of a machine using the new keyword:
server = new Server();
We can also pass arguments to the constructor. These arguments will be received by the entry block of the start state.  

this

The this keyword refers to the current instance of a state machine. 

return

We can ignore the rest of a code block using the return keyword. 

entry (writeResp: tWriteTransResp) {     // if the write was a time out lets not confirm it     if(writeResp.status == TIMEOUT)
        return;
    //rest of the code block
}

If 

if (condition) {
} else {
}
To define condition follow C syntax. Single line if does not need {}. We can have else if as well. 

Non-determinism

We can model non-determinism using $:
if ($) {
//option 1
} else {
//option 2
}
or we can use the choose keyword:

choose (mySeq) //non-deterministically choose one of the entries of mySeq
We can also choose a number non-deterministically:
choose (5) //choose an int in [0-4]

Fair Non-determinism

Using $ we can define pure non-determinism, but sometimes, we need fair non-determinism (see above for fair non-determinism). For such cases, we can use $$: 

if ($$) {
//option 1
} else {
//option 2
}

While Loop

while (condition) {
}

goto 

We can transition to a state using goto keyword inside a code block. We can pass a parameter to be received by the entry block of the destination as follows:
goto StateName, payload;

goto [with {}]

We can use goto statement for handling events as well:
on EventName goto StateName

Optionally, we can provide a code block to be executed before transitioning to the destination:

on EventName goto StateName with {
}

push/pop

Calling goto changes the state. However, sometimes we want to call a state without leaving the current state. Such a call is like pushing that state on the top of the call stack. For these cases, we can use the push which must be used as follows: 
on EventName push StateName;
It seems we cannot send payload or use with statement with push. An important point to note is that when we call a state with push, the event handlers of the calling state are still responsive to the future event. 

Using pop, we can leave the current state and return to the calling state. 

send 

We can send an event to a state machine as follows:
send stateMachineInstance, EventName, eventPayload
If the payload is a compound type we can generate an instance of it inline, as follows: 
send stateMachineInstance, EventName, (value = "my message", timestamp = 123456)
If we have a set of state machine instances, we can pick one of them non-deterministically using choose keyword. 
send chooose(seqOfStateMachineInstances), EventName, eventPayload
We could also pick the event or payload non-deterministically using choose keyword. 

raise

We can raise an event using the raise keyword. It terminates the execution of the enclosing code block and is handled immediately.

raise EvenName [, payload];
If the current state does not have an event handler for the raised event, and if the current state is pushed by another state, then raise will cause poping the current state and letting the calling state handle this event. 

It is very important to understand the difference between send and raise. Consider the following code: 

state State1 {
    entry {
        raise SUCCESS; 
    }
    on SUCCESS goto State2;
}

state State2 {
    on BEAT goto State1;
}
By using raise, we immediately transition from State1 to State2, as if there is no State1, i.e., our machine will NEVER stay in State1. Thus, if we send a BEAT event to this machine, it will be always responsive to it as in State2, it handles it properly. The code above is equivalent to the following code: 

state State1 {
    entry {
        goto State2;
    }
}

state State2 {
    on BEAT goto State1;
}
Now suppose we change the code as follows: 

state State1 {
    entry {
        send this, SUCCESS; 
    }
    on SUCCESS goto State2;
}

state State2 {
    on BEAT goto State1;
}
Now, we may stay at State1, and if we send BEAT to the machine while it is in this state, we get a bug by the model-checker, as we don't handle BEAT in State1. Thus, if you want to model non-atomic steps in your code don't use raise/goto. Instead, model those steps by sending events

announce

We can announce an event with the announce keyword. State machines who are observing the announced event will be notified. (see Specification Machines)

Functions 

fun myFunction (param1: Type1, param2: Type2) [: return_type] {
}
If you have defined a return type, then you can return a value using a return statement. 

Foreign functions 

Functions without body are foreign functions. Their body will be implemented outside in C#.

fun ChooseTransaction(): tRecord;
To see an example of a foreign function written in C# checkout this
We can print a formated message as follows: 

print format("Message value is {0} and its timestamp is {1}", myMessage.value, myMessage.timestamp)

assert

We can assert a condition. We can optionally give a message to be printed if the condition is violated. 
assert condition, format("message")

receive case

It causes the current code block to await an event. 
receive {
    case eventName1 : (payload: PayLoadType) {
    }

    case eventName2 : (payload: PayLoadType) {
    }
}

Safety and Liveness Properties

Specification Machines 

A specification machine can observe events in the system and react to them even when it is not the destination of an event. Thus, when other state machines are communicating with each other and send messages, a specification machine can examine those events and their payload and enforce certain invariants. 

We can define a specification machine as follows:
spec SpecMachineName observers EventName1, EventName2 {
  • variable defintions
  • state definitions
}

As mentioned above, a specification machine can examine an event sent from a state machine to another. However, sometimes we like a specification machine to examine an event that does not have a particular recipient. The sender state machine just want to "announce" it and let specification machines that are interested in it use it. In these cases, we can use announce keyword. 

Enforcing Safety 

To enforce safety properties we can use the assert keyword. For example, the following specification machine observes two events M_PING, and PONG. It maintains a mapping of the machines to the number of ping messages sent to each of them. On the PONG messages, it decrements the number of pending pings, and on M_PING it increments the number of ping messages and enforces that this number must not exceed 2. 

spec MySafetyProperty observes M_PING, PONG { var pending: map[machine, int]; start state Init { on M_PING do (payload: machine) { if (!(payload in pending)) pending[payload] = 0; pending[payload] = pending[payload] + 1; assert (pending[payload] <= 3); }
on PONG do (payload: machine) { assert (payload in pending); assert (0 < pending[payload]); pending[payload] = pending[payload] - 1; } } }

Enforcing Liveness

Using specification machines, we can also define properties that we want to be eventually true in our system. For this purpose, we can use hot states as follows:

spec MyLivenessProperty observes M_START, OBS_EVENT_1, OBS_EVENT_2, ... { //variable declaration start state Init { on M_START goto Wait; } hot state Wait { //hot means it is not a good state to remain at forever entry { //initialized your vars } on OBS_EVENT_1 do {
            //suppose here the desired property is satisfied:
            raise UNIT;
        }
on OBS_EVENT_2 do {} on UNIT goto Done; //leaving the hot state. } state Done { } //since it is not hot here, it is a good state to remain at forever }
At some point in the event handlers of a hot state, we should get out of it when our desired condition is met.

Examples 

You can find all examples in my Github repository [4]. 

HelloWorld

Once you model-check Test0 (see next section on how to compile and model-check), you can see the failure trace in HelloWorld_0_0.txt. It will include <ErrorLog> Assertion Failed: Hello World.

machine HelloWorld {
    start state Init {
        entry {
           assert (1==2), format("Hello {0}", "World");
        }
    }
}

test Test0 [main=HelloWorld]: {HelloWorld};

SimpleClientServer

This example demonstrates: 
  • defining events,
  • instantiating new instances of a machine with the new operator,
  • using a receive case statement,
  • using raise and goto, and
  • finally, using functions. 
Similar to the previous example, you should see the output in the txt file generated by the model-checker. In this case, <ErrorLog> Assertion Failed: received EVENT1 with param 34 will be printed in that file. 

event EVENT1: int;
event Request: machine; 
event Response: int; 
 
machine Client {
    var myMap : map[string, int];
    var server: machine; 
 
    start state Init {
        entry {
            server = new Server(); 
            send server, Request, this;
            receive {
                case Response: (payload: int) {
                    raise EVENT1, payload;
                }
            }        
        }
 
        on EVENT1 do (payload: int) {
            goto State2, testFunction(payload);
        } 
    }
 
    state State2 {
        entry (payload: int){
            assert (1==2), format ("received EVENT1 with param {0}", payload);
        }
    }
}
 
machine Server {
    start state init {
        on Request do (payload: machine){
            send payload, Response, 34; 
        }
    }
}
 
fun testFunction (num: int) :int {
    return num;
}

test Test0 [main=Client]: {Client, Server};

WorkerMachines

In this example, we consider the protocol shown in Sketch 1 with 10 worker machines. 

event REQ; 
event WORK_DONE;
event ALL_WORK_DONE;
event REQ_DONE; 

machine MainMachine {
    var workers : seq[machine]; 
    var workers_num: int; 
    var received_num: int; 

    start state Init {
        entry {
            var i: int;
            workers_num = 10;
            i = 0; 
            while (i < workers_num) {
                workers += (i, new Worker(this)); 
                i = i + 1;
            }
            goto SendRequests;
        }
    }

    state SendRequests {
        entry {
            var i: int;
            received_num = 0;
            i = 0; 
            while (i < workers_num) {
                send workers[i], REQ; 
                i = i + 1; 
            }
            send this, REQ_DONE;
        }
        on REQ_DONE goto Waiting; 
    }

    state Waiting {
        on WORK_DONE do {
            received_num = received_num + 1;
            assert received_num <= workers_num, format ("unexpected number of WORK_DONES: max {0}, but received {1}", workers_num, received_num);
            if (received_num == workers_num) {
                send this, ALL_WORK_DONE;
            } 
        }

        on ALL_WORK_DONE goto SendRequests;
    }
}

machine Worker {
    var requester_machine: machine; 

    start state Init {
        entry (id: machine){
            requester_machine = id; 
            goto Waiting;
        }
    }
    state Waiting {
        on REQ goto Working; 
    }

    state Working {
        entry {
            send requester_machine, WORK_DONE; 
            goto Waiting;
        }
    }
}

The model-checker will find a bug for this protocol. 

Pause here and try to answer why we have a bug in this protocol? 

The reason, we have bug is after sending requests to the worker machine, we might stay in state SendRequests for while before transitioning to state Waiting.  We don't have any event handler for event WORK_DONE in this state. During this time, if one of the worker machines is done already and sends its WORK_DONE event to the MainMachine, the event is not handled. Thus, we get a UnhandledEventException as follows: 

<ExceptionLog> PImplementation.MainMachine(2) running action '' in state 'SendRequests' threw exception 'UnhandledEventException'. <ErrorLog> PImplementation.MainMachine(2) received event 'PImplementation.DONE' that cannot be handled.
To solve this bug, we have to add an event handler for WORK_DONE to state SendRequests

 state SendRequests {
        entry {
            var i: int;
            received_num = 0;
            i = 0; 
            while (i < workers_num) {
                send workers[i], REQ; 
                i = i + 1; 
            }
            send this, REQ_DONE;
        }
        on REQ_DONE goto Waiting; 

        on WORK_DONE do {
            received_num = received_num + 1;
        }
    }

WorkerMachines with Call Transitions

In this example, we consider the protocol shown in Sketch 2 with 10 worker machines. Note how the MainMachine uses the null event to send PING messages. This will make MainMachine to spontaneously send PING messages. 

event REQ; 
event WORK_DONE;
event CALL_DONE;
event ALL_WORK_DONE;
event REQ_DONE; 
event PING;
event PONG; 

machine MainMachine {
    var workers : seq[machine]; 
    var workers_num: int; 
    var received_num: int; 

    start state Init {
        entry {
            var i: int;
            workers_num = 10;
            i = 0; 
            while (i < workers_num) {
                workers += (i, new Worker(this)); 
                i = i + 1;
            }
            goto SendRequests;
        }
    }

    state SendRequests {
        entry {
            var i: int;
            received_num = 0;
            i = 0; 
            while (i < workers_num) {
                send workers[i], REQ; 
                i = i + 1; 
            }
            send this, REQ_DONE;
        }
        on REQ_DONE goto Waiting; 

        on WORK_DONE do {
            received_num = received_num + 1;
        }

        ignore PONG;
    }

    state Waiting {
        on WORK_DONE do {
            received_num = received_num + 1;
            assert received_num <= workers_num, format ("unexpected number of WORK_DONES: max {0}, but received {1}", workers_num, received_num);
            if (received_num == workers_num) {
                send this, ALL_WORK_DONE;
            } 
        }

        on ALL_WORK_DONE goto SendRequests;

        on null do {
            send workers[0], PING;
        }

        ignore PONG;
    }
}

machine Worker {
    var requester_machine: machine; 

    start state Init {
        entry (id: machine){
            requester_machine = id; 
            goto Waiting;
        }
    }
    state Waiting {
        on REQ push Working; 
        on PING do {
            send requester_machine, PONG;
        }

        on WORK_DONE do {
            send requester_machine, WORK_DONE;
        }
    }

    state Working {
        entry {
            send this, CALL_DONE;
        }

        on CALL_DONE do {
            raise WORK_DONE;
        }
    }
}

WorkerMachines - Faulty

Now, we want to see how our WorkerMachines protocol performs in presence of faults that may restart any of the machines at any given time.  We add a fault-injector machine to the system as shown in Sketch 3. We also change the machines to handle RESTART events in their Init state. We also have to ignore REQ_DONE in the state Waiting of the main machine. 

To make sure our system works properly, we add an assert statement to the handler of the WORK_DONE even in state Waiting. 

event REQ; 
event WORK_DONE;
event ALL_WORK_DONE;
event REQ_DONE; 
event RESTART; 
event PUSH_START;
event INJECT_FAULT;

machine FaultInjector {
    var machines : seq[machine]; 
    var fault_count: int;
    var max_faults_num: int;

    start state Init {
        entry (targets: seq[machine]){
            var i : int; 
            
            fault_count = 0; 
            max_faults_num = 1000;
            
            i=0; 
            while (i < sizeof(targets)){
                machines += (i ,targets[i]);
                i = i + 1;
            }
            goto InjectFaults; 
        }
    }

    state InjectFaults {
        entry {
            send this, INJECT_FAULT; 
        }
        on INJECT_FAULT do {
            var m : machine; 
            m = choose (machines); 
            send m, RESTART;    
            fault_count = fault_count + 1;
            if (fault_count < max_faults_num)
                send this, INJECT_FAULT;
        }
    }
}


machine MainMachine {
    var workers : seq[machine]; 
    var workers_num: int; 
    var received_num: int; 

    start state Init {
        entry {
            var i: int;
            var targets : seq[machine]; 

            targets += (0, this); 
            workers_num = 10;
            i = 0; 
            while (i < workers_num) {
                workers += (i, new Worker(this)); 
                targets += (i+1, workers[i]); 
                i = i + 1;
            }

            new FaultInjector(targets);
            raise RESTART; 
        }

        on RESTART do {
            received_num = 0;
            raise PUSH_START; 
        }

        on PUSH_START push SendRequests;
    }

    state SendRequests {
        entry {
            var i: int;
            received_num = 0;
            i = 0; 
            while (i < workers_num) {
                send workers[i], REQ; 
                i = i + 1; 
            }
            send this, REQ_DONE;
        }
        on REQ_DONE goto Waiting; 

        on WORK_DONE do  {
            received_num = received_num + 1;
        }
    }

    state Waiting {
        on WORK_DONE do {
            received_num = received_num + 1;
            assert received_num <= workers_num, format ("unexpected number of WORK_DONES: max {0}, but received {1}", workers_num, received_num);
            if (received_num == workers_num) {
                send this, ALL_WORK_DONE;
            } 
        }

        on ALL_WORK_DONE goto SendRequests;

        ignore REQ_DONE; //bug fix
    }
}

machine Worker {
    var requester_machine: machine; 

    start state Init {
        entry (id: machine){
            requester_machine = id; 
            raise RESTART;
        }

        on RESTART push Waiting; 
    }

    state Waiting {
        on REQ goto Working; 
    }

    state Working {
        entry {
            send requester_machine, WORK_DONE; 
            raise WORK_DONE;
        }

        on WORK_DONE goto Waiting;
    }
}
When we model-check this protocol, we get the following assertion failure:

<ErrorLog> Assertion Failed: unexpected number of WORK_DONES: max 10, but received 11
This is because when the main machine restarts and sends a new set of REQs, it still may receive the response to some of its REQs sent before the failure. This will cause the number of received responses to go higher than the number of requests sent in after RESTART. 

WorkerMachines - Fault-tolerant 

To make sure our assertion regarding the number of received responses is satisfied even in presence of faults, we have to distinguish between requests and responses sents for each term of the main machine. Thus, we add a term variable to the main machine.  When the main machine restarts, it increments the term value. We add an integer to the payload of the events to indicate the term. Now, the main machine ignores any response from the older terms. 

event REQ: int; 
event WORK_DONE: int;
event ALL_WORK_DONE: int;
event REQ_DONE: int; 
event RESTART; 
event PUSH_START;
event INJECT_FAULT;

machine FaultInjector {
    var machines : seq[machine];
    var fault_count: int;
    var max_faults_num: int; 

    start state Init {
        entry (targets: seq[machine]){
            var i : int; 

            fault_count = 0; 
            max_faults_num = 10;
            
            i=0; 
            while (i < sizeof(targets)){
                machines += (i ,targets[i]);
                i = i + 1;
            }
            goto InjectFaults; 
        }
    }

    state InjectFaults {
        entry {
            send this, INJECT_FAULT; 
        }
        on INJECT_FAULT do {
            var m : machine; 
            m = choose (machines); 
            send m, RESTART;    
            fault_count = fault_count + 1;
            if (fault_count < max_faults_num)
                send this, INJECT_FAULT;
        }
    }
}

machine MainMachine {
    var workers : seq[machine]; 
    var workers_num: int; 
    var received_num: int; 
    var term : int; 

    start state Init {
        entry {
            var i: int;
            var targets : seq[machine]; 

            term = 0;
            targets += (0, this); 
            workers_num = 10;
            i = 0; 
            while (i < workers_num) {
                workers += (i, new Worker(this)); 
                targets += (i+1, workers[i]); 
                i = i + 1;
            }

            new FaultInjector(targets);
            raise RESTART; 
        }

        on RESTART do {
            term = term + 1 ;
            received_num = 0;
            raise PUSH_START; 
        }

        ignore ALL_WORK_DONE;
        on PUSH_START push SendRequests;
    }

    state SendRequests {
        entry {
            var i: int;
            received_num = 0;
            i = 0; 
            while (i < workers_num) {
                send workers[i], REQ, term; 
                i = i + 1; 
            }
            send this, REQ_DONE, term;
        }
        on REQ_DONE do (t: int) {
            if (t == term) 
                goto Waiting; 
        }

        on WORK_DONE do (t: int) {
            if (t == term) //bug fix (2)
                received_num = received_num + 1;
        }
        ignore ALL_WORK_DONE;
    }

    state Waiting {
        on WORK_DONE do (t: int) {
            if (t == term) {
                received_num = received_num + 1;
                assert received_num <= workers_num, format ("unexpected number of WORK_DONES: max {0}, but received {1}", workers_num, received_num);
                if (received_num == workers_num) {
                    send  this, ALL_WORK_DONE, term;
                } 
            }
        }

        on ALL_WORK_DONE do (t: int){
            if (t == term)
                goto SendRequests;
        } 

        ignore REQ_DONE; //bug fix (1)
    }
}

machine Worker {
    var requester_machine: machine; 

    start state Init {
        entry (id: machine){
            requester_machine = id; 
            raise RESTART;
        }

        on RESTART push Waiting; 
    }

    state Waiting {
        on REQ goto Working; 
    }

    state Working {
        entry (t: int) {
            send requester_machine, WORK_DONE, t; 
            raise WORK_DONE, t;
        }

        on WORK_DONE goto Waiting;
    }
}

WorkerMachines with Timeout 

To model the system shown in Sketch 4 with the timer, we add the following timer machine. 

machine Timer {
    var requester_machine: machine; 
    var term: int; 

    start state Init {
        entry (id: machine){
            requester_machine = id; 
            goto Waiting; 
        }
    }

    state Waiting {
        on REQ goto TimerOn; 
    }

    state TimerOn {
        entry (t: int){
            term = t; 
        }

        on null do {
            send requester_machine, TIMEOUT, term; 
            raise TIMEOUT, term; 
        }

        on TIMEOUT goto Waiting;

        ignore REQ;  
    }
}
Now, for the main machine, we ignore TIMEOUT events in state SendRequests but raise RESTART upon receiving TIMEOUT in state Waiting. Note that we keep the number of terms bounded. 

state Waiting {
        entry {
            send timer, REQ, term; 
        }

        on WORK_DONE do (t: int) {
            if (t == term) {
                received_num = received_num + 1;
                assert received_num <= workers_num, format ("unexpected number of WORK_DONES: max {0}, but received {1}", workers_num, received_num);
                if (received_num == workers_num) {
                    send this, ALL_WORK_DONE, term;
                } 
            }
        }

        on ALL_WORK_DONE do (t: int){
            if (term == t && term < max_term){
                goto SendRequests;
        } 

        on TIMEOUT do (t: int){
            if (term == t){
                raise RESTART; 
            }
        }

        ignore REQ_DONE; //bug fix (1)
    }


WorkerMachines -- Checking Liveness

We want to check that whenever there is a request, then eventually we have an ALL_WORK_DONE event. To model this requirement we define the following spec machine. Note that in the following code, Liveness is just a name, we could use any other name. Our Liveness machine has two states, one cold and one hot. It starts in cold state Init and upon observing a REQ event, it transitions to the hot state Waiting. In state Waiting it ignores future REQ messages, but upon observing an ALL_WORK_DONE message it transitions to the cold state Init. Thus, once our Liveness machine transitions to the hot state, it remains there until it  observes an ALL_WORK_DONE event. If due to some bug in our protocol the Liveness machine gets stuck in the hot state Waiting and temperature goes higher than the temperature threshold, the model-checker terminates and returns an error. 

spec Liveness observes  REQ, ALL_WORK_DONE {
    start cold state Init {
        on REQ goto Waiting; 
    }

    hot state Waiting {
        on ALL_WORK_DONE goto Init; 
        ignore REQ;
    }
}
We can add this spec machine to any of the WorkerMachiens examples above. To do that we have to assert it when we define our test cases as we will see later as follow:

test Test0 [main=MainMachine]: assert Liveness in {MainMachine, Worker, ...};
Adding this requirement to any of the models defined above results in the following error message: 

<ErrorLog> PImplementation.Liveness detected liveness bug in hot state 'Waiting_2' at the end of program execution.
It shows that all of the protocols we defined so far for the WorkersMachine example suffer from a liveness bug! Interestingly, I personally didn't know that, and once I added this requirement and checked the output, I found out that the following scenario can happen which leads to the violation of our liveness property: When the main machine is done with sending the requests, it may stay in the SendRequest state and receive all WORK_DONE from all worker machines. It increments the received_num, but it does not check if the received_num is equal to the workers_num in SendRequests states. Thus, when it transitions to the state Waiting, it will never receive any new WORK_DONE event, so it never sends ALL_WORK_DONE event. 

To fix this bug, we can check the condition right after incrementing the received_num, or we can add this check to the entry of the waiting state as follows: 
    state Waiting {
        entry { //Liveness bug fix
            if (received_num == workers_num) {
                send  this, ALL_WORK_DONE, term;
            } 
        }

    ...

Model-Checking P Programs

Configuring P Projects

Following the official tutorial examples, we can organize our P project files this way: 
  • PSrc: the source code of our state machines describing our protocol. 
  • PSpec: the source code for our specification state machines defining our desired requirements. 
  • PTst: the code of the test driver. It basically describes the deployment and properties to be model-checked. 
  • Foreign code written in C#. 
  • ProjeName.pproj file that describes the project files.
Following is an example for a .pproj file: 

<Project> <ProjectName>TwoPhaseCommit</ProjectName> <InputFiles> <PFile>./PSrc/</PFile> <PFile>./PSpec/</PFile> <PFile>./PTst/</PFile> </InputFiles> <Target>CSharp</Target> <OutputDir>./PGenerated/</OutputDir> </Project>

Writing Test Scripts

We can define our test cases as follows. 
test Test0 [main=MainMachine]:
    assert SpecMachine1, SpecMachine 2, ... in {MainMachine, Worker};

Using the Compiler and Model-checker

Follow instructions here to install P. We have to first compile our P code, and then model-check our test cases using the generated dll file. pc is the name of the P compiler. You can run it as follows:

pc -proj:ProjectName.pproj
In my machine, when I compile my program, it puts the dll in a folder names netcoreapp3.1. 

For model-checking, we have to use coyote. The official tutorial suggests set a function name pmc in the Windows Powershell profile as follows. On my Windows machine, however, setting pmc this way didn't work:

function pmc { dotnet %USERPROFILE%\.dotnet\tools\microsoft.coyote\1.0.5\lib\netcoreapp3.1\coyote.dll test $args}
However, I can execute the model-checker using coyote.exe test command:

coyote.exe test [path to my generated dll] -i 10000 --coverage activity -m PImplementation.Test0.Execut

Summary 

In this post, we saw how we can model distributed systems with P language for testing. In P, we model a protocol and its environment as a set of interacting machines that talk to each other by sending events. P is not for exhaustive model-checking to prove that a given protocol is correct. Instead, it is used for model-based testing that tries to catch bugs in a best-effort manner. To learn more about P you can use the references below.  

References 

[1] Ankush Desai , Vivek Gupta, Ethan Jackson, Shaz Qadeer, Sriram Rajamani, and Damien Zufferey. "P: safe asynchronous event-driven programming." ACM SIGPLAN Notices 48, no. 6 (2013): 321-332.

[2] P: A Domain-Specific Language for Asynchronous Event-Driven Programming [video],  https://www.youtube.com/watch?v=qjwrPMwNrFo

[3] P Github repository, https://github.com/p-org/P

[4] P Examples, https://github.com/roohitavaf/P/tree/master/Tutorial

Comments

Anonymous said…
Very helpful, thanks!
Anonymous said…
thanks for this quick introduction!

Popular posts from this blog

In-memory vs. On-disk Databases

ByteGraph: A Graph Database for TikTok

DynamoDB, Ten Years Later

Eventual Consistency and Conflict Resolution - Part 2