Explainability 000

Towards Efficient Verification of Parallel Applications with Mc SimGrid Joint work with Martin Quinson (Magellan) and Thierry Jéron (Devine)

Mathieu Laurent

October 14, 2024



Explainability 000

#### Distributed computing



- HPC applications are distributed and concurent
- Data shared via messages (e.g. MPI) or synchronizations (e.g. thread)
- Causes non-deterministic bugs
- Software model checking covers all cases

#### Content of this talk



#### 2 Dynamic software model checking

- Principle
- Partial order reduction
- Best First (O)DPOR

#### 3 Explainability



Explainability 000

| A small MF    | Pl example     |  |
|---------------|----------------|--|
| $P_{1}/P_{2}$ | P <sub>3</sub> |  |
| $Send(P_3)$   | Recv()         |  |
|               | Recv()         |  |



Explainability 000



Explainability 000



Explainability 000



Explainability 000



Explainability 000



Explainability 000



Explainability 000



Explainability 000



Explainability 000



Explainability 000



Explainability 000



Explainability 000

#### Stateless model checking



#### Stateless exploration

35 states for the same 2 behaviors.

Explainability

#### Transition dependency

# Two actions $a_1, a_2$ are **independent** if:





Example of two adjacent independent actions

#### Mazurkiewicz's traces [Maz'77]

Equivalence class of executions with adjacent independent actions swapped

Explainability 000

# DPOR approach [Fla'05]



7/15

Explainability 000

# DPOR approach [Fla'05]



Start with an arbitrary execution

Explainability 000 Conclusion 0

# DPOR approach [Fla'05]



Discover dependencies

Explainability 000

# DPOR approach [Fla'05]



Recursive DFS exploration of what has been added

Introduction 00 Dynamic software model checking

Explainability 000 Conclusion 0

# ODPOR approach [Abd'14]



Insert sequences instead of a single step

Explainability 000

# ODPOR approach [Abd'14]



What if the only bug is far from the first guess?

# Best First (O)DPOR



Multiple opened states



- Corresponding to partially explored executions:
- Notion of responsability between subtrees

Explainability

### What for?

- Alleviates the impact of early choices
- Allows the use of heuristic

```
While(True){
   While(!CAS(x, 0, 1)){
      y = 2;
   }
}
```

Example of a busy waiting



Counterexample in the other half

- Works around practical problems (as busy waiting)
- Encodes classical model checking behavior (like fairness)

Explainability

#### Experimental results

| MPI example slightly modified |                          |                |
|-------------------------------|--------------------------|----------------|
| $P_1$                         | $P_2$                    | P <sub>3</sub> |
| $Send(P_3)$                   | $Send(P_3)$              | MPI_Barrier()  |
| <pre>MPI_Barrier()</pre>      | <pre>MPI_Barrier()</pre> | Recv()         |
|                               |                          | Recv()         |



#### Why?

| ***************************************                                                                                              |
|--------------------------------------------------------------------------------------------------------------------------------------|
| *** DEADLOCK DETECTED ***                                                                                                            |
| *********************                                                                                                                |
| ] 1 actor is still active, awaiting something. Here is its status:                                                                   |
| - pid 3 (2@node-10.simgrid.org) simcall CommWait(comm_id:20 src:-1 dst:3 mbox:SMPI-3(id:3))                                          |
| Counter-example execution trace:                                                                                                     |
| Actor 2 in :0:() ==> simcall: iSend(mbox=3)                                                                                          |
| Actor 1 in :0:() ==> simcall: iSend(mbox=3)                                                                                          |
| Actor 1 in :0:() ==> simcall: iRecv(mbox=4)                                                                                          |
| Actor 1 in :0:() ==> simcall: iRecv(mbox=4)                                                                                          |
| Actor 2 in :0:() ==> simcall: iSend(mbox=4)                                                                                          |
| Actor 1 in :0:() ==> simcall: WaitComm(from 2 to 1, mbox=4, no timeout)                                                              |
| Actor 2 in :0:() ==> simcall: iRecv(mbox=5)                                                                                          |
| Actor 3 in :0:() ==> simcall: iSend(mbox=4)                                                                                          |
| Actor 1 in :0:() ==> simcall: WaitComm(from 3 to 1, mbox=4, no timeout)                                                              |
| Actor 1 in :0:() ==> simcall: iSend(mbox=5)                                                                                          |
| Actor 1 in :0:() ==> simcall: iSend(mbox=3)                                                                                          |
| Actor 1 in :0:() ==> simcall: iRecv(mbox=4)                                                                                          |
| Actor 1 in :0:() ==> simcall: iRecv(mbox=4)                                                                                          |
| <pre>Actor 2 in :0:() ==&gt; simcall: WaitComm(from 1 to 2, mbox=5, no timeout) Actor 2 in :0:() ==&gt; simcall: iSend(mbox=4)</pre> |
| Actor 2 in :0:() ==> simcall: isend(mbox=4)<br>Actor 1 in :0:() ==> simcall: WaitComm(from 2 to 1, mbox=4, no timeout)               |
| Actor 1 in :0:() ==> simcall: waitComm(from 2 to 1, mbox=4, no timeout)<br>Actor 2 in :0:() ==> simcall: iRecv(mbox=5)               |
| Actor 3 in :0:() ==> simcall: iRecv(mbox=3)<br>Actor 3 in :0:() ==> simcall: iRecv(mbox=3)                                           |
| Actor 3 in :0:() ==> simcall: NetV(mu0X=3)<br>Actor 3 in :0:() ==> simcall: WaitComm(from 1 to 3, mbox=3, no timeout)                |
| Actor 3 in :0:() ==> simcall: Walcomm(from 1 to 5, mbox=5, no timeout)<br>Actor 3 in :0:() ==> simcall: iSend(mbox=4)                |
| Actor 1 in :0:() ==> simcall: MaitComm(from 3 to 1, mbox=4, no timeout)                                                              |
| Actor 1 in :0:() ==> simcall: iSend(mbox=5)                                                                                          |
| Actor 1 in :0:() ==> simcall: iSend(mbox=3)                                                                                          |
| Actor 2 in :0:() ==> simcall: WaitComm(from 1 to 2, mbox=5, no timeout)                                                              |
| Actor 3 in :0:() ==> simcall: iRecv(mbox=3)                                                                                          |
| Actor 3 in :0:() ==> simcall: WaitComm(from 1 to 3. mbox=3. no timeout)                                                              |
| Actor 3 in :0:() ==> simcall: iRecv(mbox=3)                                                                                          |
| Actor 3 in :0:() ==> simcall: WaitComm(from 2 to 3, mbox=3, no timeout)                                                              |
| Actor 3 in :0:() ==> simcall: iRecv(mbox=3)                                                                                          |
|                                                                                                                                      |

Mc SimGrid output on a simple example with only two MPI\_Barrier().

Explainability

#### The critical transition



Critical transition

Let *E* be an incorrect execution,

Explainability

#### The critical transition

Critical transition

Critical transition

Let *E* be an incorrect execution, the **critical transition** is the unique  $t = (s, a, s') \in E$  s.t.

Explainability

#### The critical transition



Critical transition

Let *E* be an incorrect execution, the **critical transition** is the unique  $t = (s, a, s') \in E$  s.t.

• every execution from s' is incorrect

Explainability

#### The critical transition



#### Critical transition

Let *E* be an incorrect execution, the **critical transition** is the unique  $t = (s, a, s') \in E$  s.t.

- every execution from s' is incorrect
- there exists a correct execution from s

Explainability

# Critical transition: how?

- $s_{k+1}$  violates the property
- c<sub>1</sub> is the root of a correct subtree
- Hence, the critical transition is in  $\{b_1, \ldots, b_{k+1}\}$
- Use reduction and take a decision for the non-explored transitions



# Conclusion

#### What we have done

- New reduction algorithms allowing arbitrary search
- Defining and computing critical transition
- Implementing our reasearch in McSimGrid

#### Future work

- Parallelize the implementation by BeFS ODPOR
- Develop a good benchmark to explore heuristics
- Simplify counter examples using critical section