Re: [PATCH V4 18/20] rv/monitor: Add safe watchdog monitor

Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> · Mon, 4 Jul 2022 14:41:43 +0200

On 7/1/22 17:38, Steven Rostedt wrote:
> On Fri, 1 Jul 2022 07:45:50 -0700
> Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
> 
>>> Do not confuse this with static analyzers or other general purpose tooling
>>> to find bugs. This is not the purpose of the run time verify. It is just to
>>> prove that certain use cases will perform as expected, given a limited
>>> input.
>>>   
>>
>> ... this is, unfortunately, not explained in the patch. I would have much less
>> of a problem with the series if those details were included.
> 
> It's one of those cases where developers get so involved in their code that
> they leave out the things that are so obvious to them, but not obvious to
> others ;-)

I agree that the main point here is that the documentation [patch 20/20] needs
to be extended and correctly linked to the code.

The goal of the model is to specify the minimum but obligatory steps to set a
watchdog (start, set a safe timeout, ping...) so it can be used by
"safety/heath" monitors in safety-critical systems.

Another goal is to reduce the amount of code/dependencies that will require
deeper inspections to qualify the subsystem for usage in a given context via
monitoring (as steven mentioned - more about it here [1]), without having to
reduce the generic subsystem.

Although the method allows one to create a complete model of the watchdog device
layer, covering all use cases, that is not the idea of this monitor. Moreover, a
full model would not be the adequate model for this specific (but relevant) case
raised and discussed in the Elisa workgroup [2].

The goal of the monitor (that uses the model) is to verify that the interaction
between the watchdog device layer and the "safety/heath" monitors follows this
established model, at runtime.

> My new saying is: "We work in a field where the obvious seldom is".
> 
> Hmm, I think I'll go tweet that :-)

/me liked the tweet... and yep, you clarified well the context in which this is
being applied.

>>
>> Not that I would mind such a verifier, if it was possible to define one,
>> but it would have to be tested with a large number of watchdog drivers
>> to ensure that it addresses all use cases, or at least with a substantial
>> percentage of use cases. It would also require that the state machine is
>> readable to give people a chance to fix it if turns out to be broken.

The patch 20/20 has the automaton in the ASCII art format. Both C and ASCII
models were extracted from the same .dot file. I am not including the .dot file
because there was a previous discussion with the doc people that prefer the
ASCII format in the documentation. But I can add as well, not linked with the
rst file. By reading the model in the ASCII format, it is possible to see that
it is broad enough to cover many watchdogs as it uses simple/generic operations.

The model (the .h) and the instrumentation (the .c) can be updated at any time.
I am adding the tooling to facilitate that, like [patch 06/20].

Patch 19/20 adds a safety application that enables the RV monitor, uses the
watchdog, and collects feedback from the monitor to see if the requested actions
are occurring in the model - and it can be used to test the RV monitor with
any watchdog.

The goal is the one I described above, so an exception generated by this monitor
needs to be read accordingly: it does not imply that a watchdog driver is
broken. It means the interaction between the safety/health monitor and the
watchdog is not following specifications and must be checked at the development
phase, or that something went really wrong at runtime phase.

I will add that to the documentation and emphasize the context of this monitor.

>> It would also have to be robust, meaning it would have to reject invalid
>> (unsupported) settings from the start and not only during runtime.
> 
> I would agree than any module would need to state up front exactly what it
> is modeling. In safety critical systems, all the components that are used
> are defined up front. Not sure if we can have the model not load if the
> required drivers to test are not loaded or ones not part of the model are
> (Daniel?).

I can add a check in the time in which the model is being enabled. But for this
specific case, removing that part might be better and adding it later if it
becomes fundamental.

In practice, selecting a watchdog device can assume that the get_timeleft
implementation is a requirement - so that option was just an additional check.

[1] This is one use case for the broader goal of the strategy discussed here:
https://www.youtube.com/watch?v=DkiwkAKOXNs

[2] https://www.youtube.com/watch?v=qFSYlCbHCYk
> Anyway, thanks for the feedback.
> 
> -- Steve