Re: [ELISA Safety Architecture WG] [PATCH v2 0/2] Introduce the pkill_on_warn parameter

Petr Mladek <pmladek@xxxxxxxx> · Tue, 16 Nov 2021 09:41:46 +0100

On Tue 2021-11-16 10:52:39, Alexander Popov wrote:
> On 15.11.2021 18:51, Gabriele Paoloni wrote:
> > On 15/11/2021 14:59, Lukas Bulwahn wrote:
> > > On Sat, Nov 13, 2021 at 7:14 PM Alexander Popov <alex.popov@xxxxxxxxx> wrote:
> > > > On 13.11.2021 00:26, Linus Torvalds wrote:
> > > > > On Fri, Nov 12, 2021 at 10:52 AM Alexander Popov <alex.popov@xxxxxxxxx> wrote:
> > > > Killing the process that hit a kernel warning complies with the Fail-Fast
> > > > principle [1]. pkill_on_warn sysctl allows the kernel to stop the process when
> > > > the **first signs** of wrong behavior are detected.
> > > > 
> > > In summary, I am not supporting pkill_on_warn. I would support the
> > > other points I mentioned above, i.e., a good enforced policy for use
> > > of warn() and any investigation to understand the complexity of
> > > panic() and reducing its complexity if triggered by such an
> > > investigation.
> > 
> > Hi Alex
> > 
> > I also agree with the summary that Lukas gave here. From my experience
> > the safety system are always guarded by an external flow monitor (e.g. a
> > watchdog) that triggers in case the safety relevant workloads slows down
> > or block (for any reason); given this condition of use, a system that
> > goes into the panic state is always safe, since the watchdog would
> > trigger and drive the system automatically into safe state.
> > So I also don't see a clear advantage of having pkill_on_warn();
> > actually on the flip side it seems to me that such feature could
> > introduce more risk, as it kills only the threads of the process that
> > caused the kernel warning whereas the other processes are trusted to
> > run on a weaker Kernel (does killing the threads of the process that
> > caused the kernel warning always fix the Kernel condition that lead to
> > the warning?)
> 
> Lukas, Gabriele, Robert,
> Thanks for showing this from the safety point of view.
> 
> The part about believing in panic() functionality is amazing :)

Nothing is 100% reliable.

With printk() maintainer hat on, the current panic() implementation
is less reliable because it tries hard to provide some debugging
information, for example, error message, backtrace, registry,
flush pending messages on console, crashdump.

See panic() implementation, the reboot is done by emergency_restart().
The rest is about duping the information.

Well, the information is important. Otherwise, it is really hard to
fix the problem.

>From my experience, especially the access to consoles is not fully
safe. The reliability might improve a lot when a lockless console
is used. I guess that using non-volatile memory for the log buffer
might be even more reliable.

I am not familiar with the code under emergency_restart(). I am not
sure how reliable it is.

> Yes, safety critical systems depend on the robust ability to restart.

If I wanted to implement a super-reliable panic() I would
use some external device that would cause power-reset when
the watched device is not responding.

Best Regards,
Petr

PS: I do not believe much into the pkill approach as well.

    It is similar to OOM killer. And I always had to restart the
    system when it was triggered.

    Also kernel is not prepared for the situation that an external
    code kills a kthread. And kthreads are used by many subsystems
    to handle work that has to be done asynchronously and/or in
    process context. And I guess that kthreads are non-trivial
    source of WARN().