On Tue 2021-11-16 10:52:39, Alexander Popov wrote: > On 15.11.2021 18:51, Gabriele Paoloni wrote: > > On 15/11/2021 14:59, Lukas Bulwahn wrote: > > > On Sat, Nov 13, 2021 at 7:14 PM Alexander Popov <alex.popov@xxxxxxxxx> wrote: > > > > On 13.11.2021 00:26, Linus Torvalds wrote: > > > > > On Fri, Nov 12, 2021 at 10:52 AM Alexander Popov <alex.popov@xxxxxxxxx> wrote: > > > > Killing the process that hit a kernel warning complies with the Fail-Fast > > > > principle [1]. pkill_on_warn sysctl allows the kernel to stop the process when > > > > the **first signs** of wrong behavior are detected. > > > > > > > In summary, I am not supporting pkill_on_warn. I would support the > > > other points I mentioned above, i.e., a good enforced policy for use > > > of warn() and any investigation to understand the complexity of > > > panic() and reducing its complexity if triggered by such an > > > investigation. > > > > Hi Alex > > > > I also agree with the summary that Lukas gave here. From my experience > > the safety system are always guarded by an external flow monitor (e.g. a > > watchdog) that triggers in case the safety relevant workloads slows down > > or block (for any reason); given this condition of use, a system that > > goes into the panic state is always safe, since the watchdog would > > trigger and drive the system automatically into safe state. > > So I also don't see a clear advantage of having pkill_on_warn(); > > actually on the flip side it seems to me that such feature could > > introduce more risk, as it kills only the threads of the process that > > caused the kernel warning whereas the other processes are trusted to > > run on a weaker Kernel (does killing the threads of the process that > > caused the kernel warning always fix the Kernel condition that lead to > > the warning?) > > Lukas, Gabriele, Robert, > Thanks for showing this from the safety point of view. > > The part about believing in panic() functionality is amazing :) Nothing is 100% reliable. With printk() maintainer hat on, the current panic() implementation is less reliable because it tries hard to provide some debugging information, for example, error message, backtrace, registry, flush pending messages on console, crashdump. See panic() implementation, the reboot is done by emergency_restart(). The rest is about duping the information. Well, the information is important. Otherwise, it is really hard to fix the problem. >From my experience, especially the access to consoles is not fully safe. The reliability might improve a lot when a lockless console is used. I guess that using non-volatile memory for the log buffer might be even more reliable. I am not familiar with the code under emergency_restart(). I am not sure how reliable it is. > Yes, safety critical systems depend on the robust ability to restart. If I wanted to implement a super-reliable panic() I would use some external device that would cause power-reset when the watched device is not responding. Best Regards, Petr PS: I do not believe much into the pkill approach as well. It is similar to OOM killer. And I always had to restart the system when it was triggered. Also kernel is not prepared for the situation that an external code kills a kthread. And kthreads are used by many subsystems to handle work that has to be done asynchronously and/or in process context. And I guess that kthreads are non-trivial source of WARN().