Re: [PATCH] Introduce the pkill_on_warn boot parameter

Steven Rostedt <rostedt@xxxxxxxxxxx> · Thu, 30 Sep 2021 12:59:03 -0400

On Thu, 30 Sep 2021 11:15:41 +0200
Petr Mladek <pmladek@xxxxxxxx> wrote:

> On Wed 2021-09-29 12:49:24, Paul E. McKenney wrote:
> > On Wed, Sep 29, 2021 at 10:01:33PM +0300, Alexander Popov wrote:  
> > > On 29.09.2021 21:58, Alexander Popov wrote:  
> > > > Currently, the Linux kernel provides two types of reaction to kernel
> > > > warnings:
> > > >  1. Do nothing (by default),
> > > >  2. Call panic() if panic_on_warn is set. That's a very strong reaction,
> > > >     so panic_on_warn is usually disabled on production systems.  
> 
> Honestly, I am not sure if panic_on_warn() or the new pkill_on_warn()
> work as expected. I wonder who uses it in practice and what is
> the experience.

Several people use it, as I see reports all the time when someone can
trigger a warn on from user space, and it's listed as a DOS of the
system.

> 
> The problem is that many developers do not know about this behavior.
> They use WARN() when they are lazy to write more useful message or when
> they want to see all the provided details: task, registry, backtrace.

WARN() Should never be used just because of laziness. If it is, then
that's a bug. Let's not use that as an excuse to shoot down this
proposal. WARN() should only be used to test assumptions where you do
not believe something can happen. I use it all the time when the logic
prevents some state, and have the WARN() enabled if that state is hit.
Because to me, it shows something that shouldn't happen happened, and I
need to fix the code.

Basically, WARN should be used just like BUG. But Linus hates BUG,
because in most cases, these bad areas shouldn't take down the entire
kernel, but for some people, they WANT it to take down the system.

> 
> Also it is inconsistent with pr_warn() behavior. Why a single line
> warning would be innocent and full info WARN() cause panic/pkill?

pr_warn() can be used for things that the user can hit. I'll use
pr_warn, for memory failures, and such. Something that says "we ran out
of resources, this will not work the way you expect". That is perfect
for pr_warn. But not something that requires a stack dump.

> 
> What about pr_err(), pr_crit(), pr_alert(), pr_emerg()? They inform
> about even more serious problems. Why a warning should cause panic/pkill
> while an alert message is just printed?

Because really, WARN() == BUG() but like I said, Linus doesn't like
taking down the entire system on these areas.

> 
> 
> It somehow reminds me the saga with %pK. We were not able to teach
> developers to use it correctly for years and ended with hashed
> pointers.
> 
> Well, this might be different. Developers might learn this the hard
> way from bug reports. But there will be bug reports only when
> anyone really enables this behavior. They will enable it only
> when it works the right way most of the time.

The panic_on_warn() has been used for years now. I do not think this is
an issue.

> 
> I wonder if kernel could survive killing of any kthread. I have never
> seen a code that would check whether a kthread was killed and
> restart it.

We can easily check if the thread is a kernel thread or a user thread,
and make the decision on that.

-- Steve