[PATCH v2] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path

vgoyal@xxxxxxxxxx (Vivek Goyal) · Mon, 23 Mar 2015 10:31:58 -0400

On Mon, Mar 23, 2015 at 02:50:46PM +0100, Ingo Molnar wrote:
> 
> * Vivek Goyal <vgoyal at redhat.com> wrote:
> 
> > On Mon, Mar 23, 2015 at 08:19:43AM +0100, Ingo Molnar wrote:
> > > 
> > > * Baoquan He <bhe at redhat.com> wrote:
> > > 
> > > > CC more people ...
> > > > 
> > > > On 03/07/15 at 01:31am, "Hatayama, Daisuke/?? ??" wrote:
> > > > > The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
> > > > > "crash_kexec_post_notifiers" kernel boot option, which toggles
> > > > > wheather panic() calls crash_kexec() before panic_notifiers and dump
> > > > > kmsg or after.
> > > > > 
> > > > > The problem is that the commit overlooks panic_on_oops kernel boot
> > > > > option. If it is enabled, crash_kexec() is called directly without
> > > > > going through panic() in oops path.
> > > > > 
> > > > > To fix this issue, this patch adds a check to
> > > > > "crash_kexec_post_notifiers" in the condition of kexec_should_crash().
> > > > > 
> > > > > Also, put a comment in kexec_should_crash() to explain not obvious
> > > > > things on this patch.
> > > > > 
> > > > > Signed-off-by: HATAYAMA Daisuke <d.hatayama at jp.fujitsu.com>
> > > > > Acked-by: Baoquan He <bhe at redhat.com>
> > > > > Tested-by: Hidehiro Kawai <hidehiro.kawai.ez at hitachi.com>
> > > > > Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com>
> > > > > ---
> > > > >  include/linux/kernel.h |  3 +++
> > > > >  kernel/kexec.c         | 11 +++++++++++
> > > > >  kernel/panic.c         |  2 +-
> > > > >  3 files changed, 15 insertions(+), 1 deletion(-)
> > > 
> > > This is hack upon hack, but why was this crap merged in the first 
> > > place?
> > > 
> > > I see two problems just by cursory review:
> > > 
> > > 1)
> > > 
> > > Firstly, the real bug in:
> > > 
> > >   f06e5153f4ae ("kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after panic_notifers")
> > > 
> > > Was that crash_kexec() was called unconditionally after notifiers were 
> > > called, which should be fixed via the simple patch below (untested). 
> > > Looks much simpler than your fix.
> > > 
> > 
> > Hi Ingo,
> > 
> > Agreed. Your patch looks good.
> 
> In case you want that simpler fix and need my SOB:
> 
>   Signed-off-by: Ingo Molnar <mingo at kernel.org>
> 
> (but I have not tested it.)

I will quickly test it.

So this is a general fix but not a replacement for fix in this patch?

Because the problem original patch is trying to fix is that crash_kexec()
can be called from outside panic() too (kexec_should_crash()) and in that
case panic notifiers will not be called. So this patch is just trying to
delay the call to crash_kexec() to make it run much later.

> 
> > > Secondly, and more importantly, the whole premise of commit 
> > > f06e5153f4ae is broken IMHO:
> > > 
> > >  "This can help rare situations where kdump fails because of unstable
> > >   crashed kernel or hardware failure (memory corruption on critical
> > >   data/code)"
> > > 
> > > wtf?
> > > 
> > > If the kernel crashed due to a kernel crash, then the kernel booting 
> > > up in whatever hardware state should be able to do a clean bootup. The 
> > > fix for those 'rare situations' should be to fix the real bug (for 
> > > example by making hardware driver init (or deinit) sequences more 
> > > robust), not to paper it over by ordering around crash-time sequences 
> > > ...
> > > 
> > > If it crashed due to some hardware failure, there's literally an 
> > > infinite amount of failure modes that may or may not be impacted by 
> > > kexec crash-time handling ordering. We don't want to put a zillion 
> > > such flags into the kernel proper just to allow the perturbation of 
> > > the kernel.
> > 
> > I think one of the motivations behind this patch was call to kmsg_dump().
> > Some vendors have been wanting to have the capability to save kernel logs
> > to some NVRAM before transition to second kernel happens. Their argument
> > is that kdump does not succeed all the time and if kdump does not succeed
> > then atleast they have something to work with (kernel logs retrieved
> > from pstore interface).
> 
> Doesn't pstore attach itself to printk itself? AFAICS it does:
> 
>  fs/pstore/platform.c:   register_console(&pstore_console);
> 
> so the printk log leading up to and including the crash should be 
> available, regardless of this patch. What am I missing?

That's a good point. I was not aware of it. I am Ccing Don Zickus as
he has spent some time on this in the past.

Masami, would you have thougths on this? IIRC, one reason why kmsg_dump()
was written so that one could dump kernel messages to an NVRAM. Of one
could simple register pstore as console, then how kmsg_dump() will
continue to be useful?

> 
> > Not that I agree fully with this as problem might happen while we 
> > try to run panic_notifiers or kmsg_dump hooks and never transition 
> > into kdump kernel.
> 
> btw., this is the big problem with 'notifiers' in general: they are 
> opaque with barely any semantics defined, and a source of constant 
> confusion.

Agreed. That's the reason Eric never liked the idea of letting panic
notifiers run before crash_kexec().

> 
> > And it has been literally years since some developers have been 
> > pushing for allowing to run panic notifiers before crash_kexec(). 
> > Eric Biederman has been pushing back saying it reduces the 
> > reliability of kdump operation so this is not acceptable.
> 
> So what do those notifiers do?

IIRC, two main reasons had come in the past.

- In a cluster of nodes, people wanted to send some sort of notifications
  to main server that a node has crashed and don't fence it off as it
  might be saving dump.

- And saving kernel logs to non volatile store.

There might be more and I might not be aware about these. Hatayama and
Masami, can you shed more light on this.

BTW, first problem we faced in our clusters too and now it has been fixed.
Basically we send notifications in second kernel in user space to master
server that this node is still saving dump so don't fence it off.

Thanks
Vivek

> 
> Thanks,
> 
> 	Ingo