Re: infinite loop in crash due to double-NMI on x86_64 system

Dave Anderson <anderson@xxxxxxxxxx> · Mon, 28 Jun 2010 15:10:55 -0400 (EDT)

----- "Lucas Silacci" <Lucas.Silacci@xxxxxxxxxxxx> wrote:

> Below is the output of running crash (with the patch) against one of
> these dumps.
> 
> -Lucas
> 
> 
> crash 5.0.5
> Copyright (C) 2002-2010  Red Hat, Inc.
> Copyright (C) 2004, 2005, 2006  IBM Corporation
> Copyright (C) 1999-2006  Hewlett-Packard Co    
> Copyright (C) 2005, 2006  Fujitsu Limited      
> Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
> Copyright (C) 2005  NEC Corporation                  
> Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
> Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
> This program is free software, covered by the GNU General Public License,
> and you are welcome to change it and/or distribute copies of it under 
> certain conditions.  Enter "help copying" to see the conditions.
> 
> This program has absolutely no warranty.  Enter "help warranty" for
> details.
> 
> GNU gdb (GDB) 7.0
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"   
> and "show warranty" for details.
> 
> This GDB was configured as "x86_64-unknown-linux-gnu"...
> 
> please wait... (determining panic task)                               
> 
> WARNING: Loop detected in the NMI Exception Stack!                    
> 
> 
> bt: cannot transition from exception stack to current process stack:
>     exception stack pointer: ffffffff8046dc50                       
>       process stack pointer: ffffffff8046ddd8
>          current stack base: ffffffff80422000
> 
>   SYSTEM MAP: /boot/System.map-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> DEBUG KERNEL: /boot/vmlinux-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> (2.6.16.53-0.8.PTF.434477.9.TDC.0-smp)
>     DUMPFILE: /var/crash/lucas.save/vmcore  [PARTIAL DUMP]
>         CPUS: 4
>         DATE: Tue May 18 12:46:07 2010
>       UPTIME: 07:24:54
> LOAD AVERAGE: 85.74, 82.85, 82.29
>        TASKS: 2449
>     NODENAME: POLO5_1-9
>      RELEASE: 2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
>      VERSION: #1 SMP Fri Aug 31 06:07:27 PDT 2007
>      MACHINE: x86_64  (2660 Mhz)
>       MEMORY: 7.9 GB
>        PANIC: "Kernel panic - not syncing: dumpsw: Dump switch pushed; reason: 0x20  args=0xffffffff8046df08"
>          PID: 0
>      COMMAND: "swapper"
>         TASK: ffffffff8038c340  (1 of 4)  [THREAD_INFO: ffffffff80422000]
>          CPU: 0
>        STATE: TASK_RUNNING (PANIC)
> 
> crash> bt
> PID: 0      TASK: ffffffff8038c340  CPU: 0   COMMAND: "swapper"
>  #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
>  #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
>  #2 [ffffffff8046dde0] panic at ffffffff801327fa
>  #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
>  #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
>  #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
>  #6 [ffffffff8046df40] do_nmi at ffffffff80323365
>  #7 [ffffffff8046df50] nmi at ffffffff8032268f
>     [exception RIP: smp_send_stop+84]
>     RIP: ffffffff80116e44  RSP: ffffffff8046ddd8  RFLAGS: 00000246
>     RAX: 00000000000000ff  RBX: ffffffff8831c1f8  RCX: 000041049c7256e8
>     RDX: 0000000000000005  RSI: 000000005238a938  RDI: 00000000002896a0
>     RBP: ffffffff8046df08   R8: 00000000000040fb   R9: 000000005238a7e8
>     R10: 0000000000000002  R11: 0000ffff0000ffff  R12: 000000000000000c
>     R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
> bt: WARNING: Loop detected in the NMI Exception Stack!
> bt: cannot transition from exception stack to current process stack:
>     exception stack pointer: ffffffff8046dc50
>       process stack pointer: ffffffff8046ddd8
>          current stack base: ffffffff80422000
> crash> 

What exactly was the sequence of events?  Was the system repeatedly and
erroneously running one NMI after another for some reason, and *then* the
"dump switch" was pressed?  And the dumpsw_notify() function sends another
NMI?  And where does that dumpsw_notify() function live anyway?

I'm just trying to get a grip on whether this will ever happen again, or
whether it's fixing a one-time hardware abnormality?

Dave

> -----Original Message-----
> From: crash-utility-bounces@xxxxxxxxxx
> [mailto:crash-utility-bounces@xxxxxxxxxx] On Behalf Of Dave Anderson
> Sent: Friday, June 25, 2010 12:32 PM
> To: Discussion list for crash utility usage,maintenance and
> development
> Subject: Re:  infinite loop in crash due to double-NMI
> on
> x86_64 system
> 
> 
> ----- "Lucas Silacci" <Lucas.Silacci@xxxxxxxxxxxx> wrote:
> 
> > Hi,
> >  
> > I've run into an issue where crash will enter an infinite loop
> while
> > decoding exception stacks if those stacks get corrupted.
> >  
> > We've seen this on four different systems where the hardware
> generated
> > multiple NMIs and the second and subsequent NMIs caused the NMI
> > exception stack to be overwritten. When this condition is hit, the
> > bottom rsp on the NMI exception stack (which would normally point
> you
> > back to the kernel thread stack or possibly a different exception
> stack)
> > points you back into the middle of the same NMI exception stack.
> This
> > causes crash to infinitely loop when it tries to decode that
> exception
> > stack.
> >  
> > Now clearly the root cause of the issue is faulty hardware that
> > generated multiple NMIs. However a very small change in crash can
> detect
> > this issue and stop the infinite loop from happening thereby
> allowing
> > you to get to a point in crash where you can actually tell that it
> was
> > an NMI that caused the system to dump.
> >  
> > The patch is attached to this email. For x86_64 it will detect the
> > condition of any exception stack that points back at itself.
> >  
> > Please feel free to ask me any questions on this.
> 
> Wow, that's pretty interesting -- I've certainly never seen that
> before.
> Can you show me what the backtrace looks like with your patch
> applied?
> 
> Thanks,
>   Dave
> 
> --
> Crash-utility mailing list
> Crash-utility@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/crash-utility
> 
> --
> Crash-utility mailing list
> Crash-utility@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/crash-utility

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility