----- "Lucas Silacci" <Lucas.Silacci@xxxxxxxxxxxx> wrote: > Below is the output of running crash (with the patch) against one of > these dumps. > > -Lucas > > > crash 5.0.5 > Copyright (C) 2002-2010 Red Hat, Inc. > Copyright (C) 2004, 2005, 2006 IBM Corporation > Copyright (C) 1999-2006 Hewlett-Packard Co > Copyright (C) 2005, 2006 Fujitsu Limited > Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. > Copyright (C) 2005 NEC Corporation > Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. > Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. > This program is free software, covered by the GNU General Public License, > and you are welcome to change it and/or distribute copies of it under > certain conditions. Enter "help copying" to see the conditions. > > This program has absolutely no warranty. Enter "help warranty" for > details. > > GNU gdb (GDB) 7.0 > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later > <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > > This GDB was configured as "x86_64-unknown-linux-gnu"... > > please wait... (determining panic task) > > WARNING: Loop detected in the NMI Exception Stack! > > > bt: cannot transition from exception stack to current process stack: > exception stack pointer: ffffffff8046dc50 > process stack pointer: ffffffff8046ddd8 > current stack base: ffffffff80422000 > > SYSTEM MAP: /boot/System.map-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp > DEBUG KERNEL: /boot/vmlinux-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp > (2.6.16.53-0.8.PTF.434477.9.TDC.0-smp) > DUMPFILE: /var/crash/lucas.save/vmcore [PARTIAL DUMP] > CPUS: 4 > DATE: Tue May 18 12:46:07 2010 > UPTIME: 07:24:54 > LOAD AVERAGE: 85.74, 82.85, 82.29 > TASKS: 2449 > NODENAME: POLO5_1-9 > RELEASE: 2.6.16.53-0.8.PTF.434477.9.TDC.0-smp > VERSION: #1 SMP Fri Aug 31 06:07:27 PDT 2007 > MACHINE: x86_64 (2660 Mhz) > MEMORY: 7.9 GB > PANIC: "Kernel panic - not syncing: dumpsw: Dump switch pushed; reason: 0x20 args=0xffffffff8046df08" > PID: 0 > COMMAND: "swapper" > TASK: ffffffff8038c340 (1 of 4) [THREAD_INFO: ffffffff80422000] > CPU: 0 > STATE: TASK_RUNNING (PANIC) > > crash> bt > PID: 0 TASK: ffffffff8038c340 CPU: 0 COMMAND: "swapper" > #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b > #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351 > #2 [ffffffff8046dde0] panic at ffffffff801327fa > #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3 > #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f > #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab > #6 [ffffffff8046df40] do_nmi at ffffffff80323365 > #7 [ffffffff8046df50] nmi at ffffffff8032268f > [exception RIP: smp_send_stop+84] > RIP: ffffffff80116e44 RSP: ffffffff8046ddd8 RFLAGS: 00000246 > RAX: 00000000000000ff RBX: ffffffff8831c1f8 RCX: 000041049c7256e8 > RDX: 0000000000000005 RSI: 000000005238a938 RDI: 00000000002896a0 > RBP: ffffffff8046df08 R8: 00000000000040fb R9: 000000005238a7e8 > R10: 0000000000000002 R11: 0000ffff0000ffff R12: 000000000000000c > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > --- <NMI exception stack> --- > #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44 > bt: WARNING: Loop detected in the NMI Exception Stack! > bt: cannot transition from exception stack to current process stack: > exception stack pointer: ffffffff8046dc50 > process stack pointer: ffffffff8046ddd8 > current stack base: ffffffff80422000 > crash> What exactly was the sequence of events? Was the system repeatedly and erroneously running one NMI after another for some reason, and *then* the "dump switch" was pressed? And the dumpsw_notify() function sends another NMI? And where does that dumpsw_notify() function live anyway? I'm just trying to get a grip on whether this will ever happen again, or whether it's fixing a one-time hardware abnormality? Dave > -----Original Message----- > From: crash-utility-bounces@xxxxxxxxxx > [mailto:crash-utility-bounces@xxxxxxxxxx] On Behalf Of Dave Anderson > Sent: Friday, June 25, 2010 12:32 PM > To: Discussion list for crash utility usage,maintenance and > development > Subject: Re: infinite loop in crash due to double-NMI > on > x86_64 system > > > ----- "Lucas Silacci" <Lucas.Silacci@xxxxxxxxxxxx> wrote: > > > Hi, > > > > I've run into an issue where crash will enter an infinite loop > while > > decoding exception stacks if those stacks get corrupted. > > > > We've seen this on four different systems where the hardware > generated > > multiple NMIs and the second and subsequent NMIs caused the NMI > > exception stack to be overwritten. When this condition is hit, the > > bottom rsp on the NMI exception stack (which would normally point > you > > back to the kernel thread stack or possibly a different exception > stack) > > points you back into the middle of the same NMI exception stack. > This > > causes crash to infinitely loop when it tries to decode that > exception > > stack. > > > > Now clearly the root cause of the issue is faulty hardware that > > generated multiple NMIs. However a very small change in crash can > detect > > this issue and stop the infinite loop from happening thereby > allowing > > you to get to a point in crash where you can actually tell that it > was > > an NMI that caused the system to dump. > > > > The patch is attached to this email. For x86_64 it will detect the > > condition of any exception stack that points back at itself. > > > > Please feel free to ask me any questions on this. > > Wow, that's pretty interesting -- I've certainly never seen that > before. > Can you show me what the backtrace looks like with your patch > applied? > > Thanks, > Dave > > -- > Crash-utility mailing list > Crash-utility@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/crash-utility > > -- > Crash-utility mailing list > Crash-utility@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/crash-utility -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility