> -----Original Message----- > From: crash-utility-bounces@xxxxxxxxxx > [mailto:crash-utility-bounces@xxxxxxxxxx] On Behalf Of Dave Anderson > Sent: Monday, June 28, 2010 1:35 PM > To: Discussion list for crash utility usage,maintenance and > development > Subject: Re: infinite loop in crash due to > double-NMI on x86_64 system > > > ----- "Lucas Silacci" <Lucas.Silacci@xxxxxxxxxxxx> wrote: > > > > -----Original Message----- > > > From: crash-utility-bounces@xxxxxxxxxx > > > [mailto:crash-utility-bounces@xxxxxxxxxx] On Behalf Of Dave > > Anderson > > > Sent: Monday, June 28, 2010 12:11 PM > > > To: Discussion list for crash utility usage,maintenance and > > > development > > > Subject: Re: infinite loop in crash due to > > > double-NMI on x86_64 system > > > > > > > > > > > > ----- "Lucas Silacci" <Lucas.Silacci@xxxxxxxxxxxx> wrote: > > > > > > > Below is the output of running crash (with the patch) > against one > > of > > > > these dumps. > > > > > > > > -Lucas > > > > > > > > > > > > crash 5.0.5 > > > > Copyright (C) 2002-2010 Red Hat, Inc. > > > > Copyright (C) 2004, 2005, 2006 IBM Corporation > > > > Copyright (C) 1999-2006 Hewlett-Packard Co > > > > Copyright (C) 2005, 2006 Fujitsu Limited > > > > Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. > > > > Copyright (C) 2005 NEC Corporation > > > > Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. > > > > Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, > > Inc. > > > > This program is free software, covered by the GNU > General Public License, > > > > and you are welcome to change it and/or distribute > copies of it under > > > > certain conditions. Enter "help copying" to see the conditions. > > > > This program has absolutely no warranty. Enter "help > warranty" for > > > > details. > > > > > > > > GNU gdb (GDB) 7.0 > > > > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later > > > > <http://gnu.org/licenses/gpl.html> > > > > This is free software: you are free to change and > redistribute it. > > > > There is NO WARRANTY, to the extent permitted by law. > Type "show copying" > > > > and "show warranty" for details. > > > > > > > > This GDB was configured as "x86_64-unknown-linux-gnu"... > > > > > > > > please wait... (determining panic task) > > > > > > > > > > > WARNING: Loop detected in the NMI Exception Stack! > > > > > > > > > > > > > > > bt: cannot transition from exception stack to current process > > stack: > > > > exception stack pointer: ffffffff8046dc50 > > > > > > > process stack pointer: ffffffff8046ddd8 > > > > current stack base: ffffffff80422000 > > > > > > > > SYSTEM MAP: > > /boot/System.map-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp > > > > DEBUG KERNEL: /boot/vmlinux-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp > > > > (2.6.16.53-0.8.PTF.434477.9.TDC.0-smp) > > > > DUMPFILE: /var/crash/lucas.save/vmcore [PARTIAL DUMP] > > > > CPUS: 4 > > > > DATE: Tue May 18 12:46:07 2010 > > > > UPTIME: 07:24:54 > > > > LOAD AVERAGE: 85.74, 82.85, 82.29 > > > > TASKS: 2449 > > > > NODENAME: POLO5_1-9 > > > > RELEASE: 2.6.16.53-0.8.PTF.434477.9.TDC.0-smp > > > > VERSION: #1 SMP Fri Aug 31 06:07:27 PDT 2007 > > > > MACHINE: x86_64 (2660 Mhz) > > > > MEMORY: 7.9 GB > > > > PANIC: "Kernel panic - not syncing: dumpsw: Dump > > > switch pushed; reason: 0x20 args=0xffffffff8046df08" > > > > PID: 0 > > > > COMMAND: "swapper" > > > > TASK: ffffffff8038c340 (1 of 4) [THREAD_INFO: > > > ffffffff80422000] > > > > CPU: 0 > > > > STATE: TASK_RUNNING (PANIC) > > > > > > > > crash> bt > > > > PID: 0 TASK: ffffffff8038c340 CPU: 0 COMMAND: "swapper" > > > > #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b > > > > #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351 > > > > #2 [ffffffff8046dde0] panic at ffffffff801327fa > > > > #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3 > > > > #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f > > > > #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab > > > > #6 [ffffffff8046df40] do_nmi at ffffffff80323365 > > > > #7 [ffffffff8046df50] nmi at ffffffff8032268f > > > > [exception RIP: smp_send_stop+84] > > > > RIP: ffffffff80116e44 RSP: ffffffff8046ddd8 RFLAGS: > > 00000246 > > > > RAX: 00000000000000ff RBX: ffffffff8831c1f8 RCX: > > > 000041049c7256e8 > > > > RDX: 0000000000000005 RSI: 000000005238a938 RDI: > > > 00000000002896a0 > > > > RBP: ffffffff8046df08 R8: 00000000000040fb R9: > > > 000000005238a7e8 > > > > R10: 0000000000000002 R11: 0000ffff0000ffff R12: > > > 000000000000000c > > > > R13: 0000000000000000 R14: 0000000000000000 R15: > > > 0000000000000000 > > > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > > > > --- <NMI exception stack> --- > > > > #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44 > > > > bt: WARNING: Loop detected in the NMI Exception Stack! > > > > bt: cannot transition from exception stack to current process > > stack: > > > > exception stack pointer: ffffffff8046dc50 > > > > process stack pointer: ffffffff8046ddd8 > > > > current stack base: ffffffff80422000 > > > > crash> > > > > > > What exactly was the sequence of events? Was the system > repeatedly and > > > erroneously running one NMI after another for some > reason, and *then* the > > > "dump switch" was pressed? And the dumpsw_notify() > function sends another > > > NMI? And where does that dumpsw_notify() function live anyway? > > > > > > I'm just trying to get a grip on whether this will ever > happen again, or > > > whether it's fixing a one-time hardware abnormality? > > > > > > Dave > > > > > > > As far as I am aware, we have had three separate customers encounter > > this issue. It appears from the hardware SEL log that multiple PCI > > SERR's came in at the same time and somehow triggered multiple NMIs. > > You can see the SEL entries from the output of the "ipmitool sel" > > command: > > > > 0231 11FC 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 15 08 > > Crit. > > Interrupt PCI SERR (PCI Bus 15 Device 1 Function 0) was asserted > > 0232 1210 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 20 > > Crit. > > Interrupt PCI SERR (PCI Bus 16 Device 4 Function 0) was asserted > > 0233 1224 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 21 > > Crit. > > Interrupt PCI SERR (PCI Bus 16 Device 4 Function 1) was asserted > > 0234 1238 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 30 > > Crit. > > Interrupt PCI SERR (PCI Bus 16 Device 6 Function 0) was asserted > > 0235 124C 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 31 > > Crit. > > Interrupt PCI SERR (PCI Bus 16 Device 6 Function 1) was asserted > > > > My understanding of the architecture of the system is that > only one NMI > > should have been asserted to the OS regardless of the > number of times > > there was a hardware error, but clearly that wasn't the > case in these > > three instances. > > > > Also, it seemed like my patch made crash a little bit more > tolerant of > > "corrupted" dump images which I thought could only be a good thing. > > Right, I understand that... > > But you didn't answer my questions re: the "dump switch" procedure and > the dumpsw_notify() function. Was the system stuck in the > NMI handler, > somebody noticed the repetetive NMIs (?), and so they hit the > "dump switch"? > (whatever that may be...) > > Dave > > -- > Crash-utility mailing list > Crash-utility@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/crash-utility > Sorry, guess I wasn't clear. Nobody hit the dump switch on these systems. They simply had multiple hardware errors that apparently triggered the NMI more than once. That's what I was trying to show with the SEL records, that the multiple NMIs were straight from hardware with no human intervention. The systems went through a panic (due to multiple NMIs), a reboot, and then crash was run on the resulting dump. In fact crash was automatically run via a startup script and there was no human intervention until after it was noticed that crash was filling up the root file system with a temporary file due to the inifinite loop. -Lucas -Lucas -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility