I ran a test on ~200 dumpfiles, and for the most part, the patch is quite useful in replacing the "Oops" message with something more helpful. However, the "[Hardware Error]" check should be the very last thing checked. Actually, I'm not even sure whether it should be checked at all, because there are dozens of pr_emerg(HW_ERR ...) calls in the kernel, and it appears that they don't all necessarily cause the kernel to crash. For example, this sample vmcore currently correctly shows that the kernel has crashed due to a BUG in mm/slab.c: crash> sys KERNEL: 2.6.32-220.el6.x86_64_slab_page_corruption/vmlinux.gz DUMPFILE: 2.6.32-220.el6.x86_64_slab_page_corruption/musa_vmcore [PARTIAL DUMP] CPUS: 32 DATE: Thu Feb 14 09:14:12 2013 UPTIME: 14:18:49 LOAD AVERAGE: 2.23, 1.94, 2.04 TASKS: 1621 NODENAME: musa RELEASE: 2.6.32-220.el6.x86_64 VERSION: #1 SMP Wed Nov 9 08:03:13 EST 2011 MACHINE: x86_64 (2599 Mhz) MEMORY: 128 GB PANIC: "kernel BUG at mm/slab.c:533!" crash> bt PID: 159 TASK: ffff881018c2eac0 CPU: 28 COMMAND: "events/28" #0 [ffff881018c359f0] machine_kexec at ffffffff81031fcb #1 [ffff881018c35a50] crash_kexec at ffffffff810b8f72 #2 [ffff881018c35b20] oops_end at ffffffff814f04b0 #3 [ffff881018c35b50] die at ffffffff8100f26b #4 [ffff881018c35b80] do_trap at ffffffff814efda4 #5 [ffff881018c35be0] do_invalid_op at ffffffff8100ce35 #6 [ffff881018c35c80] invalid_op at ffffffff8100bedb [exception RIP: free_block+357] RIP: ffffffff8115ffd5 RSP: ffff881018c35d30 RFLAGS: 00010006 RAX: ffffea00321db658 RBX: ffff880f5bc52c80 RCX: 0000000000000002 RDX: 004000000000006c RSI: ffff880fb58e9ac0 RDI: ffff880e51a1d000 RBP: ffff881018c35d80 R8: ffff880fb58e9ac0 R9: 0000000000000000 R10: 000000000000000c R11: 0000000000000000 R12: 0000000000000006 R13: ffff880ffaa95828 R14: 0000000000000002 R15: ffffea0000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff881018c35d88] drain_array at ffffffff81160211 #8 [ffff881018c35dd8] cache_reap at ffffffff81161210 #9 [ffff881018c35e38] worker_thread at ffffffff8108b2b0 #10 [ffff881018c35ee8] kthread at ffffffff81090886 #11 [ffff881018c35f48] kernel_thread at ffffffff8100c14a crash> With your patch applied, it incorrectly shows this: crash> sys KERNEL: 2.6.32-220.el6.x86_64_slab_page_corruption/vmlinux.gz DUMPFILE: 2.6.32-220.el6.x86_64_slab_page_corruption/musa_vmcore [PARTIAL DUMP] CPUS: 32 DATE: Thu Feb 14 09:14:12 2013 UPTIME: 14:18:49 LOAD AVERAGE: 2.23, 1.94, 2.04 TASKS: 1621 NODENAME: musa RELEASE: 2.6.32-220.el6.x86_64 VERSION: #1 SMP Wed Nov 9 08:03:13 EST 2011 MACHINE: x86_64 (2599 Mhz) MEMORY: 128 GB PANIC: "[Hardware Error]: Machine check events logged" I don't have a problem with the other parts of the patch. I'll move the hardware error check to the bottom, and only use it if there are no other relevant strings found, and then re-test that configuration. Dave ----- Original Message ----- > There are just too many kinds of panic types are categorized under > the same Oops: xxxx, makes this field really ambiguous and not so useful > > PANIC: "Oops: 0000 [#1] SMP " (check log for details) > > this patch separated 3 kinds of panicmsg out, as the most happening cases > among the machines managed by me; the match string are copied > from kernel source code exactly, after applied, I got panicmsg like: > > include/linux/kernel.h:#define HW_ERR > panicmsg: "[Hardware Error]: CPU 7: Machine Check Exception: 5 Bank > 11: f200003f000100b2" > drivers/char/sysrq.c:__handle_sysrq > panicmsg: "SysRq : Trigger a crash" > arch/x86/kernel/traps.c:do_general_protection > panicmsg: "general protection fault: 8800 [#1] SMP" > arch/x86/mm/fault.c:show_fault_oops > panicmsg: "BUG: unable to handle kernel paging request at > 00001248a68eb328" > > We need to move the SysRq matching lines to before matching "Oops", because > SysRq lines usually also has the Oops, need to take precedence for SysRq. > > Signed-off-by: Derek Che <drc@xxxxxxxxxxxxx> > --- > task.c | 20 ++++++++++++++++---- > 1 file changed, 16 insertions(+), 4 deletions(-) > > diff --git a/task.c b/task.c > index 4214d7f..1530e7b 100644 > --- a/task.c > +++ b/task.c > @@ -5509,19 +5509,31 @@ get_panicmsg(char *buf) > } > rewind(pc->tmpfile); > while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) { > - if (strstr(buf, "Oops: ") || > - strstr(buf, "kernel BUG at")) > - msg_found = TRUE; > + if (strstr(buf, "[Hardware Error]: ")) > + msg_found = TRUE; > + } > + rewind(pc->tmpfile); > + while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) { > + if (strstr(buf, "general protection fault")) > + msg_found = TRUE; > } > rewind(pc->tmpfile); > while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) { > if (strstr(buf, "SysRq : Netdump") || > strstr(buf, "SysRq : Trigger a crashdump") || > - strstr(buf, "SysRq : Crash")) { > + strstr(buf, "SysRq : Crash") || > + strstr(buf, "SysRq : Trigger a crash")) { > pc->flags |= SYSRQ; > msg_found = TRUE; > } > } > + rewind(pc->tmpfile); > + while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) { > + if (strstr(buf, "Oops: ") || > + strstr(buf, "kernel BUG at") || > + strstr(buf, "BUG: unable to handle kernel ")) > + msg_found = TRUE; > + } > rewind(pc->tmpfile); > while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) { > if (strstr(buf, "sysrq") && > -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility