Re: [PATCH V2] take Hardware Error & kernel pointer bug as separate panicmsg

Dave Anderson <anderson@xxxxxxxxxx> · Tue, 3 Feb 2015 15:52:37 -0500 (EST)

I ran a test on ~200 dumpfiles, and for the most part, the patch is
quite useful in replacing the "Oops" message with something more
helpful.

However, the "[Hardware Error]" check should be the very last thing checked.
Actually, I'm not even sure whether it should be checked at all, because there
are dozens of pr_emerg(HW_ERR ...) calls in the kernel, and it appears that they
don't all necessarily cause the kernel to crash.  

For example, this sample vmcore currently correctly shows that the kernel has
crashed due to a BUG in mm/slab.c:

  crash> sys
        KERNEL: 2.6.32-220.el6.x86_64_slab_page_corruption/vmlinux.gz
      DUMPFILE: 2.6.32-220.el6.x86_64_slab_page_corruption/musa_vmcore  [PARTIAL DUMP]
          CPUS: 32
          DATE: Thu Feb 14 09:14:12 2013
        UPTIME: 14:18:49
  LOAD AVERAGE: 2.23, 1.94, 2.04
         TASKS: 1621
      NODENAME: musa
       RELEASE: 2.6.32-220.el6.x86_64
       VERSION: #1 SMP Wed Nov 9 08:03:13 EST 2011
       MACHINE: x86_64  (2599 Mhz)
        MEMORY: 128 GB
         PANIC: "kernel BUG at mm/slab.c:533!"
  crash> bt
  PID: 159    TASK: ffff881018c2eac0  CPU: 28  COMMAND: "events/28"
   #0 [ffff881018c359f0] machine_kexec at ffffffff81031fcb
   #1 [ffff881018c35a50] crash_kexec at ffffffff810b8f72
   #2 [ffff881018c35b20] oops_end at ffffffff814f04b0
   #3 [ffff881018c35b50] die at ffffffff8100f26b
   #4 [ffff881018c35b80] do_trap at ffffffff814efda4
   #5 [ffff881018c35be0] do_invalid_op at ffffffff8100ce35
   #6 [ffff881018c35c80] invalid_op at ffffffff8100bedb
      [exception RIP: free_block+357]
      RIP: ffffffff8115ffd5  RSP: ffff881018c35d30  RFLAGS: 00010006
      RAX: ffffea00321db658  RBX: ffff880f5bc52c80  RCX: 0000000000000002
      RDX: 004000000000006c  RSI: ffff880fb58e9ac0  RDI: ffff880e51a1d000
      RBP: ffff881018c35d80   R8: ffff880fb58e9ac0   R9: 0000000000000000
      R10: 000000000000000c  R11: 0000000000000000  R12: 0000000000000006
      R13: ffff880ffaa95828  R14: 0000000000000002  R15: ffffea0000000000
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff881018c35d88] drain_array at ffffffff81160211
   #8 [ffff881018c35dd8] cache_reap at ffffffff81161210
   #9 [ffff881018c35e38] worker_thread at ffffffff8108b2b0
  #10 [ffff881018c35ee8] kthread at ffffffff81090886
  #11 [ffff881018c35f48] kernel_thread at ffffffff8100c14a
  crash> 

With your patch applied, it incorrectly shows this:

  crash> sys
        KERNEL: 2.6.32-220.el6.x86_64_slab_page_corruption/vmlinux.gz
      DUMPFILE: 2.6.32-220.el6.x86_64_slab_page_corruption/musa_vmcore  [PARTIAL DUMP]
          CPUS: 32
          DATE: Thu Feb 14 09:14:12 2013
        UPTIME: 14:18:49
  LOAD AVERAGE: 2.23, 1.94, 2.04
         TASKS: 1621
      NODENAME: musa
       RELEASE: 2.6.32-220.el6.x86_64
       VERSION: #1 SMP Wed Nov 9 08:03:13 EST 2011
       MACHINE: x86_64  (2599 Mhz)
        MEMORY: 128 GB
         PANIC: "[Hardware Error]: Machine check events logged"

I don't have a problem with the other parts of the patch.

I'll move the hardware error check to the bottom, and only use it if there
are no other relevant strings found, and then re-test that configuration.    

Dave

----- Original Message -----
> There are just too many kinds of panic types are categorized under
> the same Oops: xxxx, makes this field really ambiguous and not so useful
> 
>        PANIC: "Oops: 0000 [#1] SMP " (check log for details)
> 
> this patch separated 3 kinds of panicmsg out, as the most happening cases
> among the machines managed by me; the match string are copied
> from kernel source code exactly, after applied, I got panicmsg like:
> 
>  include/linux/kernel.h:#define HW_ERR
>           panicmsg: "[Hardware Error]: CPU 7: Machine Check Exception: 5 Bank
>           11: f200003f000100b2"
>  drivers/char/sysrq.c:__handle_sysrq
>           panicmsg: "SysRq : Trigger a crash"
>  arch/x86/kernel/traps.c:do_general_protection
>           panicmsg: "general protection fault: 8800 [#1] SMP"
>  arch/x86/mm/fault.c:show_fault_oops
>           panicmsg: "BUG: unable to handle kernel paging request at
>           00001248a68eb328"
> 
> We need to move the SysRq matching lines to before matching "Oops", because
> SysRq lines usually also has the Oops, need to take precedence for SysRq.
> 
> Signed-off-by: Derek Che <drc@xxxxxxxxxxxxx>
> ---
>  task.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/task.c b/task.c
> index 4214d7f..1530e7b 100644
> --- a/task.c
> +++ b/task.c
> @@ -5509,19 +5509,31 @@ get_panicmsg(char *buf)
>  	}
>  	rewind(pc->tmpfile);
>  	while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) {
> -	        if (strstr(buf, "Oops: ") ||
> -		    strstr(buf, "kernel BUG at"))
> -	        	msg_found = TRUE;
> +		if (strstr(buf, "[Hardware Error]: "))
> +			msg_found = TRUE;
> +	}
> +	rewind(pc->tmpfile);
> +	while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) {
> +		if (strstr(buf, "general protection fault"))
> +			msg_found = TRUE;
>  	}
>          rewind(pc->tmpfile);
>          while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) {
>                  if (strstr(buf, "SysRq : Netdump") ||
>  		    strstr(buf, "SysRq : Trigger a crashdump") ||
> -		    strstr(buf, "SysRq : Crash")) {
> +		    strstr(buf, "SysRq : Crash") ||
> +		    strstr(buf, "SysRq : Trigger a crash")) {
>  			pc->flags |= SYSRQ;
>                          msg_found = TRUE;
>  		}
>          }
> +	rewind(pc->tmpfile);
> +	while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) {
> +	        if (strstr(buf, "Oops: ") ||
> +		    strstr(buf, "kernel BUG at") ||
> +		    strstr(buf, "BUG: unable to handle kernel "))
> +	        	msg_found = TRUE;
> +	}
>          rewind(pc->tmpfile);
>          while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) {
>                  if (strstr(buf, "sysrq") &&
> 

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility