Re: Question re: xen hypervisor backtrace problem

Itsuro ODA <oda@xxxxxxxxxxxxx> · Wed, 15 Oct 2008 08:22:39 +0900

Hi Dave,

> Do you agree with these changes?

Yes.

Thank you.
Itsuro Oda

On Tue, 14 Oct 2008 16:30:18 -0400 (EDT)
Dave Anderson <anderson@xxxxxxxxxx> wrote:

> 
> Hello Oda-san,
> 
> I have a xen-syms vmcore that finds a path that the hypervisor-related
> changes in lkcd_x86_trace.c cannot handle.  When the back trace runs 
> into the "process_softirqs" text return address reference from 
> "xen/arch/x86/x86_32/entry.S", it cannot go any further.  Therefore 
> the backtrace fails, and in the recovery code it incorrectly searches 
> for a (vmlinux) eframe: 
> 
>   crash> bt -a
>   PCPU:  0  VCPU: ffbc7080
>   bt: cannot resolve stack trace:
>    #0 [ff1d3ebc] elf_core_save_regs at ff10a810
>    #1 [ff1d3ec4] common_interrupt at ff1222ed
>    #2 [ff1d3ed0] do_nmi at ff1335bb
>    #3 [ff1d3ef0] handle_nmi_mce at ff17442e
>    #4 [ff1d3f24] csched_tick at ff110aa7
>    #5 [ff1d3f80] timer_softirq_action at ff1155d2
>    #6 [ff1d3fa0] do_softirq at ff1143fe
>    #7 [ff1d3fb0] process_softirqs at ff173f61
>   bt: text symbols on stack:
>       [ff1d3ebc] disable_local_APIC at ff11db75
>       [ff1d3ec0] crash_nmi_callback at ff13cc96
>       [ff1d3ec4] common_interrupt at ff1222f2
>       [ff1d3ed0] do_nmi at ff1335c1
>       [ff1d3ef0] handle_nmi_mce at ff174435
>       [ff1d3f18] csched_tick at ff110aa7
>       [ff1d3f80] timer_softirq_action at ff1155d4
>       [ff1d3fa0] do_softirq at ff114405
>       [ff1d3fb0] process_softirqs at ff173f66
>   
>   bt: invalid structure size: task_struct
>       FILE: x86.c  LINE: 1576  FUNCTION: x86_eframe_search()
>   
>   [/usr/bin/crash] error trace: 816373b => 8164497 => 810c40c => 813ed94
>   
>     813ed94: SIZE_verify+126
>     810c40c: x86_eframe_search+1075
>     8164497: handle_trace_error+692
>     816373b: lkcd_x86_back_trace+2370
>   
>   bt: invalid structure size: task_struct
>       FILE: x86.c  LINE: 1576  FUNCTION: x86_eframe_search()
>   
>   crash> 
>   
> Now, the bogus vmlinux eframe search can be avoided by doing this in 
> handle_trace_error():
> 
> --- lkcd_x86_trace.c.orig       2008-10-14 15:46:33.000000000 -0400
> +++ lkcd_x86_trace.c    2008-10-14 16:09:26.000000000 -0400
> @@ -2440,12 +2441,14 @@ handle_trace_error(struct bt_info *bt, i
>          bt->flags |= BT_TEXT_SYMBOLS_PRINT|BT_ERROR_MASK;
>          back_trace(bt);
>  
> -        bt->flags = BT_EFRAME_COUNT;
> -        if ((cnt = machdep->eframe_search(bt))) {
> -               error(INFO, "possible exception frame%s:\n", 
> -                       cnt > 1 ? "s" : "");
> -               bt->flags &= ~(ulonglong)BT_EFRAME_COUNT;
> -               machdep->eframe_search(bt); 
> +       if (!XEN_HYPER_MODE()) {
> +               bt->flags = BT_EFRAME_COUNT;
> +               if ((cnt = machdep->eframe_search(bt))) {
> +                       error(INFO, "possible exception frame%s:\n", 
> +                               cnt > 1 ? "s" : "");
> +                       bt->flags &= ~(ulonglong)BT_EFRAME_COUNT;
> +                       machdep->eframe_search(bt); 
> +               }
>         }
>  }
> 
> After doing the above, the bt -a shows this, and therefore does 
> not fail prematurely:
>   
>   crash> bt -a
>   PCPU:  0  VCPU: ffbc7080
>   bt: cannot resolve stack trace:
>    #0 [ff1d3ebc] elf_core_save_regs at ff10a810
>    #1 [ff1d3ec4] common_interrupt at ff1222ed
>    #2 [ff1d3ed0] do_nmi at ff1335bb
>    #3 [ff1d3ef0] handle_nmi_mce at ff17442e
>    #4 [ff1d3f24] csched_tick at ff110aa7
>    #5 [ff1d3f80] timer_softirq_action at ff1155d2
>    #6 [ff1d3fa0] do_softirq at ff1143fe
>    #7 [ff1d3fb0] process_softirqs at ff173f61
>   bt: text symbols on stack:
>       [ff1d3ebc] disable_local_APIC at ff11db75
>       [ff1d3ec0] crash_nmi_callback at ff13cc96
>       [ff1d3ec4] common_interrupt at ff1222f2
>       [ff1d3ed0] do_nmi at ff1335c1
>       [ff1d3ef0] handle_nmi_mce at ff174435
>       [ff1d3f18] csched_tick at ff110aa7
>       [ff1d3f80] timer_softirq_action at ff1155d4
>       [ff1d3fa0] do_softirq at ff114405
>       [ff1d3fb0] process_softirqs at ff173f66
> 
>   PCPU:  1  VCPU: ff1b6080
>   ...
>   
> Carrying it one step further, and given that the relevant part 
> of the stack from above looks like this:
> 
>   crash> rd -s ff1d3ebc 84
>   ff1d3ebc:  disable_local_APIC+5 crash_nmi_callback+38 common_interrupt+82 cpu0_stack+16076 
>   ff1d3ecc:  0003d027 do_nmi+49 cpu0_stack+16120 00000000 
>   ff1d3edc:  ffbca000 ffbcbeb0 00000030 cpu0_stack+16308 
>   ff1d3eec:  0000e010 handle_nmi_mce+91 cpu0_stack+16120 00000100 
>   ff1d3efc:  00000005 000000ff 000005dc ffbdee88 
>   ff1d3f0c:  00000000 00000960 00020000 csched_tick+1239 
>   ff1d3f1c:  0000e008 00000083 ffbc7080 00000030 
>   ff1d3f2c:  0003d027 80000003 000583a8 per_cpu__schedule_data 
>   ff1d3f3c:  c840ceb2 00000000 ffbfda80 00000000 
>   ff1d3f4c:  00000000 00000000 00000100 00000960 
>   ff1d3f5c:  ffbdee80 00000246 000000ff csched_priv+4 
>   ff1d3f6c:  00000000 ffbfda8c __per_cpu_data_end+54972 e4c5d8d9 
>   ff1d3f7c:  0000008b timer_softirq_action+132 00000000 ffbc7080 
>   ff1d3f8c:  per_cpu__timers 00000000 cpu0_stack+16308 0000007b 
>   ff1d3f9c:  eaed7700 do_softirq+53 00000000 ffbc7080 
>   ff1d3fac:  0000007b process_softirqs+6 eb396d84 00000002 
>   ff1d3fbc:  c0678470 c0678470 00000002 eaed7700 
>   ff1d3fcc:  00000000 000d0000 c04011a7 00000061 
>   ff1d3fdc:  00000202 eb396d48 00000069 0000007b 
>   ff1d3fec:  0000007b 00000000 00000000 00000000 
>   ff1d3ffc:  ffbc7080 ffffffff ffffffff ffffffff
>   crash> 
>   
> Clearly "process_softirqs" is the last text return address
> reference that the backtrace code can work with.  So to try
> to clean up the backtrace, I added this:
> 
> --- lkcd_x86_trace.c.orig       2008-10-14 15:46:33.000000000 -0400
> +++ lkcd_x86_trace.c    2008-10-14 16:09:26.000000000 -0400
> @@ -1423,6 +1423,7 @@ find_trace(
>                 if (XEN_HYPER_MODE()) {
>                         func_name = kl_funcname(pc);
>                         if (STREQ(func_name, "idle_loop") || STREQ(func_name, "hypercall")
> +                               || STREQ(func_name, "process_softirqs")
>                                 || STREQ(func_name, "tracing_off")
>                                 || STREQ(func_name, "handle_exception")) {
>                                 UPDATE_FRAME(func_name, pc, 0, sp, bp, asp, 0, 0, bp - sp, 0);
> 
> which shows:
>   
>   crash> bt -a
>   PCPU:  0  VCPU: ffbc7080
>    #0 [ff1d3ebc] elf_core_save_regs at ff10a810
>    #1 [ff1d3ec4] common_interrupt at ff1222ed
>    #2 [ff1d3ed0] do_nmi at ff1335bb
>    #3 [ff1d3ef0] handle_nmi_mce at ff17442e
>    #4 [ff1d3f24] csched_tick at ff110aa7
>    #5 [ff1d3f80] timer_softirq_action at ff1155d2
>    #6 [ff1d3fa0] do_softirq at ff1143fe
>    #7 [ff1d3fb0] process_softirqs at ff173f61
>   
>   PCPU:  1  VCPU: ff1b6080
>   ...
>         
> The patch to avoid eframe search can be avoided entirely by applying 
> the second patch, but it seems that it should be left in place for 
> other unforeseen possibilities in the future.
> 
> Do you agree with these changes?
> 
> Thanks,
>   Dave
> 

-- 
Itsuro ODA <oda@xxxxxxxxxxxxx>

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility